Project 5: PDF Assembly Language

Project Overview

Attribute Details
Difficulty Level 2: Intermediate
Time Estimate 2-3 weeks
Programming Language Python (primary), Rust/C/Go (alternatives)
Knowledge Area Document Processing / DSL Design
Prerequisites Project 2 (PDF parser understanding)

What you’ll build: A human-readable “assembly language” for PDF that compiles to valid PDF files, with a disassembler that converts PDF back to this format.

Why it teaches PostScript→PDF: By creating a readable intermediate format, you’ll deeply understand PDF’s structure. The assembler/disassembler round-trip proves your understanding is complete.


Learning Objectives

By completing this project, you will:

  1. Design a domain-specific language (DSL) that maps 1:1 to PDF structures
  2. Implement an assembler that converts text to binary PDF
  3. Implement a disassembler that converts PDF to readable text
  4. Handle all PDF object types and compression
  5. Achieve round-trip fidelity (PDF → Assembly → PDF produces identical output)

The Core Question You’re Answering

“How can we make PDF’s complex binary structure human-readable and editable, while maintaining perfect round-trip fidelity?”

This is the same challenge that:

  • Assembly language solves for machine code
  • JSON/YAML solve for serialized data
  • SQL DDL solves for database schema

By building this system, you’ll understand:

  • What makes a good intermediate representation
  • How to balance human readability with technical completeness
  • The fundamental structures that make PDF work

Deep Theoretical Foundation

1. PDF as a Database

PDF files are essentially databases with a clever indexing scheme:

┌─────────────────────────────────────────────────────────────────┐
│                    PDF AS A DATABASE                            │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  DATABASE CONCEPT        PDF EQUIVALENT                         │
│  ─────────────────       ──────────────────                    │
│  Table                   Object type (/Type /Page, etc.)       │
│  Row                     Indirect object (1 0 obj ... endobj)  │
│  Primary key             Object number (1, 2, 3, ...)          │
│  Foreign key             Reference (2 0 R)                     │
│  Index                   Cross-reference table (xref)          │
│  Binary blob             Stream object                          │
│                                                                 │
│  QUERIES:                                                       │
│  "Get page 5"     → Traverse Catalog→Pages→Kids[4]→Contents   │
│  "Get object 10"  → Look up in xref, seek to byte offset      │
│  "Get all fonts"  → Traverse page resources                   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

2. Why Assembly Language?

Assembly languages provide:

  1. 1:1 Mapping: Every assembly instruction maps to exactly one machine instruction (or PDF object)
  2. Human Readability: Mnemonics instead of binary opcodes
  3. Symbolic References: Labels instead of memory addresses (or object numbers)
  4. Comments: Documentation within the source
MACHINE CODE:          ASSEMBLY:              PDF ASSEMBLY:
─────────────         ──────────             ─────────────────
B8 0A 00 00 00        mov eax, 10            object 1
                                               type Catalog
                                               pages @pages_root
89 C3                 mov ebx, eax           end

48 01 D8              add rax, rbx           object 2 as pages_root
                                               type Pages
                                               kids [@page1]
                                             end

3. Design Choices

Object Representation:

OPTION 1: Close to PDF Syntax
─────────────────────────────
1 0 obj
  << /Type /Catalog
     /Pages 2 0 R >>
endobj

OPTION 2: Simplified Syntax
─────────────────────────────
object 1 0
  type Catalog
  pages 2 0 R
end

OPTION 3: Labeled Objects
─────────────────────────────
catalog:
  type Catalog
  pages @pages_root

pages_root:
  type Pages
  kids [@page1]

Stream Handling:

OPTION A: Inline (base64 for binary)
─────────────────────────────────────
object 4
  stream
    BT
    /F1 12 Tf
    (Hello) Tj
    ET
  endstream
end

OPTION B: External File Reference
─────────────────────────────────────
object 4
  stream from "content.txt"
  filter FlateDecode
end

OPTION C: Automatic Compression
─────────────────────────────────────
object 4
  stream compress=true
    BT
    /F1 12 Tf
    (Hello) Tj
    ET
  endstream
end

4. The 8 PDF Object Types

Your assembly language must represent all of these:

Type PDF Syntax Assembly Syntax (Example)
Boolean true, false true, false
Integer 42 42
Real 3.14159 3.14159
String (Hello) or <48656C6C6F> "Hello" or hex:48656C6C6F
Name /Type Type or /Type
Array [1 2 3] [1, 2, 3]
Dictionary << /Key Value >> { Key: Value } or indented block
Stream stream...endstream stream...endstream

Project Specification

Core Features

Assembler (Text → PDF)

# Basic usage
./pdfasm assemble input.pdfasm -o output.pdf

# With compression
./pdfasm assemble input.pdfasm -o output.pdf --compress

# Verbose output
./pdfasm assemble input.pdfasm -o output.pdf -v

Input format example:

; PDF Assembly Language - Hello World
version 1.7

object 1
  type Catalog
  pages 2 0 R
end

object 2
  type Pages
  kids [3 0 R]
  count 1
end

object 3
  type Page
  parent 2 0 R
  mediabox [0 0 612 792]
  contents 4 0 R
  resources {
    Font {
      F1 5 0 R
    }
  }
end

object 4
  stream
    BT
    /F1 24 Tf
    100 700 Td
    (Hello, PDF World!) Tj
    ET
  endstream
end

object 5
  type Font
  subtype Type1
  basefont Helvetica
end

Output: Valid PDF file that opens in any reader.

Disassembler (PDF → Text)

# Basic usage
./pdfasm disassemble input.pdf -o output.pdfasm

# Include comments
./pdfasm disassemble input.pdf -o output.pdfasm --annotate

# Decompress streams
./pdfasm disassemble input.pdf -o output.pdfasm --decompress

Output format: Human-readable assembly that can be reassembled.

Validator

# Validate PDF structure
./pdfasm validate input.pdf

# Validate assembly syntax
./pdfasm validate input.pdfasm --syntax

Interactive Explorer

./pdfasm explore input.pdf

> list               # List all objects
> show 3             # Show object 3
> page 1             # Show page 1 info
> stream 4           # Show content stream
> tree               # Show document tree
> quit

Solution Architecture

High-Level Design

┌─────────────────────────────────────────────────────────────────┐
│                    PDF ASSEMBLY TOOLKIT                         │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │                    ASSEMBLER                            │   │
│  │                                                          │   │
│  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐     │   │
│  │  │   Lexer     │─▶│   Parser    │─▶│   Emitter   │     │   │
│  │  │             │  │             │  │             │     │   │
│  │  │ Tokenize    │  │ Build AST   │  │ Write PDF   │     │   │
│  │  │ .pdfasm     │  │ of objects  │  │ with xref   │     │   │
│  │  └─────────────┘  └─────────────┘  └─────────────┘     │   │
│  └─────────────────────────────────────────────────────────┘   │
│                            ↑ ↓                                  │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │                   DISASSEMBLER                          │   │
│  │                                                          │   │
│  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐     │   │
│  │  │ PDF Parser  │─▶│ Decompressor│─▶│ Formatter   │     │   │
│  │  │             │  │             │  │             │     │   │
│  │  │ Parse xref  │  │ Inflate     │  │ Pretty-print│     │   │
│  │  │ Read objects│  │ streams     │  │ to .pdfasm  │     │   │
│  │  └─────────────┘  └─────────────┘  └─────────────┘     │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │                    VALIDATOR                            │   │
│  │                                                          │   │
│  │  • Check required keys (Catalog has /Pages)             │   │
│  │  • Validate references (all refs point to valid objects)│   │
│  │  • Check stream lengths                                  │   │
│  │  • Verify xref offsets                                  │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Key Data Structures

from dataclasses import dataclass
from typing import Union, List, Dict, Optional
from enum import Enum

class PDFObjectType(Enum):
    NULL = "null"
    BOOLEAN = "boolean"
    INTEGER = "integer"
    REAL = "real"
    STRING = "string"
    NAME = "name"
    ARRAY = "array"
    DICTIONARY = "dictionary"
    STREAM = "stream"
    REFERENCE = "reference"

@dataclass
class PDFReference:
    obj_num: int
    gen_num: int = 0

@dataclass
class PDFStream:
    dictionary: Dict[str, any]
    data: bytes
    is_compressed: bool = False

@dataclass
class PDFObject:
    obj_num: int
    gen_num: int
    value: Union[bool, int, float, str, bytes,
                 List['PDFObject'], Dict[str, 'PDFObject'],
                 PDFStream, PDFReference, None]

@dataclass
class PDFDocument:
    version: str
    objects: Dict[int, PDFObject]
    root_obj_num: int

    def get_object(self, num: int) -> Optional[PDFObject]:
        return self.objects.get(num)

    def resolve_ref(self, ref: PDFReference) -> Optional[PDFObject]:
        return self.objects.get(ref.obj_num)

Implementation Guide

Phase 1: Assembly Parser (Days 1-4)

Create a parser for the assembly language:

# pdfasm/lexer.py
import re
from dataclasses import dataclass
from enum import Enum, auto
from typing import Iterator

class TokenType(Enum):
    # Keywords
    VERSION = auto()
    OBJECT = auto()
    END = auto()
    STREAM = auto()
    ENDSTREAM = auto()

    # Values
    INTEGER = auto()
    REAL = auto()
    STRING = auto()
    NAME = auto()
    TRUE = auto()
    FALSE = auto()
    NULL = auto()

    # Delimiters
    LBRACKET = auto()
    RBRACKET = auto()
    LBRACE = auto()
    RBRACE = auto()
    REF = auto()  # "R" suffix

    # Misc
    COMMENT = auto()
    NEWLINE = auto()
    EOF = auto()

@dataclass
class Token:
    type: TokenType
    value: str
    line: int
    column: int

class Lexer:
    def __init__(self, source: str):
        self.source = source
        self.pos = 0
        self.line = 1
        self.column = 1

    def tokenize(self) -> Iterator[Token]:
        while self.pos < len(self.source):
            # Skip whitespace (except newlines)
            if self.source[self.pos] in ' \t\r':
                self.advance()
                continue

            # Newlines
            if self.source[self.pos] == '\n':
                yield Token(TokenType.NEWLINE, '\n', self.line, self.column)
                self.advance()
                self.line += 1
                self.column = 1
                continue

            # Comments
            if self.source[self.pos] == ';':
                start = self.pos
                while self.pos < len(self.source) and self.source[self.pos] != '\n':
                    self.advance()
                yield Token(TokenType.COMMENT, self.source[start:self.pos],
                           self.line, self.column)
                continue

            # Strings
            if self.source[self.pos] == '"':
                yield self.read_string()
                continue

            # Numbers
            if self.source[self.pos].isdigit() or self.source[self.pos] == '-':
                yield self.read_number()
                continue

            # Delimiters
            if self.source[self.pos] == '[':
                yield Token(TokenType.LBRACKET, '[', self.line, self.column)
                self.advance()
                continue
            if self.source[self.pos] == ']':
                yield Token(TokenType.RBRACKET, ']', self.line, self.column)
                self.advance()
                continue
            if self.source[self.pos] == '{':
                yield Token(TokenType.LBRACE, '{', self.line, self.column)
                self.advance()
                continue
            if self.source[self.pos] == '}':
                yield Token(TokenType.RBRACE, '}', self.line, self.column)
                self.advance()
                continue

            # Names and keywords
            if self.source[self.pos].isalpha() or self.source[self.pos] == '/':
                yield self.read_name_or_keyword()
                continue

            raise SyntaxError(f"Unexpected character at line {self.line}: "
                            f"'{self.source[self.pos]}'")

        yield Token(TokenType.EOF, '', self.line, self.column)

    def advance(self):
        self.pos += 1
        self.column += 1

    def read_string(self) -> Token:
        start_line, start_col = self.line, self.column
        self.advance()  # Skip opening quote
        start = self.pos

        while self.pos < len(self.source) and self.source[self.pos] != '"':
            if self.source[self.pos] == '\\':
                self.advance()  # Skip escape
            self.advance()

        value = self.source[start:self.pos]
        self.advance()  # Skip closing quote
        return Token(TokenType.STRING, value, start_line, start_col)

    def read_number(self) -> Token:
        start_line, start_col = self.line, self.column
        start = self.pos

        if self.source[self.pos] == '-':
            self.advance()

        while self.pos < len(self.source) and self.source[self.pos].isdigit():
            self.advance()

        if self.pos < len(self.source) and self.source[self.pos] == '.':
            self.advance()
            while self.pos < len(self.source) and self.source[self.pos].isdigit():
                self.advance()
            return Token(TokenType.REAL, self.source[start:self.pos],
                        start_line, start_col)

        return Token(TokenType.INTEGER, self.source[start:self.pos],
                    start_line, start_col)

    def read_name_or_keyword(self) -> Token:
        start_line, start_col = self.line, self.column
        start = self.pos

        # Handle /Name syntax
        if self.source[self.pos] == '/':
            self.advance()

        while (self.pos < len(self.source) and
               (self.source[self.pos].isalnum() or self.source[self.pos] == '_')):
            self.advance()

        value = self.source[start:self.pos]

        # Check for keywords
        keywords = {
            'version': TokenType.VERSION,
            'object': TokenType.OBJECT,
            'end': TokenType.END,
            'stream': TokenType.STREAM,
            'endstream': TokenType.ENDSTREAM,
            'true': TokenType.TRUE,
            'false': TokenType.FALSE,
            'null': TokenType.NULL,
            'R': TokenType.REF,
        }

        token_type = keywords.get(value, TokenType.NAME)
        return Token(token_type, value, start_line, start_col)

Phase 2: Assembly to PDF (Days 5-8)

# pdfasm/assembler.py
import zlib
from typing import Dict, List, Tuple

class PDFAssembler:
    def __init__(self, document: PDFDocument, compress: bool = False):
        self.doc = document
        self.compress = compress
        self.output = bytearray()
        self.offsets: Dict[int, int] = {}  # obj_num -> byte offset

    def assemble(self) -> bytes:
        # Write header
        self.write_line(f"%PDF-{self.doc.version}")
        self.write_line("%\xe2\xe3\xcf\xd3")  # Binary marker

        # Write objects in order
        for obj_num in sorted(self.doc.objects.keys()):
            self.write_object(obj_num)

        # Write xref table
        xref_offset = len(self.output)
        self.write_xref()

        # Write trailer
        self.write_trailer(xref_offset)

        return bytes(self.output)

    def write_line(self, line: str):
        self.output.extend(line.encode('latin-1'))
        self.output.extend(b'\n')

    def write_object(self, obj_num: int):
        obj = self.doc.objects[obj_num]

        # Record offset
        self.offsets[obj_num] = len(self.output)

        # Write object header
        self.write_line(f"{obj_num} {obj.gen_num} obj")

        # Write value
        if isinstance(obj.value, PDFStream):
            self.write_stream(obj.value)
        else:
            self.write_value(obj.value)

        self.write_line("endobj")

    def write_value(self, value) -> str:
        if value is None:
            self.write_line("null")
        elif isinstance(value, bool):
            self.write_line("true" if value else "false")
        elif isinstance(value, int):
            self.write_line(str(value))
        elif isinstance(value, float):
            self.write_line(f"{value:.6g}")
        elif isinstance(value, str):
            # Check if it's a name or string
            if value.startswith('/'):
                self.write_line(value)
            else:
                self.write_line(f"({self.escape_string(value)})")
        elif isinstance(value, PDFReference):
            self.write_line(f"{value.obj_num} {value.gen_num} R")
        elif isinstance(value, list):
            self.write_array(value)
        elif isinstance(value, dict):
            self.write_dict(value)

    def write_array(self, arr: List):
        parts = []
        for item in arr:
            parts.append(self.value_to_string(item))
        self.write_line(f"[{' '.join(parts)}]")

    def write_dict(self, d: Dict):
        self.write_line("<<")
        for key, value in d.items():
            if not key.startswith('/'):
                key = '/' + key
            self.write_line(f"  {key} {self.value_to_string(value)}")
        self.write_line(">>")

    def write_stream(self, stream: PDFStream):
        data = stream.data

        # Compress if requested
        if self.compress and not stream.is_compressed:
            data = zlib.compress(data)
            stream.dictionary['Filter'] = '/FlateDecode'

        stream.dictionary['Length'] = len(data)

        # Write dictionary
        self.write_dict(stream.dictionary)

        # Write stream
        self.write_line("stream")
        self.output.extend(data)
        if not data.endswith(b'\n'):
            self.output.extend(b'\n')
        self.write_line("endstream")

    def value_to_string(self, value) -> str:
        if value is None:
            return "null"
        elif isinstance(value, bool):
            return "true" if value else "false"
        elif isinstance(value, int):
            return str(value)
        elif isinstance(value, float):
            return f"{value:.6g}"
        elif isinstance(value, str):
            if value.startswith('/'):
                return value
            return f"({self.escape_string(value)})"
        elif isinstance(value, PDFReference):
            return f"{value.obj_num} {value.gen_num} R"
        elif isinstance(value, list):
            return '[' + ' '.join(self.value_to_string(v) for v in value) + ']'
        elif isinstance(value, dict):
            parts = []
            for k, v in value.items():
                if not k.startswith('/'):
                    k = '/' + k
                parts.append(f"{k} {self.value_to_string(v)}")
            return '<< ' + ' '.join(parts) + ' >>'
        return str(value)

    def escape_string(self, s: str) -> str:
        return s.replace('\\', '\\\\').replace('(', '\\(').replace(')', '\\)')

    def write_xref(self):
        self.write_line("xref")
        self.write_line(f"0 {len(self.offsets) + 1}")
        self.write_line("0000000000 65535 f ")  # Object 0 is free

        for obj_num in sorted(self.offsets.keys()):
            offset = self.offsets[obj_num]
            self.write_line(f"{offset:010d} 00000 n ")

    def write_trailer(self, xref_offset: int):
        self.write_line("trailer")
        self.write_line(f"<< /Size {len(self.offsets) + 1} /Root {self.doc.root_obj_num} 0 R >>")
        self.write_line("startxref")
        self.write_line(str(xref_offset))
        self.write_line("%%EOF")

Phase 3: PDF to Assembly (Days 9-12)

# pdfasm/disassembler.py
import zlib
from typing import Dict, List, Optional

class PDFDisassembler:
    def __init__(self, pdf_data: bytes, decompress: bool = True, annotate: bool = True):
        self.data = pdf_data
        self.decompress = decompress
        self.annotate = annotate
        self.pos = 0

    def disassemble(self) -> str:
        lines = []

        # Parse PDF
        version = self.parse_version()
        lines.append(f"; PDF Disassembly")
        lines.append(f"; Original size: {len(self.data)} bytes")
        lines.append("")
        lines.append(f"version {version}")
        lines.append("")

        # Find xref and parse objects
        xref_offset = self.find_startxref()
        xref = self.parse_xref(xref_offset)
        trailer = self.parse_trailer()

        # Parse and output each object
        for obj_num, (offset, gen) in sorted(xref.items()):
            if offset == 0:  # Free object
                continue

            obj = self.parse_object_at(offset)
            lines.extend(self.format_object(obj_num, gen, obj))
            lines.append("")

        return '\n'.join(lines)

    def parse_version(self) -> str:
        # Find %PDF-X.X
        idx = self.data.find(b'%PDF-')
        if idx == -1:
            raise ValueError("Not a PDF file")
        end = self.data.find(b'\n', idx)
        return self.data[idx+5:end].decode('latin-1').strip()

    def find_startxref(self) -> int:
        # Search from end of file
        idx = self.data.rfind(b'startxref')
        if idx == -1:
            raise ValueError("Cannot find startxref")

        # Read the offset
        start = idx + 10
        end = self.data.find(b'\n', start)
        return int(self.data[start:end].strip())

    def parse_xref(self, offset: int) -> Dict[int, Tuple[int, int]]:
        self.pos = offset
        xref = {}

        # Skip "xref" keyword
        self.skip_whitespace()
        self.expect(b'xref')
        self.skip_whitespace()

        # Read subsections
        while True:
            line = self.read_line()
            if line.startswith(b'trailer'):
                self.pos -= len(line) + 1
                break

            parts = line.split()
            if len(parts) != 2:
                break

            first_obj = int(parts[0])
            count = int(parts[1])

            for i in range(count):
                entry = self.read_line()
                offset_str, gen_str, status = entry.split()
                offset = int(offset_str)
                gen = int(gen_str)

                if status == b'n':  # In use
                    xref[first_obj + i] = (offset, gen)

        return xref

    def parse_trailer(self) -> Dict:
        # Find trailer
        idx = self.data.rfind(b'trailer')
        self.pos = idx + 7
        self.skip_whitespace()
        return self.parse_dict()

    def parse_object_at(self, offset: int) -> any:
        self.pos = offset

        # Read "N G obj"
        obj_num = self.read_number()
        gen = self.read_number()
        self.skip_whitespace()
        self.expect(b'obj')
        self.skip_whitespace()

        # Parse value
        value = self.parse_value()

        # Check for stream
        self.skip_whitespace()
        if self.peek(6) == b'stream':
            self.pos += 6
            # Skip to stream data
            while self.data[self.pos] in b'\r\n':
                self.pos += 1

            # Get length from dictionary
            length = value.get('/Length', 0)
            if isinstance(length, tuple):  # Reference
                # Would need to resolve - simplified here
                length = 0

            stream_data = self.data[self.pos:self.pos + length]
            self.pos += length

            # Decompress if needed
            if self.decompress and '/Filter' in value:
                filter_name = value['/Filter']
                if filter_name == '/FlateDecode':
                    try:
                        stream_data = zlib.decompress(stream_data)
                    except:
                        pass  # Keep compressed

            return {'type': 'stream', 'dict': value, 'data': stream_data}

        return value

    def format_object(self, obj_num: int, gen: int, obj) -> List[str]:
        lines = []

        if self.annotate:
            lines.append(f"; Object {obj_num}")

        lines.append(f"object {obj_num}")

        if isinstance(obj, dict) and obj.get('type') == 'stream':
            # Stream object
            for key, value in obj['dict'].items():
                if key == '/Length':
                    continue  # Will be calculated
                lines.append(f"  {key[1:]} {self.format_value(value)}")

            lines.append("  stream")
            # Indent stream content
            stream_text = obj['data'].decode('latin-1', errors='replace')
            for line in stream_text.split('\n'):
                lines.append(f"    {line}")
            lines.append("  endstream")
        elif isinstance(obj, dict):
            # Dictionary object
            for key, value in obj.items():
                lines.append(f"  {key[1:]} {self.format_value(value)}")
        else:
            lines.append(f"  {self.format_value(obj)}")

        lines.append("end")
        return lines

    def format_value(self, value) -> str:
        if value is None:
            return "null"
        elif isinstance(value, bool):
            return "true" if value else "false"
        elif isinstance(value, int):
            return str(value)
        elif isinstance(value, float):
            return f"{value:.6g}"
        elif isinstance(value, str):
            if value.startswith('/'):
                return value[1:]  # Remove leading /
            return f'"{value}"'
        elif isinstance(value, tuple):  # Reference
            return f"{value[0]} {value[1]} R"
        elif isinstance(value, list):
            return '[' + ' '.join(self.format_value(v) for v in value) + ']'
        elif isinstance(value, dict):
            parts = []
            for k, v in value.items():
                parts.append(f"{k[1:]}: {self.format_value(v)}")
            return '{ ' + ', '.join(parts) + ' }'
        return str(value)

    # ... helper methods for parsing ...

Phase 4: CLI Tool (Days 13-14)

#!/usr/bin/env python3
# pdfasm/cli.py
import argparse
import sys
from pathlib import Path

from .lexer import Lexer
from .parser import Parser
from .assembler import PDFAssembler
from .disassembler import PDFDisassembler
from .validator import PDFValidator

def cmd_assemble(args):
    """Assemble .pdfasm to .pdf"""
    source = Path(args.input).read_text()

    # Parse assembly
    lexer = Lexer(source)
    parser = Parser(lexer.tokenize())
    document = parser.parse()

    # Assemble to PDF
    assembler = PDFAssembler(document, compress=args.compress)
    pdf_data = assembler.assemble()

    # Write output
    Path(args.output).write_bytes(pdf_data)
    print(f"Assembled {args.input} to {args.output} ({len(pdf_data)} bytes)")

def cmd_disassemble(args):
    """Disassemble .pdf to .pdfasm"""
    pdf_data = Path(args.input).read_bytes()

    # Disassemble
    disasm = PDFDisassembler(
        pdf_data,
        decompress=args.decompress,
        annotate=args.annotate
    )
    assembly = disasm.disassemble()

    # Write output
    Path(args.output).write_text(assembly)
    print(f"Disassembled {args.input} to {args.output}")

def cmd_validate(args):
    """Validate PDF or assembly"""
    path = Path(args.input)

    if path.suffix == '.pdfasm':
        # Validate assembly syntax
        source = path.read_text()
        lexer = Lexer(source)
        parser = Parser(lexer.tokenize())
        try:
            document = parser.parse()
            print(f"Syntax OK: {len(document.objects)} objects")
        except SyntaxError as e:
            print(f"Syntax error: {e}")
            sys.exit(1)
    else:
        # Validate PDF structure
        pdf_data = path.read_bytes()
        validator = PDFValidator(pdf_data)
        errors, warnings = validator.validate()

        for err in errors:
            print(f"ERROR: {err}")
        for warn in warnings:
            print(f"WARNING: {warn}")

        if errors:
            sys.exit(1)
        print("PDF structure is valid")

def cmd_explore(args):
    """Interactive PDF exploration"""
    pdf_data = Path(args.input).read_bytes()
    disasm = PDFDisassembler(pdf_data)

    print(f"PDF Explorer - {args.input}")
    print("Commands: list, show <n>, page <n>, stream <n>, tree, quit")
    print()

    while True:
        try:
            cmd = input("> ").strip()
        except EOFError:
            break

        if cmd == 'quit':
            break
        elif cmd == 'list':
            # List all objects
            for obj_num in sorted(disasm.xref.keys()):
                print(f"  {obj_num}: ...")  # TODO: Show type
        elif cmd.startswith('show '):
            obj_num = int(cmd.split()[1])
            obj = disasm.parse_object_at(disasm.xref[obj_num][0])
            print(disasm.format_value(obj))
        # ... more commands

def main():
    parser = argparse.ArgumentParser(description='PDF Assembly Language Tool')
    subparsers = parser.add_subparsers(dest='command', required=True)

    # assemble command
    p_asm = subparsers.add_parser('assemble', help='Assemble .pdfasm to .pdf')
    p_asm.add_argument('input', help='Input .pdfasm file')
    p_asm.add_argument('-o', '--output', required=True, help='Output .pdf file')
    p_asm.add_argument('--compress', action='store_true', help='Compress streams')
    p_asm.set_defaults(func=cmd_assemble)

    # disassemble command
    p_disasm = subparsers.add_parser('disassemble', help='Disassemble .pdf to .pdfasm')
    p_disasm.add_argument('input', help='Input .pdf file')
    p_disasm.add_argument('-o', '--output', required=True, help='Output .pdfasm file')
    p_disasm.add_argument('--decompress', action='store_true', default=True,
                         help='Decompress streams')
    p_disasm.add_argument('--annotate', action='store_true', default=True,
                         help='Add comments')
    p_disasm.set_defaults(func=cmd_disassemble)

    # validate command
    p_val = subparsers.add_parser('validate', help='Validate PDF or assembly')
    p_val.add_argument('input', help='Input file')
    p_val.set_defaults(func=cmd_validate)

    # explore command
    p_exp = subparsers.add_parser('explore', help='Interactive PDF exploration')
    p_exp.add_argument('input', help='Input .pdf file')
    p_exp.set_defaults(func=cmd_explore)

    args = parser.parse_args()
    args.func(args)

if __name__ == '__main__':
    main()

Testing Strategy

Round-Trip Testing

The most important test: disassemble → reassemble → compare

# Test round-trip
./pdfasm disassemble original.pdf -o original.pdfasm
./pdfasm assemble original.pdfasm -o reassembled.pdf

# Compare
diff <(pdftotext original.pdf -) <(pdftotext reassembled.pdf -)

# Or use qpdf to normalize and compare
qpdf --qdf original.pdf - > orig.qdf
qpdf --qdf reassembled.pdf - > reasm.qdf
diff orig.qdf reasm.qdf

Unit Tests

def test_simple_document():
    source = """
    version 1.4

    object 1
      type Catalog
      pages 2 0 R
    end

    object 2
      type Pages
      kids [3 0 R]
      count 1
    end

    object 3
      type Page
      parent 2 0 R
      mediabox [0 0 612 792]
    end
    """

    lexer = Lexer(source)
    parser = Parser(lexer.tokenize())
    doc = parser.parse()

    assert doc.version == "1.4"
    assert len(doc.objects) == 3
    assert doc.objects[1].value['/Type'] == '/Catalog'

def test_assemble_disassemble():
    # Create minimal document
    doc = PDFDocument(
        version="1.4",
        objects={...},
        root_obj_num=1
    )

    # Assemble
    assembler = PDFAssembler(doc)
    pdf_bytes = assembler.assemble()

    # Disassemble
    disasm = PDFDisassembler(pdf_bytes)
    assembly = disasm.disassemble()

    # Parse again
    lexer = Lexer(assembly)
    parser = Parser(lexer.tokenize())
    doc2 = parser.parse()

    # Should be equivalent
    assert doc.version == doc2.version
    assert len(doc.objects) == len(doc2.objects)

Common Pitfalls

1. String Escaping

PDF strings can contain special characters:

def escape_pdf_string(s: str) -> str:
    return (s.replace('\\', '\\\\')
             .replace('(', '\\(')
             .replace(')', '\\)')
             .replace('\n', '\\n')
             .replace('\r', '\\r'))

def unescape_pdf_string(s: str) -> str:
    # Handle escape sequences
    result = []
    i = 0
    while i < len(s):
        if s[i] == '\\' and i + 1 < len(s):
            next_char = s[i + 1]
            if next_char == 'n':
                result.append('\n')
            elif next_char == 'r':
                result.append('\r')
            elif next_char == 't':
                result.append('\t')
            elif next_char in '\\()':
                result.append(next_char)
            i += 2
        else:
            result.append(s[i])
            i += 1
    return ''.join(result)

2. Object Number Assignment

When assembling, you need to assign object numbers correctly:

def assign_object_numbers(objects: List[PDFObject]) -> Dict[str, int]:
    """Assign sequential object numbers, handling labeled references."""
    number_map = {}
    current_num = 1

    for obj in objects:
        if obj.label:
            number_map[obj.label] = current_num
        obj.obj_num = current_num
        current_num += 1

    return number_map

3. Stream Length

The /Length value must exactly match the stream content:

def write_stream(self, stream: PDFStream):
    data = stream.data

    # Calculate length AFTER any processing
    if self.compress:
        data = zlib.compress(data)

    stream.dictionary['Length'] = len(data)  # Exact length

    self.write_dict(stream.dictionary)
    self.write_line("stream")
    self.output.extend(data)
    # DON'T add extra newlines that aren't counted
    self.write_line("endstream")

Extensions

Level 1: Syntax Highlighting

Create syntax highlighting for popular editors (VS Code, vim, Sublime).

Level 2: Semantic Validation

Validate PDF semantics:

  • Catalog must have /Pages
  • Pages must be in a tree
  • References must point to valid objects

Level 3: Diff Tool

Create a diff tool that shows semantic differences between PDFs:

./pdfasm diff file1.pdf file2.pdf

Level 4: Web Interface

Build a web-based PDF explorer with visualization.


Self-Assessment

Before considering this project complete:

  • Can parse and generate all 8 PDF object types
  • Assembly → PDF produces valid, viewable PDFs
  • PDF → Assembly produces readable output
  • Round-trip preserves document structure
  • Handles compressed streams correctly
  • CLI tool is usable and well-documented
  • Can edit a PDF by modifying assembly and reassembling

Resources

Essential Reading

  • Domain Specific Languages by Martin Fowler - DSL design
  • PDF Reference Manual 1.7 by Adobe - PDF specification
  • Developing with PDF by Leonard Rosenthol - PDF internals

Tools

  • qpdf: PDF normalization and validation
  • pypdf/pikepdf: Python PDF libraries
  • pdftotext: Text extraction for comparison