Project 5: PDF Assembly Language

Project 5: PDF Assembly Language

Project Overview

Attribute Details
Difficulty Level 2: Intermediate
Time Estimate 2-3 weeks
Programming Language Python (primary), Rust/C/Go (alternatives)
Knowledge Area Document Processing / DSL Design
Prerequisites Project 2 (PDF parser understanding)

What youโ€™ll build: A human-readable โ€œassembly languageโ€ for PDF that compiles to valid PDF files, with a disassembler that converts PDF back to this format.

Why it teaches PostScriptโ†’PDF: By creating a readable intermediate format, youโ€™ll deeply understand PDFโ€™s structure. The assembler/disassembler round-trip proves your understanding is complete.


Learning Objectives

By completing this project, you will:

  1. Design a domain-specific language (DSL) that maps 1:1 to PDF structures
  2. Implement an assembler that converts text to binary PDF
  3. Implement a disassembler that converts PDF to readable text
  4. Handle all PDF object types and compression
  5. Achieve round-trip fidelity (PDF โ†’ Assembly โ†’ PDF produces identical output)

The Core Question Youโ€™re Answering

โ€œHow can we make PDFโ€™s complex binary structure human-readable and editable, while maintaining perfect round-trip fidelity?โ€

This is the same challenge that:

  • Assembly language solves for machine code
  • JSON/YAML solve for serialized data
  • SQL DDL solves for database schema

By building this system, youโ€™ll understand:

  • What makes a good intermediate representation
  • How to balance human readability with technical completeness
  • The fundamental structures that make PDF work

Deep Theoretical Foundation

1. PDF as a Database

PDF files are essentially databases with a clever indexing scheme:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    PDF AS A DATABASE                            โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                                 โ”‚
โ”‚  DATABASE CONCEPT        PDF EQUIVALENT                         โ”‚
โ”‚  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€       โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€                    โ”‚
โ”‚  Table                   Object type (/Type /Page, etc.)       โ”‚
โ”‚  Row                     Indirect object (1 0 obj ... endobj)  โ”‚
โ”‚  Primary key             Object number (1, 2, 3, ...)          โ”‚
โ”‚  Foreign key             Reference (2 0 R)                     โ”‚
โ”‚  Index                   Cross-reference table (xref)          โ”‚
โ”‚  Binary blob             Stream object                          โ”‚
โ”‚                                                                 โ”‚
โ”‚  QUERIES:                                                       โ”‚
โ”‚  "Get page 5"     โ†’ Traverse Catalogโ†’Pagesโ†’Kids[4]โ†’Contents   โ”‚
โ”‚  "Get object 10"  โ†’ Look up in xref, seek to byte offset      โ”‚
โ”‚  "Get all fonts"  โ†’ Traverse page resources                   โ”‚
โ”‚                                                                 โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

2. Why Assembly Language?

Assembly languages provide:

  1. 1:1 Mapping: Every assembly instruction maps to exactly one machine instruction (or PDF object)
  2. Human Readability: Mnemonics instead of binary opcodes
  3. Symbolic References: Labels instead of memory addresses (or object numbers)
  4. Comments: Documentation within the source
MACHINE CODE:          ASSEMBLY:              PDF ASSEMBLY:
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€         โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€             โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
B8 0A 00 00 00        mov eax, 10            object 1
                                               type Catalog
                                               pages @pages_root
89 C3                 mov ebx, eax           end

48 01 D8              add rax, rbx           object 2 as pages_root
                                               type Pages
                                               kids [@page1]
                                             end

3. Design Choices

Object Representation:

OPTION 1: Close to PDF Syntax
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
1 0 obj
  << /Type /Catalog
     /Pages 2 0 R >>
endobj

OPTION 2: Simplified Syntax
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
object 1 0
  type Catalog
  pages 2 0 R
end

OPTION 3: Labeled Objects
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
catalog:
  type Catalog
  pages @pages_root

pages_root:
  type Pages
  kids [@page1]

Stream Handling:

OPTION A: Inline (base64 for binary)
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
object 4
  stream
    BT
    /F1 12 Tf
    (Hello) Tj
    ET
  endstream
end

OPTION B: External File Reference
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
object 4
  stream from "content.txt"
  filter FlateDecode
end

OPTION C: Automatic Compression
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
object 4
  stream compress=true
    BT
    /F1 12 Tf
    (Hello) Tj
    ET
  endstream
end

4. The 8 PDF Object Types

Your assembly language must represent all of these:

Type PDF Syntax Assembly Syntax (Example)
Boolean true, false true, false
Integer 42 42
Real 3.14159 3.14159
String (Hello) or <48656C6C6F> "Hello" or hex:48656C6C6F
Name /Type Type or /Type
Array [1 2 3] [1, 2, 3]
Dictionary << /Key Value >> { Key: Value } or indented block
Stream stream...endstream stream...endstream

Project Specification

Core Features

Assembler (Text โ†’ PDF)

# Basic usage
./pdfasm assemble input.pdfasm -o output.pdf

# With compression
./pdfasm assemble input.pdfasm -o output.pdf --compress

# Verbose output
./pdfasm assemble input.pdfasm -o output.pdf -v

Input format example:

; PDF Assembly Language - Hello World
version 1.7

object 1
  type Catalog
  pages 2 0 R
end

object 2
  type Pages
  kids [3 0 R]
  count 1
end

object 3
  type Page
  parent 2 0 R
  mediabox [0 0 612 792]
  contents 4 0 R
  resources {
    Font {
      F1 5 0 R
    }
  }
end

object 4
  stream
    BT
    /F1 24 Tf
    100 700 Td
    (Hello, PDF World!) Tj
    ET
  endstream
end

object 5
  type Font
  subtype Type1
  basefont Helvetica
end

Output: Valid PDF file that opens in any reader.

Disassembler (PDF โ†’ Text)

# Basic usage
./pdfasm disassemble input.pdf -o output.pdfasm

# Include comments
./pdfasm disassemble input.pdf -o output.pdfasm --annotate

# Decompress streams
./pdfasm disassemble input.pdf -o output.pdfasm --decompress

Output format: Human-readable assembly that can be reassembled.

Validator

# Validate PDF structure
./pdfasm validate input.pdf

# Validate assembly syntax
./pdfasm validate input.pdfasm --syntax

Interactive Explorer

./pdfasm explore input.pdf

> list               # List all objects
> show 3             # Show object 3
> page 1             # Show page 1 info
> stream 4           # Show content stream
> tree               # Show document tree
> quit

Solution Architecture

High-Level Design

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    PDF ASSEMBLY TOOLKIT                         โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                                 โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”‚
โ”‚  โ”‚                    ASSEMBLER                            โ”‚   โ”‚
โ”‚  โ”‚                                                          โ”‚   โ”‚
โ”‚  โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”‚   โ”‚
โ”‚  โ”‚  โ”‚   Lexer     โ”‚โ”€โ–ถโ”‚   Parser    โ”‚โ”€โ–ถโ”‚   Emitter   โ”‚     โ”‚   โ”‚
โ”‚  โ”‚  โ”‚             โ”‚  โ”‚             โ”‚  โ”‚             โ”‚     โ”‚   โ”‚
โ”‚  โ”‚  โ”‚ Tokenize    โ”‚  โ”‚ Build AST   โ”‚  โ”‚ Write PDF   โ”‚     โ”‚   โ”‚
โ”‚  โ”‚  โ”‚ .pdfasm     โ”‚  โ”‚ of objects  โ”‚  โ”‚ with xref   โ”‚     โ”‚   โ”‚
โ”‚  โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ”‚   โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚
โ”‚                            โ†‘ โ†“                                  โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”‚
โ”‚  โ”‚                   DISASSEMBLER                          โ”‚   โ”‚
โ”‚  โ”‚                                                          โ”‚   โ”‚
โ”‚  โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”‚   โ”‚
โ”‚  โ”‚  โ”‚ PDF Parser  โ”‚โ”€โ–ถโ”‚ Decompressorโ”‚โ”€โ–ถโ”‚ Formatter   โ”‚     โ”‚   โ”‚
โ”‚  โ”‚  โ”‚             โ”‚  โ”‚             โ”‚  โ”‚             โ”‚     โ”‚   โ”‚
โ”‚  โ”‚  โ”‚ Parse xref  โ”‚  โ”‚ Inflate     โ”‚  โ”‚ Pretty-printโ”‚     โ”‚   โ”‚
โ”‚  โ”‚  โ”‚ Read objectsโ”‚  โ”‚ streams     โ”‚  โ”‚ to .pdfasm  โ”‚     โ”‚   โ”‚
โ”‚  โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ”‚   โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚
โ”‚                                                                 โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”‚
โ”‚  โ”‚                    VALIDATOR                            โ”‚   โ”‚
โ”‚  โ”‚                                                          โ”‚   โ”‚
โ”‚  โ”‚  โ€ข Check required keys (Catalog has /Pages)             โ”‚   โ”‚
โ”‚  โ”‚  โ€ข Validate references (all refs point to valid objects)โ”‚   โ”‚
โ”‚  โ”‚  โ€ข Check stream lengths                                  โ”‚   โ”‚
โ”‚  โ”‚  โ€ข Verify xref offsets                                  โ”‚   โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚
โ”‚                                                                 โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Key Data Structures

from dataclasses import dataclass
from typing import Union, List, Dict, Optional
from enum import Enum

class PDFObjectType(Enum):
    NULL = "null"
    BOOLEAN = "boolean"
    INTEGER = "integer"
    REAL = "real"
    STRING = "string"
    NAME = "name"
    ARRAY = "array"
    DICTIONARY = "dictionary"
    STREAM = "stream"
    REFERENCE = "reference"

@dataclass
class PDFReference:
    obj_num: int
    gen_num: int = 0

@dataclass
class PDFStream:
    dictionary: Dict[str, any]
    data: bytes
    is_compressed: bool = False

@dataclass
class PDFObject:
    obj_num: int
    gen_num: int
    value: Union[bool, int, float, str, bytes,
                 List['PDFObject'], Dict[str, 'PDFObject'],
                 PDFStream, PDFReference, None]

@dataclass
class PDFDocument:
    version: str
    objects: Dict[int, PDFObject]
    root_obj_num: int

    def get_object(self, num: int) -> Optional[PDFObject]:
        return self.objects.get(num)

    def resolve_ref(self, ref: PDFReference) -> Optional[PDFObject]:
        return self.objects.get(ref.obj_num)

Implementation Guide

Phase 1: Assembly Parser (Days 1-4)

Create a parser for the assembly language:

# pdfasm/lexer.py
import re
from dataclasses import dataclass
from enum import Enum, auto
from typing import Iterator

class TokenType(Enum):
    # Keywords
    VERSION = auto()
    OBJECT = auto()
    END = auto()
    STREAM = auto()
    ENDSTREAM = auto()

    # Values
    INTEGER = auto()
    REAL = auto()
    STRING = auto()
    NAME = auto()
    TRUE = auto()
    FALSE = auto()
    NULL = auto()

    # Delimiters
    LBRACKET = auto()
    RBRACKET = auto()
    LBRACE = auto()
    RBRACE = auto()
    REF = auto()  # "R" suffix

    # Misc
    COMMENT = auto()
    NEWLINE = auto()
    EOF = auto()

@dataclass
class Token:
    type: TokenType
    value: str
    line: int
    column: int

class Lexer:
    def __init__(self, source: str):
        self.source = source
        self.pos = 0
        self.line = 1
        self.column = 1

    def tokenize(self) -> Iterator[Token]:
        while self.pos < len(self.source):
            # Skip whitespace (except newlines)
            if self.source[self.pos] in ' \t\r':
                self.advance()
                continue

            # Newlines
            if self.source[self.pos] == '\n':
                yield Token(TokenType.NEWLINE, '\n', self.line, self.column)
                self.advance()
                self.line += 1
                self.column = 1
                continue

            # Comments
            if self.source[self.pos] == ';':
                start = self.pos
                while self.pos < len(self.source) and self.source[self.pos] != '\n':
                    self.advance()
                yield Token(TokenType.COMMENT, self.source[start:self.pos],
                           self.line, self.column)
                continue

            # Strings
            if self.source[self.pos] == '"':
                yield self.read_string()
                continue

            # Numbers
            if self.source[self.pos].isdigit() or self.source[self.pos] == '-':
                yield self.read_number()
                continue

            # Delimiters
            if self.source[self.pos] == '[':
                yield Token(TokenType.LBRACKET, '[', self.line, self.column)
                self.advance()
                continue
            if self.source[self.pos] == ']':
                yield Token(TokenType.RBRACKET, ']', self.line, self.column)
                self.advance()
                continue
            if self.source[self.pos] == '{':
                yield Token(TokenType.LBRACE, '{', self.line, self.column)
                self.advance()
                continue
            if self.source[self.pos] == '}':
                yield Token(TokenType.RBRACE, '}', self.line, self.column)
                self.advance()
                continue

            # Names and keywords
            if self.source[self.pos].isalpha() or self.source[self.pos] == '/':
                yield self.read_name_or_keyword()
                continue

            raise SyntaxError(f"Unexpected character at line {self.line}: "
                            f"'{self.source[self.pos]}'")

        yield Token(TokenType.EOF, '', self.line, self.column)

    def advance(self):
        self.pos += 1
        self.column += 1

    def read_string(self) -> Token:
        start_line, start_col = self.line, self.column
        self.advance()  # Skip opening quote
        start = self.pos

        while self.pos < len(self.source) and self.source[self.pos] != '"':
            if self.source[self.pos] == '\\':
                self.advance()  # Skip escape
            self.advance()

        value = self.source[start:self.pos]
        self.advance()  # Skip closing quote
        return Token(TokenType.STRING, value, start_line, start_col)

    def read_number(self) -> Token:
        start_line, start_col = self.line, self.column
        start = self.pos

        if self.source[self.pos] == '-':
            self.advance()

        while self.pos < len(self.source) and self.source[self.pos].isdigit():
            self.advance()

        if self.pos < len(self.source) and self.source[self.pos] == '.':
            self.advance()
            while self.pos < len(self.source) and self.source[self.pos].isdigit():
                self.advance()
            return Token(TokenType.REAL, self.source[start:self.pos],
                        start_line, start_col)

        return Token(TokenType.INTEGER, self.source[start:self.pos],
                    start_line, start_col)

    def read_name_or_keyword(self) -> Token:
        start_line, start_col = self.line, self.column
        start = self.pos

        # Handle /Name syntax
        if self.source[self.pos] == '/':
            self.advance()

        while (self.pos < len(self.source) and
               (self.source[self.pos].isalnum() or self.source[self.pos] == '_')):
            self.advance()

        value = self.source[start:self.pos]

        # Check for keywords
        keywords = {
            'version': TokenType.VERSION,
            'object': TokenType.OBJECT,
            'end': TokenType.END,
            'stream': TokenType.STREAM,
            'endstream': TokenType.ENDSTREAM,
            'true': TokenType.TRUE,
            'false': TokenType.FALSE,
            'null': TokenType.NULL,
            'R': TokenType.REF,
        }

        token_type = keywords.get(value, TokenType.NAME)
        return Token(token_type, value, start_line, start_col)

Phase 2: Assembly to PDF (Days 5-8)

# pdfasm/assembler.py
import zlib
from typing import Dict, List, Tuple

class PDFAssembler:
    def __init__(self, document: PDFDocument, compress: bool = False):
        self.doc = document
        self.compress = compress
        self.output = bytearray()
        self.offsets: Dict[int, int] = {}  # obj_num -> byte offset

    def assemble(self) -> bytes:
        # Write header
        self.write_line(f"%PDF-{self.doc.version}")
        self.write_line("%\xe2\xe3\xcf\xd3")  # Binary marker

        # Write objects in order
        for obj_num in sorted(self.doc.objects.keys()):
            self.write_object(obj_num)

        # Write xref table
        xref_offset = len(self.output)
        self.write_xref()

        # Write trailer
        self.write_trailer(xref_offset)

        return bytes(self.output)

    def write_line(self, line: str):
        self.output.extend(line.encode('latin-1'))
        self.output.extend(b'\n')

    def write_object(self, obj_num: int):
        obj = self.doc.objects[obj_num]

        # Record offset
        self.offsets[obj_num] = len(self.output)

        # Write object header
        self.write_line(f"{obj_num} {obj.gen_num} obj")

        # Write value
        if isinstance(obj.value, PDFStream):
            self.write_stream(obj.value)
        else:
            self.write_value(obj.value)

        self.write_line("endobj")

    def write_value(self, value) -> str:
        if value is None:
            self.write_line("null")
        elif isinstance(value, bool):
            self.write_line("true" if value else "false")
        elif isinstance(value, int):
            self.write_line(str(value))
        elif isinstance(value, float):
            self.write_line(f"{value:.6g}")
        elif isinstance(value, str):
            # Check if it's a name or string
            if value.startswith('/'):
                self.write_line(value)
            else:
                self.write_line(f"({self.escape_string(value)})")
        elif isinstance(value, PDFReference):
            self.write_line(f"{value.obj_num} {value.gen_num} R")
        elif isinstance(value, list):
            self.write_array(value)
        elif isinstance(value, dict):
            self.write_dict(value)

    def write_array(self, arr: List):
        parts = []
        for item in arr:
            parts.append(self.value_to_string(item))
        self.write_line(f"[{' '.join(parts)}]")

    def write_dict(self, d: Dict):
        self.write_line("<<")
        for key, value in d.items():
            if not key.startswith('/'):
                key = '/' + key
            self.write_line(f"  {key} {self.value_to_string(value)}")
        self.write_line(">>")

    def write_stream(self, stream: PDFStream):
        data = stream.data

        # Compress if requested
        if self.compress and not stream.is_compressed:
            data = zlib.compress(data)
            stream.dictionary['Filter'] = '/FlateDecode'

        stream.dictionary['Length'] = len(data)

        # Write dictionary
        self.write_dict(stream.dictionary)

        # Write stream
        self.write_line("stream")
        self.output.extend(data)
        if not data.endswith(b'\n'):
            self.output.extend(b'\n')
        self.write_line("endstream")

    def value_to_string(self, value) -> str:
        if value is None:
            return "null"
        elif isinstance(value, bool):
            return "true" if value else "false"
        elif isinstance(value, int):
            return str(value)
        elif isinstance(value, float):
            return f"{value:.6g}"
        elif isinstance(value, str):
            if value.startswith('/'):
                return value
            return f"({self.escape_string(value)})"
        elif isinstance(value, PDFReference):
            return f"{value.obj_num} {value.gen_num} R"
        elif isinstance(value, list):
            return '[' + ' '.join(self.value_to_string(v) for v in value) + ']'
        elif isinstance(value, dict):
            parts = []
            for k, v in value.items():
                if not k.startswith('/'):
                    k = '/' + k
                parts.append(f"{k} {self.value_to_string(v)}")
            return '<< ' + ' '.join(parts) + ' >>'
        return str(value)

    def escape_string(self, s: str) -> str:
        return s.replace('\\', '\\\\').replace('(', '\\(').replace(')', '\\)')

    def write_xref(self):
        self.write_line("xref")
        self.write_line(f"0 {len(self.offsets) + 1}")
        self.write_line("0000000000 65535 f ")  # Object 0 is free

        for obj_num in sorted(self.offsets.keys()):
            offset = self.offsets[obj_num]
            self.write_line(f"{offset:010d} 00000 n ")

    def write_trailer(self, xref_offset: int):
        self.write_line("trailer")
        self.write_line(f"<< /Size {len(self.offsets) + 1} /Root {self.doc.root_obj_num} 0 R >>")
        self.write_line("startxref")
        self.write_line(str(xref_offset))
        self.write_line("%%EOF")

Phase 3: PDF to Assembly (Days 9-12)

# pdfasm/disassembler.py
import zlib
from typing import Dict, List, Optional

class PDFDisassembler:
    def __init__(self, pdf_data: bytes, decompress: bool = True, annotate: bool = True):
        self.data = pdf_data
        self.decompress = decompress
        self.annotate = annotate
        self.pos = 0

    def disassemble(self) -> str:
        lines = []

        # Parse PDF
        version = self.parse_version()
        lines.append(f"; PDF Disassembly")
        lines.append(f"; Original size: {len(self.data)} bytes")
        lines.append("")
        lines.append(f"version {version}")
        lines.append("")

        # Find xref and parse objects
        xref_offset = self.find_startxref()
        xref = self.parse_xref(xref_offset)
        trailer = self.parse_trailer()

        # Parse and output each object
        for obj_num, (offset, gen) in sorted(xref.items()):
            if offset == 0:  # Free object
                continue

            obj = self.parse_object_at(offset)
            lines.extend(self.format_object(obj_num, gen, obj))
            lines.append("")

        return '\n'.join(lines)

    def parse_version(self) -> str:
        # Find %PDF-X.X
        idx = self.data.find(b'%PDF-')
        if idx == -1:
            raise ValueError("Not a PDF file")
        end = self.data.find(b'\n', idx)
        return self.data[idx+5:end].decode('latin-1').strip()

    def find_startxref(self) -> int:
        # Search from end of file
        idx = self.data.rfind(b'startxref')
        if idx == -1:
            raise ValueError("Cannot find startxref")

        # Read the offset
        start = idx + 10
        end = self.data.find(b'\n', start)
        return int(self.data[start:end].strip())

    def parse_xref(self, offset: int) -> Dict[int, Tuple[int, int]]:
        self.pos = offset
        xref = {}

        # Skip "xref" keyword
        self.skip_whitespace()
        self.expect(b'xref')
        self.skip_whitespace()

        # Read subsections
        while True:
            line = self.read_line()
            if line.startswith(b'trailer'):
                self.pos -= len(line) + 1
                break

            parts = line.split()
            if len(parts) != 2:
                break

            first_obj = int(parts[0])
            count = int(parts[1])

            for i in range(count):
                entry = self.read_line()
                offset_str, gen_str, status = entry.split()
                offset = int(offset_str)
                gen = int(gen_str)

                if status == b'n':  # In use
                    xref[first_obj + i] = (offset, gen)

        return xref

    def parse_trailer(self) -> Dict:
        # Find trailer
        idx = self.data.rfind(b'trailer')
        self.pos = idx + 7
        self.skip_whitespace()
        return self.parse_dict()

    def parse_object_at(self, offset: int) -> any:
        self.pos = offset

        # Read "N G obj"
        obj_num = self.read_number()
        gen = self.read_number()
        self.skip_whitespace()
        self.expect(b'obj')
        self.skip_whitespace()

        # Parse value
        value = self.parse_value()

        # Check for stream
        self.skip_whitespace()
        if self.peek(6) == b'stream':
            self.pos += 6
            # Skip to stream data
            while self.data[self.pos] in b'\r\n':
                self.pos += 1

            # Get length from dictionary
            length = value.get('/Length', 0)
            if isinstance(length, tuple):  # Reference
                # Would need to resolve - simplified here
                length = 0

            stream_data = self.data[self.pos:self.pos + length]
            self.pos += length

            # Decompress if needed
            if self.decompress and '/Filter' in value:
                filter_name = value['/Filter']
                if filter_name == '/FlateDecode':
                    try:
                        stream_data = zlib.decompress(stream_data)
                    except:
                        pass  # Keep compressed

            return {'type': 'stream', 'dict': value, 'data': stream_data}

        return value

    def format_object(self, obj_num: int, gen: int, obj) -> List[str]:
        lines = []

        if self.annotate:
            lines.append(f"; Object {obj_num}")

        lines.append(f"object {obj_num}")

        if isinstance(obj, dict) and obj.get('type') == 'stream':
            # Stream object
            for key, value in obj['dict'].items():
                if key == '/Length':
                    continue  # Will be calculated
                lines.append(f"  {key[1:]} {self.format_value(value)}")

            lines.append("  stream")
            # Indent stream content
            stream_text = obj['data'].decode('latin-1', errors='replace')
            for line in stream_text.split('\n'):
                lines.append(f"    {line}")
            lines.append("  endstream")
        elif isinstance(obj, dict):
            # Dictionary object
            for key, value in obj.items():
                lines.append(f"  {key[1:]} {self.format_value(value)}")
        else:
            lines.append(f"  {self.format_value(obj)}")

        lines.append("end")
        return lines

    def format_value(self, value) -> str:
        if value is None:
            return "null"
        elif isinstance(value, bool):
            return "true" if value else "false"
        elif isinstance(value, int):
            return str(value)
        elif isinstance(value, float):
            return f"{value:.6g}"
        elif isinstance(value, str):
            if value.startswith('/'):
                return value[1:]  # Remove leading /
            return f'"{value}"'
        elif isinstance(value, tuple):  # Reference
            return f"{value[0]} {value[1]} R"
        elif isinstance(value, list):
            return '[' + ' '.join(self.format_value(v) for v in value) + ']'
        elif isinstance(value, dict):
            parts = []
            for k, v in value.items():
                parts.append(f"{k[1:]}: {self.format_value(v)}")
            return '{ ' + ', '.join(parts) + ' }'
        return str(value)

    # ... helper methods for parsing ...

Phase 4: CLI Tool (Days 13-14)

#!/usr/bin/env python3
# pdfasm/cli.py
import argparse
import sys
from pathlib import Path

from .lexer import Lexer
from .parser import Parser
from .assembler import PDFAssembler
from .disassembler import PDFDisassembler
from .validator import PDFValidator

def cmd_assemble(args):
    """Assemble .pdfasm to .pdf"""
    source = Path(args.input).read_text()

    # Parse assembly
    lexer = Lexer(source)
    parser = Parser(lexer.tokenize())
    document = parser.parse()

    # Assemble to PDF
    assembler = PDFAssembler(document, compress=args.compress)
    pdf_data = assembler.assemble()

    # Write output
    Path(args.output).write_bytes(pdf_data)
    print(f"Assembled {args.input} to {args.output} ({len(pdf_data)} bytes)")

def cmd_disassemble(args):
    """Disassemble .pdf to .pdfasm"""
    pdf_data = Path(args.input).read_bytes()

    # Disassemble
    disasm = PDFDisassembler(
        pdf_data,
        decompress=args.decompress,
        annotate=args.annotate
    )
    assembly = disasm.disassemble()

    # Write output
    Path(args.output).write_text(assembly)
    print(f"Disassembled {args.input} to {args.output}")

def cmd_validate(args):
    """Validate PDF or assembly"""
    path = Path(args.input)

    if path.suffix == '.pdfasm':
        # Validate assembly syntax
        source = path.read_text()
        lexer = Lexer(source)
        parser = Parser(lexer.tokenize())
        try:
            document = parser.parse()
            print(f"Syntax OK: {len(document.objects)} objects")
        except SyntaxError as e:
            print(f"Syntax error: {e}")
            sys.exit(1)
    else:
        # Validate PDF structure
        pdf_data = path.read_bytes()
        validator = PDFValidator(pdf_data)
        errors, warnings = validator.validate()

        for err in errors:
            print(f"ERROR: {err}")
        for warn in warnings:
            print(f"WARNING: {warn}")

        if errors:
            sys.exit(1)
        print("PDF structure is valid")

def cmd_explore(args):
    """Interactive PDF exploration"""
    pdf_data = Path(args.input).read_bytes()
    disasm = PDFDisassembler(pdf_data)

    print(f"PDF Explorer - {args.input}")
    print("Commands: list, show <n>, page <n>, stream <n>, tree, quit")
    print()

    while True:
        try:
            cmd = input("> ").strip()
        except EOFError:
            break

        if cmd == 'quit':
            break
        elif cmd == 'list':
            # List all objects
            for obj_num in sorted(disasm.xref.keys()):
                print(f"  {obj_num}: ...")  # TODO: Show type
        elif cmd.startswith('show '):
            obj_num = int(cmd.split()[1])
            obj = disasm.parse_object_at(disasm.xref[obj_num][0])
            print(disasm.format_value(obj))
        # ... more commands

def main():
    parser = argparse.ArgumentParser(description='PDF Assembly Language Tool')
    subparsers = parser.add_subparsers(dest='command', required=True)

    # assemble command
    p_asm = subparsers.add_parser('assemble', help='Assemble .pdfasm to .pdf')
    p_asm.add_argument('input', help='Input .pdfasm file')
    p_asm.add_argument('-o', '--output', required=True, help='Output .pdf file')
    p_asm.add_argument('--compress', action='store_true', help='Compress streams')
    p_asm.set_defaults(func=cmd_assemble)

    # disassemble command
    p_disasm = subparsers.add_parser('disassemble', help='Disassemble .pdf to .pdfasm')
    p_disasm.add_argument('input', help='Input .pdf file')
    p_disasm.add_argument('-o', '--output', required=True, help='Output .pdfasm file')
    p_disasm.add_argument('--decompress', action='store_true', default=True,
                         help='Decompress streams')
    p_disasm.add_argument('--annotate', action='store_true', default=True,
                         help='Add comments')
    p_disasm.set_defaults(func=cmd_disassemble)

    # validate command
    p_val = subparsers.add_parser('validate', help='Validate PDF or assembly')
    p_val.add_argument('input', help='Input file')
    p_val.set_defaults(func=cmd_validate)

    # explore command
    p_exp = subparsers.add_parser('explore', help='Interactive PDF exploration')
    p_exp.add_argument('input', help='Input .pdf file')
    p_exp.set_defaults(func=cmd_explore)

    args = parser.parse_args()
    args.func(args)

if __name__ == '__main__':
    main()

Testing Strategy

Round-Trip Testing

The most important test: disassemble โ†’ reassemble โ†’ compare

# Test round-trip
./pdfasm disassemble original.pdf -o original.pdfasm
./pdfasm assemble original.pdfasm -o reassembled.pdf

# Compare
diff <(pdftotext original.pdf -) <(pdftotext reassembled.pdf -)

# Or use qpdf to normalize and compare
qpdf --qdf original.pdf - > orig.qdf
qpdf --qdf reassembled.pdf - > reasm.qdf
diff orig.qdf reasm.qdf

Unit Tests

def test_simple_document():
    source = """
    version 1.4

    object 1
      type Catalog
      pages 2 0 R
    end

    object 2
      type Pages
      kids [3 0 R]
      count 1
    end

    object 3
      type Page
      parent 2 0 R
      mediabox [0 0 612 792]
    end
    """

    lexer = Lexer(source)
    parser = Parser(lexer.tokenize())
    doc = parser.parse()

    assert doc.version == "1.4"
    assert len(doc.objects) == 3
    assert doc.objects[1].value['/Type'] == '/Catalog'

def test_assemble_disassemble():
    # Create minimal document
    doc = PDFDocument(
        version="1.4",
        objects={...},
        root_obj_num=1
    )

    # Assemble
    assembler = PDFAssembler(doc)
    pdf_bytes = assembler.assemble()

    # Disassemble
    disasm = PDFDisassembler(pdf_bytes)
    assembly = disasm.disassemble()

    # Parse again
    lexer = Lexer(assembly)
    parser = Parser(lexer.tokenize())
    doc2 = parser.parse()

    # Should be equivalent
    assert doc.version == doc2.version
    assert len(doc.objects) == len(doc2.objects)

Common Pitfalls

1. String Escaping

PDF strings can contain special characters:

def escape_pdf_string(s: str) -> str:
    return (s.replace('\\', '\\\\')
             .replace('(', '\\(')
             .replace(')', '\\)')
             .replace('\n', '\\n')
             .replace('\r', '\\r'))

def unescape_pdf_string(s: str) -> str:
    # Handle escape sequences
    result = []
    i = 0
    while i < len(s):
        if s[i] == '\\' and i + 1 < len(s):
            next_char = s[i + 1]
            if next_char == 'n':
                result.append('\n')
            elif next_char == 'r':
                result.append('\r')
            elif next_char == 't':
                result.append('\t')
            elif next_char in '\\()':
                result.append(next_char)
            i += 2
        else:
            result.append(s[i])
            i += 1
    return ''.join(result)

2. Object Number Assignment

When assembling, you need to assign object numbers correctly:

def assign_object_numbers(objects: List[PDFObject]) -> Dict[str, int]:
    """Assign sequential object numbers, handling labeled references."""
    number_map = {}
    current_num = 1

    for obj in objects:
        if obj.label:
            number_map[obj.label] = current_num
        obj.obj_num = current_num
        current_num += 1

    return number_map

3. Stream Length

The /Length value must exactly match the stream content:

def write_stream(self, stream: PDFStream):
    data = stream.data

    # Calculate length AFTER any processing
    if self.compress:
        data = zlib.compress(data)

    stream.dictionary['Length'] = len(data)  # Exact length

    self.write_dict(stream.dictionary)
    self.write_line("stream")
    self.output.extend(data)
    # DON'T add extra newlines that aren't counted
    self.write_line("endstream")

Extensions

Level 1: Syntax Highlighting

Create syntax highlighting for popular editors (VS Code, vim, Sublime).

Level 2: Semantic Validation

Validate PDF semantics:

  • Catalog must have /Pages
  • Pages must be in a tree
  • References must point to valid objects

Level 3: Diff Tool

Create a diff tool that shows semantic differences between PDFs:

./pdfasm diff file1.pdf file2.pdf

Level 4: Web Interface

Build a web-based PDF explorer with visualization.


Self-Assessment

Before considering this project complete:

  • Can parse and generate all 8 PDF object types
  • Assembly โ†’ PDF produces valid, viewable PDFs
  • PDF โ†’ Assembly produces readable output
  • Round-trip preserves document structure
  • Handles compressed streams correctly
  • CLI tool is usable and well-documented
  • Can edit a PDF by modifying assembly and reassembling

Resources

Essential Reading

  • Domain Specific Languages by Martin Fowler - DSL design
  • PDF Reference Manual 1.7 by Adobe - PDF specification
  • Developing with PDF by Leonard Rosenthol - PDF internals

Tools

  • qpdf: PDF normalization and validation
  • pypdf/pikepdf: Python PDF libraries
  • pdftotext: Text extraction for comparison