Project 5: PDF Assembly Language
Project 5: PDF Assembly Language
Project Overview
| Attribute | Details |
|---|---|
| Difficulty | Level 2: Intermediate |
| Time Estimate | 2-3 weeks |
| Programming Language | Python (primary), Rust/C/Go (alternatives) |
| Knowledge Area | Document Processing / DSL Design |
| Prerequisites | Project 2 (PDF parser understanding) |
What youโll build: A human-readable โassembly languageโ for PDF that compiles to valid PDF files, with a disassembler that converts PDF back to this format.
Why it teaches PostScriptโPDF: By creating a readable intermediate format, youโll deeply understand PDFโs structure. The assembler/disassembler round-trip proves your understanding is complete.
Learning Objectives
By completing this project, you will:
- Design a domain-specific language (DSL) that maps 1:1 to PDF structures
- Implement an assembler that converts text to binary PDF
- Implement a disassembler that converts PDF to readable text
- Handle all PDF object types and compression
- Achieve round-trip fidelity (PDF โ Assembly โ PDF produces identical output)
The Core Question Youโre Answering
โHow can we make PDFโs complex binary structure human-readable and editable, while maintaining perfect round-trip fidelity?โ
This is the same challenge that:
- Assembly language solves for machine code
- JSON/YAML solve for serialized data
- SQL DDL solves for database schema
By building this system, youโll understand:
- What makes a good intermediate representation
- How to balance human readability with technical completeness
- The fundamental structures that make PDF work
Deep Theoretical Foundation
1. PDF as a Database
PDF files are essentially databases with a clever indexing scheme:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ PDF AS A DATABASE โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ DATABASE CONCEPT PDF EQUIVALENT โ
โ โโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโ โ
โ Table Object type (/Type /Page, etc.) โ
โ Row Indirect object (1 0 obj ... endobj) โ
โ Primary key Object number (1, 2, 3, ...) โ
โ Foreign key Reference (2 0 R) โ
โ Index Cross-reference table (xref) โ
โ Binary blob Stream object โ
โ โ
โ QUERIES: โ
โ "Get page 5" โ Traverse CatalogโPagesโKids[4]โContents โ
โ "Get object 10" โ Look up in xref, seek to byte offset โ
โ "Get all fonts" โ Traverse page resources โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
2. Why Assembly Language?
Assembly languages provide:
- 1:1 Mapping: Every assembly instruction maps to exactly one machine instruction (or PDF object)
- Human Readability: Mnemonics instead of binary opcodes
- Symbolic References: Labels instead of memory addresses (or object numbers)
- Comments: Documentation within the source
MACHINE CODE: ASSEMBLY: PDF ASSEMBLY:
โโโโโโโโโโโโโ โโโโโโโโโโ โโโโโโโโโโโโโโโโโ
B8 0A 00 00 00 mov eax, 10 object 1
type Catalog
pages @pages_root
89 C3 mov ebx, eax end
48 01 D8 add rax, rbx object 2 as pages_root
type Pages
kids [@page1]
end
3. Design Choices
Object Representation:
OPTION 1: Close to PDF Syntax
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 0 obj
<< /Type /Catalog
/Pages 2 0 R >>
endobj
OPTION 2: Simplified Syntax
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
object 1 0
type Catalog
pages 2 0 R
end
OPTION 3: Labeled Objects
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
catalog:
type Catalog
pages @pages_root
pages_root:
type Pages
kids [@page1]
Stream Handling:
OPTION A: Inline (base64 for binary)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
object 4
stream
BT
/F1 12 Tf
(Hello) Tj
ET
endstream
end
OPTION B: External File Reference
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
object 4
stream from "content.txt"
filter FlateDecode
end
OPTION C: Automatic Compression
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
object 4
stream compress=true
BT
/F1 12 Tf
(Hello) Tj
ET
endstream
end
4. The 8 PDF Object Types
Your assembly language must represent all of these:
| Type | PDF Syntax | Assembly Syntax (Example) |
|---|---|---|
| Boolean | true, false |
true, false |
| Integer | 42 |
42 |
| Real | 3.14159 |
3.14159 |
| String | (Hello) or <48656C6C6F> |
"Hello" or hex:48656C6C6F |
| Name | /Type |
Type or /Type |
| Array | [1 2 3] |
[1, 2, 3] |
| Dictionary | << /Key Value >> |
{ Key: Value } or indented block |
| Stream | stream...endstream |
stream...endstream |
Project Specification
Core Features
Assembler (Text โ PDF)
# Basic usage
./pdfasm assemble input.pdfasm -o output.pdf
# With compression
./pdfasm assemble input.pdfasm -o output.pdf --compress
# Verbose output
./pdfasm assemble input.pdfasm -o output.pdf -v
Input format example:
; PDF Assembly Language - Hello World
version 1.7
object 1
type Catalog
pages 2 0 R
end
object 2
type Pages
kids [3 0 R]
count 1
end
object 3
type Page
parent 2 0 R
mediabox [0 0 612 792]
contents 4 0 R
resources {
Font {
F1 5 0 R
}
}
end
object 4
stream
BT
/F1 24 Tf
100 700 Td
(Hello, PDF World!) Tj
ET
endstream
end
object 5
type Font
subtype Type1
basefont Helvetica
end
Output: Valid PDF file that opens in any reader.
Disassembler (PDF โ Text)
# Basic usage
./pdfasm disassemble input.pdf -o output.pdfasm
# Include comments
./pdfasm disassemble input.pdf -o output.pdfasm --annotate
# Decompress streams
./pdfasm disassemble input.pdf -o output.pdfasm --decompress
Output format: Human-readable assembly that can be reassembled.
Validator
# Validate PDF structure
./pdfasm validate input.pdf
# Validate assembly syntax
./pdfasm validate input.pdfasm --syntax
Interactive Explorer
./pdfasm explore input.pdf
> list # List all objects
> show 3 # Show object 3
> page 1 # Show page 1 info
> stream 4 # Show content stream
> tree # Show document tree
> quit
Solution Architecture
High-Level Design
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ PDF ASSEMBLY TOOLKIT โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ ASSEMBLER โ โ
โ โ โ โ
โ โ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โ โ
โ โ โ Lexer โโโถโ Parser โโโถโ Emitter โ โ โ
โ โ โ โ โ โ โ โ โ โ
โ โ โ Tokenize โ โ Build AST โ โ Write PDF โ โ โ
โ โ โ .pdfasm โ โ of objects โ โ with xref โ โ โ
โ โ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ DISASSEMBLER โ โ
โ โ โ โ
โ โ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โ โ
โ โ โ PDF Parser โโโถโ Decompressorโโโถโ Formatter โ โ โ
โ โ โ โ โ โ โ โ โ โ
โ โ โ Parse xref โ โ Inflate โ โ Pretty-printโ โ โ
โ โ โ Read objectsโ โ streams โ โ to .pdfasm โ โ โ
โ โ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ VALIDATOR โ โ
โ โ โ โ
โ โ โข Check required keys (Catalog has /Pages) โ โ
โ โ โข Validate references (all refs point to valid objects)โ โ
โ โ โข Check stream lengths โ โ
โ โ โข Verify xref offsets โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Key Data Structures
from dataclasses import dataclass
from typing import Union, List, Dict, Optional
from enum import Enum
class PDFObjectType(Enum):
NULL = "null"
BOOLEAN = "boolean"
INTEGER = "integer"
REAL = "real"
STRING = "string"
NAME = "name"
ARRAY = "array"
DICTIONARY = "dictionary"
STREAM = "stream"
REFERENCE = "reference"
@dataclass
class PDFReference:
obj_num: int
gen_num: int = 0
@dataclass
class PDFStream:
dictionary: Dict[str, any]
data: bytes
is_compressed: bool = False
@dataclass
class PDFObject:
obj_num: int
gen_num: int
value: Union[bool, int, float, str, bytes,
List['PDFObject'], Dict[str, 'PDFObject'],
PDFStream, PDFReference, None]
@dataclass
class PDFDocument:
version: str
objects: Dict[int, PDFObject]
root_obj_num: int
def get_object(self, num: int) -> Optional[PDFObject]:
return self.objects.get(num)
def resolve_ref(self, ref: PDFReference) -> Optional[PDFObject]:
return self.objects.get(ref.obj_num)
Implementation Guide
Phase 1: Assembly Parser (Days 1-4)
Create a parser for the assembly language:
# pdfasm/lexer.py
import re
from dataclasses import dataclass
from enum import Enum, auto
from typing import Iterator
class TokenType(Enum):
# Keywords
VERSION = auto()
OBJECT = auto()
END = auto()
STREAM = auto()
ENDSTREAM = auto()
# Values
INTEGER = auto()
REAL = auto()
STRING = auto()
NAME = auto()
TRUE = auto()
FALSE = auto()
NULL = auto()
# Delimiters
LBRACKET = auto()
RBRACKET = auto()
LBRACE = auto()
RBRACE = auto()
REF = auto() # "R" suffix
# Misc
COMMENT = auto()
NEWLINE = auto()
EOF = auto()
@dataclass
class Token:
type: TokenType
value: str
line: int
column: int
class Lexer:
def __init__(self, source: str):
self.source = source
self.pos = 0
self.line = 1
self.column = 1
def tokenize(self) -> Iterator[Token]:
while self.pos < len(self.source):
# Skip whitespace (except newlines)
if self.source[self.pos] in ' \t\r':
self.advance()
continue
# Newlines
if self.source[self.pos] == '\n':
yield Token(TokenType.NEWLINE, '\n', self.line, self.column)
self.advance()
self.line += 1
self.column = 1
continue
# Comments
if self.source[self.pos] == ';':
start = self.pos
while self.pos < len(self.source) and self.source[self.pos] != '\n':
self.advance()
yield Token(TokenType.COMMENT, self.source[start:self.pos],
self.line, self.column)
continue
# Strings
if self.source[self.pos] == '"':
yield self.read_string()
continue
# Numbers
if self.source[self.pos].isdigit() or self.source[self.pos] == '-':
yield self.read_number()
continue
# Delimiters
if self.source[self.pos] == '[':
yield Token(TokenType.LBRACKET, '[', self.line, self.column)
self.advance()
continue
if self.source[self.pos] == ']':
yield Token(TokenType.RBRACKET, ']', self.line, self.column)
self.advance()
continue
if self.source[self.pos] == '{':
yield Token(TokenType.LBRACE, '{', self.line, self.column)
self.advance()
continue
if self.source[self.pos] == '}':
yield Token(TokenType.RBRACE, '}', self.line, self.column)
self.advance()
continue
# Names and keywords
if self.source[self.pos].isalpha() or self.source[self.pos] == '/':
yield self.read_name_or_keyword()
continue
raise SyntaxError(f"Unexpected character at line {self.line}: "
f"'{self.source[self.pos]}'")
yield Token(TokenType.EOF, '', self.line, self.column)
def advance(self):
self.pos += 1
self.column += 1
def read_string(self) -> Token:
start_line, start_col = self.line, self.column
self.advance() # Skip opening quote
start = self.pos
while self.pos < len(self.source) and self.source[self.pos] != '"':
if self.source[self.pos] == '\\':
self.advance() # Skip escape
self.advance()
value = self.source[start:self.pos]
self.advance() # Skip closing quote
return Token(TokenType.STRING, value, start_line, start_col)
def read_number(self) -> Token:
start_line, start_col = self.line, self.column
start = self.pos
if self.source[self.pos] == '-':
self.advance()
while self.pos < len(self.source) and self.source[self.pos].isdigit():
self.advance()
if self.pos < len(self.source) and self.source[self.pos] == '.':
self.advance()
while self.pos < len(self.source) and self.source[self.pos].isdigit():
self.advance()
return Token(TokenType.REAL, self.source[start:self.pos],
start_line, start_col)
return Token(TokenType.INTEGER, self.source[start:self.pos],
start_line, start_col)
def read_name_or_keyword(self) -> Token:
start_line, start_col = self.line, self.column
start = self.pos
# Handle /Name syntax
if self.source[self.pos] == '/':
self.advance()
while (self.pos < len(self.source) and
(self.source[self.pos].isalnum() or self.source[self.pos] == '_')):
self.advance()
value = self.source[start:self.pos]
# Check for keywords
keywords = {
'version': TokenType.VERSION,
'object': TokenType.OBJECT,
'end': TokenType.END,
'stream': TokenType.STREAM,
'endstream': TokenType.ENDSTREAM,
'true': TokenType.TRUE,
'false': TokenType.FALSE,
'null': TokenType.NULL,
'R': TokenType.REF,
}
token_type = keywords.get(value, TokenType.NAME)
return Token(token_type, value, start_line, start_col)
Phase 2: Assembly to PDF (Days 5-8)
# pdfasm/assembler.py
import zlib
from typing import Dict, List, Tuple
class PDFAssembler:
def __init__(self, document: PDFDocument, compress: bool = False):
self.doc = document
self.compress = compress
self.output = bytearray()
self.offsets: Dict[int, int] = {} # obj_num -> byte offset
def assemble(self) -> bytes:
# Write header
self.write_line(f"%PDF-{self.doc.version}")
self.write_line("%\xe2\xe3\xcf\xd3") # Binary marker
# Write objects in order
for obj_num in sorted(self.doc.objects.keys()):
self.write_object(obj_num)
# Write xref table
xref_offset = len(self.output)
self.write_xref()
# Write trailer
self.write_trailer(xref_offset)
return bytes(self.output)
def write_line(self, line: str):
self.output.extend(line.encode('latin-1'))
self.output.extend(b'\n')
def write_object(self, obj_num: int):
obj = self.doc.objects[obj_num]
# Record offset
self.offsets[obj_num] = len(self.output)
# Write object header
self.write_line(f"{obj_num} {obj.gen_num} obj")
# Write value
if isinstance(obj.value, PDFStream):
self.write_stream(obj.value)
else:
self.write_value(obj.value)
self.write_line("endobj")
def write_value(self, value) -> str:
if value is None:
self.write_line("null")
elif isinstance(value, bool):
self.write_line("true" if value else "false")
elif isinstance(value, int):
self.write_line(str(value))
elif isinstance(value, float):
self.write_line(f"{value:.6g}")
elif isinstance(value, str):
# Check if it's a name or string
if value.startswith('/'):
self.write_line(value)
else:
self.write_line(f"({self.escape_string(value)})")
elif isinstance(value, PDFReference):
self.write_line(f"{value.obj_num} {value.gen_num} R")
elif isinstance(value, list):
self.write_array(value)
elif isinstance(value, dict):
self.write_dict(value)
def write_array(self, arr: List):
parts = []
for item in arr:
parts.append(self.value_to_string(item))
self.write_line(f"[{' '.join(parts)}]")
def write_dict(self, d: Dict):
self.write_line("<<")
for key, value in d.items():
if not key.startswith('/'):
key = '/' + key
self.write_line(f" {key} {self.value_to_string(value)}")
self.write_line(">>")
def write_stream(self, stream: PDFStream):
data = stream.data
# Compress if requested
if self.compress and not stream.is_compressed:
data = zlib.compress(data)
stream.dictionary['Filter'] = '/FlateDecode'
stream.dictionary['Length'] = len(data)
# Write dictionary
self.write_dict(stream.dictionary)
# Write stream
self.write_line("stream")
self.output.extend(data)
if not data.endswith(b'\n'):
self.output.extend(b'\n')
self.write_line("endstream")
def value_to_string(self, value) -> str:
if value is None:
return "null"
elif isinstance(value, bool):
return "true" if value else "false"
elif isinstance(value, int):
return str(value)
elif isinstance(value, float):
return f"{value:.6g}"
elif isinstance(value, str):
if value.startswith('/'):
return value
return f"({self.escape_string(value)})"
elif isinstance(value, PDFReference):
return f"{value.obj_num} {value.gen_num} R"
elif isinstance(value, list):
return '[' + ' '.join(self.value_to_string(v) for v in value) + ']'
elif isinstance(value, dict):
parts = []
for k, v in value.items():
if not k.startswith('/'):
k = '/' + k
parts.append(f"{k} {self.value_to_string(v)}")
return '<< ' + ' '.join(parts) + ' >>'
return str(value)
def escape_string(self, s: str) -> str:
return s.replace('\\', '\\\\').replace('(', '\\(').replace(')', '\\)')
def write_xref(self):
self.write_line("xref")
self.write_line(f"0 {len(self.offsets) + 1}")
self.write_line("0000000000 65535 f ") # Object 0 is free
for obj_num in sorted(self.offsets.keys()):
offset = self.offsets[obj_num]
self.write_line(f"{offset:010d} 00000 n ")
def write_trailer(self, xref_offset: int):
self.write_line("trailer")
self.write_line(f"<< /Size {len(self.offsets) + 1} /Root {self.doc.root_obj_num} 0 R >>")
self.write_line("startxref")
self.write_line(str(xref_offset))
self.write_line("%%EOF")
Phase 3: PDF to Assembly (Days 9-12)
# pdfasm/disassembler.py
import zlib
from typing import Dict, List, Optional
class PDFDisassembler:
def __init__(self, pdf_data: bytes, decompress: bool = True, annotate: bool = True):
self.data = pdf_data
self.decompress = decompress
self.annotate = annotate
self.pos = 0
def disassemble(self) -> str:
lines = []
# Parse PDF
version = self.parse_version()
lines.append(f"; PDF Disassembly")
lines.append(f"; Original size: {len(self.data)} bytes")
lines.append("")
lines.append(f"version {version}")
lines.append("")
# Find xref and parse objects
xref_offset = self.find_startxref()
xref = self.parse_xref(xref_offset)
trailer = self.parse_trailer()
# Parse and output each object
for obj_num, (offset, gen) in sorted(xref.items()):
if offset == 0: # Free object
continue
obj = self.parse_object_at(offset)
lines.extend(self.format_object(obj_num, gen, obj))
lines.append("")
return '\n'.join(lines)
def parse_version(self) -> str:
# Find %PDF-X.X
idx = self.data.find(b'%PDF-')
if idx == -1:
raise ValueError("Not a PDF file")
end = self.data.find(b'\n', idx)
return self.data[idx+5:end].decode('latin-1').strip()
def find_startxref(self) -> int:
# Search from end of file
idx = self.data.rfind(b'startxref')
if idx == -1:
raise ValueError("Cannot find startxref")
# Read the offset
start = idx + 10
end = self.data.find(b'\n', start)
return int(self.data[start:end].strip())
def parse_xref(self, offset: int) -> Dict[int, Tuple[int, int]]:
self.pos = offset
xref = {}
# Skip "xref" keyword
self.skip_whitespace()
self.expect(b'xref')
self.skip_whitespace()
# Read subsections
while True:
line = self.read_line()
if line.startswith(b'trailer'):
self.pos -= len(line) + 1
break
parts = line.split()
if len(parts) != 2:
break
first_obj = int(parts[0])
count = int(parts[1])
for i in range(count):
entry = self.read_line()
offset_str, gen_str, status = entry.split()
offset = int(offset_str)
gen = int(gen_str)
if status == b'n': # In use
xref[first_obj + i] = (offset, gen)
return xref
def parse_trailer(self) -> Dict:
# Find trailer
idx = self.data.rfind(b'trailer')
self.pos = idx + 7
self.skip_whitespace()
return self.parse_dict()
def parse_object_at(self, offset: int) -> any:
self.pos = offset
# Read "N G obj"
obj_num = self.read_number()
gen = self.read_number()
self.skip_whitespace()
self.expect(b'obj')
self.skip_whitespace()
# Parse value
value = self.parse_value()
# Check for stream
self.skip_whitespace()
if self.peek(6) == b'stream':
self.pos += 6
# Skip to stream data
while self.data[self.pos] in b'\r\n':
self.pos += 1
# Get length from dictionary
length = value.get('/Length', 0)
if isinstance(length, tuple): # Reference
# Would need to resolve - simplified here
length = 0
stream_data = self.data[self.pos:self.pos + length]
self.pos += length
# Decompress if needed
if self.decompress and '/Filter' in value:
filter_name = value['/Filter']
if filter_name == '/FlateDecode':
try:
stream_data = zlib.decompress(stream_data)
except:
pass # Keep compressed
return {'type': 'stream', 'dict': value, 'data': stream_data}
return value
def format_object(self, obj_num: int, gen: int, obj) -> List[str]:
lines = []
if self.annotate:
lines.append(f"; Object {obj_num}")
lines.append(f"object {obj_num}")
if isinstance(obj, dict) and obj.get('type') == 'stream':
# Stream object
for key, value in obj['dict'].items():
if key == '/Length':
continue # Will be calculated
lines.append(f" {key[1:]} {self.format_value(value)}")
lines.append(" stream")
# Indent stream content
stream_text = obj['data'].decode('latin-1', errors='replace')
for line in stream_text.split('\n'):
lines.append(f" {line}")
lines.append(" endstream")
elif isinstance(obj, dict):
# Dictionary object
for key, value in obj.items():
lines.append(f" {key[1:]} {self.format_value(value)}")
else:
lines.append(f" {self.format_value(obj)}")
lines.append("end")
return lines
def format_value(self, value) -> str:
if value is None:
return "null"
elif isinstance(value, bool):
return "true" if value else "false"
elif isinstance(value, int):
return str(value)
elif isinstance(value, float):
return f"{value:.6g}"
elif isinstance(value, str):
if value.startswith('/'):
return value[1:] # Remove leading /
return f'"{value}"'
elif isinstance(value, tuple): # Reference
return f"{value[0]} {value[1]} R"
elif isinstance(value, list):
return '[' + ' '.join(self.format_value(v) for v in value) + ']'
elif isinstance(value, dict):
parts = []
for k, v in value.items():
parts.append(f"{k[1:]}: {self.format_value(v)}")
return '{ ' + ', '.join(parts) + ' }'
return str(value)
# ... helper methods for parsing ...
Phase 4: CLI Tool (Days 13-14)
#!/usr/bin/env python3
# pdfasm/cli.py
import argparse
import sys
from pathlib import Path
from .lexer import Lexer
from .parser import Parser
from .assembler import PDFAssembler
from .disassembler import PDFDisassembler
from .validator import PDFValidator
def cmd_assemble(args):
"""Assemble .pdfasm to .pdf"""
source = Path(args.input).read_text()
# Parse assembly
lexer = Lexer(source)
parser = Parser(lexer.tokenize())
document = parser.parse()
# Assemble to PDF
assembler = PDFAssembler(document, compress=args.compress)
pdf_data = assembler.assemble()
# Write output
Path(args.output).write_bytes(pdf_data)
print(f"Assembled {args.input} to {args.output} ({len(pdf_data)} bytes)")
def cmd_disassemble(args):
"""Disassemble .pdf to .pdfasm"""
pdf_data = Path(args.input).read_bytes()
# Disassemble
disasm = PDFDisassembler(
pdf_data,
decompress=args.decompress,
annotate=args.annotate
)
assembly = disasm.disassemble()
# Write output
Path(args.output).write_text(assembly)
print(f"Disassembled {args.input} to {args.output}")
def cmd_validate(args):
"""Validate PDF or assembly"""
path = Path(args.input)
if path.suffix == '.pdfasm':
# Validate assembly syntax
source = path.read_text()
lexer = Lexer(source)
parser = Parser(lexer.tokenize())
try:
document = parser.parse()
print(f"Syntax OK: {len(document.objects)} objects")
except SyntaxError as e:
print(f"Syntax error: {e}")
sys.exit(1)
else:
# Validate PDF structure
pdf_data = path.read_bytes()
validator = PDFValidator(pdf_data)
errors, warnings = validator.validate()
for err in errors:
print(f"ERROR: {err}")
for warn in warnings:
print(f"WARNING: {warn}")
if errors:
sys.exit(1)
print("PDF structure is valid")
def cmd_explore(args):
"""Interactive PDF exploration"""
pdf_data = Path(args.input).read_bytes()
disasm = PDFDisassembler(pdf_data)
print(f"PDF Explorer - {args.input}")
print("Commands: list, show <n>, page <n>, stream <n>, tree, quit")
print()
while True:
try:
cmd = input("> ").strip()
except EOFError:
break
if cmd == 'quit':
break
elif cmd == 'list':
# List all objects
for obj_num in sorted(disasm.xref.keys()):
print(f" {obj_num}: ...") # TODO: Show type
elif cmd.startswith('show '):
obj_num = int(cmd.split()[1])
obj = disasm.parse_object_at(disasm.xref[obj_num][0])
print(disasm.format_value(obj))
# ... more commands
def main():
parser = argparse.ArgumentParser(description='PDF Assembly Language Tool')
subparsers = parser.add_subparsers(dest='command', required=True)
# assemble command
p_asm = subparsers.add_parser('assemble', help='Assemble .pdfasm to .pdf')
p_asm.add_argument('input', help='Input .pdfasm file')
p_asm.add_argument('-o', '--output', required=True, help='Output .pdf file')
p_asm.add_argument('--compress', action='store_true', help='Compress streams')
p_asm.set_defaults(func=cmd_assemble)
# disassemble command
p_disasm = subparsers.add_parser('disassemble', help='Disassemble .pdf to .pdfasm')
p_disasm.add_argument('input', help='Input .pdf file')
p_disasm.add_argument('-o', '--output', required=True, help='Output .pdfasm file')
p_disasm.add_argument('--decompress', action='store_true', default=True,
help='Decompress streams')
p_disasm.add_argument('--annotate', action='store_true', default=True,
help='Add comments')
p_disasm.set_defaults(func=cmd_disassemble)
# validate command
p_val = subparsers.add_parser('validate', help='Validate PDF or assembly')
p_val.add_argument('input', help='Input file')
p_val.set_defaults(func=cmd_validate)
# explore command
p_exp = subparsers.add_parser('explore', help='Interactive PDF exploration')
p_exp.add_argument('input', help='Input .pdf file')
p_exp.set_defaults(func=cmd_explore)
args = parser.parse_args()
args.func(args)
if __name__ == '__main__':
main()
Testing Strategy
Round-Trip Testing
The most important test: disassemble โ reassemble โ compare
# Test round-trip
./pdfasm disassemble original.pdf -o original.pdfasm
./pdfasm assemble original.pdfasm -o reassembled.pdf
# Compare
diff <(pdftotext original.pdf -) <(pdftotext reassembled.pdf -)
# Or use qpdf to normalize and compare
qpdf --qdf original.pdf - > orig.qdf
qpdf --qdf reassembled.pdf - > reasm.qdf
diff orig.qdf reasm.qdf
Unit Tests
def test_simple_document():
source = """
version 1.4
object 1
type Catalog
pages 2 0 R
end
object 2
type Pages
kids [3 0 R]
count 1
end
object 3
type Page
parent 2 0 R
mediabox [0 0 612 792]
end
"""
lexer = Lexer(source)
parser = Parser(lexer.tokenize())
doc = parser.parse()
assert doc.version == "1.4"
assert len(doc.objects) == 3
assert doc.objects[1].value['/Type'] == '/Catalog'
def test_assemble_disassemble():
# Create minimal document
doc = PDFDocument(
version="1.4",
objects={...},
root_obj_num=1
)
# Assemble
assembler = PDFAssembler(doc)
pdf_bytes = assembler.assemble()
# Disassemble
disasm = PDFDisassembler(pdf_bytes)
assembly = disasm.disassemble()
# Parse again
lexer = Lexer(assembly)
parser = Parser(lexer.tokenize())
doc2 = parser.parse()
# Should be equivalent
assert doc.version == doc2.version
assert len(doc.objects) == len(doc2.objects)
Common Pitfalls
1. String Escaping
PDF strings can contain special characters:
def escape_pdf_string(s: str) -> str:
return (s.replace('\\', '\\\\')
.replace('(', '\\(')
.replace(')', '\\)')
.replace('\n', '\\n')
.replace('\r', '\\r'))
def unescape_pdf_string(s: str) -> str:
# Handle escape sequences
result = []
i = 0
while i < len(s):
if s[i] == '\\' and i + 1 < len(s):
next_char = s[i + 1]
if next_char == 'n':
result.append('\n')
elif next_char == 'r':
result.append('\r')
elif next_char == 't':
result.append('\t')
elif next_char in '\\()':
result.append(next_char)
i += 2
else:
result.append(s[i])
i += 1
return ''.join(result)
2. Object Number Assignment
When assembling, you need to assign object numbers correctly:
def assign_object_numbers(objects: List[PDFObject]) -> Dict[str, int]:
"""Assign sequential object numbers, handling labeled references."""
number_map = {}
current_num = 1
for obj in objects:
if obj.label:
number_map[obj.label] = current_num
obj.obj_num = current_num
current_num += 1
return number_map
3. Stream Length
The /Length value must exactly match the stream content:
def write_stream(self, stream: PDFStream):
data = stream.data
# Calculate length AFTER any processing
if self.compress:
data = zlib.compress(data)
stream.dictionary['Length'] = len(data) # Exact length
self.write_dict(stream.dictionary)
self.write_line("stream")
self.output.extend(data)
# DON'T add extra newlines that aren't counted
self.write_line("endstream")
Extensions
Level 1: Syntax Highlighting
Create syntax highlighting for popular editors (VS Code, vim, Sublime).
Level 2: Semantic Validation
Validate PDF semantics:
- Catalog must have /Pages
- Pages must be in a tree
- References must point to valid objects
Level 3: Diff Tool
Create a diff tool that shows semantic differences between PDFs:
./pdfasm diff file1.pdf file2.pdf
Level 4: Web Interface
Build a web-based PDF explorer with visualization.
Self-Assessment
Before considering this project complete:
- Can parse and generate all 8 PDF object types
- Assembly โ PDF produces valid, viewable PDFs
- PDF โ Assembly produces readable output
- Round-trip preserves document structure
- Handles compressed streams correctly
- CLI tool is usable and well-documented
- Can edit a PDF by modifying assembly and reassembling
Resources
Essential Reading
- Domain Specific Languages by Martin Fowler - DSL design
- PDF Reference Manual 1.7 by Adobe - PDF specification
- Developing with PDF by Leonard Rosenthol - PDF internals
Tools
- qpdf: PDF normalization and validation
- pypdf/pikepdf: Python PDF libraries
- pdftotext: Text extraction for comparison