Project 2: PDF File Parser & Renderer
Project 2: PDF File Parser & Renderer
Project Overview
| Attribute | Details |
|---|---|
| Difficulty | Level 3: Advanced |
| Time Estimate | 3-4 weeks |
| Programming Language | C |
| Knowledge Area | Document Formats / Compression |
| Prerequisites | Project 1 or equivalent understanding of graphics state |
What youโll build: A tool that parses PDF files, extracts their structure, and renders pages to images.
Why it teaches PostScriptโPDF: Youโll see that PDF is essentially โfrozen PostScriptโ - the same drawing operations exist, but in a declarative structure rather than executable code. Youโll understand what Ghostscript produces when it converts PS to PDF.
Learning Objectives
By completing this project, you will:
- Parse binary file formats with mixed ASCII and binary content
- Navigate object graphs using indirect references and cross-reference tables
- Decompress streams using zlib (Flate/Deflate)
- Execute PDF operators (nearly identical to PostScript)
- Understand PDFโs random-access architecture and why it exists
The Core Question Youโre Answering
โWhat IS a PDF file, really? How does โfrozen PostScriptโ work without an interpreter?โ
PDF isnโt just โa document formatโโitโs a carefully designed data structure that solves a specific problem: How do you distribute PostScript documents without requiring an interpreter?
The answer: Separate structure from content.
PostScript (dynamic): PDF (static):
โโโโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ /box { โ โ 4 0 obj โ
โ /h exch def โ โ << /Length 44 >> โ
โ /w exch def โ โ stream โ
โ 0 0 moveto โ โ 0 0 m โ
โ w 0 lineto โ โ 100 0 l โ
โ w h lineto โ โ 100 50 l โ
โ 0 h lineto โ โ 0 50 l โ
โ closepath โ โ h โ
โ } def โ โ f โ
โ โ โ endstream โ
โ 100 50 box fill โ โ endobj โ
โ โ โ โ
โ REQUIRES INTERPRETER โ โ JUST PARSE & RENDER โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโ
The profound difference:
| Aspect | PostScript | |
|---|---|---|
| Execution | Requires Turing-complete interpreter | Simple operator parsing |
| Page Access | Must execute from page 1 to get to page 100 | Jump directly via xref |
| Security | Can hang, crash, or execute malicious code | Deterministic, no loops |
| File Size | Can be tiny (code generates content) | Larger but compressed |
By building this parser, youโll internalize: PDF traded flexibility for reliability and random access.
Deep Theoretical Foundation
1. PDF File Structure
Every PDF file has four major sections:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ PDF FILE LAYOUT โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ HEADER โ โ
โ โ %PDF-1.7 โ โ
โ โ %รขรฃรร (binary marker for tools that check for binary) โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ BODY โ โ
โ โ 1 0 obj << /Type /Catalog ... >> endobj โ โ
โ โ 2 0 obj << /Type /Pages ... >> endobj โ โ
โ โ 3 0 obj << /Type /Page ... >> endobj โ โ
โ โ 4 0 obj << /Length 187 >> stream ... endstream endobj โ โ
โ โ ... โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ CROSS-REFERENCE TABLE (xref) โ โ
โ โ xref โ โ
โ โ 0 5 โ โ
โ โ 0000000000 65535 f โ โ
โ โ 0000000015 00000 n โ Object 1 at byte 15 โ โ
โ โ 0000000074 00000 n โ Object 2 at byte 74 โ โ
โ โ 0000000133 00000 n โ Object 3 at byte 133 โ โ
โ โ 0000000250 00000 n โ Object 4 at byte 250 โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ TRAILER โ โ
โ โ trailer โ โ
โ โ << /Size 5 /Root 1 0 R >> โ โ
โ โ startxref โ โ
โ โ 379 โ Byte offset of xref โ โ
โ โ %%EOF โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
2. PDF Object Types
PDF has exactly 8 basic object types:
| Type | Syntax | Example |
|---|---|---|
| Boolean | true / false |
true |
| Integer | Digits | 42, -17 |
| Real | Decimal | 3.14159, -2.5 |
| String | Parens or hex | (Hello), <48656C6C6F> |
| Name | Slash prefix | /Type, /Page |
| Array | Square brackets | [1 2 3], [/Name 123 (text)] |
| Dictionary | Double angle | << /Key1 value1 /Key2 value2 >> |
| Stream | Dict + binary | << /Length 44 >> stream...endstream |
3. Indirect Objects and References
The power of PDF comes from indirect objects:
DIRECT OBJECTS:
<< /Type /Page /MediaBox [0 0 612 792] >>
โ Everything is inline
INDIRECT OBJECTS:
3 0 obj โ Object number 3, generation 0
<< /Type /Page /Contents 4 0 R >> โ References object 4
endobj
4 0 obj
<< /Length 187 >>
stream
... content ...
endstream
endobj
Why indirect objects?
- Reuse: Same object can be referenced multiple times
- Random access: Jump to any object via xref table
- Incremental updates: Add new objects without rewriting file
4. The Cross-Reference Table
The xref table is the secret to PDFโs speed:
xref
0 5 โ Entries for objects 0-4 (5 objects total)
0000000000 65535 f โ Object 0: free (linked list head)
0000000015 00000 n โ Object 1: in use at byte 15
0000000074 00000 n โ Object 2: in use at byte 74
0000000133 00000 n โ Object 3: in use at byte 133
0000000250 00000 n โ Object 4: in use at byte 250
Format: OOOOOOOOOO GGGGG S
โ โ โ
Offset Gen Status (n=in use, f=free)
(10 digits)(5) (1)
How to find xref: Read from end of file:
- Find
%%EOF - Read backward to find
startxref - Read the byte offset after
startxref - Seek to that offset โ xref table
5. Document Structure
PDF documents form a tree:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ PDF DOCUMENT TREE โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ CATALOG (object 1) โ
โ << /Type /Catalog โ
โ /Pages 2 0 R โ
โ /Outlines 10 0 R (optional bookmarks) โ
โ /Metadata 15 0 R (optional XMP metadata) โ
โ >> โ
โ โ โ
โ โผ โ
โ PAGES TREE (object 2) โ
โ << /Type /Pages โ
โ /Kids [3 0 R 6 0 R 9 0 R] โ
โ /Count 3 โ
โ >> โ
โ โ โ
โ โโโโโโโผโโโโโโฌโโโโโโโโโโ โ
โ โผ โผ โผ โผ โ
โ PAGE 1 PAGE 2 PAGE 3 โ
โ (obj 3) (obj 6) (obj 9) โ
โ โ โ
โ โผ โ
โ CONTENT STREAM (object 4) โ
โ << /Length 187 /Filter /FlateDecode >> โ
โ stream โ
โ ... compressed drawing operators ... โ
โ endstream โ
โ โ
โ RESOURCES (object 5) โ
โ << /Font << /F1 7 0 R >> โ
โ /XObject << /Im1 8 0 R >> >> โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
6. Content Streams: PostScript Inside PDF
Content streams contain the actual drawing commands. They look almost exactly like PostScript:
PostScript: PDF Content Stream:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
newpath (implicit)
100 100 moveto 100 100 m
200 100 lineto 200 100 l
200 200 lineto 200 200 l
100 200 lineto 100 200 l
closepath h
0.5 setgray 0.5 g
fill f
PDF Operator Reference (subset):
| Category | Operators | Description |
|---|---|---|
| Path | m (moveto), l (lineto), c (curveto), h (closepath) |
Build paths |
| Paint | S (stroke), f (fill), B (fill+stroke) |
Paint paths |
| Color | g (setgray fill), G (setgray stroke), rg/RG (RGB) |
Set colors |
| State | q (gsave), Q (grestore), cm (concat matrix) |
Manage state |
| Text | BT (begin text), ET (end text), Tf (font), Tj (show) |
Render text |
7. Compression: Flate/Deflate
Most PDF streams are compressed with FlateDecode (zlib):
#include <zlib.h>
unsigned char* decompress_stream(unsigned char* compressed,
size_t comp_len,
size_t* out_len) {
// Estimate output size (usually 3-10x larger)
size_t estimated = comp_len * 10;
unsigned char* output = malloc(estimated);
z_stream z = {0};
z.next_in = compressed;
z.avail_in = comp_len;
z.next_out = output;
z.avail_out = estimated;
inflateInit(&z);
int ret = inflate(&z, Z_FINISH);
inflateEnd(&z);
*out_len = z.total_out;
return output;
}
Project Specification
Core Features
Your PDF parser must:
- Parse PDF Header
- Extract PDF version (e.g., โ1.7โ)
- Detect binary marker
- Find and Parse Cross-Reference Table
- Locate via
startxref(read from end of file) - Parse xref entries (byte offset, generation, status)
- Support both xref tables and xref streams
- Locate via
- Parse Trailer
- Extract
/Root(catalog reference) - Extract
/Size(number of objects)
- Extract
- Dereference Indirect Objects
- Given
N 0 R, find object N in xref, seek to its offset, parse it
- Given
- Parse All Object Types
- Booleans, integers, reals, strings, names, arrays, dictionaries, streams
- Decompress Content Streams
- Support FlateDecode filter (zlib)
- Handle
/Lengthfor stream boundaries
- Parse Content Stream Operators
- Tokenize operators and operands
- Execute graphics state machine
- Track path construction and colors
- Render to Output
- Generate PNG or SVG from parsed content
Command-Line Interface
# Dump PDF structure
./pdf_parser document.pdf --dump-structure
# Extract and display content stream operators
./pdf_parser document.pdf --extract-operators
# Render page to PNG
./pdf_parser document.pdf --render output.png
# Validate PDF structure
./pdf_parser document.pdf --validate
Solution Architecture
High-Level Design
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ PDF PARSER โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ FILE READER โ โ
โ โ โ โ
โ โ - Memory-map file (or load into buffer) โ โ
โ โ - Provide seek/read operations โ โ
โ โ - Handle binary + ASCII content โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ CROSS-REFERENCE PARSER โ โ
โ โ โ โ
โ โ - Find trailer from end of file โ โ
โ โ - Parse xref table or xref stream โ โ
โ โ - Build object offset lookup table โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ OBJECT PARSER โ โ
โ โ โ โ
โ โ - Tokenize PDF syntax โ โ
โ โ - Parse dictionaries, arrays, streams โ โ
โ โ - Dereference indirect objects โ โ
โ โ - Decompress streams with zlib โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ DOCUMENT NAVIGATOR โ โ
โ โ โ โ
โ โ - Follow Catalog โ Pages โ Page โ โ
โ โ - Extract content streams โ โ
โ โ - Resolve resource dictionaries โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ CONTENT STREAM INTERPRETER โ โ
โ โ โ โ
โ โ - Parse PostScript-like operators โ โ
โ โ - Maintain graphics state โ โ
โ โ - Build path and text operations โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ RENDERER โ โ
โ โ โ โ
โ โ - Convert graphics operations to pixels โ โ
โ โ - Output PNG (via Cairo or libpng) โ โ
โ โ - Or output SVG (text-based) โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Key Data Structures
// PDF Object representation
typedef enum {
PDF_NULL,
PDF_BOOL,
PDF_INT,
PDF_REAL,
PDF_STRING,
PDF_NAME,
PDF_ARRAY,
PDF_DICT,
PDF_STREAM,
PDF_REF
} PDFObjectType;
typedef struct PDFObject {
PDFObjectType type;
union {
bool boolean;
long integer;
double real;
struct {
char* data;
size_t length;
} string;
char* name;
struct {
struct PDFObject** items;
size_t count;
} array;
struct {
char** keys;
struct PDFObject** values;
size_t count;
} dict;
struct {
struct PDFObject* dict;
unsigned char* data;
size_t length;
} stream;
struct {
int obj_num;
int gen_num;
} ref;
} value;
} PDFObject;
// Cross-reference entry
typedef struct {
long offset; // Byte offset in file
int generation; // Generation number
bool in_use; // true = n (in use), false = f (free)
} XRefEntry;
// PDF Document
typedef struct {
FILE* file;
char* data; // Memory-mapped content (optional)
size_t size;
char version[8]; // "1.7", "2.0", etc.
XRefEntry* xref;
size_t xref_size;
PDFObject* trailer;
PDFObject* catalog;
PDFObject* pages;
} PDFDocument;
// Graphics state for rendering
typedef struct {
double ctm[6];
double fill_color[4]; // RGBA
double stroke_color[4];
double line_width;
// Current path
struct {
double x, y;
int type; // 0=move, 1=line, 2=curve
} path[10000];
size_t path_len;
} GraphicsState;
Implementation Guide
Phase 1: File Reading and xref Location (Week 1, Days 1-2)
Start by finding the cross-reference table:
// Find startxref by reading from end of file
long find_startxref(FILE* f) {
// Read last 1024 bytes
fseek(f, -1024, SEEK_END);
char buf[1024];
size_t n = fread(buf, 1, 1024, f);
// Search for "startxref"
for (int i = n - 9; i >= 0; i--) {
if (memcmp(buf + i, "startxref", 9) == 0) {
// Read the offset value after it
long offset;
sscanf(buf + i + 9, "%ld", &offset);
return offset;
}
}
return -1; // Not found
}
// Parse the xref table
void parse_xref_table(PDFDocument* doc, long xref_offset) {
fseek(doc->file, xref_offset, SEEK_SET);
char line[256];
fgets(line, sizeof(line), doc->file); // "xref"
while (true) {
// Read subsection header: "first_obj count"
int first_obj, count;
if (fscanf(doc->file, "%d %d", &first_obj, &count) != 2) break;
// Expand xref table if needed
if (first_obj + count > doc->xref_size) {
doc->xref_size = first_obj + count;
doc->xref = realloc(doc->xref, doc->xref_size * sizeof(XRefEntry));
}
// Read entries
for (int i = 0; i < count; i++) {
long offset;
int gen;
char status;
fscanf(doc->file, "%ld %d %c", &offset, &gen, &status);
doc->xref[first_obj + i].offset = offset;
doc->xref[first_obj + i].generation = gen;
doc->xref[first_obj + i].in_use = (status == 'n');
}
// Check for "trailer" keyword
long pos = ftell(doc->file);
fscanf(doc->file, "%s", line);
if (strcmp(line, "trailer") == 0) break;
fseek(doc->file, pos, SEEK_SET);
}
}
Phase 2: Object Parser (Week 1, Days 3-5)
Parse PDF objects:
// Skip whitespace and comments
void skip_whitespace(FILE* f) {
int c;
while ((c = fgetc(f)) != EOF) {
if (c == '%') {
// Skip comment until newline
while ((c = fgetc(f)) != EOF && c != '\n' && c != '\r');
} else if (!isspace(c)) {
ungetc(c, f);
return;
}
}
}
// Parse a single PDF object
PDFObject* parse_object(FILE* f) {
skip_whitespace(f);
int c = fgetc(f);
// Boolean
if (c == 't') {
char buf[5];
buf[0] = c;
fread(buf + 1, 1, 3, f);
if (memcmp(buf, "true", 4) == 0) {
PDFObject* obj = calloc(1, sizeof(PDFObject));
obj->type = PDF_BOOL;
obj->value.boolean = true;
return obj;
}
}
if (c == 'f') {
char buf[6];
buf[0] = c;
fread(buf + 1, 1, 4, f);
if (memcmp(buf, "false", 5) == 0) {
PDFObject* obj = calloc(1, sizeof(PDFObject));
obj->type = PDF_BOOL;
obj->value.boolean = false;
return obj;
}
}
// Name
if (c == '/') {
char name[256];
int i = 0;
while ((c = fgetc(f)) != EOF && !isspace(c) &&
c != '/' && c != '[' && c != ']' &&
c != '<' && c != '>' && c != '(' && c != ')') {
name[i++] = c;
}
name[i] = '\0';
ungetc(c, f);
PDFObject* obj = calloc(1, sizeof(PDFObject));
obj->type = PDF_NAME;
obj->value.name = strdup(name);
return obj;
}
// String
if (c == '(') {
char str[4096];
int i = 0;
int paren_depth = 1;
while ((c = fgetc(f)) != EOF && paren_depth > 0) {
if (c == '(') paren_depth++;
else if (c == ')') paren_depth--;
if (paren_depth > 0) str[i++] = c;
}
str[i] = '\0';
PDFObject* obj = calloc(1, sizeof(PDFObject));
obj->type = PDF_STRING;
obj->value.string.data = strdup(str);
obj->value.string.length = i;
return obj;
}
// Array
if (c == '[') {
PDFObject** items = NULL;
size_t count = 0;
size_t capacity = 0;
while (true) {
skip_whitespace(f);
c = fgetc(f);
if (c == ']') break;
ungetc(c, f);
if (count >= capacity) {
capacity = capacity ? capacity * 2 : 8;
items = realloc(items, capacity * sizeof(PDFObject*));
}
items[count++] = parse_object(f);
}
PDFObject* obj = calloc(1, sizeof(PDFObject));
obj->type = PDF_ARRAY;
obj->value.array.items = items;
obj->value.array.count = count;
return obj;
}
// Dictionary
if (c == '<') {
c = fgetc(f);
if (c == '<') {
// Dictionary
char** keys = NULL;
PDFObject** values = NULL;
size_t count = 0;
size_t capacity = 0;
while (true) {
skip_whitespace(f);
c = fgetc(f);
if (c == '>') {
fgetc(f); // Skip second '>'
break;
}
ungetc(c, f);
if (count >= capacity) {
capacity = capacity ? capacity * 2 : 8;
keys = realloc(keys, capacity * sizeof(char*));
values = realloc(values, capacity * sizeof(PDFObject*));
}
// Parse key (must be a name)
PDFObject* key_obj = parse_object(f);
keys[count] = strdup(key_obj->value.name);
free_object(key_obj);
// Parse value
values[count] = parse_object(f);
count++;
}
PDFObject* obj = calloc(1, sizeof(PDFObject));
obj->type = PDF_DICT;
obj->value.dict.keys = keys;
obj->value.dict.values = values;
obj->value.dict.count = count;
return obj;
}
// Hex string
// ...
}
// Number or reference
if (isdigit(c) || c == '-' || c == '+' || c == '.') {
ungetc(c, f);
// Try to parse number
long pos = ftell(f);
double num;
fscanf(f, "%lf", &num);
// Check if this might be a reference
skip_whitespace(f);
int gen;
char r;
if (fscanf(f, "%d %c", &gen, &r) == 2 && r == 'R') {
PDFObject* obj = calloc(1, sizeof(PDFObject));
obj->type = PDF_REF;
obj->value.ref.obj_num = (int)num;
obj->value.ref.gen_num = gen;
return obj;
}
// Just a number
fseek(f, pos, SEEK_SET);
fscanf(f, "%lf", &num);
PDFObject* obj = calloc(1, sizeof(PDFObject));
if (num == (long)num) {
obj->type = PDF_INT;
obj->value.integer = (long)num;
} else {
obj->type = PDF_REAL;
obj->value.real = num;
}
return obj;
}
return NULL;
}
Phase 3: Object Dereferencing (Week 2, Days 1-2)
// Get an object by its number
PDFObject* get_object(PDFDocument* doc, int obj_num) {
if (obj_num >= doc->xref_size || !doc->xref[obj_num].in_use) {
return NULL;
}
long offset = doc->xref[obj_num].offset;
fseek(doc->file, offset, SEEK_SET);
// Read "N G obj"
int num, gen;
char keyword[16];
fscanf(doc->file, "%d %d %s", &num, &gen, keyword);
if (strcmp(keyword, "obj") != 0) {
return NULL;
}
PDFObject* obj = parse_object(doc->file);
// Check for stream
skip_whitespace(doc->file);
long pos = ftell(doc->file);
char buf[16];
fscanf(doc->file, "%15s", buf);
if (strcmp(buf, "stream") == 0) {
// Skip past newline
int c;
while ((c = fgetc(doc->file)) != EOF && c != '\n');
// Get stream length
PDFObject* length_obj = dict_get(obj, "Length");
size_t length;
if (length_obj->type == PDF_REF) {
PDFObject* len = get_object(doc, length_obj->value.ref.obj_num);
length = len->value.integer;
} else {
length = length_obj->value.integer;
}
// Read stream data
unsigned char* data = malloc(length);
fread(data, 1, length, doc->file);
// Convert to stream object
PDFObject* stream_obj = calloc(1, sizeof(PDFObject));
stream_obj->type = PDF_STREAM;
stream_obj->value.stream.dict = obj;
stream_obj->value.stream.data = data;
stream_obj->value.stream.length = length;
return stream_obj;
}
fseek(doc->file, pos, SEEK_SET);
return obj;
}
// Dereference if it's a reference
PDFObject* deref(PDFDocument* doc, PDFObject* obj) {
if (obj->type == PDF_REF) {
return get_object(doc, obj->value.ref.obj_num);
}
return obj;
}
Phase 4: Stream Decompression (Week 2, Days 3-4)
#include <zlib.h>
unsigned char* decompress_flate(unsigned char* data, size_t len, size_t* out_len) {
// Start with estimated size
size_t out_size = len * 4;
unsigned char* output = malloc(out_size);
z_stream z = {0};
z.next_in = data;
z.avail_in = len;
z.next_out = output;
z.avail_out = out_size;
if (inflateInit(&z) != Z_OK) {
free(output);
return NULL;
}
while (true) {
int ret = inflate(&z, Z_NO_FLUSH);
if (ret == Z_STREAM_END) break;
if (ret != Z_OK) {
inflateEnd(&z);
free(output);
return NULL;
}
// Need more output space
if (z.avail_out == 0) {
size_t new_size = out_size * 2;
output = realloc(output, new_size);
z.next_out = output + out_size;
z.avail_out = new_size - out_size;
out_size = new_size;
}
}
inflateEnd(&z);
*out_len = z.total_out;
return output;
}
unsigned char* get_stream_data(PDFDocument* doc, PDFObject* stream, size_t* out_len) {
PDFObject* filter = dict_get(stream->value.stream.dict, "Filter");
if (filter == NULL) {
// No compression
*out_len = stream->value.stream.length;
unsigned char* copy = malloc(*out_len);
memcpy(copy, stream->value.stream.data, *out_len);
return copy;
}
if (filter->type == PDF_NAME && strcmp(filter->value.name, "FlateDecode") == 0) {
return decompress_flate(stream->value.stream.data,
stream->value.stream.length,
out_len);
}
// Other filters: LZWDecode, ASCII85Decode, etc.
fprintf(stderr, "Unsupported filter\n");
return NULL;
}
Phase 5: Content Stream Parsing (Week 2, Day 5 - Week 3, Day 2)
typedef enum {
OP_MOVETO, OP_LINETO, OP_CURVETO, OP_CLOSEPATH,
OP_STROKE, OP_FILL, OP_FILL_STROKE,
OP_GSAVE, OP_GRESTORE, OP_CONCAT,
OP_SETGRAY, OP_SETRGB, OP_SETCMYK,
OP_BEGINTEXT, OP_ENDTEXT, OP_SETFONT, OP_SHOWTEXT
} OpCode;
typedef struct {
OpCode op;
double operands[6];
int num_operands;
char* string_operand;
} Operation;
// Parse content stream into operations
Operation* parse_content_stream(unsigned char* data, size_t len, size_t* num_ops) {
Operation* ops = malloc(10000 * sizeof(Operation));
size_t count = 0;
double operand_stack[100];
int stack_ptr = 0;
size_t pos = 0;
while (pos < len) {
// Skip whitespace
while (pos < len && isspace(data[pos])) pos++;
if (pos >= len) break;
// Check for number
if (isdigit(data[pos]) || data[pos] == '-' || data[pos] == '+' || data[pos] == '.') {
char num_buf[64];
int i = 0;
while (pos < len && (isdigit(data[pos]) || data[pos] == '.' || data[pos] == '-' || data[pos] == '+')) {
num_buf[i++] = data[pos++];
}
num_buf[i] = '\0';
operand_stack[stack_ptr++] = atof(num_buf);
continue;
}
// Check for string
if (data[pos] == '(') {
pos++;
char str_buf[1024];
int i = 0;
int depth = 1;
while (pos < len && depth > 0) {
if (data[pos] == '(') depth++;
else if (data[pos] == ')') depth--;
if (depth > 0) str_buf[i++] = data[pos];
pos++;
}
str_buf[i] = '\0';
// Store string for next operator
continue;
}
// Check for operator
if (isalpha(data[pos]) || data[pos] == '\'' || data[pos] == '"') {
char op_buf[32];
int i = 0;
while (pos < len && (isalpha(data[pos]) || data[pos] == '*' || data[pos] == '\'' || data[pos] == '"')) {
op_buf[i++] = data[pos++];
}
op_buf[i] = '\0';
// Map operator name to OpCode
Operation* op = &ops[count++];
op->num_operands = 0;
if (strcmp(op_buf, "m") == 0) {
op->op = OP_MOVETO;
op->operands[1] = operand_stack[--stack_ptr]; // y
op->operands[0] = operand_stack[--stack_ptr]; // x
op->num_operands = 2;
}
else if (strcmp(op_buf, "l") == 0) {
op->op = OP_LINETO;
op->operands[1] = operand_stack[--stack_ptr];
op->operands[0] = operand_stack[--stack_ptr];
op->num_operands = 2;
}
else if (strcmp(op_buf, "h") == 0) {
op->op = OP_CLOSEPATH;
}
else if (strcmp(op_buf, "S") == 0) {
op->op = OP_STROKE;
}
else if (strcmp(op_buf, "f") == 0 || strcmp(op_buf, "F") == 0) {
op->op = OP_FILL;
}
else if (strcmp(op_buf, "q") == 0) {
op->op = OP_GSAVE;
}
else if (strcmp(op_buf, "Q") == 0) {
op->op = OP_GRESTORE;
}
else if (strcmp(op_buf, "g") == 0) {
op->op = OP_SETGRAY;
op->operands[0] = operand_stack[--stack_ptr];
op->num_operands = 1;
}
else if (strcmp(op_buf, "rg") == 0) {
op->op = OP_SETRGB;
op->operands[2] = operand_stack[--stack_ptr]; // b
op->operands[1] = operand_stack[--stack_ptr]; // g
op->operands[0] = operand_stack[--stack_ptr]; // r
op->num_operands = 3;
}
// ... more operators
stack_ptr = 0; // Clear operand stack after each operator
}
}
*num_ops = count;
return ops;
}
Phase 6: Rendering (Week 3, Days 3-5)
// Simple SVG output
void render_to_svg(Operation* ops, size_t num_ops, const char* filename) {
FILE* f = fopen(filename, "w");
fprintf(f, "<?xml version=\"1.0\"?>\n");
fprintf(f, "<svg xmlns=\"http://www.w3.org/2000/svg\" width=\"612\" height=\"792\">\n");
fprintf(f, " <g transform=\"translate(0,792) scale(1,-1)\">\n");
double fill_r = 0, fill_g = 0, fill_b = 0;
double stroke_r = 0, stroke_g = 0, stroke_b = 0;
char path_data[65536] = "";
for (size_t i = 0; i < num_ops; i++) {
Operation* op = &ops[i];
switch (op->op) {
case OP_MOVETO:
sprintf(path_data + strlen(path_data), "M %.2f %.2f ",
op->operands[0], op->operands[1]);
break;
case OP_LINETO:
sprintf(path_data + strlen(path_data), "L %.2f %.2f ",
op->operands[0], op->operands[1]);
break;
case OP_CLOSEPATH:
strcat(path_data, "Z ");
break;
case OP_SETGRAY:
fill_r = fill_g = fill_b = op->operands[0];
break;
case OP_SETRGB:
fill_r = op->operands[0];
fill_g = op->operands[1];
fill_b = op->operands[2];
break;
case OP_FILL:
if (strlen(path_data) > 0) {
fprintf(f, " <path d=\"%s\" fill=\"rgb(%d,%d,%d)\" stroke=\"none\"/>\n",
path_data,
(int)(fill_r * 255),
(int)(fill_g * 255),
(int)(fill_b * 255));
path_data[0] = '\0';
}
break;
case OP_STROKE:
if (strlen(path_data) > 0) {
fprintf(f, " <path d=\"%s\" fill=\"none\" stroke=\"rgb(%d,%d,%d)\"/>\n",
path_data,
(int)(stroke_r * 255),
(int)(stroke_g * 255),
(int)(stroke_b * 255));
path_data[0] = '\0';
}
break;
default:
break;
}
}
fprintf(f, " </g>\n");
fprintf(f, "</svg>\n");
fclose(f);
}
Testing Strategy
Test with Hand-Crafted PDFs
Create the simplest valid PDF:
cat > test_minimal.pdf << 'EOF'
%PDF-1.4
1 0 obj
<< /Type /Catalog /Pages 2 0 R >>
endobj
2 0 obj
<< /Type /Pages /Kids [3 0 R] /Count 1 >>
endobj
3 0 obj
<< /Type /Page /Parent 2 0 R /MediaBox [0 0 612 792] /Contents 4 0 R >>
endobj
4 0 obj
<< /Length 29 >>
stream
100 100 m 200 200 l S
endstream
endobj
xref
0 5
0000000000 65535 f
0000000009 00000 n
0000000058 00000 n
0000000117 00000 n
0000000218 00000 n
trailer
<< /Size 5 /Root 1 0 R >>
startxref
301
%%EOF
EOF
Validate with Industry Tools
# Check PDF structure
qpdf --check output.pdf
# Extract info
pdfinfo output.pdf
# Compare text extraction
pdftotext reference.pdf - > ref.txt
pdftotext output.pdf - > out.txt
diff ref.txt out.txt
Compare Rendering with Ghostscript
# Render with Ghostscript
gs -sDEVICE=png16m -r72 -o gs_output.png test.pdf
# Render with your parser
./pdf_parser test.pdf --render my_output.png
# Visual comparison with ImageMagick
compare gs_output.png my_output.png diff.png
Common Pitfalls
1. xref Offset Calculation
The xref table offsets are byte offsets from the start of the file. Make sure to:
- Count bytes exactly (including newlines)
- Handle both
\nand\r\nline endings
2. Stream Length Indirection
Stream /Length can be a reference:
4 0 obj
<< /Length 5 0 R /Filter /FlateDecode >>
stream
...
endstream
endobj
5 0 obj
187
endobj
You must dereference to get the actual length.
3. Object Number 0
Object 0 is always free (the head of the free list). Your xref array should start at index 0 but object 0 is never actually used.
4. Multiple xref Tables (Incremental Updates)
PDFs can have multiple xref tables (appended at end). The trailer contains /Prev pointing to the previous xref. You may need to merge them.
Extensions and Challenges
Level 1: More Compression Filters
- LZWDecode
- ASCII85Decode
- RunLengthDecode
Level 2: Font Handling
- Parse Type 1 and TrueType font programs
- Extract glyph outlines
- Render text properly
Level 3: Images
- Decode inline images
- Handle DCTDecode (JPEG)
- Support JPEG2000
Level 4: Interactive Features
- Parse annotations
- Handle form fields
- Support links and bookmarks
Self-Assessment Checklist
Before moving to Project 3, verify you can:
- Locate and parse the xref table from any PDF
- Dereference indirect object references
- Parse all 8 PDF object types correctly
- Decompress FlateDecode streams
- Navigate Catalog โ Pages โ Page hierarchy
- Extract and parse content streams
- Explain the difference between PDF and PostScript operators
- Render simple shapes from PDF content streams
Resources
Essential Reading
- PDF Reference Manual 1.7 by Adobe (free PDF) - The specification
- Developing with PDF by Leonard Rosenthol - Practical guide
- PDF Explained by John Whitington - Clear introduction
Tools
- qpdf: PDF manipulation and validation
- pdftk: PDF toolkit
- pdfinfo: Basic PDF information
- Ghostscript: Reference implementation
Libraries
- zlib: Stream decompression
- Cairo: Rendering (optional)
- libpng: PNG output (optional)