Project 2: PDF File Parser & Renderer

Project Overview

Attribute	Details
Difficulty	Level 3: Advanced
Time Estimate	3-4 weeks
Programming Language	C
Knowledge Area	Document Formats / Compression
Prerequisites	Project 1 or equivalent understanding of graphics state

What you’ll build: A tool that parses PDF files, extracts their structure, and renders pages to images.

Why it teaches PostScript→PDF: You’ll see that PDF is essentially “frozen PostScript” - the same drawing operations exist, but in a declarative structure rather than executable code. You’ll understand what Ghostscript produces when it converts PS to PDF.

Learning Objectives

By completing this project, you will:

Parse binary file formats with mixed ASCII and binary content
Navigate object graphs using indirect references and cross-reference tables
Decompress streams using zlib (Flate/Deflate)
Execute PDF operators (nearly identical to PostScript)
Understand PDF’s random-access architecture and why it exists

The Core Question You’re Answering

“What IS a PDF file, really? How does ‘frozen PostScript’ work without an interpreter?”

PDF isn’t just “a document format”—it’s a carefully designed data structure that solves a specific problem: How do you distribute PostScript documents without requiring an interpreter?

The answer: Separate structure from content.

PostScript (dynamic):               PDF (static):
┌─────────────────────────┐        ┌─────────────────────────┐
│ /box {                  │        │ 4 0 obj                 │
│   /h exch def           │        │ << /Length 44 >>        │
│   /w exch def           │        │ stream                  │
│   0 0 moveto            │        │ 0 0 m                   │
│   w 0 lineto            │        │ 100 0 l                 │
│   w h lineto            │        │ 100 50 l                │
│   0 h lineto            │        │ 0 50 l                  │
│   closepath             │        │ h                       │
│ } def                   │        │ f                       │
│                         │        │ endstream               │
│ 100 50 box fill         │        │ endobj                  │
│                         │        │                         │
│ REQUIRES INTERPRETER    │        │ JUST PARSE & RENDER     │
└─────────────────────────┘        └─────────────────────────┘

The profound difference:

Aspect	PostScript	PDF
Execution	Requires Turing-complete interpreter	Simple operator parsing
Page Access	Must execute from page 1 to get to page 100	Jump directly via xref
Security	Can hang, crash, or execute malicious code	Deterministic, no loops
File Size	Can be tiny (code generates content)	Larger but compressed

By building this parser, you’ll internalize: PDF traded flexibility for reliability and random access.

Deep Theoretical Foundation

1. PDF File Structure

Every PDF file has four major sections:

┌─────────────────────────────────────────────────────────────────┐
│                        PDF FILE LAYOUT                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │                      HEADER                              │   │
│  │  %PDF-1.7                                               │   │
│  │  %âãÏÓ  (binary marker for tools that check for binary) │   │
│  └─────────────────────────────────────────────────────────┘   │
│                              ↓                                  │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │                      BODY                                │   │
│  │  1 0 obj << /Type /Catalog ... >> endobj                │   │
│  │  2 0 obj << /Type /Pages ... >> endobj                  │   │
│  │  3 0 obj << /Type /Page ... >> endobj                   │   │
│  │  4 0 obj << /Length 187 >> stream ... endstream endobj  │   │
│  │  ...                                                     │   │
│  └─────────────────────────────────────────────────────────┘   │
│                              ↓                                  │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │              CROSS-REFERENCE TABLE (xref)                │   │
│  │  xref                                                    │   │
│  │  0 5                                                     │   │
│  │  0000000000 65535 f                                     │   │
│  │  0000000015 00000 n    ← Object 1 at byte 15            │   │
│  │  0000000074 00000 n    ← Object 2 at byte 74            │   │
│  │  0000000133 00000 n    ← Object 3 at byte 133           │   │
│  │  0000000250 00000 n    ← Object 4 at byte 250           │   │
│  └─────────────────────────────────────────────────────────┘   │
│                              ↓                                  │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │                      TRAILER                             │   │
│  │  trailer                                                 │   │
│  │  << /Size 5 /Root 1 0 R >>                             │   │
│  │  startxref                                               │   │
│  │  379                        ← Byte offset of xref       │   │
│  │  %%EOF                                                   │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

2. PDF Object Types

PDF has exactly 8 basic object types:

Type	Syntax	Example
Boolean	`true` / `false`	`true`
Integer	Digits	`42`, `-17`
Real	Decimal	`3.14159`, `-2.5`
String	Parens or hex	`(Hello)`, `<48656C6C6F>`
Name	Slash prefix	`/Type`, `/Page`
Array	Square brackets	`[1 2 3]`, `[/Name 123 (text)]`
Dictionary	Double angle	`<< /Key1 value1 /Key2 value2 >>`
Stream	Dict + binary	`<< /Length 44 >> stream...endstream`

3. Indirect Objects and References

The power of PDF comes from indirect objects:

DIRECT OBJECTS:
  << /Type /Page /MediaBox [0 0 612 792] >>
  ↑ Everything is inline

INDIRECT OBJECTS:
  3 0 obj                           ← Object number 3, generation 0
  << /Type /Page /Contents 4 0 R >> ← References object 4
  endobj

  4 0 obj
  << /Length 187 >>
  stream
  ... content ...
  endstream
  endobj

Why indirect objects?

Reuse: Same object can be referenced multiple times
Random access: Jump to any object via xref table
Incremental updates: Add new objects without rewriting file

4. The Cross-Reference Table

The xref table is the secret to PDF’s speed:

xref
0 5                     ← Entries for objects 0-4 (5 objects total)
0000000000 65535 f      ← Object 0: free (linked list head)
0000000015 00000 n      ← Object 1: in use at byte 15
0000000074 00000 n      ← Object 2: in use at byte 74
0000000133 00000 n      ← Object 3: in use at byte 133
0000000250 00000 n      ← Object 4: in use at byte 250

Format: OOOOOOOOOO GGGGG S
        ↑          ↑     ↑
        Offset     Gen   Status (n=in use, f=free)
        (10 digits)(5)   (1)

How to find xref: Read from end of file:

Find %%EOF
Read backward to find startxref
Read the byte offset after startxref
Seek to that offset → xref table

5. Document Structure

PDF documents form a tree:

┌─────────────────────────────────────────────────────────────────┐
│                    PDF DOCUMENT TREE                            │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  CATALOG (object 1)                                             │
│  << /Type /Catalog                                              │
│     /Pages 2 0 R                                                │
│     /Outlines 10 0 R     (optional bookmarks)                  │
│     /Metadata 15 0 R     (optional XMP metadata)               │
│  >>                                                             │
│           │                                                     │
│           ▼                                                     │
│  PAGES TREE (object 2)                                          │
│  << /Type /Pages                                                │
│     /Kids [3 0 R 6 0 R 9 0 R]                                  │
│     /Count 3                                                    │
│  >>                                                             │
│           │                                                     │
│     ┌─────┼─────┬─────────┐                                    │
│     ▼     ▼     ▼         ▼                                    │
│  PAGE 1  PAGE 2  PAGE 3                                        │
│  (obj 3) (obj 6) (obj 9)                                       │
│     │                                                           │
│     ▼                                                           │
│  CONTENT STREAM (object 4)                                      │
│  << /Length 187 /Filter /FlateDecode >>                        │
│  stream                                                         │
│  ... compressed drawing operators ...                           │
│  endstream                                                      │
│                                                                 │
│  RESOURCES (object 5)                                           │
│  << /Font << /F1 7 0 R >>                                      │
│     /XObject << /Im1 8 0 R >> >>                               │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

6. Content Streams: PostScript Inside PDF

Content streams contain the actual drawing commands. They look almost exactly like PostScript:

PostScript:                     PDF Content Stream:
─────────────────────────────   ─────────────────────────────
newpath                         (implicit)
100 100 moveto                  100 100 m
200 100 lineto                  200 100 l
200 200 lineto                  200 200 l
100 200 lineto                  100 200 l
closepath                       h
0.5 setgray                     0.5 g
fill                            f

PDF Operator Reference (subset):

Category	Operators	Description
Path	`m` (moveto), `l` (lineto), `c` (curveto), `h` (closepath)	Build paths
Paint	`S` (stroke), `f` (fill), `B` (fill+stroke)	Paint paths
Color	`g` (setgray fill), `G` (setgray stroke), `rg`/`RG` (RGB)	Set colors
State	`q` (gsave), `Q` (grestore), `cm` (concat matrix)	Manage state
Text	`BT` (begin text), `ET` (end text), `Tf` (font), `Tj` (show)	Render text

7. Compression: Flate/Deflate

Most PDF streams are compressed with FlateDecode (zlib):

#include <zlib.h>

unsigned char* decompress_stream(unsigned char* compressed,
                                  size_t comp_len,
                                  size_t* out_len) {
    // Estimate output size (usually 3-10x larger)
    size_t estimated = comp_len * 10;
    unsigned char* output = malloc(estimated);

    z_stream z = {0};
    z.next_in = compressed;
    z.avail_in = comp_len;
    z.next_out = output;
    z.avail_out = estimated;

    inflateInit(&z);
    int ret = inflate(&z, Z_FINISH);
    inflateEnd(&z);

    *out_len = z.total_out;
    return output;
}

Project Specification

Core Features

Your PDF parser must:

Parse PDF Header
- Extract PDF version (e.g., “1.7”)
- Detect binary marker
Find and Parse Cross-Reference Table
- Locate via startxref (read from end of file)
- Parse xref entries (byte offset, generation, status)
- Support both xref tables and xref streams
Parse Trailer
- Extract /Root (catalog reference)
- Extract /Size (number of objects)
Dereference Indirect Objects
- Given N 0 R, find object N in xref, seek to its offset, parse it
Parse All Object Types
- Booleans, integers, reals, strings, names, arrays, dictionaries, streams
Decompress Content Streams
- Support FlateDecode filter (zlib)
- Handle /Length for stream boundaries
Parse Content Stream Operators
- Tokenize operators and operands
- Execute graphics state machine
- Track path construction and colors
Render to Output
- Generate PNG or SVG from parsed content

Command-Line Interface

# Dump PDF structure
./pdf_parser document.pdf --dump-structure

# Extract and display content stream operators
./pdf_parser document.pdf --extract-operators

# Render page to PNG
./pdf_parser document.pdf --render output.png

# Validate PDF structure
./pdf_parser document.pdf --validate

Solution Architecture

High-Level Design

┌─────────────────────────────────────────────────────────────────┐
│                      PDF PARSER                                 │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌───────────────────────────────────────────────────────────┐ │
│  │                  FILE READER                              │ │
│  │                                                           │ │
│  │  - Memory-map file (or load into buffer)                 │ │
│  │  - Provide seek/read operations                          │ │
│  │  - Handle binary + ASCII content                         │ │
│  └───────────────────────────────────────────────────────────┘ │
│                            ↓                                    │
│  ┌───────────────────────────────────────────────────────────┐ │
│  │              CROSS-REFERENCE PARSER                       │ │
│  │                                                           │ │
│  │  - Find trailer from end of file                         │ │
│  │  - Parse xref table or xref stream                       │ │
│  │  - Build object offset lookup table                      │ │
│  └───────────────────────────────────────────────────────────┘ │
│                            ↓                                    │
│  ┌───────────────────────────────────────────────────────────┐ │
│  │                 OBJECT PARSER                             │ │
│  │                                                           │ │
│  │  - Tokenize PDF syntax                                   │ │
│  │  - Parse dictionaries, arrays, streams                   │ │
│  │  - Dereference indirect objects                          │ │
│  │  - Decompress streams with zlib                          │ │
│  └───────────────────────────────────────────────────────────┘ │
│                            ↓                                    │
│  ┌───────────────────────────────────────────────────────────┐ │
│  │              DOCUMENT NAVIGATOR                           │ │
│  │                                                           │ │
│  │  - Follow Catalog → Pages → Page                         │ │
│  │  - Extract content streams                               │ │
│  │  - Resolve resource dictionaries                         │ │
│  └───────────────────────────────────────────────────────────┘ │
│                            ↓                                    │
│  ┌───────────────────────────────────────────────────────────┐ │
│  │           CONTENT STREAM INTERPRETER                      │ │
│  │                                                           │ │
│  │  - Parse PostScript-like operators                       │ │
│  │  - Maintain graphics state                               │ │
│  │  - Build path and text operations                        │ │
│  └───────────────────────────────────────────────────────────┘ │
│                            ↓                                    │
│  ┌───────────────────────────────────────────────────────────┐ │
│  │                   RENDERER                                │ │
│  │                                                           │ │
│  │  - Convert graphics operations to pixels                 │ │
│  │  - Output PNG (via Cairo or libpng)                      │ │
│  │  - Or output SVG (text-based)                            │ │
│  └───────────────────────────────────────────────────────────┘ │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Key Data Structures

// PDF Object representation
typedef enum {
    PDF_NULL,
    PDF_BOOL,
    PDF_INT,
    PDF_REAL,
    PDF_STRING,
    PDF_NAME,
    PDF_ARRAY,
    PDF_DICT,
    PDF_STREAM,
    PDF_REF
} PDFObjectType;

typedef struct PDFObject {
    PDFObjectType type;
    union {
        bool boolean;
        long integer;
        double real;
        struct {
            char* data;
            size_t length;
        } string;
        char* name;
        struct {
            struct PDFObject** items;
            size_t count;
        } array;
        struct {
            char** keys;
            struct PDFObject** values;
            size_t count;
        } dict;
        struct {
            struct PDFObject* dict;
            unsigned char* data;
            size_t length;
        } stream;
        struct {
            int obj_num;
            int gen_num;
        } ref;
    } value;
} PDFObject;

// Cross-reference entry
typedef struct {
    long offset;        // Byte offset in file
    int generation;     // Generation number
    bool in_use;        // true = n (in use), false = f (free)
} XRefEntry;

// PDF Document
typedef struct {
    FILE* file;
    char* data;         // Memory-mapped content (optional)
    size_t size;

    char version[8];    // "1.7", "2.0", etc.

    XRefEntry* xref;
    size_t xref_size;

    PDFObject* trailer;
    PDFObject* catalog;
    PDFObject* pages;
} PDFDocument;

// Graphics state for rendering
typedef struct {
    double ctm[6];
    double fill_color[4];    // RGBA
    double stroke_color[4];
    double line_width;

    // Current path
    struct {
        double x, y;
        int type;  // 0=move, 1=line, 2=curve
    } path[10000];
    size_t path_len;
} GraphicsState;

Implementation Guide

Phase 1: File Reading and xref Location (Week 1, Days 1-2)

Start by finding the cross-reference table:

// Find startxref by reading from end of file
long find_startxref(FILE* f) {
    // Read last 1024 bytes
    fseek(f, -1024, SEEK_END);
    char buf[1024];
    size_t n = fread(buf, 1, 1024, f);

    // Search for "startxref"
    for (int i = n - 9; i >= 0; i--) {
        if (memcmp(buf + i, "startxref", 9) == 0) {
            // Read the offset value after it
            long offset;
            sscanf(buf + i + 9, "%ld", &offset);
            return offset;
        }
    }
    return -1;  // Not found
}

// Parse the xref table
void parse_xref_table(PDFDocument* doc, long xref_offset) {
    fseek(doc->file, xref_offset, SEEK_SET);

    char line[256];
    fgets(line, sizeof(line), doc->file);  // "xref"

    while (true) {
        // Read subsection header: "first_obj count"
        int first_obj, count;
        if (fscanf(doc->file, "%d %d", &first_obj, &count) != 2) break;

        // Expand xref table if needed
        if (first_obj + count > doc->xref_size) {
            doc->xref_size = first_obj + count;
            doc->xref = realloc(doc->xref, doc->xref_size * sizeof(XRefEntry));
        }

        // Read entries
        for (int i = 0; i < count; i++) {
            long offset;
            int gen;
            char status;
            fscanf(doc->file, "%ld %d %c", &offset, &gen, &status);

            doc->xref[first_obj + i].offset = offset;
            doc->xref[first_obj + i].generation = gen;
            doc->xref[first_obj + i].in_use = (status == 'n');
        }

        // Check for "trailer" keyword
        long pos = ftell(doc->file);
        fscanf(doc->file, "%s", line);
        if (strcmp(line, "trailer") == 0) break;
        fseek(doc->file, pos, SEEK_SET);
    }
}

Phase 2: Object Parser (Week 1, Days 3-5)

Parse PDF objects:

// Skip whitespace and comments
void skip_whitespace(FILE* f) {
    int c;
    while ((c = fgetc(f)) != EOF) {
        if (c == '%') {
            // Skip comment until newline
            while ((c = fgetc(f)) != EOF && c != '\n' && c != '\r');
        } else if (!isspace(c)) {
            ungetc(c, f);
            return;
        }
    }
}

// Parse a single PDF object
PDFObject* parse_object(FILE* f) {
    skip_whitespace(f);

    int c = fgetc(f);

    // Boolean
    if (c == 't') {
        char buf[5];
        buf[0] = c;
        fread(buf + 1, 1, 3, f);
        if (memcmp(buf, "true", 4) == 0) {
            PDFObject* obj = calloc(1, sizeof(PDFObject));
            obj->type = PDF_BOOL;
            obj->value.boolean = true;
            return obj;
        }
    }
    if (c == 'f') {
        char buf[6];
        buf[0] = c;
        fread(buf + 1, 1, 4, f);
        if (memcmp(buf, "false", 5) == 0) {
            PDFObject* obj = calloc(1, sizeof(PDFObject));
            obj->type = PDF_BOOL;
            obj->value.boolean = false;
            return obj;
        }
    }

    // Name
    if (c == '/') {
        char name[256];
        int i = 0;
        while ((c = fgetc(f)) != EOF && !isspace(c) &&
               c != '/' && c != '[' && c != ']' &&
               c != '<' && c != '>' && c != '(' && c != ')') {
            name[i++] = c;
        }
        name[i] = '\0';
        ungetc(c, f);

        PDFObject* obj = calloc(1, sizeof(PDFObject));
        obj->type = PDF_NAME;
        obj->value.name = strdup(name);
        return obj;
    }

    // String
    if (c == '(') {
        char str[4096];
        int i = 0;
        int paren_depth = 1;
        while ((c = fgetc(f)) != EOF && paren_depth > 0) {
            if (c == '(') paren_depth++;
            else if (c == ')') paren_depth--;

            if (paren_depth > 0) str[i++] = c;
        }
        str[i] = '\0';

        PDFObject* obj = calloc(1, sizeof(PDFObject));
        obj->type = PDF_STRING;
        obj->value.string.data = strdup(str);
        obj->value.string.length = i;
        return obj;
    }

    // Array
    if (c == '[') {
        PDFObject** items = NULL;
        size_t count = 0;
        size_t capacity = 0;

        while (true) {
            skip_whitespace(f);
            c = fgetc(f);
            if (c == ']') break;
            ungetc(c, f);

            if (count >= capacity) {
                capacity = capacity ? capacity * 2 : 8;
                items = realloc(items, capacity * sizeof(PDFObject*));
            }
            items[count++] = parse_object(f);
        }

        PDFObject* obj = calloc(1, sizeof(PDFObject));
        obj->type = PDF_ARRAY;
        obj->value.array.items = items;
        obj->value.array.count = count;
        return obj;
    }

    // Dictionary
    if (c == '<') {
        c = fgetc(f);
        if (c == '<') {
            // Dictionary
            char** keys = NULL;
            PDFObject** values = NULL;
            size_t count = 0;
            size_t capacity = 0;

            while (true) {
                skip_whitespace(f);
                c = fgetc(f);
                if (c == '>') {
                    fgetc(f);  // Skip second '>'
                    break;
                }
                ungetc(c, f);

                if (count >= capacity) {
                    capacity = capacity ? capacity * 2 : 8;
                    keys = realloc(keys, capacity * sizeof(char*));
                    values = realloc(values, capacity * sizeof(PDFObject*));
                }

                // Parse key (must be a name)
                PDFObject* key_obj = parse_object(f);
                keys[count] = strdup(key_obj->value.name);
                free_object(key_obj);

                // Parse value
                values[count] = parse_object(f);
                count++;
            }

            PDFObject* obj = calloc(1, sizeof(PDFObject));
            obj->type = PDF_DICT;
            obj->value.dict.keys = keys;
            obj->value.dict.values = values;
            obj->value.dict.count = count;
            return obj;
        }
        // Hex string
        // ...
    }

    // Number or reference
    if (isdigit(c) || c == '-' || c == '+' || c == '.') {
        ungetc(c, f);
        // Try to parse number
        long pos = ftell(f);
        double num;
        fscanf(f, "%lf", &num);

        // Check if this might be a reference
        skip_whitespace(f);
        int gen;
        char r;
        if (fscanf(f, "%d %c", &gen, &r) == 2 && r == 'R') {
            PDFObject* obj = calloc(1, sizeof(PDFObject));
            obj->type = PDF_REF;
            obj->value.ref.obj_num = (int)num;
            obj->value.ref.gen_num = gen;
            return obj;
        }

        // Just a number
        fseek(f, pos, SEEK_SET);
        fscanf(f, "%lf", &num);

        PDFObject* obj = calloc(1, sizeof(PDFObject));
        if (num == (long)num) {
            obj->type = PDF_INT;
            obj->value.integer = (long)num;
        } else {
            obj->type = PDF_REAL;
            obj->value.real = num;
        }
        return obj;
    }

    return NULL;
}

Phase 3: Object Dereferencing (Week 2, Days 1-2)

// Get an object by its number
PDFObject* get_object(PDFDocument* doc, int obj_num) {
    if (obj_num >= doc->xref_size || !doc->xref[obj_num].in_use) {
        return NULL;
    }

    long offset = doc->xref[obj_num].offset;
    fseek(doc->file, offset, SEEK_SET);

    // Read "N G obj"
    int num, gen;
    char keyword[16];
    fscanf(doc->file, "%d %d %s", &num, &gen, keyword);

    if (strcmp(keyword, "obj") != 0) {
        return NULL;
    }

    PDFObject* obj = parse_object(doc->file);

    // Check for stream
    skip_whitespace(doc->file);
    long pos = ftell(doc->file);
    char buf[16];
    fscanf(doc->file, "%15s", buf);

    if (strcmp(buf, "stream") == 0) {
        // Skip past newline
        int c;
        while ((c = fgetc(doc->file)) != EOF && c != '\n');

        // Get stream length
        PDFObject* length_obj = dict_get(obj, "Length");
        size_t length;
        if (length_obj->type == PDF_REF) {
            PDFObject* len = get_object(doc, length_obj->value.ref.obj_num);
            length = len->value.integer;
        } else {
            length = length_obj->value.integer;
        }

        // Read stream data
        unsigned char* data = malloc(length);
        fread(data, 1, length, doc->file);

        // Convert to stream object
        PDFObject* stream_obj = calloc(1, sizeof(PDFObject));
        stream_obj->type = PDF_STREAM;
        stream_obj->value.stream.dict = obj;
        stream_obj->value.stream.data = data;
        stream_obj->value.stream.length = length;

        return stream_obj;
    }

    fseek(doc->file, pos, SEEK_SET);
    return obj;
}

// Dereference if it's a reference
PDFObject* deref(PDFDocument* doc, PDFObject* obj) {
    if (obj->type == PDF_REF) {
        return get_object(doc, obj->value.ref.obj_num);
    }
    return obj;
}

Phase 4: Stream Decompression (Week 2, Days 3-4)

#include <zlib.h>

unsigned char* decompress_flate(unsigned char* data, size_t len, size_t* out_len) {
    // Start with estimated size
    size_t out_size = len * 4;
    unsigned char* output = malloc(out_size);

    z_stream z = {0};
    z.next_in = data;
    z.avail_in = len;
    z.next_out = output;
    z.avail_out = out_size;

    if (inflateInit(&z) != Z_OK) {
        free(output);
        return NULL;
    }

    while (true) {
        int ret = inflate(&z, Z_NO_FLUSH);

        if (ret == Z_STREAM_END) break;

        if (ret != Z_OK) {
            inflateEnd(&z);
            free(output);
            return NULL;
        }

        // Need more output space
        if (z.avail_out == 0) {
            size_t new_size = out_size * 2;
            output = realloc(output, new_size);
            z.next_out = output + out_size;
            z.avail_out = new_size - out_size;
            out_size = new_size;
        }
    }

    inflateEnd(&z);
    *out_len = z.total_out;
    return output;
}

unsigned char* get_stream_data(PDFDocument* doc, PDFObject* stream, size_t* out_len) {
    PDFObject* filter = dict_get(stream->value.stream.dict, "Filter");

    if (filter == NULL) {
        // No compression
        *out_len = stream->value.stream.length;
        unsigned char* copy = malloc(*out_len);
        memcpy(copy, stream->value.stream.data, *out_len);
        return copy;
    }

    if (filter->type == PDF_NAME && strcmp(filter->value.name, "FlateDecode") == 0) {
        return decompress_flate(stream->value.stream.data,
                                stream->value.stream.length,
                                out_len);
    }

    // Other filters: LZWDecode, ASCII85Decode, etc.
    fprintf(stderr, "Unsupported filter\n");
    return NULL;
}

Phase 5: Content Stream Parsing (Week 2, Day 5 - Week 3, Day 2)

typedef enum {
    OP_MOVETO, OP_LINETO, OP_CURVETO, OP_CLOSEPATH,
    OP_STROKE, OP_FILL, OP_FILL_STROKE,
    OP_GSAVE, OP_GRESTORE, OP_CONCAT,
    OP_SETGRAY, OP_SETRGB, OP_SETCMYK,
    OP_BEGINTEXT, OP_ENDTEXT, OP_SETFONT, OP_SHOWTEXT
} OpCode;

typedef struct {
    OpCode op;
    double operands[6];
    int num_operands;
    char* string_operand;
} Operation;

// Parse content stream into operations
Operation* parse_content_stream(unsigned char* data, size_t len, size_t* num_ops) {
    Operation* ops = malloc(10000 * sizeof(Operation));
    size_t count = 0;

    double operand_stack[100];
    int stack_ptr = 0;

    size_t pos = 0;
    while (pos < len) {
        // Skip whitespace
        while (pos < len && isspace(data[pos])) pos++;
        if (pos >= len) break;

        // Check for number
        if (isdigit(data[pos]) || data[pos] == '-' || data[pos] == '+' || data[pos] == '.') {
            char num_buf[64];
            int i = 0;
            while (pos < len && (isdigit(data[pos]) || data[pos] == '.' || data[pos] == '-' || data[pos] == '+')) {
                num_buf[i++] = data[pos++];
            }
            num_buf[i] = '\0';
            operand_stack[stack_ptr++] = atof(num_buf);
            continue;
        }

        // Check for string
        if (data[pos] == '(') {
            pos++;
            char str_buf[1024];
            int i = 0;
            int depth = 1;
            while (pos < len && depth > 0) {
                if (data[pos] == '(') depth++;
                else if (data[pos] == ')') depth--;
                if (depth > 0) str_buf[i++] = data[pos];
                pos++;
            }
            str_buf[i] = '\0';
            // Store string for next operator
            continue;
        }

        // Check for operator
        if (isalpha(data[pos]) || data[pos] == '\'' || data[pos] == '"') {
            char op_buf[32];
            int i = 0;
            while (pos < len && (isalpha(data[pos]) || data[pos] == '*' || data[pos] == '\'' || data[pos] == '"')) {
                op_buf[i++] = data[pos++];
            }
            op_buf[i] = '\0';

            // Map operator name to OpCode
            Operation* op = &ops[count++];
            op->num_operands = 0;

            if (strcmp(op_buf, "m") == 0) {
                op->op = OP_MOVETO;
                op->operands[1] = operand_stack[--stack_ptr];  // y
                op->operands[0] = operand_stack[--stack_ptr];  // x
                op->num_operands = 2;
            }
            else if (strcmp(op_buf, "l") == 0) {
                op->op = OP_LINETO;
                op->operands[1] = operand_stack[--stack_ptr];
                op->operands[0] = operand_stack[--stack_ptr];
                op->num_operands = 2;
            }
            else if (strcmp(op_buf, "h") == 0) {
                op->op = OP_CLOSEPATH;
            }
            else if (strcmp(op_buf, "S") == 0) {
                op->op = OP_STROKE;
            }
            else if (strcmp(op_buf, "f") == 0 || strcmp(op_buf, "F") == 0) {
                op->op = OP_FILL;
            }
            else if (strcmp(op_buf, "q") == 0) {
                op->op = OP_GSAVE;
            }
            else if (strcmp(op_buf, "Q") == 0) {
                op->op = OP_GRESTORE;
            }
            else if (strcmp(op_buf, "g") == 0) {
                op->op = OP_SETGRAY;
                op->operands[0] = operand_stack[--stack_ptr];
                op->num_operands = 1;
            }
            else if (strcmp(op_buf, "rg") == 0) {
                op->op = OP_SETRGB;
                op->operands[2] = operand_stack[--stack_ptr];  // b
                op->operands[1] = operand_stack[--stack_ptr];  // g
                op->operands[0] = operand_stack[--stack_ptr];  // r
                op->num_operands = 3;
            }
            // ... more operators

            stack_ptr = 0;  // Clear operand stack after each operator
        }
    }

    *num_ops = count;
    return ops;
}

Phase 6: Rendering (Week 3, Days 3-5)

// Simple SVG output
void render_to_svg(Operation* ops, size_t num_ops, const char* filename) {
    FILE* f = fopen(filename, "w");

    fprintf(f, "<?xml version=\"1.0\"?>\n");
    fprintf(f, "<svg xmlns=\"http://www.w3.org/2000/svg\" width=\"612\" height=\"792\">\n");
    fprintf(f, "  <g transform=\"translate(0,792) scale(1,-1)\">\n");

    double fill_r = 0, fill_g = 0, fill_b = 0;
    double stroke_r = 0, stroke_g = 0, stroke_b = 0;
    char path_data[65536] = "";

    for (size_t i = 0; i < num_ops; i++) {
        Operation* op = &ops[i];

        switch (op->op) {
            case OP_MOVETO:
                sprintf(path_data + strlen(path_data), "M %.2f %.2f ",
                        op->operands[0], op->operands[1]);
                break;

            case OP_LINETO:
                sprintf(path_data + strlen(path_data), "L %.2f %.2f ",
                        op->operands[0], op->operands[1]);
                break;

            case OP_CLOSEPATH:
                strcat(path_data, "Z ");
                break;

            case OP_SETGRAY:
                fill_r = fill_g = fill_b = op->operands[0];
                break;

            case OP_SETRGB:
                fill_r = op->operands[0];
                fill_g = op->operands[1];
                fill_b = op->operands[2];
                break;

            case OP_FILL:
                if (strlen(path_data) > 0) {
                    fprintf(f, "    <path d=\"%s\" fill=\"rgb(%d,%d,%d)\" stroke=\"none\"/>\n",
                            path_data,
                            (int)(fill_r * 255),
                            (int)(fill_g * 255),
                            (int)(fill_b * 255));
                    path_data[0] = '\0';
                }
                break;

            case OP_STROKE:
                if (strlen(path_data) > 0) {
                    fprintf(f, "    <path d=\"%s\" fill=\"none\" stroke=\"rgb(%d,%d,%d)\"/>\n",
                            path_data,
                            (int)(stroke_r * 255),
                            (int)(stroke_g * 255),
                            (int)(stroke_b * 255));
                    path_data[0] = '\0';
                }
                break;

            default:
                break;
        }
    }

    fprintf(f, "  </g>\n");
    fprintf(f, "</svg>\n");
    fclose(f);
}

Testing Strategy

Test with Hand-Crafted PDFs

Create the simplest valid PDF:

cat > test_minimal.pdf << 'EOF'
%PDF-1.4
1 0 obj
<< /Type /Catalog /Pages 2 0 R >>
endobj
2 0 obj
<< /Type /Pages /Kids [3 0 R] /Count 1 >>
endobj
3 0 obj
<< /Type /Page /Parent 2 0 R /MediaBox [0 0 612 792] /Contents 4 0 R >>
endobj
4 0 obj
<< /Length 29 >>
stream
100 100 m 200 200 l S
endstream
endobj
xref
0 5
0000000000 65535 f
0000000009 00000 n
0000000058 00000 n
0000000117 00000 n
0000000218 00000 n
trailer
<< /Size 5 /Root 1 0 R >>
startxref
301
%%EOF
EOF

Validate with Industry Tools

# Check PDF structure
qpdf --check output.pdf

# Extract info
pdfinfo output.pdf

# Compare text extraction
pdftotext reference.pdf - > ref.txt
pdftotext output.pdf - > out.txt
diff ref.txt out.txt

Compare Rendering with Ghostscript

# Render with Ghostscript
gs -sDEVICE=png16m -r72 -o gs_output.png test.pdf

# Render with your parser
./pdf_parser test.pdf --render my_output.png

# Visual comparison with ImageMagick
compare gs_output.png my_output.png diff.png

Common Pitfalls

1. xref Offset Calculation

The xref table offsets are byte offsets from the start of the file. Make sure to:

Count bytes exactly (including newlines)
Handle both \n and \r\n line endings

2. Stream Length Indirection

Stream /Length can be a reference:

4 0 obj
<< /Length 5 0 R /Filter /FlateDecode >>
stream
...
endstream
endobj
5 0 obj
187
endobj

You must dereference to get the actual length.

3. Object Number 0

Object 0 is always free (the head of the free list). Your xref array should start at index 0 but object 0 is never actually used.

4. Multiple xref Tables (Incremental Updates)

PDFs can have multiple xref tables (appended at end). The trailer contains /Prev pointing to the previous xref. You may need to merge them.

Extensions and Challenges

Level 1: More Compression Filters

LZWDecode
ASCII85Decode
RunLengthDecode

Level 2: Font Handling

Parse Type 1 and TrueType font programs
Extract glyph outlines
Render text properly

Level 3: Images

Decode inline images
Handle DCTDecode (JPEG)
Support JPEG2000

Level 4: Interactive Features

Parse annotations
Handle form fields
Support links and bookmarks

Self-Assessment Checklist

Before moving to Project 3, verify you can:

Locate and parse the xref table from any PDF
Dereference indirect object references
Parse all 8 PDF object types correctly
Decompress FlateDecode streams
Navigate Catalog → Pages → Page hierarchy
Extract and parse content streams
Explain the difference between PDF and PostScript operators
Render simple shapes from PDF content streams

Resources

Essential Reading

PDF Reference Manual 1.7 by Adobe (free PDF) - The specification
Developing with PDF by Leonard Rosenthol - Practical guide
PDF Explained by John Whitington - Clear introduction

Tools

qpdf: PDF manipulation and validation
pdftk: PDF toolkit
pdfinfo: Basic PDF information
Ghostscript: Reference implementation

Libraries

zlib: Stream decompression
Cairo: Rendering (optional)
libpng: PNG output (optional)