Project 6: Document Processing Pipeline (Capstone)
Project 6: Document Processing Pipeline (Capstone)
Project Overview
| Attribute | Details |
|---|---|
| Difficulty | Level 4: Expert |
| Time Estimate | 2-3 months |
| Programming Language | C (core), Python (web interface) |
| Knowledge Area | Full Stack / Systems Design |
| Prerequisites | All previous projects |
What youโll build: A complete document processing system that accepts PostScript, PDF, or a custom markup language as input, processes through a unified internal representation, and outputs to PDF, SVG, PNG, or printer commands.
Why this is the ultimate test: This mirrors what production systems like Ghostscript, Cairo, and print servers actually do. Youโll understand why these systems are architected the way they are.
Learning Objectives
By completing this capstone project, you will:
- Design a unified graphics model that captures PS and PDF semantics
- Implement multiple input parsers feeding one representation
- Implement multiple output backends from one representation
- Build production-quality software with error handling, logging, and testing
- Create a usable interface (CLI and optionally web)
- Understand the architecture of professional document processors
The Core Question Youโre Answering
โHow do you build a document processing pipeline that can handle any input format and produce any output format?โ
This is the fundamental architecture question behind:
- Ghostscript: PostScript/PDF โ PDF/PNG/printer
- Cairo: Abstract graphics โ PDF/SVG/PNG/X11
- Print servers: Application โ printer language
- Browser engines: HTML/CSS โ screen/PDF
The answer is the Intermediate Representation (IR) pattern:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ THE INTERMEDIATE REPRESENTATION PATTERN โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ INPUTS IR OUTPUTS โ
โ โโโโโโ โโโโ โโโโโโโ โ
โ โ
โ PostScript โโโโโโ โโโโโโ PDF โ
โ โ โ โ
โ PDF โโโโโโโโโโโโโผโโโโถ Graphics Model โโโโโผโโโโโ SVG โ
โ โ โ โ
โ Custom Markup โโโ โโโโโโ PNG โ
โ โ โ
โ โโโโโโ Printer โ
โ โ
โ N inputs + M outputs = N+M implementations โ
โ (Not NรM implementations!) โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Deep Theoretical Foundation
1. The Graphics Model
Your unified graphics model must capture all operations from all input formats:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ UNIFIED GRAPHICS MODEL โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ DOCUMENT โ
โ โโโ Metadata (title, author, creation date) โ
โ โโโ Resources (fonts, images, color profiles) โ
โ โโโ Pages[] โ
โ โโโ Page โ
โ โโโ Size (width, height) โ
โ โโโ Resources (fonts, images) โ
โ โโโ Operations[] โ
โ โโโ Path Operations โ
โ โ โโโ MoveTo(x, y) โ
โ โ โโโ LineTo(x, y) โ
โ โ โโโ CurveTo(x1,y1, x2,y2, x3,y3) โ
โ โ โโโ ClosePath() โ
โ โโโ Paint Operations โ
โ โ โโโ Stroke(path, color, width) โ
โ โ โโโ Fill(path, color, rule) โ
โ โ โโโ Clip(path, rule) โ
โ โโโ Text Operations โ
โ โ โโโ SetFont(name, size) โ
โ โ โโโ ShowText(string, x, y) โ
โ โ โโโ ShowGlyphs(glyphs[], positions[]) โ
โ โโโ Image Operations โ
โ โ โโโ DrawImage(image, matrix) โ
โ โโโ State Operations โ
โ โ โโโ Save() โ
โ โ โโโ Restore() โ
โ โ โโโ SetColor(color) โ
โ โ โโโ SetLineWidth(width) โ
โ โ โโโ ConcatMatrix(matrix) โ
โ โโโ Group Operations โ
โ โโโ BeginGroup(transparency) โ
โ โโโ EndGroup() โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
2. Pipeline Architecture
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ PROCESSING PIPELINE โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ Stage 1: INPUT PARSING โ
โ โโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โ
โ โ PS Parser โ โ PDF Parser โ โ Markup Parserโ โ
โ โ โ โ โ โ โ โ
โ โ Tokenize โ โ Parse xref โ โ Parse custom โ โ
โ โ Execute โ โ Dereference โ โ syntax โ โ
โ โ Capture ops โ โ Extract ops โ โ Build ops โ โ
โ โโโโโโโโฌโโโโโโโโ โโโโโโโโฌโโโโโโโโ โโโโโโโโฌโโโโโโโโ โ
โ โ โ โ โ
โ โโโโโโโโโโโโโโฌโโโโโดโโโโโโโโโโโโโโโโโโ โ
โ โผ โ
โ Stage 2: GRAPHICS MODEL โ
โ โโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Document Object โ โ
โ โ โ โ
โ โ โข Page list with operations โ โ
โ โ โข Font references โ โ
โ โ โข Image data โ โ
โ โ โข Metadata โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โโโโโโโโโโโโโโผโโโโโโโโโโโโโฌโโโโโโโโโโโโโ โ
โ โผ โผ โผ โผ โ
โ Stage 3: OUTPUT RENDERING โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ โโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโ โ
โ โPDF Writerโ โSVG Writerโ โPNG Renderโ โPCL Writerโ โ
โ โ โ โ โ โ โ โ โ โ
โ โGenerate โ โGenerate โ โRasterize โ โGenerate โ โ
โ โPDF objs โ โSVG XML โ โto bitmap โ โprinter โ โ
โ โ+ xref โ โ โ โ โ โcommands โ โ
โ โโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโ โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
3. The Device Abstraction
Like Ghostscript, your system uses a device abstraction:
// Abstract output device interface
typedef struct Device {
// Device identification
const char* name;
DeviceType type; // VECTOR, RASTER, STREAM
// Device capabilities
bool supports_color;
bool supports_transparency;
int max_resolution;
// Device procedures
int (*open)(struct Device* dev, const char* output);
int (*close)(struct Device* dev);
// Graphics operations
int (*begin_page)(struct Device* dev, double width, double height);
int (*end_page)(struct Device* dev);
int (*set_color)(struct Device* dev, Color color);
int (*set_line_width)(struct Device* dev, double width);
int (*set_transform)(struct Device* dev, Matrix matrix);
int (*move_to)(struct Device* dev, double x, double y);
int (*line_to)(struct Device* dev, double x, double y);
int (*curve_to)(struct Device* dev, double x1, double y1,
double x2, double y2, double x3, double y3);
int (*close_path)(struct Device* dev);
int (*stroke)(struct Device* dev);
int (*fill)(struct Device* dev, FillRule rule);
int (*draw_text)(struct Device* dev, const char* text,
double x, double y, Font* font);
int (*draw_image)(struct Device* dev, Image* img, Matrix transform);
int (*save)(struct Device* dev);
int (*restore)(struct Device* dev);
// Device-specific data
void* private_data;
} Device;
Project Specification
Core Features
Input Formats
- PostScript (subset)
- Path operations
- Paint operations
- Transformations
- Basic text
- PDF (subset)
- Object parsing
- Content stream interpretation
- Basic fonts
- Custom Markup Language
- Simple, readable syntax for document creation
- Example:
page 612x792 rect 100 100 200 300 fill=#ff0000 text "Hello" at 200 500 font="Helvetica" size=24 line from 0 0 to 612 792 stroke=#000000 width=2 endpage
Output Formats
- PDF
- Valid PDF 1.4+ output
- Optional compression
- SVG
- Vector graphics output
- Viewable in browsers
- PNG
- Rasterized output
- Configurable DPI
- Printer (optional)
- PCL or ESC/P commands
Command-Line Interface
# Convert PostScript to PDF
./docpipe convert input.ps -o output.pdf
# Convert PDF to PNG at 300 DPI
./docpipe convert input.pdf -o output.png --dpi 300
# Convert custom markup to SVG
./docpipe convert input.dml -o output.svg
# Specify input/output formats explicitly
./docpipe convert --from=postscript --to=pdf input.ps output.pdf
# Multi-page to single pages
./docpipe convert input.pdf -o page_%d.png
# List supported formats
./docpipe formats
Web Interface (Optional)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Document Processing Pipeline [?] โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Upload Document โ โ
โ โ โโโโโโโโโโโโโโโโโ โ โ
โ โ [ Drag & drop or click to select ] โ โ
โ โ โ โ
โ โ Supported: .ps, .pdf, .dml โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ Output Format: [PDF โผ] DPI: [150] Pages: [All โผ] โ
โ โ
โ [ Convert ] โ
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Preview โ โ
โ โ โโโโโโโ โ โ
โ โ โ โ
โ โ [ Rendered preview will appear here ] โ โ
โ โ โ โ
โ โ [< Page 1 of 3 >] [Download PDF] โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Solution Architecture
High-Level Design
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ DOCUMENT PROCESSING PIPELINE โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ CLI / Web API โ โ
โ โ โ โ
โ โ docpipe convert input.ps -o output.pdf โ โ
โ โ POST /api/convert { file, output_format } โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โผ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ FORMAT DETECTOR โ โ
โ โ โ โ
โ โ Detect input format from extension or magic bytes โ โ
โ โ Select appropriate parser โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโ โ
โ โผ โผ โผ โ
โ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โ
โ โ PS Parser โ โ PDF Parser โ โ DML Parser โ โ
โ โ โ โ โ โ โ โ
โ โ Stack-based โ โ Object- โ โ Line-based โ โ
โ โ interpreter โ โ oriented โ โ parser โ โ
โ โโโโโโโโฌโโโโโโโ โโโโโโโโฌโโโโโโโ โโโโโโโโฌโโโโโโโ โ
โ โ โ โ โ
โ โโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโ โ
โ โผ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ GRAPHICS MODEL โ โ
โ โ โ โ
โ โ Document โ โ
โ โ โโโ metadata: {...} โ โ
โ โ โโโ fonts: [Font, ...] โ โ
โ โ โโโ images: [Image, ...] โ โ
โ โ โโโ pages: [ โ โ
โ โ Page { size, operations: [Op, ...] } โ โ
โ โ ] โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโ โ
โ โผ โผ โผ โผ โ
โ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโ ... โ
โ โ PDF Device โ โ SVG Device โ โPNG Deviceโ โ
โ โ โ โ โ โ โ โ
โ โ Generate โ โ Generate โ โRasterize โ โ
โ โ PDF objects โ โ SVG XML โ โ+ encode โ โ
โ โโโโโโโโฌโโโโโโโ โโโโโโโโฌโโโโโโโ โโโโโโฌโโโโโโ โ
โ โ โ โ โ
โ โโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโ โ
โ โผ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ OUTPUT FILE โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Key Data Structures
// Color representation
typedef struct {
ColorSpace space; // RGB, CMYK, Gray
union {
struct { double r, g, b; } rgb;
struct { double c, m, y, k; } cmyk;
double gray;
} value;
double alpha;
} Color;
// 2D transformation matrix
typedef struct {
double a, b, c, d, tx, ty;
} Matrix;
// Path segment
typedef enum {
PATH_MOVE, PATH_LINE, PATH_CURVE, PATH_CLOSE
} PathSegmentType;
typedef struct {
PathSegmentType type;
double x, y;
double x1, y1, x2, y2; // Control points for curves
} PathSegment;
// Path
typedef struct {
PathSegment* segments;
size_t count;
size_t capacity;
} Path;
// Graphics operation
typedef enum {
OP_MOVE_TO, OP_LINE_TO, OP_CURVE_TO, OP_CLOSE_PATH,
OP_STROKE, OP_FILL, OP_CLIP,
OP_SET_COLOR, OP_SET_LINE_WIDTH, OP_SET_TRANSFORM,
OP_DRAW_TEXT, OP_DRAW_IMAGE,
OP_SAVE, OP_RESTORE
} OpType;
typedef struct {
OpType type;
union {
struct { double x, y; } point;
struct { double x1, y1, x2, y2, x3, y3; } curve;
struct { Color color; } set_color;
struct { double width; } set_line_width;
struct { Matrix matrix; } set_transform;
struct { char* text; double x, y; int font_id; } draw_text;
struct { int image_id; Matrix transform; } draw_image;
struct { FillRule rule; } fill;
} data;
} Operation;
// Page
typedef struct {
double width, height;
Operation* operations;
size_t op_count;
size_t op_capacity;
} Page;
// Font resource
typedef struct {
int id;
char* name;
char* family;
double size;
unsigned char* data; // Embedded font data
size_t data_len;
} Font;
// Image resource
typedef struct {
int id;
int width, height;
int channels;
unsigned char* data;
size_t data_len;
ImageFormat format; // RAW, JPEG, PNG
} Image;
// Document (the unified representation)
typedef struct {
char* title;
char* author;
time_t creation_date;
Font* fonts;
size_t font_count;
Image* images;
size_t image_count;
Page* pages;
size_t page_count;
} Document;
Implementation Guide
Phase 1: Graphics Model (Week 1)
Start with the core data structures:
// graphics.h - Graphics model API
Document* doc_create(void);
void doc_destroy(Document* doc);
Page* doc_add_page(Document* doc, double width, double height);
int doc_add_font(Document* doc, const char* name, const char* family);
int doc_add_image(Document* doc, int width, int height,
const unsigned char* data, size_t len);
void page_move_to(Page* page, double x, double y);
void page_line_to(Page* page, double x, double y);
void page_curve_to(Page* page, double x1, double y1,
double x2, double y2, double x3, double y3);
void page_close_path(Page* page);
void page_stroke(Page* page);
void page_fill(Page* page, FillRule rule);
void page_set_color(Page* page, Color color);
void page_set_line_width(Page* page, double width);
void page_set_transform(Page* page, Matrix m);
void page_draw_text(Page* page, const char* text,
double x, double y, int font_id);
void page_draw_image(Page* page, int image_id, Matrix transform);
void page_save(Page* page);
void page_restore(Page* page);
Phase 2: Output Devices (Weeks 2-3)
Implement the device interface for each output format:
// device.h - Device abstraction
typedef struct Device Device;
// Create devices
Device* pdf_device_create(const char* output_path);
Device* svg_device_create(const char* output_path);
Device* png_device_create(const char* output_path, int dpi);
// Common device operations
void device_destroy(Device* dev);
int device_render_document(Device* dev, Document* doc);
// --- PDF Device Implementation ---
typedef struct {
Device base;
FILE* file;
long* offsets;
int num_objects;
int next_obj_num;
// PDF-specific state
} PDFDevice;
static int pdf_begin_page(Device* dev, double width, double height) {
PDFDevice* pdf = (PDFDevice*)dev;
// Start accumulating content stream
return 0;
}
static int pdf_stroke(Device* dev) {
PDFDevice* pdf = (PDFDevice*)dev;
// Add "S" to content stream
buffer_append(pdf->content, "S\n");
return 0;
}
// ... implement all operations ...
Device* pdf_device_create(const char* output_path) {
PDFDevice* pdf = calloc(1, sizeof(PDFDevice));
pdf->base.name = "pdfwrite";
pdf->base.begin_page = pdf_begin_page;
pdf->base.end_page = pdf_end_page;
pdf->base.stroke = pdf_stroke;
// ... set all function pointers ...
pdf->file = fopen(output_path, "wb");
return (Device*)pdf;
}
Phase 3: Input Parsers (Weeks 4-6)
Implement parsers that produce the graphics model:
// parser.h - Input format parsers
typedef enum {
FORMAT_UNKNOWN,
FORMAT_POSTSCRIPT,
FORMAT_PDF,
FORMAT_DML
} InputFormat;
InputFormat detect_format(const char* filename, const unsigned char* data, size_t len);
Document* parse_postscript(const char* filename);
Document* parse_pdf(const char* filename);
Document* parse_dml(const char* filename);
// Unified parse function
Document* parse_document(const char* filename) {
FILE* f = fopen(filename, "rb");
unsigned char header[32];
fread(header, 1, sizeof(header), f);
fclose(f);
InputFormat format = detect_format(filename, header, sizeof(header));
switch (format) {
case FORMAT_POSTSCRIPT:
return parse_postscript(filename);
case FORMAT_PDF:
return parse_pdf(filename);
case FORMAT_DML:
return parse_dml(filename);
default:
fprintf(stderr, "Unknown format: %s\n", filename);
return NULL;
}
}
Phase 4: CLI Tool (Week 7)
// main.c - Command-line interface
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <getopt.h>
void print_usage(const char* prog) {
printf("Usage: %s convert [options] input output\n", prog);
printf("\nOptions:\n");
printf(" --from=FORMAT Input format (ps, pdf, dml)\n");
printf(" --to=FORMAT Output format (pdf, svg, png)\n");
printf(" --dpi=N DPI for raster output (default: 150)\n");
printf(" --pages=RANGE Pages to convert (e.g., 1-5, 2,4,6)\n");
printf(" -v, --verbose Verbose output\n");
printf(" -h, --help Show this help\n");
}
int cmd_convert(int argc, char** argv) {
int dpi = 150;
int verbose = 0;
const char* from_format = NULL;
const char* to_format = NULL;
const char* pages = "all";
static struct option long_options[] = {
{"from", required_argument, 0, 'f'},
{"to", required_argument, 0, 't'},
{"dpi", required_argument, 0, 'd'},
{"pages", required_argument, 0, 'p'},
{"verbose", no_argument, 0, 'v'},
{"help", no_argument, 0, 'h'},
{0, 0, 0, 0}
};
int c;
while ((c = getopt_long(argc, argv, "f:t:d:p:vh", long_options, NULL)) != -1) {
switch (c) {
case 'f': from_format = optarg; break;
case 't': to_format = optarg; break;
case 'd': dpi = atoi(optarg); break;
case 'p': pages = optarg; break;
case 'v': verbose = 1; break;
case 'h': print_usage(argv[0]); return 0;
}
}
if (optind + 2 > argc) {
fprintf(stderr, "Error: Missing input or output file\n");
print_usage(argv[0]);
return 1;
}
const char* input = argv[optind];
const char* output = argv[optind + 1];
if (verbose) {
printf("Converting %s to %s\n", input, output);
}
// Parse input
Document* doc = parse_document(input);
if (!doc) {
fprintf(stderr, "Error: Failed to parse %s\n", input);
return 1;
}
// Create output device
Device* dev = NULL;
if (to_format && strcmp(to_format, "pdf") == 0) {
dev = pdf_device_create(output);
} else if (to_format && strcmp(to_format, "svg") == 0) {
dev = svg_device_create(output);
} else if (to_format && strcmp(to_format, "png") == 0) {
dev = png_device_create(output, dpi);
} else {
// Detect from extension
const char* ext = strrchr(output, '.');
if (ext && strcmp(ext, ".pdf") == 0) {
dev = pdf_device_create(output);
} else if (ext && strcmp(ext, ".svg") == 0) {
dev = svg_device_create(output);
} else if (ext && strcmp(ext, ".png") == 0) {
dev = png_device_create(output, dpi);
}
}
if (!dev) {
fprintf(stderr, "Error: Unknown output format\n");
doc_destroy(doc);
return 1;
}
// Render
int result = device_render_document(dev, doc);
// Cleanup
device_destroy(dev);
doc_destroy(doc);
if (result == 0 && verbose) {
printf("Successfully created %s\n", output);
}
return result;
}
int main(int argc, char** argv) {
if (argc < 2) {
print_usage(argv[0]);
return 1;
}
if (strcmp(argv[1], "convert") == 0) {
return cmd_convert(argc - 1, argv + 1);
} else if (strcmp(argv[1], "formats") == 0) {
printf("Input formats: ps, pdf, dml\n");
printf("Output formats: pdf, svg, png\n");
return 0;
} else {
fprintf(stderr, "Unknown command: %s\n", argv[1]);
print_usage(argv[0]);
return 1;
}
}
Phase 5: Web Interface (Week 8)
# web/app.py - Flask web interface
from flask import Flask, request, jsonify, send_file
import subprocess
import tempfile
import os
app = Flask(__name__)
@app.route('/')
def index():
return app.send_static_file('index.html')
@app.route('/api/convert', methods=['POST'])
def convert():
if 'file' not in request.files:
return jsonify({'error': 'No file provided'}), 400
file = request.files['file']
output_format = request.form.get('format', 'pdf')
dpi = request.form.get('dpi', '150')
# Save uploaded file
with tempfile.NamedTemporaryFile(delete=False, suffix=os.path.splitext(file.filename)[1]) as tmp:
file.save(tmp.name)
input_path = tmp.name
# Create output file
output_ext = {'pdf': '.pdf', 'svg': '.svg', 'png': '.png'}[output_format]
output_path = tempfile.mktemp(suffix=output_ext)
# Run converter
cmd = ['./docpipe', 'convert', input_path, output_path, '--to', output_format]
if output_format == 'png':
cmd.extend(['--dpi', dpi])
result = subprocess.run(cmd, capture_output=True, text=True)
if result.returncode != 0:
os.unlink(input_path)
return jsonify({'error': result.stderr}), 500
# Clean up input
os.unlink(input_path)
# Return output file
return send_file(
output_path,
as_attachment=True,
download_name=f'converted{output_ext}'
)
@app.route('/api/formats')
def formats():
return jsonify({
'input': ['ps', 'pdf', 'dml'],
'output': ['pdf', 'svg', 'png']
})
if __name__ == '__main__':
app.run(debug=True, port=8080)
Phase 6: Testing and Polish (Weeks 9-10)
Create comprehensive tests:
// test_graphics.c - Graphics model tests
void test_document_creation() {
Document* doc = doc_create();
assert(doc != NULL);
assert(doc->page_count == 0);
doc_destroy(doc);
printf("PASS: test_document_creation\n");
}
void test_page_operations() {
Document* doc = doc_create();
Page* page = doc_add_page(doc, 612, 792);
page_move_to(page, 100, 100);
page_line_to(page, 200, 200);
page_stroke(page);
assert(page->op_count == 3);
assert(page->operations[0].type == OP_MOVE_TO);
assert(page->operations[1].type == OP_LINE_TO);
assert(page->operations[2].type == OP_STROKE);
doc_destroy(doc);
printf("PASS: test_page_operations\n");
}
void test_pdf_output() {
Document* doc = doc_create();
Page* page = doc_add_page(doc, 612, 792);
page_move_to(page, 100, 100);
page_line_to(page, 200, 200);
page_stroke(page);
Device* dev = pdf_device_create("test_output.pdf");
int result = device_render_document(dev, doc);
device_destroy(dev);
assert(result == 0);
// Validate output with external tool
int check = system("qpdf --check test_output.pdf > /dev/null 2>&1");
assert(check == 0);
unlink("test_output.pdf");
doc_destroy(doc);
printf("PASS: test_pdf_output\n");
}
void test_conversion_ps_to_pdf() {
// Create test PostScript file
FILE* f = fopen("test_input.ps", "w");
fprintf(f, "%%!PS-Adobe-3.0\n");
fprintf(f, "100 100 moveto\n");
fprintf(f, "200 200 lineto\n");
fprintf(f, "stroke\n");
fprintf(f, "showpage\n");
fclose(f);
Document* doc = parse_postscript("test_input.ps");
assert(doc != NULL);
assert(doc->page_count == 1);
Device* dev = pdf_device_create("test_output.pdf");
device_render_document(dev, doc);
device_destroy(dev);
doc_destroy(doc);
// Validate
int check = system("qpdf --check test_output.pdf > /dev/null 2>&1");
assert(check == 0);
unlink("test_input.ps");
unlink("test_output.pdf");
printf("PASS: test_conversion_ps_to_pdf\n");
}
Testing Strategy
Unit Tests
- Test each data structure
- Test each operation
- Test each device
Integration Tests
- Test complete pipelines (input โ output)
- Test all format combinations
- Compare with reference tools (Ghostscript, Cairo)
Visual Regression Tests
- Render test documents
- Compare images pixel-by-pixel
- Flag regressions
Performance Tests
- Measure conversion time
- Profile memory usage
- Test with large documents
Common Pitfalls
1. Coordinate System Mismatches
Different formats use different coordinate systems:
- PostScript/PDF: Origin at bottom-left
- SVG: Origin at top-left
- PNG: Origin at top-left
Handle transformations consistently.
2. Font Handling Complexity
Fonts are the hardest part:
- Font subsetting
- Glyph mapping
- Embedded vs. system fonts
Start with simple font references, add complexity later.
3. Memory Management
Documents can be large:
- Use streaming where possible
- Free resources promptly
- Consider memory pools
4. Thread Safety
For web interface:
- Donโt share mutable state
- Use separate processes for conversion
- Handle concurrent requests
Extensions
Level 1: More Input Formats
- Add SVG input
- Add HTML input (with simple CSS)
Level 2: More Output Formats
- Add TIFF output
- Add EPS output
- Add PCL for printers
Level 3: Advanced Features
- PDF encryption
- PDF/A compliance
- Transparency groups
Level 4: Optimization
- Parallel rendering
- GPU acceleration
- Caching
Self-Assessment
Before considering this capstone complete:
- Can convert PostScript to PDF, SVG, and PNG
- Can convert PDF to PNG
- Custom markup language works as documented
- CLI tool handles all options correctly
- Web interface uploads and downloads correctly
- All tests pass
- Performance is acceptable for multi-page documents
- Documentation is complete
Resources
Architecture References
- Cairo Graphics Library: https://cairographics.org/
- Ghostscript Architecture: Study the source
- PDF.js (Mozilla): JavaScript PDF renderer
Books
- Software Architecture in Practice by Bass, Clements & Kazman
- Designing Data-Intensive Applications by Martin Kleppmann
- Computer Graphics: Principles and Practice by Foley et al.
Libraries
- Cairo: 2D graphics library
- FreeType: Font rendering
- libpng/libjpeg: Image encoding
- zlib: Compression