Project 4: Compile a Language to WASM

Project 4: Compile a Language to WASM

Build a compiler that translates a simple programming language into WebAssembly binary format


Project Overview

Attribute Value
Difficulty Advanced
Time Estimate 1 month+
Languages C (primary), Rust, Go, TypeScript
Prerequisites Projects 1-2, basic parsing knowledge
Main Reference Engineering a Compiler by Cooper & Torczon
Knowledge Area Compilers, Code Generation, Language Design

Learning Objectives

After completing this project, you will be able to:

  1. Design a simple language - Define syntax and semantics for a compilable language
  2. Build a lexer and parser - Transform source text into an Abstract Syntax Tree (AST)
  3. Generate stack machine code - Emit WASM instructions from high-level constructs
  4. Emit valid WASM binaries - Produce correct .wasm files with proper structure
  5. Implement type checking - Ensure programs match WASMโ€™s type constraints
  6. Handle control flow compilation - Transform if/while/for into WASM blocks and branches
  7. Understand compiler architecture - See how real compilers target WASM

Conceptual Foundation

1. Why Build a Compiler to WASM?

Compiling to WASM gives you the โ€œproduction viewโ€ of WebAssembly:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚              The Compiler Writer's Perspective                       โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                                      โ”‚
โ”‚  Source Code          AST              IR              .wasm        โ”‚
โ”‚  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€      โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€      โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€      โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€        โ”‚
โ”‚  func add(a, b)   FuncDecl        f0: [i32,i32]   00 61 73 6d       โ”‚
โ”‚    return a + b     โ”‚params       โ†’ [i32]        01 00 00 00       โ”‚
โ”‚                     โ”‚  โ”” a: i32     local.get 0   01 07 01 60       โ”‚
โ”‚                     โ”‚  โ”” b: i32     local.get 1   ...               โ”‚
โ”‚                     โ”‚body          i32.add                         โ”‚
โ”‚                     โ”” BinOp(+)     end                              โ”‚
โ”‚                         โ”‚ a                                         โ”‚
โ”‚                         โ”‚ b                                         โ”‚
โ”‚                                                                      โ”‚
โ”‚  You learn:                                                          โ”‚
โ”‚  โ€ข Why WASM has certain constraints (they make compilation easy!)   โ”‚
โ”‚  โ€ข Why structured control flow matters (no goto = simpler codegen)  โ”‚
โ”‚  โ€ข Why only 4 types (uniform representation, easy type checking)    โ”‚
โ”‚  โ€ข Why stack machine (postfix order matches evaluation order)       โ”‚
โ”‚                                                                      โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

The Compiler Writer's Perspective - showing the transformation from source code through AST and IR to WASM binary

Real-world impact: Every language that targets WASM (C/C++, Rust, Go, AssemblyScript, Grain) goes through this process. Youโ€™re learning what LLVM does.

2. The Source Language: MiniLang

Design a small but complete language. Hereโ€™s a suggested specification:

MiniLang Grammar (EBNF):

program     = { function } ;
function    = "func" IDENT "(" params ")" [ ":" type ] block ;
params      = [ param { "," param } ] ;
param       = IDENT ":" type ;
type        = "i32" | "i64" | "f32" | "f64" | "void" ;

block       = "{" { statement } "}" ;
statement   = var_decl | assignment | if_stmt | while_stmt
            | return_stmt | expr_stmt ;

var_decl    = "let" IDENT [ ":" type ] "=" expr ";" ;
assignment  = IDENT "=" expr ";" ;
if_stmt     = "if" expr block [ "else" block ] ;
while_stmt  = "while" expr block ;
return_stmt = "return" [ expr ] ";" ;
expr_stmt   = expr ";" ;

expr        = comparison ;
comparison  = addition { ( "==" | "!=" | "<" | ">" | "<=" | ">=" ) addition } ;
addition    = term { ( "+" | "-" ) term } ;
term        = factor { ( "*" | "/" | "%" ) factor } ;
factor      = unary ;
unary       = [ "-" | "!" ] primary ;
primary     = NUMBER | IDENT | call | "(" expr ")" ;
call        = IDENT "(" [ expr { "," expr } ] ")" ;

IDENT       = letter { letter | digit | "_" } ;
NUMBER      = digit { digit } [ "." digit { digit } ] ;

Example MiniLang program:

func factorial(n: i32): i32 {
    if n < 2 {
        return 1;
    }
    return n * factorial(n - 1);
}

func main(): i32 {
    let result: i32 = factorial(10);
    return result;
}

3. Compiler Pipeline Overview

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                      Compiler Pipeline                                 โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                                        โ”‚
โ”‚    Source           Tokens           AST             WASM             โ”‚
โ”‚    โ”Œโ”€โ”€โ”€โ”€โ”€โ”         โ”Œโ”€โ”€โ”€โ”€โ”€โ”         โ”Œโ”€โ”€โ”€โ”€โ”€โ”         โ”Œโ”€โ”€โ”€โ”€โ”€โ”           โ”‚
โ”‚    โ”‚.miniโ”‚  โ”€โ”€โ”€โ”€โ”€โ”€โ–ถโ”‚Tokenโ”‚  โ”€โ”€โ”€โ”€โ”€โ”€โ–ถโ”‚ AST โ”‚  โ”€โ”€โ”€โ”€โ”€โ”€โ–ถโ”‚.wasmโ”‚           โ”‚
โ”‚    โ”‚file โ”‚  Lexer  โ”‚List โ”‚  Parser โ”‚Tree โ”‚  CodeGenโ”‚file โ”‚           โ”‚
โ”‚    โ””โ”€โ”€โ”€โ”€โ”€โ”˜         โ””โ”€โ”€โ”€โ”€โ”€โ”˜         โ””โ”€โ”€โ”€โ”€โ”€โ”˜         โ””โ”€โ”€โ”€โ”€โ”€โ”˜           โ”‚
โ”‚                                        โ”‚                              โ”‚
โ”‚                                        โ–ผ                              โ”‚
โ”‚                                   Type Check                          โ”‚
โ”‚                                   (optional but                       โ”‚
โ”‚                                    recommended)                       โ”‚
โ”‚                                                                        โ”‚
โ”‚  Phase Details:                                                        โ”‚
โ”‚  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€                                                        โ”‚
โ”‚  Lexer:    "func add" โ†’ [FUNC, IDENT("add")]                         โ”‚
โ”‚  Parser:   Tokens โ†’ FuncDecl { name: "add", params: [...], body }    โ”‚
โ”‚  TypeChk:  Verify types match, infer where needed                    โ”‚
โ”‚  CodeGen:  AST โ†’ WASM instructions โ†’ Binary encoding                 โ”‚
โ”‚                                                                        โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Compiler Pipeline - showing the flow from source file through lexer, parser, type checker to WASM output

4. The Lexer (Tokenizer)

Convert character stream to token stream:

typedef enum {
    // Keywords
    TOK_FUNC, TOK_LET, TOK_IF, TOK_ELSE, TOK_WHILE, TOK_RETURN,
    TOK_I32, TOK_I64, TOK_F32, TOK_F64, TOK_VOID,

    // Symbols
    TOK_LPAREN, TOK_RPAREN, TOK_LBRACE, TOK_RBRACE,
    TOK_COMMA, TOK_COLON, TOK_SEMICOLON,
    TOK_PLUS, TOK_MINUS, TOK_STAR, TOK_SLASH, TOK_PERCENT,
    TOK_EQ, TOK_EQEQ, TOK_NE, TOK_LT, TOK_GT, TOK_LE, TOK_GE,
    TOK_BANG,

    // Literals
    TOK_NUMBER,
    TOK_IDENT,

    // Special
    TOK_EOF,
} TokenType;

typedef struct {
    TokenType type;
    const char* start;
    int length;
    int line;
} Token;

Lexer pseudocode:

next_token():
  skip_whitespace()
  skip_comments()

  if at_end(): return EOF

  start = current

  if is_digit(peek()):
    while is_digit(peek()): advance()
    if peek() == '.' and is_digit(peek_next()):
      advance()  // consume '.'
      while is_digit(peek()): advance()
    return NUMBER

  if is_alpha(peek()):
    while is_alphanumeric(peek()): advance()
    return check_keyword() or IDENT

  // Single/double character tokens
  switch advance():
    '(': return LPAREN
    ')': return RPAREN
    '{': return LBRACE
    // ... etc
    '=': return match('=') ? EQEQ : EQ
    '<': return match('=') ? LE : LT
    // ... etc

5. The Parser (Syntax Analysis)

Build an Abstract Syntax Tree from tokens. Use recursive descent:

// AST Node types
typedef enum {
    NODE_PROGRAM,
    NODE_FUNC_DECL,
    NODE_VAR_DECL,
    NODE_ASSIGNMENT,
    NODE_IF,
    NODE_WHILE,
    NODE_RETURN,
    NODE_BINARY,
    NODE_UNARY,
    NODE_CALL,
    NODE_NUMBER,
    NODE_IDENT,
} NodeType;

typedef struct ASTNode {
    NodeType type;
    union {
        struct { /* program data */ } program;
        struct { char* name; Param* params; int param_count;
                 Type return_type; struct ASTNode* body; } func;
        struct { char* name; Type type; struct ASTNode* init; } var_decl;
        struct { char* name; struct ASTNode* value; } assignment;
        struct { struct ASTNode* condition;
                 struct ASTNode* then_branch;
                 struct ASTNode* else_branch; } if_stmt;
        struct { struct ASTNode* condition;
                 struct ASTNode* body; } while_stmt;
        struct { struct ASTNode* value; } return_stmt;
        struct { char op; struct ASTNode* left;
                 struct ASTNode* right; } binary;
        struct { char op; struct ASTNode* operand; } unary;
        struct { char* name; struct ASTNode** args;
                 int arg_count; } call;
        double number;
        char* ident;
    };
} ASTNode;

Recursive descent parser pattern:

// Parse with precedence: low to high
// comparison โ†’ addition โ†’ term โ†’ factor โ†’ unary โ†’ primary

ASTNode* parse_expression() {
    return parse_comparison();
}

ASTNode* parse_comparison() {
    ASTNode* left = parse_addition();

    while (match(TOK_EQEQ) || match(TOK_NE) ||
           match(TOK_LT) || match(TOK_GT) ||
           match(TOK_LE) || match(TOK_GE)) {
        Token op = previous();
        ASTNode* right = parse_addition();
        left = make_binary(op, left, right);
    }

    return left;
}

ASTNode* parse_addition() {
    ASTNode* left = parse_term();

    while (match(TOK_PLUS) || match(TOK_MINUS)) {
        Token op = previous();
        ASTNode* right = parse_term();
        left = make_binary(op, left, right);
    }

    return left;
}

// ... continue for term, factor, unary, primary ...

ASTNode* parse_primary() {
    if (match(TOK_NUMBER)) {
        return make_number(strtod(previous().start, NULL));
    }

    if (match(TOK_IDENT)) {
        char* name = copy_string(previous());
        if (match(TOK_LPAREN)) {
            // Function call
            return parse_call(name);
        }
        return make_ident(name);
    }

    if (match(TOK_LPAREN)) {
        ASTNode* expr = parse_expression();
        consume(TOK_RPAREN, "Expected ')' after expression");
        return expr;
    }

    error("Expected expression");
}

6. Type Checking

WASM has strict types. Your compiler must verify type consistency:

typedef enum { TYPE_I32, TYPE_I64, TYPE_F32, TYPE_F64, TYPE_VOID } Type;

// Symbol table entry
typedef struct {
    char* name;
    Type type;
    bool is_param;
    int local_index;
} Symbol;

// Type checker
Type check_expression(ASTNode* node, SymbolTable* scope) {
    switch (node->type) {
        case NODE_NUMBER:
            // Infer type from value
            if (is_integer(node->number)) return TYPE_I32;
            return TYPE_F64;

        case NODE_IDENT: {
            Symbol* sym = lookup(scope, node->ident);
            if (!sym) error("Undefined variable: %s", node->ident);
            return sym->type;
        }

        case NODE_BINARY: {
            Type left = check_expression(node->binary.left, scope);
            Type right = check_expression(node->binary.right, scope);
            if (left != right) {
                error("Type mismatch: %s vs %s", type_name(left), type_name(right));
            }
            // Comparison operators return i32 (boolean)
            if (is_comparison(node->binary.op)) return TYPE_I32;
            return left;
        }

        case NODE_CALL: {
            FuncInfo* func = lookup_func(node->call.name);
            // Check argument count and types
            if (node->call.arg_count != func->param_count) {
                error("Wrong number of arguments");
            }
            for (int i = 0; i < node->call.arg_count; i++) {
                Type arg_type = check_expression(node->call.args[i], scope);
                if (arg_type != func->param_types[i]) {
                    error("Argument type mismatch");
                }
            }
            return func->return_type;
        }

        // ... other cases ...
    }
}

7. Code Generation: The Heart of the Compiler

Transform AST into WASM instructions. The key insight: expressions compile to stack operations naturally.

Expression: a + b * c

Parse tree:        Evaluation order:       WASM:
    +              1. get a                local.get $a
   / \             2. get b                local.get $b
  a   *            3. get c                local.get $c
     / \           4. multiply             i32.mul
    b   c          5. add                  i32.add

The tree's post-order traversal IS the WASM instruction sequence!

Code generator structure:

typedef struct {
    // Output buffer for instructions
    uint8_t* code;
    int code_len;
    int code_cap;

    // Local variable tracking
    Symbol* locals;
    int local_count;
    int next_local_idx;

    // Function being compiled
    FuncInfo* current_func;
} CodeGen;

void emit_byte(CodeGen* cg, uint8_t byte) {
    if (cg->code_len >= cg->code_cap) {
        cg->code_cap *= 2;
        cg->code = realloc(cg->code, cg->code_cap);
    }
    cg->code[cg->code_len++] = byte;
}

void emit_leb128(CodeGen* cg, uint32_t value) {
    do {
        uint8_t byte = value & 0x7f;
        value >>= 7;
        if (value != 0) byte |= 0x80;
        emit_byte(cg, byte);
    } while (value != 0);
}

void emit_sleb128(CodeGen* cg, int32_t value) {
    bool more = true;
    while (more) {
        uint8_t byte = value & 0x7f;
        value >>= 7;
        if ((value == 0 && !(byte & 0x40)) ||
            (value == -1 && (byte & 0x40))) {
            more = false;
        } else {
            byte |= 0x80;
        }
        emit_byte(cg, byte);
    }
}

8. Compiling Expressions

Post-order traversal generates correct stack code:

void compile_expression(CodeGen* cg, ASTNode* node) {
    switch (node->type) {
        case NODE_NUMBER:
            emit_byte(cg, 0x41);  // i32.const
            emit_sleb128(cg, (int32_t)node->number);
            break;

        case NODE_IDENT: {
            Symbol* sym = lookup(cg, node->ident);
            emit_byte(cg, 0x20);  // local.get
            emit_leb128(cg, sym->local_index);
            break;
        }

        case NODE_BINARY:
            // Post-order: left, right, operator
            compile_expression(cg, node->binary.left);
            compile_expression(cg, node->binary.right);

            // Emit operator
            switch (node->binary.op) {
                case '+': emit_byte(cg, 0x6a); break;  // i32.add
                case '-': emit_byte(cg, 0x6b); break;  // i32.sub
                case '*': emit_byte(cg, 0x6c); break;  // i32.mul
                case '/': emit_byte(cg, 0x6d); break;  // i32.div_s
                case '%': emit_byte(cg, 0x6f); break;  // i32.rem_s
                case '<': emit_byte(cg, 0x48); break;  // i32.lt_s
                case '>': emit_byte(cg, 0x4a); break;  // i32.gt_s
                // ... etc
            }
            break;

        case NODE_CALL: {
            // Compile arguments (they go on stack)
            for (int i = 0; i < node->call.arg_count; i++) {
                compile_expression(cg, node->call.args[i]);
            }
            // Call function
            int func_idx = lookup_func_idx(cg, node->call.name);
            emit_byte(cg, 0x10);  // call
            emit_leb128(cg, func_idx);
            break;
        }

        case NODE_UNARY:
            compile_expression(cg, node->unary.operand);
            if (node->unary.op == '-') {
                // Negate: push 0, swap, subtract
                emit_byte(cg, 0x41);  // i32.const
                emit_byte(cg, 0x00);  // 0
                // Stack: [operand, 0]
                // We need: [0, operand]
                // Actually, easier to do: 0 - operand
                // But we already have operand on stack...
                // Better approach: multiply by -1 for floats, or use i32.sub from 0

                // Simple approach: push 0 first, then subtract
                // But operand is already on stack. Use local for temp.
                // Actually, even simpler for i32: use xor with -1 and add 1 (two's complement)
                // Or: i32.const 0; swap positions; i32.sub

                // Cleanest: emit as (0 - x)
                // This requires we compile differently:
            }
            break;
    }
}

9. Compiling Control Flow

This is where WASMโ€™s structured control flow shines:

If statements:

void compile_if(CodeGen* cg, ASTNode* node) {
    // Compile condition (leaves i32 on stack)
    compile_expression(cg, node->if_stmt.condition);

    if (node->if_stmt.else_branch) {
        // if-else
        emit_byte(cg, 0x04);  // if
        emit_byte(cg, 0x40);  // void block type (or result type)

        compile_block(cg, node->if_stmt.then_branch);

        emit_byte(cg, 0x05);  // else

        compile_block(cg, node->if_stmt.else_branch);

        emit_byte(cg, 0x0b);  // end
    } else {
        // if without else
        emit_byte(cg, 0x04);  // if
        emit_byte(cg, 0x40);  // void

        compile_block(cg, node->if_stmt.then_branch);

        emit_byte(cg, 0x0b);  // end
    }
}

While loops:

void compile_while(CodeGen* cg, ASTNode* node) {
    // WASM pattern for while:
    // block $exit
    //   loop $continue
    //     <condition>
    //     i32.eqz          ;; invert for "branch if false"
    //     br_if $exit
    //     <body>
    //     br $continue
    //   end
    // end

    emit_byte(cg, 0x02);  // block
    emit_byte(cg, 0x40);  // void

    emit_byte(cg, 0x03);  // loop
    emit_byte(cg, 0x40);  // void

    // Condition
    compile_expression(cg, node->while_stmt.condition);
    emit_byte(cg, 0x45);  // i32.eqz (invert)
    emit_byte(cg, 0x0d);  // br_if
    emit_leb128(cg, 1);   // break to outer block (depth 1)

    // Body
    compile_block(cg, node->while_stmt.body);

    // Loop back
    emit_byte(cg, 0x0c);  // br
    emit_leb128(cg, 0);   // continue to loop (depth 0)

    emit_byte(cg, 0x0b);  // end loop
    emit_byte(cg, 0x0b);  // end block
}

Return statements:

void compile_return(CodeGen* cg, ASTNode* node) {
    if (node->return_stmt.value) {
        compile_expression(cg, node->return_stmt.value);
    }
    emit_byte(cg, 0x0f);  // return
}

10. Emitting the WASM Binary

Finally, assemble everything into a valid .wasm file:

void emit_module(CodeGen* cg, Program* program) {
    // Header
    emit_bytes(cg, "\0asm", 4);    // Magic
    emit_u32_le(cg, 1);            // Version

    // Type section
    emit_type_section(cg, program);

    // Function section
    emit_function_section(cg, program);

    // Export section
    emit_export_section(cg, program);

    // Code section
    emit_code_section(cg, program);
}

void emit_type_section(CodeGen* cg, Program* program) {
    ByteBuffer section = new_buffer();

    // Count types
    emit_leb128(&section, program->func_count);

    for (int i = 0; i < program->func_count; i++) {
        FuncDecl* f = program->funcs[i];

        emit_byte(&section, 0x60);  // func type

        // Params
        emit_leb128(&section, f->param_count);
        for (int j = 0; j < f->param_count; j++) {
            emit_byte(&section, type_to_byte(f->params[j].type));
        }

        // Results
        if (f->return_type == TYPE_VOID) {
            emit_leb128(&section, 0);
        } else {
            emit_leb128(&section, 1);
            emit_byte(&section, type_to_byte(f->return_type));
        }
    }

    // Write section
    emit_byte(cg, 0x01);  // Type section ID
    emit_leb128(cg, section.len);
    emit_bytes(cg, section.data, section.len);
}

void emit_code_section(CodeGen* cg, Program* program) {
    ByteBuffer section = new_buffer();

    emit_leb128(&section, program->func_count);

    for (int i = 0; i < program->func_count; i++) {
        FuncDecl* f = program->funcs[i];

        // Compile function body
        ByteBuffer body = new_buffer();

        // Local declarations
        emit_local_declarations(&body, f);

        // Instructions
        compile_function_body(&body, f);

        emit_byte(&body, 0x0b);  // end

        // Write body size + body
        emit_leb128(&section, body.len);
        emit_bytes(&section, body.data, body.len);
    }

    // Write section
    emit_byte(cg, 0x0a);  // Code section ID
    emit_leb128(cg, section.len);
    emit_bytes(cg, section.data, section.len);
}

Project Specification

Required Features

Your compiler must support:

Types:

  • i32 (32-bit integers)
  • void (for functions with no return)

Statements:

  • Variable declarations: let x: i32 = 5;
  • Assignments: x = 10;
  • If statements: if condition { ... } else { ... }
  • While loops: while condition { ... }
  • Return: return expr;

Expressions:

  • Integer literals: 42
  • Variables: x
  • Binary operators: +, -, *, /, %, ==, !=, <, >, <=, >=
  • Function calls: factorial(n - 1)
  • Parentheses: (a + b) * c

Functions:

  • Multiple functions per file
  • Parameters and return values
  • Recursion

Input/Output

Input: .mini source file Output: .wasm binary file

$ ./minilang factorial.mini -o factorial.wasm
Compiled factorial.mini -> factorial.wasm (247 bytes)

$ wasmtime factorial.wasm --invoke factorial 10
3628800

Success Criteria

  1. Hello World: Compile and run func main(): i32 { return 42; }
  2. Arithmetic: Compile func add(a: i32, b: i32): i32 { return a + b; }
  3. Conditionals: Compile and run a function with if/else
  4. Loops: Compile and run sum-to-n using a while loop
  5. Recursion: Compile and run factorial recursively
  6. Multiple functions: Call one function from another
  7. Validation: Output passes wasm-validate

Solution Architecture

Compiler Structure

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    Compiler Architecture                             โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                                      โ”‚
โ”‚  main.c                                                              โ”‚
โ”‚    โ”‚                                                                 โ”‚
โ”‚    โ”œโ”€โ”€โ–ถ lexer.c                                                      โ”‚
โ”‚    โ”‚      โ”” next_token() โ†’ Token stream                             โ”‚
โ”‚    โ”‚                                                                 โ”‚
โ”‚    โ”œโ”€โ”€โ–ถ parser.c                                                     โ”‚
โ”‚    โ”‚      โ”” parse() โ†’ AST                                           โ”‚
โ”‚    โ”‚                                                                 โ”‚
โ”‚    โ”œโ”€โ”€โ–ถ checker.c (optional but recommended)                        โ”‚
โ”‚    โ”‚      โ”” check() โ†’ Type-annotated AST                            โ”‚
โ”‚    โ”‚                                                                 โ”‚
โ”‚    โ””โ”€โ”€โ–ถ codegen.c                                                    โ”‚
โ”‚           โ”œ compile_function()                                       โ”‚
โ”‚           โ”œ compile_statement()                                      โ”‚
โ”‚           โ”œ compile_expression()                                     โ”‚
โ”‚           โ”” emit_module() โ†’ .wasm bytes                             โ”‚
โ”‚                                                                      โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Compiler Architecture - showing the hierarchical structure from main.c to lexer, parser, checker, and codegen modules

Suggested File Organization

minilang/
โ”œโ”€โ”€ src/
โ”‚   โ”œโ”€โ”€ main.c              # Entry point, CLI
โ”‚   โ”œโ”€โ”€ lexer.c             # Tokenization
โ”‚   โ”œโ”€โ”€ lexer.h
โ”‚   โ”œโ”€โ”€ parser.c            # AST construction
โ”‚   โ”œโ”€โ”€ parser.h
โ”‚   โ”œโ”€โ”€ ast.h               # AST node definitions
โ”‚   โ”œโ”€โ”€ checker.c           # Type checking
โ”‚   โ”œโ”€โ”€ checker.h
โ”‚   โ”œโ”€โ”€ codegen.c           # WASM generation
โ”‚   โ”œโ”€โ”€ codegen.h
โ”‚   โ”œโ”€โ”€ wasm.c              # Binary encoding helpers
โ”‚   โ”œโ”€โ”€ wasm.h
โ”‚   โ””โ”€โ”€ common.h            # Shared utilities
โ”œโ”€โ”€ tests/
โ”‚   โ”œโ”€โ”€ arithmetic.mini
โ”‚   โ”œโ”€โ”€ factorial.mini
โ”‚   โ”œโ”€โ”€ fibonacci.mini
โ”‚   โ”œโ”€โ”€ control_flow.mini
โ”‚   โ””โ”€โ”€ run_tests.sh
โ”œโ”€โ”€ Makefile
โ””โ”€โ”€ README.md

Implementation Guide

Phase 1: Lexer (Days 1-3)

Goal: Tokenize MiniLang source code

// Test input
const char* src = "func add(a: i32, b: i32): i32 { return a + b; }";

// Expected tokens
// FUNC, IDENT("add"), LPAREN, IDENT("a"), COLON, I32, COMMA,
// IDENT("b"), COLON, I32, RPAREN, COLON, I32, LBRACE,
// RETURN, IDENT("a"), PLUS, IDENT("b"), SEMICOLON, RBRACE, EOF

Steps:

  1. Define token types enum
  2. Implement skip_whitespace() and skip_comments()
  3. Implement number scanning
  4. Implement identifier scanning with keyword lookup
  5. Implement single and double character operators
  6. Test on sample programs

Checkpoint: All tokens extracted from factorial.mini.

Phase 2: Parser (Days 4-10)

Goal: Build AST from tokens

Steps:

  1. Define AST node structures
  2. Implement parse_program() โ†’ list of functions
  3. Implement parse_function() โ†’ function declaration
  4. Implement parse_block() โ†’ list of statements
  5. Implement statement parsers (if, while, return, etc.)
  6. Implement expression parser with precedence
  7. Add error recovery (synchronize on semicolons)

Testing approach:

// Print AST for debugging
void print_ast(ASTNode* node, int indent) {
    // Recursively print tree structure
}

// Verify structure
assert(ast->type == NODE_PROGRAM);
assert(ast->program.func_count == 2);
assert(strcmp(ast->program.funcs[0]->name, "factorial") == 0);

Checkpoint: Parse factorial.mini into correct AST.

Phase 3: Type Checker (Days 11-14)

Goal: Verify type consistency

Steps:

  1. Build symbol table (function for scope)
  2. Implement type inference for expressions
  3. Check assignment compatibility
  4. Check function call argument types
  5. Verify return type matches declaration
  6. Report clear error messages

Checkpoint: Reject let x: i32 = "hello"; with error.

Phase 4: Code Generator - Expressions (Days 15-21)

Goal: Compile expressions to WASM

Steps:

  1. Implement emit_byte(), emit_leb128(), emit_sleb128()
  2. Compile integer constants (i32.const)
  3. Compile variable references (local.get)
  4. Compile binary operations (arithmetic, comparison)
  5. Compile function calls

Test:

func add(a: i32, b: i32): i32 {
    return a + b;
}

Should produce working WASM.

Checkpoint: add(5, 3) returns 8.

Phase 5: Code Generator - Control Flow (Days 22-28)

Goal: Compile if/while/return

Steps:

  1. Compile if-else to WASM if/else/end
  2. Compile while to block/loop/br/br_if
  3. Handle nested control structures
  4. Implement return (handle early return)

Test:

func sum_to(n: i32): i32 {
    let sum: i32 = 0;
    let i: i32 = 1;
    while i <= n {
        sum = sum + i;
        i = i + 1;
    }
    return sum;
}

Checkpoint: sum_to(100) returns 5050.

Phase 6: Binary Emission (Days 29-35)

Goal: Produce valid .wasm files

Steps:

  1. Emit module header (magic, version)
  2. Emit type section (function signatures)
  3. Emit function section (type indices)
  4. Emit export section
  5. Emit code section (function bodies)
  6. Validate with wasm-validate

Checkpoint: Output passes validation and runs in wasmtime.

Phase 7: Polish (Week 5+)

Goal: Production-quality compiler

Steps:

  1. Add error messages with line numbers
  2. Support all comparison operators
  3. Add unary minus and not
  4. Support local variable declarations mid-function
  5. Add memory operations (stretch goal)

Testing Strategy

Unit Tests

Test each phase independently:

// Lexer test
void test_lexer() {
    Lexer l = init_lexer("1 + 2");
    assert(next_token(&l).type == TOK_NUMBER);
    assert(next_token(&l).type == TOK_PLUS);
    assert(next_token(&l).type == TOK_NUMBER);
    assert(next_token(&l).type == TOK_EOF);
}

// Parser test
void test_parser() {
    ASTNode* ast = parse("func f(): i32 { return 1 + 2; }");
    assert(ast->type == NODE_PROGRAM);
    assert(ast->program.funcs[0]->body->stmts[0]->type == NODE_RETURN);
}

// Codegen test
void test_codegen() {
    uint8_t expected[] = {0x41, 0x01, 0x41, 0x02, 0x6a};  // i32.const 1; i32.const 2; i32.add
    ASTNode* expr = parse_expr("1 + 2");
    ByteBuffer buf = compile_expr(expr);
    assert(memcmp(buf.data, expected, 5) == 0);
}

Integration Tests

Compile and run complete programs:

#!/bin/bash
# tests/run_tests.sh

for f in tests/*.mini; do
    name=$(basename "$f" .mini)
    ./minilang "$f" -o "/tmp/$name.wasm"

    # Validate
    wasm-validate "/tmp/$name.wasm" || { echo "FAIL: $name (invalid)"; continue; }

    # Run and check output
    result=$(wasmtime "/tmp/$name.wasm" --invoke main)
    expected=$(head -1 "$f" | grep -oP '// expect: \K.*')

    if [ "$result" = "$expected" ]; then
        echo "PASS: $name"
    else
        echo "FAIL: $name (expected $expected, got $result)"
    fi
done

Test Programs

// tests/arithmetic.mini
// expect: 8
func main(): i32 {
    return 5 + 3;
}

// tests/factorial.mini
// expect: 3628800
func factorial(n: i32): i32 {
    if n < 2 {
        return 1;
    }
    return n * factorial(n - 1);
}

func main(): i32 {
    return factorial(10);
}

// tests/fibonacci.mini
// expect: 55
func fib(n: i32): i32 {
    if n < 2 {
        return n;
    }
    return fib(n - 1) + fib(n - 2);
}

func main(): i32 {
    return fib(10);
}

// tests/while_loop.mini
// expect: 5050
func main(): i32 {
    let sum: i32 = 0;
    let i: i32 = 1;
    while i <= 100 {
        sum = sum + i;
        i = i + 1;
    }
    return sum;
}

Common Pitfalls

1. Off-by-One in Precedence

Expression parsing precedence must be correct:

a + b * c    should parse as    a + (b * c)
a == b + c  should parse as    a == (b + c)

Lower precedence = called first in recursive descent.

2. Local Index Calculation

Parameters and locals share the same index space:

func f(a: i32, b: i32): i32 {  // a=0, b=1
    let x: i32 = 0;             // x=2
    let y: i32 = 0;             // y=3
    return a + x;               // local.get 0 + local.get 2
}

3. Block Type Annotations

WASM blocks need type annotations:

;; Void block (no result)
(block
  ...
)
;; Bytecode: 0x02 0x40

;; Block returning i32
(block (result i32)
  i32.const 42
)
;; Bytecode: 0x02 0x7f

Use 0x40 for void, 0x7f for i32, etc.

4. While Loop Break Depth

When breaking out of a while loop, the depth depends on nesting:

block $exit      ;; depth 1 from inside loop
  loop $continue ;; depth 0 from inside loop
    br_if 1      ;; break to $exit
    br 0         ;; continue to $continue
  end
end

5. Function Index Offset

If you have imports, function indices are offset:

;; With 2 imports:
;; Import 0, Import 1, Func 0 (index 2), Func 1 (index 3)

call 2  ;; calls your first function

6. LEB128 Signed vs Unsigned

i32.const uses signed LEB128 (negative numbers):

// i32.const -1
emit_byte(0x41);
emit_sleb128(-1);  // NOT emit_leb128!

Extensions

1. Add More Types

Support i64, f32, f64:

func pi(): f64 {
    return 3.14159265359;
}

func big(): i64 {
    return 9223372036854775807;
}

2. Add Arrays via Memory

Use linear memory for arrays:

func sum_array(ptr: i32, len: i32): i32 {
    let sum: i32 = 0;
    let i: i32 = 0;
    while i < len {
        sum = sum + load_i32(ptr + i * 4);
        i = i + 1;
    }
    return sum;
}

3. Add String Literals

Store in data section, pass as pointers:

func main(): i32 {
    print("Hello, World!");  // ptr to data section
    return 0;
}

4. Add Imports

Support calling host functions:

import "console" "log" func log(x: i32);

func main(): i32 {
    log(42);  // calls imported function
    return 0;
}

5. Add Optimizations

Implement simple optimizations:

  • Constant folding: 1 + 2 โ†’ 3
  • Dead code elimination: Remove unreachable code after return
  • Strength reduction: x * 2 โ†’ x << 1

6. Self-Hosting

The ultimate goal: write MiniLang in MiniLang and compile it to WASM. This requires adding strings, memory, and file I/O.


Real-World Connections

How Real Compilers Target WASM

LLVM (Clang, Rust, etc.):

  • LLVM IR โ†’ WASM backend
  • Your project is a simplified version of this path
  • Study llc -march=wasm32 output

AssemblyScript:

  • TypeScript-like โ†’ WASM
  • Direct compilation without LLVM
  • Similar architecture to your project

Emscripten:

  • C/C++ โ†’ LLVM โ†’ WASM
  • Adds JavaScript glue code
  • Full POSIX emulation

Debug Info and Source Maps

Production compilers emit DWARF debug info:

;; Custom section with debug info
(custom "name" ...)
(custom ".debug_info" ...)

Your compiler could add function names for debugging.

Optimization Opportunities

WASM runtimes can optimize, but compilers can help:

  • Use local.tee instead of local.set + local.get
  • Use block arity to avoid extra operations
  • Inline small functions

Self-Assessment Checklist

Compiler Knowledge

  • Explain the difference between lexing, parsing, and code generation
  • Describe how operator precedence is implemented in recursive descent
  • Explain why ASTs are useful intermediate representations
  • Draw the AST for a complex expression

WASM Code Generation

  • Explain why post-order traversal produces correct stack code
  • Generate correct code for nested if/else statements
  • Generate correct code for while loops with break conditions
  • Handle function calls with correct stack discipline

Binary Format

  • Emit valid LEB128 for both positive and negative numbers
  • Structure a complete WASM module with all required sections
  • Calculate correct section sizes

Testing

  • Verify output with wasm-validate
  • Run programs in wasmtime and verify results
  • Debug compilation errors systematically

Resources

Primary References

Books

  • Engineering a Compiler by Cooper & Torczon - Comprehensive compiler theory
  • Modern Compiler Implementation by Appel - Alternative textbook
  • WebAssembly: The Definitive Guide - WASM-specific details

Reference Implementations

Tools


Key Insights

Expressions are post-order, control flow is structured. These two facts make WASM an excellent compilation target. Your ASTโ€™s shape naturally maps to the instruction sequence.

The binary format is the easy part. Most compiler complexity is in parsing and type checking. Once you have a correct AST, code generation is mechanical.

Type checking prevents runtime errors. WASMโ€™s validation guarantees that well-formed modules canโ€™t have stack underflows or type mismatches. Your type checker provides the same guarantee for your source language.

Building a compiler teaches you what languages can and canโ€™t do. After this project, youโ€™ll understand why certain language features exist and how theyโ€™re implemented.


The Core Question Youโ€™re Answering

For a compiler targeting WASM, the fundamental question is:

โ€œHow do you translate high-level language constructs into the constrained world of stack-based bytecode?โ€

This question unpacks into several critical sub-questions:

  1. Expression Translation: How do infix expressions like a + b * c become postfix stack operations?
  2. Scope and Binding: How do variable names map to numbered local slots?
  3. Control Flow Linearization: How do nested if/while structures become flat sequences of blocks and branches?
  4. Type Erasure and Preservation: Which type information survives to runtime vs. is checked at compile time?

The answer reveals why WASM was designed the way it was: every design decision makes compilation straightforward while maintaining safety and performance.

Source Expression:     result = (a + b) * c - d

Your compiler must ask:

1. Where does 'result' live?     โ†’ Local slot assignment
2. What type is each operand?    โ†’ Type checking phase
3. What evaluation order?        โ†’ Post-order AST traversal
4. How to encode in binary?      โ†’ LEB128 encoding, opcodes
5. How to structure the module?  โ†’ Sections, type indices

The "constrained world" of WASM isn't limitingโ€”it's clarifying.
Every constraint eliminates ambiguity you'd have to handle yourself.

Concepts You Must Understand First

Before writing a single line of compiler code, you need mental models for these concepts:

1. Lexical Analysis and Tokenization

What it is: Converting a character stream into meaningful tokens.

Key insight: The lexer is a finite state machine. Each state transition consumes characters and potentially emits a token.

Input:  "func add(a: i32)"
Output: [FUNC, IDENT("add"), LPAREN, IDENT("a"), COLON, I32, RPAREN]

The lexer doesn't care about meaningโ€”just pattern recognition.

Book reference: Compilers: Principles, Techniques, and Tools (Dragon Book), Chapter 3 covers lexical analysis with DFA theory.

2. Parsing (Recursive Descent or PEG)

What it is: Transforming a token stream into a structured representation according to a grammar.

Key insight: Recursive descent parsers mirror the grammar structure directly. Each grammar rule becomes a function.

Grammar:            Parser code:
expr = term         fn parse_expr() {
     { "+" term }       let left = parse_term();
                        while match(PLUS) {
                            let right = parse_term();
                            left = make_binary(PLUS, left, right);
                        }
                        return left;
                    }

Book reference: Crafting Interpreters by Robert Nystrom, Chapters 5-6 provide the clearest recursive descent explanation.

3. Abstract Syntax Trees (AST)

What it is: A tree representation that captures the hierarchical structure of the program, stripped of syntactic sugar.

Key insight: The ASTโ€™s shape determines the code generation order. Post-order traversal naturally produces stack machine code.

Expression: a + b * c

     (+)              Post-order traversal:
    /   \             1. Visit a โ†’ local.get $a
   a    (*)           2. Visit b โ†’ local.get $b
       /   \          3. Visit c โ†’ local.get $c
      b     c         4. Process (*) โ†’ i32.mul
                      5. Process (+) โ†’ i32.add

Book reference: Language Implementation Patterns by Terence Parr, Pattern 8 (Tree Grammar) and Pattern 9 (Tree Pattern Matcher).

4. Stack-Based Code Generation

What it is: Emitting instructions for a virtual machine that uses an operand stack rather than registers.

Key insight: The operand stack is implicit in your recursive calls. Each compile_expression() call leaves exactly one value on the stack.

Invariant: Every expression leaves exactly ONE value on the stack
           (or zero for void expressions)

compile(a + b):
  compile(a)    ; stack: [a]
  compile(b)    ; stack: [a, b]
  emit(i32.add) ; stack: [a+b]

Book reference: Engineering a Compiler by Cooper & Torczon, Chapter 7 (Code Shape) covers stack-based evaluation.

5. Expression Evaluation Order

What it is: The sequence in which subexpressions are computed to ensure correct results.

Key insight: Left-to-right evaluation with operator precedence is encoded in the AST structure, then extracted via post-order traversal.

a - b - c parses as ((a - b) - c), not (a - (b - c))

       (-)           Post-order: a, b, -, c, -
      /   \          Which gives: (a - b) - c โœ“
    (-)    c
   /   \             If parsed wrong: a, b, c, -, -
  a     b            Would give: a - (b - c) โœ—

Book reference: Writing a C Compiler by Nora Sandler, Chapter 2 covers expression parsing with associativity.

6. Variable Allocation (Locals vs Memory)

What it is: Deciding where each variable lives at runtimeโ€”in a local slot or in linear memory.

Key insight: WASM locals are typed, indexed slots. The compiler must assign each source variable a unique index.

func example(a: i32, b: i32): i32 {
    let x: i32 = 0;
    let y: i32 = 0;
    ...
}

Local index allocation:
  $a โ†’ 0 (parameter)
  $b โ†’ 1 (parameter)
  $x โ†’ 2 (local)
  $y โ†’ 3 (local)

The symbol table maps names to indices.

Book reference: Modern Compiler Implementation in C by Andrew Appel, Chapter 6 covers activation records and local allocation.

7. Type Checking and Inference

What it is: Ensuring that operations receive operands of the correct type, and inferring types where not explicitly stated.

Key insight: Type checking happens in a separate pass before code generation. By the time you emit code, you know every expressionโ€™s type.

Type checking example:

let x: i32 = 5;      โœ“ 5 is i32, matches declaration
let y = 3.14;        โœ“ Infer y: f64 from literal
x = y;               โœ— Cannot assign f64 to i32
foo(x, y);           Check: foo's param types match argument types

Book reference: Types and Programming Languages by Benjamin Pierce, Chapters 8-11 for foundational type theory.


Questions to Guide Your Design

Use these questions as checkpoints during implementation:

Converting Expressions to Stack Operations

  1. How do I ensure correct operator precedence in the AST?
    • Lower precedence operators become parents of higher precedence ones
    • a + b * c โ†’ Add(a, Mul(b, c)), not Mul(Add(a, b), c)
  2. How do I handle left vs right associativity?
    • Left-associative: a - b - c โ†’ Sub(Sub(a, b), c)
    • Right-associative: a = b = c โ†’ Assign(a, Assign(b, c))
  3. What invariant does each compile_expression() call maintain?
    • Answer: Exactly one value pushed to the stack (for non-void expressions)
  4. How do I generate code for short-circuit operators (&&, ||)?
    • They require control flow, not just stack operations
    • a && b must skip evaluating b if a is false

Handling Variables and Scoping

  1. How do I distinguish parameters from local variables?
    • Both use local.get/local.set with different index ranges
    • Parameters: indices 0 to param_count-1
    • Locals: indices param_count and above
  2. How do I handle shadowing (same name in nested scope)?
    • Stack of symbol tables, or single table with scope depth
    • Inner scopeโ€™s binding shadows outer scopeโ€™s
  3. What happens to local indices when a variable goes out of scope?
    • WASM doesnโ€™t have dynamic local allocationโ€”all locals declared at function start
    • You can reuse indices for non-overlapping scopes (optimization)

Compiling Control Flow

  1. How does if/else map to WASMโ€™s structured control flow?
    (if (condition)
      (then ...)
      (else ...))
    
  2. How do I compile while with WASMโ€™s block/loop/br?
    (block $exit
      (loop $continue
        (br_if $exit (i32.eqz (condition)))
        ...body...
        (br $continue)))
    
  3. How do I handle nested loops with break and continue?
    • Track depth of enclosing blocks
    • break โ†’ br to outer block, continue โ†’ br to inner loop

Function Prologue/Epilogue

  1. What goes in the function prologue?
    • Local variable declarations (count and types)
    • Initialization of locals to zero (WASM does this automatically)
  2. What goes in the function epilogue?
    • Ensure return value is on stack
    • The final end instruction
  3. How do I handle early returns?
    • The return instruction exits immediately
    • Stack must have correct return type (or be empty for void)

Thinking Exercise

Hand-Compile an Expression

Task: Convert (a + b) * c - d to WAT stack operations, assuming all variables are i32 locals.

Step 1: Draw the AST

        (-)
       /   \
     (*)    d
    /   \
  (+)    c
 /   \
a     b

Step 2: Post-order traversal

Visit order: a, b, (+), c, (*), d, (-)

Step 3: Emit instructions for each visit

;; Visit a
local.get $a        ;; stack: [a]

;; Visit b
local.get $b        ;; stack: [a, b]

;; Process (+)
i32.add             ;; stack: [(a+b)]

;; Visit c
local.get $c        ;; stack: [(a+b), c]

;; Process (*)
i32.mul             ;; stack: [(a+b)*c]

;; Visit d
local.get $d        ;; stack: [(a+b)*c, d]

;; Process (-)
i32.sub             ;; stack: [(a+b)*c - d]

Step 4: Verify with concrete values

Let a=2, b=3, c=4, d=5:

  • Expected: (2 + 3) * 4 - 5 = 5 * 4 - 5 = 20 - 5 = 15
  • Trace: [2] โ†’ [2,3] โ†’ [5] โ†’ [5,4] โ†’ [20] โ†’ [20,5] โ†’ [15] โœ“

Challenge Yourself

Now try these:

  1. a < b && c > d (short-circuit required!)
  2. if (x > 0) { y = x; } else { y = -x; } (if as expression)
  3. factorial(n - 1) * n (function call + arithmetic)

The Interview Questions Theyโ€™ll Ask

Conceptual Questions

Q1: โ€œExplain the phases of a compiler and what each does.โ€

Expected answer: Lexical analysis (tokenization), parsing (AST construction), semantic analysis (type checking, symbol resolution), optimization (optional), code generation. Each phase transforms the representation: characters โ†’ tokens โ†’ AST โ†’ annotated AST โ†’ target code.

Q2: โ€œWhy use an AST instead of directly generating code during parsing?โ€

Expected answer: Separation of concernsโ€”parsing handles syntax, later phases handle semantics. AST allows multiple passes (type checking, optimization, codegen). Also enables targeting multiple backends from the same AST.

Q3: โ€œHow would you implement operator precedence in a recursive descent parser?โ€

Expected answer: Each precedence level gets its own parsing function. Lower precedence functions call higher precedence functions. For example, parse_addition() calls parse_multiplication(), ensuring multiplication binds tighter.

Code Generation Questions

Q4: โ€œHow do you compile a while loop for a stack-based VM?โ€

Expected answer: Use nested block/loop structure:

block $exit
  loop $continue
    <condition>
    br_if $exit (if false)
    <body>
    br $continue
  end
end

The br $continue jumps to loop start; br_if $exit exits when condition is false.

Q5: โ€œWhatโ€™s the difference between local.get/local.set and memory loads/stores?โ€

Expected answer: Locals are typed slots on the call frameโ€”fast, limited number, no address. Memory is linear byte arrayโ€”can store arbitrary data structures, addressable, requires explicit load/store with offset.

Q6: โ€œHow do you handle function calls in a stack-based compiler?โ€

Expected answer: Push arguments left-to-right onto the stack, then emit call $func_idx. Callee expects arguments as its first N locals. Return value (if any) is left on callerโ€™s stack.

AST Traversal Questions

Q7: โ€œImplement a post-order tree traversal that generates stack machine code.โ€

def codegen(node):
    match node:
        case BinaryOp(op, left, right):
            codegen(left)   # leaves value on stack
            codegen(right)  # leaves value on stack
            emit(op_to_instr[op])  # consumes 2, pushes 1
        case Number(value):
            emit(f"i32.const {value}")
        case Variable(name):
            emit(f"local.get ${name}")

Q8: โ€œHow would you type-check a function call expression?โ€

Expected answer:

  1. Look up function in symbol table
  2. Check argument count matches parameter count
  3. Type-check each argument expression
  4. Verify each argument type matches corresponding parameter type
  5. Return the functionโ€™s return type as the expressionโ€™s type

System Design Questions

Q9: โ€œDesign a symbol table that handles nested scopes.โ€

Expected answer: Stack of hash tables. Each scope pushes a new table. Lookup searches from top to bottom. Define adds to topmost table. When scope ends, pop the table.

Q10: โ€œHow would you add error recovery to your parser?โ€

Expected answer: Synchronization on statement boundaries. When an error is detected, skip tokens until reaching a synchronization point (like ; or }), report the error, then continue parsing. Collect all errors rather than stopping at first.


Hints in Layers

For when youโ€™re stuck, reveal hints progressively:

Layer 1: Getting Started

Hint: How do I structure the compiler?

Start with the simplest possible pipeline:

main.c:
  source = read_file(argv[1])
  tokens = lex(source)
  ast = parse(tokens)
  wasm_bytes = codegen(ast)
  write_file(output, wasm_bytes)

Get each phase working before moving to the next. Use print statements liberally.

Hint: What's the minimum viable program?

Start with func main(): i32 { return 42; }

This requires:

  • Lexer: FUNC, IDENT, LPAREN, RPAREN, COLON, I32, LBRACE, RETURN, NUMBER, SEMICOLON, RBRACE
  • Parser: parse_function, parse_return, parse_number
  • Codegen: emit module header, type section, function section, export section, code section with i32.const 42; end

Layer 2: Lexer Struggles

Hint: My lexer is getting complicated

Keep the lexer stateless except for position. Use a switch statement for single-character tokens, special cases for two-character tokens (==, <=, etc.), and separate functions for numbers and identifiers.

Token next_token() {
    skip_whitespace();
    if (at_end()) return make_token(TOK_EOF);

    char c = advance();

    if (is_alpha(c)) return identifier();
    if (is_digit(c)) return number();

    switch (c) {
        case '(': return make_token(TOK_LPAREN);
        case ')': return make_token(TOK_RPAREN);
        case '=': return match('=') ? make_token(TOK_EQEQ) : make_token(TOK_EQ);
        // ...
    }
}

Layer 3: Parser Struggles

Hint: Operator precedence is wrong

Remember: lower precedence = called first.

Precedence (lowest to highest):
1. || (or)
2. && (and)
3. == != (equality)
4. < > <= >= (comparison)
5. + - (addition)
6. * / % (multiplication)
7. - ! (unary)
8. () function call (primary)

parse_expression() calls parse_or()
parse_or() calls parse_and()
parse_and() calls parse_equality()
... and so on
Hint: Left recursion is breaking my parser

Convert left recursion to a while loop:

// Instead of: expr = expr "+" term | term
// Which would infinitely recurse

ASTNode* parse_expr() {
    ASTNode* left = parse_term();
    while (match(TOK_PLUS) || match(TOK_MINUS)) {
        Token op = previous();
        ASTNode* right = parse_term();
        left = make_binary(op, left, right);
    }
    return left;
}

Layer 4: Code Generation Struggles

Hint: My WASM binary is invalid

Common issues:

  1. Wrong section order: Type (1), Import (2), Function (3), Table (4), Memory (5), Global (6), Export (7), Start (8), Element (9), Code (10), Data (11)
  2. Size mismatch: Section size doesnโ€™t match actual content
  3. LEB128 encoding: Use signed for constants, unsigned for indices
  4. Missing end byte (0x0b): Every function, block, loop, if must end with 0x0b
Hint: My while loop doesn't work

The pattern:

block $exit      ;; 0x02 0x40
  loop $cont     ;; 0x03 0x40
    <condition>
    i32.eqz      ;; 0x45 (invert condition)
    br_if $exit  ;; 0x0d 0x01 (depth 1 = outer block)
    <body>
    br $cont     ;; 0x0c 0x00 (depth 0 = this loop)
  end            ;; 0x0b
end              ;; 0x0b

Key: br to a loop jumps to the start, br to a block jumps to the end.

Layer 5: Deep Issues

Hint: Early return doesn't work correctly

WASMโ€™s return instruction (0x0f) exits the function immediately. But you need to ensure the stack has the right type.

For functions returning i32:

if (condition)
  i32.const 0
  return        ;; OK: stack has i32
end
i32.const 1     ;; Normal path continues

For void functions, donโ€™t push anything before return.

Hint: Nested control flow has wrong branch depths

Track block depth during codegen:

typedef struct {
    int block_depth;
    // ...
} CodeGen;

void compile_while(CodeGen* cg, ...) {
    emit(cg, 0x02);  // block
    cg->block_depth++;
    emit(cg, 0x03);  // loop
    cg->block_depth++;

    // br_if to exit: depth is 1 (back to outer block)
    // br to continue: depth is 0 (back to loop start)

    emit(cg, 0x0b);  // end loop
    cg->block_depth--;
    emit(cg, 0x0b);  // end block
    cg->block_depth--;
}

Books That Will Help

Book Author(s) Relevance Key Chapters
Writing a C Compiler Nora Sandler Hands-on compiler construction with modern approach All chaptersโ€”follows a similar project structure
Engineering a Compiler Keith Cooper & Linda Torczon Comprehensive theory with practical focus Ch. 4 (Parsing), Ch. 7 (Code Shape), Ch. 13 (Instruction Selection)
Language Implementation Patterns Terence Parr Pattern-based approach to language tools Part II (Analyzing Languages), Part III (Interpretation)
Compilers: Principles and Practice Aho, Lam, Sethi, Ullman The โ€œDragon Bookโ€โ€”foundational theory Ch. 2 (A Simple Syntax-Directed Translator), Ch. 6 (Intermediate-Code Generation)
Crafting Interpreters Robert Nystrom Most accessible introduction, free online Part II (Tree-Walk Interpreter)โ€”applicable to compilers too
Modern Compiler Implementation in C Andrew Appel Alternative to Dragon Book, more implementation-focused Ch. 6 (Activation Records), Ch. 7 (Translation to IR)

Reading Order Recommendation

  1. Start with Crafting Interpreters (free online) for fundamentals
  2. Then read Writing a C Compiler for a complete project walkthrough
  3. Reference Engineering a Compiler Chapter 7 for code generation theory
  4. Consult Language Implementation Patterns when stuck on specific patterns

Real-World Outcome

When you complete this project, your compiler will take source code and produce working WebAssembly. Hereโ€™s what the end-to-end process looks like:

Input: MiniLang Source File

File: factorial.mini

// Compute factorial recursively
func factorial(n: i32): i32 {
    if n < 2 {
        return 1;
    }
    return n * factorial(n - 1);
}

// Compute factorial iteratively for comparison
func factorial_iter(n: i32): i32 {
    let result: i32 = 1;
    let i: i32 = 2;
    while i <= n {
        result = result * i;
        i = i + 1;
    }
    return result;
}

func main(): i32 {
    return factorial(10);
}

Compilation Process

$ ./minilang factorial.mini -o factorial.wasm --emit-wat
Lexing... 47 tokens
Parsing... AST with 3 functions
Type checking... OK
Code generation...
  factorial: 23 bytes
  factorial_iter: 41 bytes
  main: 8 bytes
Emitting binary...

Output:
  factorial.wasm (198 bytes)
  factorial.wat  (generated WAT for debugging)

$ wasm-validate factorial.wasm
OK

$ wasmtime factorial.wasm --invoke factorial 10
3628800

$ wasmtime factorial.wasm --invoke factorial_iter 10
3628800

Generated WAT (Human-Readable)

File: factorial.wat

(module
  ;; Type section: function signatures
  (type $t0 (func (param i32) (result i32)))  ;; factorial, factorial_iter
  (type $t1 (func (result i32)))               ;; main

  ;; Function declarations
  (func $factorial (type $t0) (param $n i32) (result i32)
    ;; if n < 2
    local.get $n
    i32.const 2
    i32.lt_s
    if (result i32)
      ;; return 1
      i32.const 1
    else
      ;; return n * factorial(n - 1)
      local.get $n
      local.get $n
      i32.const 1
      i32.sub
      call $factorial
      i32.mul
    end
  )

  (func $factorial_iter (type $t0) (param $n i32) (result i32)
    (local $result i32)
    (local $i i32)

    ;; let result = 1
    i32.const 1
    local.set $result

    ;; let i = 2
    i32.const 2
    local.set $i

    ;; while i <= n
    block $exit
      loop $continue
        local.get $i
        local.get $n
        i32.gt_s        ;; i > n means exit
        br_if $exit

        ;; result = result * i
        local.get $result
        local.get $i
        i32.mul
        local.set $result

        ;; i = i + 1
        local.get $i
        i32.const 1
        i32.add
        local.set $i

        br $continue
      end
    end

    ;; return result
    local.get $result
  )

  (func $main (type $t1) (result i32)
    i32.const 10
    call $factorial
  )

  ;; Export main for runtime invocation
  (export "factorial" (func $factorial))
  (export "factorial_iter" (func $factorial_iter))
  (export "main" (func $main))
)

Binary Output Analysis

$ xxd factorial.wasm | head -20
00000000: 0061 736d 0100 0000 0110 0360 017f 017f  .asm.......`....
00000010: 6000 017f 0304 0300 0001 0722 0366 6163  `..........".fac
00000020: 746f 7269 616c 0000 0e66 6163 746f 7269  torial...factori
00000030: 616c 5f69 7465 7200 0104 6d61 696e 0002  al_iter...main..
00000040: 0a56 0317 0020 0041 0248 0440 4101 0520  .V... .A.H.@A..
...

Section breakdown:
  00-03: Magic number (0x00 0x61 0x73 0x6d = "\0asm")
  04-07: Version (0x01 0x00 0x00 0x00 = 1)
  08-17: Type section (01) - function signatures
  18-1b: Function section (03) - type indices
  1c-43: Export section (07) - exported names
  44-c5: Code section (0a) - function bodies

Integration with JavaScript

// Load and run the compiled WASM
const wasmBuffer = await fetch('factorial.wasm');
const wasmModule = await WebAssembly.instantiate(await wasmBuffer.arrayBuffer());
const { factorial, factorial_iter, main } = wasmModule.instance.exports;

// Use the compiled functions
console.log(factorial(10));       // 3628800
console.log(factorial_iter(10));  // 3628800
console.log(main());              // 3628800

// Benchmark: your compiled code runs at near-native speed
console.time('factorial');
for (let i = 0; i < 100000; i++) factorial(20);
console.timeEnd('factorial');  // ~50ms for 100k calls

Compiler Test Suite Output

$ ./run_tests.sh
Running test suite...

[PASS] arithmetic.mini: 5 + 3 = 8
[PASS] precedence.mini: 2 + 3 * 4 = 14
[PASS] comparison.mini: 5 > 3 = 1
[PASS] if_else.mini: max(3, 7) = 7
[PASS] while_loop.mini: sum_to(100) = 5050
[PASS] factorial.mini: factorial(10) = 3628800
[PASS] fibonacci.mini: fib(10) = 55
[PASS] multiple_funcs.mini: calling helper functions
[PASS] nested_if.mini: nested conditionals
[PASS] nested_while.mini: nested loops

All 10 tests passed!

Validation:
  All outputs pass wasm-validate โœ“
  All outputs run in wasmtime โœ“
  All outputs run in Node.js โœ“

This is what youโ€™ve built: a complete pipeline from high-level source code to portable, efficient binary that runs anywhere WebAssembly is supported.


After this project, youโ€™ll see every programming language differentlyโ€”as a layer of abstraction over something like WASM. Project 5 (WASI) shows how WASM programs interact with the outside world.