Project 2: Build a WASM Binary Parser
Project 2: Build a WASM Binary Parser
Decode the .wasm binary format byte-by-byte and understand what makes a WebAssembly module
Project Overview
| Attribute | Value |
|---|---|
| Difficulty | Intermediate |
| Time Estimate | 1-2 weeks |
| Languages | C (primary), Rust, Python, TypeScript |
| Prerequisites | Project 1 (WAT), binary/hex familiarity, file I/O |
| Main Reference | WebAssembly Specification ยง5 (Binary Format) |
| Knowledge Area | Binary Parsing, WebAssembly Internals |
Learning Objectives
After completing this project, you will be able to:
- Parse LEB128 encoding - Decode variable-length integers used throughout WASM
- Identify all section types - Know the 12 standard sections and their purposes
- Decode function signatures - Understand the type sectionโs structure
- Read import/export tables - Parse how modules connect to their environment
- Disassemble code sections - Turn bytecode back into readable instructions
- Validate module structure - Know the ordering and format requirements
Conceptual Foundation
1. Why Binary Format Matters
When you write WAT and compile with wat2wasm, the output is a .wasm fileโa binary encoding of your module. Understanding this format lets you:
- Debug mysterious WASM failures by inspecting the actual bytes
- Build tools (validators, optimizers, debuggers)
- Write compilers that output WASM directly
- Understand runtime performance (binary is what actually executes)
The binary format IS WebAssembly. WAT is just a convenience for humans.
2. The Magic Number and Version
Every WASM file begins with 8 bytes:
Offset Bytes Meaning
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
0x00 00 61 73 6D Magic number: "\0asm"
0x04 01 00 00 00 Version: 1 (little-endian u32)
Why โ\0asmโ?
- Starts with null byte to distinguish from text files
- โasmโ is a nod to the original asm.js project
- Makes files easily identifiable:
file mymodule.wasmshows โWebAssemblyโ
Version 1: All deployed WASM today is version 1. Version 2 is in development with new features.
3. LEB128: Variable-Length Integers
WASM uses LEB128 (Little Endian Base 128) to encode integers compactly:
- Small numbers use few bytes: 0-127 fit in 1 byte
- Large numbers use more bytes: up to 5 bytes for u32, 10 for u64
- Each byteโs high bit indicates โmore bytes followโ
Unsigned LEB128 Algorithm
Reading uLEB128:
result = 0
shift = 0
loop:
byte = read_byte()
result |= (byte & 0x7F) << shift // Low 7 bits contribute to value
if (byte & 0x80) == 0: // High bit clear = done
break
shift += 7
return result
Example: Decoding 0xE5 0x8E 0x26
Byte 1: 0xE5 = 1110 0101
- High bit set (1): more bytes coming
- Value bits: 110 0101 = 0x65
Byte 2: 0x8E = 1000 1110
- High bit set (1): more bytes coming
- Value bits: 000 1110 = 0x0E
Byte 3: 0x26 = 0010 0110
- High bit clear (0): done
- Value bits: 010 0110 = 0x26
Assembly:
result = 0x65 | (0x0E << 7) | (0x26 << 14)
= 0x65 | 0x700 | 0x98000
= 624485
Signed LEB128 (sLEB128)
Similar, but the final byteโs second-highest bit is the sign bit. If set, extend with 1s:
If final byte has bit 6 set and result needs sign extension:
result |= (~0 << shift) // Fill high bits with 1s
4. Section Structure
After the header, a WASM module is a sequence of sections. Each section:
โโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโ
โ Section ID โ Section Size โ Section Data โ
โ (1 byte) โ (uLEB128) โ (size bytes) โ
โโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโ

Section IDs:
| ID | Name | Purpose |
|---|---|---|
| 0 | Custom | Name sections, debug info, extensions |
| 1 | Type | Function signatures (param/result types) |
| 2 | Import | Functions/memory/tables/globals from host |
| 3 | Function | Maps function index โ type index |
| 4 | Table | Indirect call tables (for function pointers) |
| 5 | Memory | Linear memory declarations |
| 6 | Global | Global variable declarations |
| 7 | Export | Names exposed to host |
| 8 | Start | Optional startup function index |
| 9 | Element | Table initialization data |
| 10 | Code | Function bodies |
| 11 | Data | Memory initialization data |
| 12 | Data Count | Count of data segments (for validation) |
Critical rule: Sections must appear in ID order (except custom sections, which can appear anywhere). Duplicate sections are invalid.
5. The Type Section (ID = 1)
Contains function signatures (types). Structure:
โโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ num_types โ type[0], type[1], ..., type[num_types-1] โ
โ (uLEB128) โ โ
โโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Each type:
โโโโโโโโโโฌโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโ
โ 0x60 โ num_params โ param_types... โ num_results โ result_types...โ
โ (func) โ (uLEB128) โ (n bytes) โ (uLEB128) โ (m bytes) โ
โโโโโโโโโโดโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโ

Type encoding: | Byte | Type | |โโ|โโ| | 0x7F | i32 | | 0x7E | i64 | | 0x7D | f32 | | 0x7C | f64 | | 0x70 | funcref | | 0x6F | externref |
Example: (func (param i32 i32) (result i32)) encodes as:
60 ; function type marker
02 ; 2 parameters
7F 7F ; both are i32
01 ; 1 result
7F ; i32
6. The Function Section (ID = 3)
Maps each function in the code section to a type index:
โโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ num_funcs โ type_idx[0], type_idx[1], ... โ
โ (uLEB128) โ (each is uLEB128) โ
โโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Why separate from Code?
- Allows streaming compilation: know all types before seeing bodies
- Enables forward references and mutual recursion
Function indexing: Imported functions come first. If you have 3 imports, your first local function is index 3.
7. The Import Section (ID = 2)
โโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ num_imports โ import[0], import[1], ... โ
โโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Each import:
โโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโฌโโโโโโโโโโโโโโโ
โ module_name โ import_name โ kind โ description โ
โ (string) โ (string) โ (byte) โ (varies) โ
โโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโดโโโโโโโโโโโโโโโ
Strings: length (uLEB128) followed by UTF-8 bytes
Kind:
0x00 = function (followed by type index)
0x01 = table
0x02 = memory
0x03 = global
8. The Export Section (ID = 7)
Each export:
โโโโโโโโโโโโโโโโโฌโโโโโโโโโโฌโโโโโโโโโโโโโโโโ
โ export_name โ kind โ index โ
โ (string) โ (byte) โ (uLEB128) โ
โโโโโโโโโโโโโโโโโดโโโโโโโโโโดโโโโโโโโโโโโโโโโ
Kind: same as imports (0x00=func, 0x01=table, 0x02=memory, 0x03=global)
9. The Code Section (ID = 10)
The heart of WASMโfunction bodies:
โโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ num_funcs โ func_body[0], func_body[1], ... โ
โโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Each func_body:
โโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ body_size โ locals โ instructions โ
โ (uLEB128) โ (see below) โ (sequence of opcodes) โ
โโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Locals:
โโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ num_groups โ (count, type) pairs โ
โโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
For example: 2 locals of type i32, 1 local of type i64:
02 ; 2 groups
02 7F ; 2 ร i32
01 7E ; 1 ร i64
10. Instruction Encoding (Opcodes)
Instructions are single bytes (sometimes followed by immediates):
| Opcode | Instruction | Immediates |
|---|---|---|
| 0x00 | unreachable | none |
| 0x01 | nop | none |
| 0x02 | block | block type |
| 0x03 | loop | block type |
| 0x04 | if | block type |
| 0x05 | else | none |
| 0x0B | end | none |
| 0x0C | br | label index |
| 0x0D | br_if | label index |
| 0x10 | call | function index |
| 0x20 | local.get | local index |
| 0x21 | local.set | local index |
| 0x22 | local.tee | local index |
| 0x28 | i32.load | memarg |
| 0x36 | i32.store | memarg |
| 0x41 | i32.const | sLEB128 value |
| 0x42 | i64.const | sLEB128 value |
| 0x6A | i32.add | none |
| 0x6B | i32.sub | none |
| 0x6C | i32.mul | none |
memarg: Memory operations take (align, offset) both as uLEB128.
block type: Either 0x40 (void) or a value type (0x7F for i32, etc.)
11. Module Anatomy Diagram
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ WASM Module โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Magic Number (4 bytes): 00 61 73 6D โ
โ Version (4 bytes): 01 00 00 00 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Section 1 (Type): โ
โ โโโ Number of types โ
โ โโโ Type definitions (function signatures) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Section 2 (Import): โ
โ โโโ Number of imports โ
โ โโโ Import entries (module, name, kind, description) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Section 3 (Function): โ
โ โโโ Number of functions โ
โ โโโ Type indices (which signature each function uses) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Section 5 (Memory): โ
โ โโโ Memory limits (initial, optional max) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Section 7 (Export): โ
โ โโโ Number of exports โ
โ โโโ Export entries (name, kind, index) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Section 10 (Code): โ
โ โโโ Number of function bodies โ
โ โโโ Function bodies โ
โ โโโ Body size โ
โ โโโ Local declarations โ
โ โโโ Instructions (ending with 0x0B = end) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ

Project Specification
Deliverables
Build a command-line tool that parses .wasm files and prints their structure:
$ ./wasmparser example.wasm
WASM Module
Version: 1
Section: Type (1)
[0] (i32, i32) -> (i32)
[1] (i32) -> ()
Section: Import (2)
[0] "env"."print" : func type=1
Section: Function (3)
[0] type=0
[1] type=0
Section: Memory (5)
[0] min=1, max=none
Section: Export (7)
[0] "add" : func 1
[1] "memory" : memory 0
Section: Code (10)
[0] locals: 0
20 00 local.get 0
20 01 local.get 1
6A i32.add
0B end
[1] locals: 1 ร i32
41 00 i32.const 0
21 02 local.set 2
...
Functional Requirements
- Parse header: Validate magic number and version
- Parse all 12 section types (or at least Type, Import, Function, Memory, Export, Code)
- Decode LEB128: Both signed and unsigned variants
- Parse type section: Show all function signatures
- Parse import section: Show module, name, kind, and type
- Parse export section: Show name, kind, and index
- Parse code section: Disassemble to readable instructions
- Handle errors gracefully: Invalid files shouldnโt crash
Minimum Viable Product
Start with:
- Header parsing
- Section enumeration (just ID and size)
- Type section parsing
- Code section basic disassembly
Then expand to remaining sections.
Success Criteria
- Parses official WASM test suite files without crashing
- Correctly decodes LEB128 values (test with edge cases)
- Output matches
wasm-objdump -xfor section structure - Disassembly matches
wasm2watoutput for simple modules - Handles empty sections and missing optional sections
Solution Architecture
High-Level Design
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ wasmparser โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โ
โ โ Reader โโโโโถโ Parser โโโโโถโ Printer โ โ
โ โ (byte stream)โ โ(decode logic)โ โ(format output)โ โ
โ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โ
โ โ
โ Reader provides: Parser builds: Printer shows: โ
โ - read_byte() - Module struct - Human text โ
โ - read_bytes(n) - Section list - Or JSON โ
โ - read_uleb128() - Types list - Or other โ
โ - read_sleb128() - Functions โ โ
โ - position - etc. โ โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ

Data Structures
Module {
version: u32
sections: Vec<Section>
}
Section {
id: u8
size: u32
data: SectionData // enum or variant type
}
SectionData =
| TypeSection { types: Vec<FuncType> }
| ImportSection { imports: Vec<Import> }
| FunctionSection { type_indices: Vec<u32> }
| MemorySection { memories: Vec<MemoryType> }
| ExportSection { exports: Vec<Export> }
| CodeSection { bodies: Vec<FuncBody> }
| CustomSection { name: String, bytes: Vec<u8> }
| ...
FuncType {
params: Vec<ValueType>
results: Vec<ValueType>
}
Import {
module_name: String
import_name: String
kind: ImportKind
}
FuncBody {
locals: Vec<(u32, ValueType)> // (count, type) pairs
instructions: Vec<Instruction>
}
Instruction {
opcode: u8
immediates: Immediates // varies by instruction
}
Reader Interface
trait WasmReader {
fn read_byte(&mut self) -> Result<u8>
fn read_bytes(&mut self, n: usize) -> Result<Vec<u8>>
fn read_u32_le(&mut self) -> Result<u32> // for header
fn read_uleb128(&mut self) -> Result<u64>
fn read_sleb128(&mut self) -> Result<i64>
fn read_string(&mut self) -> Result<String> // length-prefixed
fn position(&self) -> usize
fn remaining(&self) -> usize
}
Parsing Flow
parse_module():
1. Read and validate magic number
2. Read and store version
3. While bytes remaining:
a. Read section id (1 byte)
b. Read section size (uLEB128)
c. Read section_size bytes
d. Dispatch to section parser based on id
e. Verify exactly section_size bytes were consumed
parse_type_section(bytes):
1. Read number of types
2. For each type:
a. Read 0x60 (func type marker)
b. Read param count, then that many type bytes
c. Read result count, then that many type bytes
parse_code_section(bytes):
1. Read number of function bodies
2. For each body:
a. Read body size
b. Read locals (count, then groups of (count, type))
c. Read instructions until 0x0B (end)
parse_instruction():
1. Read opcode byte
2. Based on opcode, read immediates:
- 0x41 (i32.const): read sLEB128
- 0x20 (local.get): read uLEB128 (index)
- 0x02 (block): read block type
- etc.
Implementation Guide
Phase 1: File Reading and Header (Day 1)
Goal: Read a .wasm file and validate the header
Steps:
- Read entire file into a byte buffer
- Check first 4 bytes are
00 61 73 6D - Read next 4 bytes as little-endian u32 (should be 1)
- Print โValid WASM module, version Xโ
Hint: Little-endian u32 from bytes b0 b1 b2 b3:
value = b0 | (b1 << 8) | (b2 << 16) | (b3 << 24)
Testing: Create a minimal WAT file, compile it, parse it:
(module)
wat2wasm minimal.wat -o minimal.wasm
./wasmparser minimal.wasm
Phase 2: LEB128 Decoding (Day 1-2)
Goal: Implement robust LEB128 decoding
Unsigned LEB128 algorithm:
result = 0
shift = 0
while true:
byte = read_byte()
result |= (byte & 0x7F) << shift
if (byte & 0x80) == 0:
break
shift += 7
return result
Signed LEB128 addition:
// After the loop, if sign extension needed:
if shift < 64 and (byte & 0x40) != 0:
result |= -(1 << shift) // Sign extend
Test cases:
| Bytes | Unsigned | Signed |
|โโ-|โโโ-|โโโ|
| 0x00 | 0 | 0 |
| 0x7F | 127 | -1 |
| 0x80 0x01 | 128 | 128 |
| 0xFF 0x7F | 16383 | -1 |
| 0xE5 0x8E 0x26 | 624485 | 624485 |
| 0x9B 0xF1 0x59 | - | -624485 |
Phase 3: Section Enumeration (Day 2)
Goal: List all sections without parsing contents
Algorithm:
while position < file_length:
section_id = read_byte()
section_size = read_uleb128()
print("Section", section_id, "size", section_size)
skip(section_size) // advance position
Testing: Parse a compiled WAT with multiple elements:
(module
(memory 1)
(func (export "test") (param i32) (result i32)
local.get 0
)
)
Expected sections: Type, Function, Memory, Export, Code
Phase 4: Type Section Parser (Day 3)
Goal: Parse and display function signatures
Hint for structure:
Type section:
count (uLEB128)
for each:
0x60 (function type marker)
param_count (uLEB128)
param_types[param_count] (one byte each)
result_count (uLEB128)
result_types[result_count] (one byte each)
Type byte decoding:
0x7F โ "i32"
0x7E โ "i64"
0x7D โ "f32"
0x7C โ "f64"
Output format:
Type Section (1):
[0] (i32, i32) -> (i32)
[1] () -> ()
Phase 5: Import and Export Sections (Day 4)
Goal: Parse imports and exports
String reading: Length-prefixed UTF-8
length = read_uleb128()
bytes = read_bytes(length)
string = utf8_decode(bytes)
Import kind dispatch:
kind = read_byte()
match kind:
0x00 (function): type_index = read_uleb128()
0x01 (table): parse table type
0x02 (memory): parse memory limits
0x03 (global): parse global type
Export format:
name = read_string()
kind = read_byte()
index = read_uleb128()
Phase 6: Code Section and Disassembly (Day 5-7)
Goal: Parse function bodies and print instructions
Function body structure:
body_size = read_uleb128()
// Now read exactly body_size bytes
locals_count = read_uleb128()
for locals_count times:
count = read_uleb128() // how many of this type
type = read_byte() // what type
// Rest is instructions until 0x0B
while current_byte != 0x0B:
parse_instruction()
Instruction parsing (start with these):
| Opcode | Name | Immediates |
|---|---|---|
| 0x00 | unreachable | - |
| 0x01 | nop | - |
| 0x0B | end | - |
| 0x10 | call | func_idx (uLEB128) |
| 0x20 | local.get | local_idx (uLEB128) |
| 0x21 | local.set | local_idx (uLEB128) |
| 0x41 | i32.const | value (sLEB128) |
| 0x6A | i32.add | - |
| 0x6B | i32.sub | - |
| 0x6C | i32.mul | - |
Output format:
Code Section (10):
[0] 5 bytes, locals: 0
20 00 local.get 0
20 01 local.get 1
6A i32.add
0B end
Phase 7: Full Integration (Day 8+)
Goal: Handle all sections, robust error handling
- Add remaining sections (Table, Memory limits, Global, Element, Data)
- Add control flow instructions (block, loop, if, br, br_if)
- Add memory instructions with memarg parsing
- Improve error messages with byte offsets
- Add validation (check section ordering, size consistency)
Testing Strategy
Unit Tests for LEB128
// Unsigned
assert(parse_uleb128([0x00]) == 0)
assert(parse_uleb128([0x7F]) == 127)
assert(parse_uleb128([0x80, 0x01]) == 128)
assert(parse_uleb128([0xE5, 0x8E, 0x26]) == 624485)
// Signed
assert(parse_sleb128([0x00]) == 0)
assert(parse_sleb128([0x7F]) == -1)
assert(parse_sleb128([0x80, 0x01]) == 128)
assert(parse_sleb128([0x9B, 0xF1, 0x59]) == -624485)
Integration Tests
Create WAT files, compile them, parse them, verify output:
# Create test cases
echo '(module)' > empty.wat
echo '(module (func (export "f")))' > func.wat
echo '(module (func (param i32) (result i32) local.get 0))' > identity.wat
# Compile
wat2wasm empty.wat
wat2wasm func.wat
wat2wasm identity.wat
# Parse and verify
./wasmparser empty.wasm
./wasmparser func.wasm
./wasmparser identity.wasm
Comparison Testing
Compare your output to wasm-objdump:
wasm-objdump -x example.wasm > expected.txt
./wasmparser example.wasm > actual.txt
diff expected.txt actual.txt
Fuzzing (Advanced)
Use the official WASM test suite:
git clone https://github.com/aspect-it/aspect-it/aspect-it/aspect-itspec-repo
for f in *.wasm; do
./wasmparser "$f" || echo "Failed: $f"
done
Common Pitfalls and Debugging
Pitfall 1: LEB128 Off-by-One
Symptom: Values slightly wrong, especially around powers of 2 Cause: Shift calculation or termination condition wrong Fix: Test with values at boundaries: 0, 127, 128, 255, 256, 16383, 16384
Pitfall 2: Forgetting Section Size Constraint
Symptom: Parsing goes off the rails after first section
Cause: Not tracking bytes consumed within section
Fix: Record position before section, verify position == start + size after
Pitfall 3: Signed vs Unsigned Confusion
Symptom: Negative numbers look huge positive Cause: Using uLEB128 where sLEB128 needed (e.g., i32.const) Fix: i32.const and i64.const use signed; indices use unsigned
Pitfall 4: String Encoding
Symptom: Import/export names garbled Cause: Not reading length prefix, or encoding issues Fix: Strings are length-prefixed (uLEB128) followed by raw UTF-8 bytes
Pitfall 5: Nested Control Structures
Symptom: Disassembly wrong for if/block/loop Cause: Not tracking nesting depth Fix: block/loop/if push depth, end pops depth. Track for proper indentation.
Debugging with xxd
xxd example.wasm | head -20
Shows hex dump. Match against your parserโs position to find issues.
Debugging with wasm-objdump
wasm-objdump -h example.wasm # Headers and section list
wasm-objdump -x example.wasm # Full details
wasm-objdump -d example.wasm # Disassembly
Compare output to find where your parser diverges.
Extensions and Challenges
Extension 1: JSON Output
Add --json flag for machine-readable output. Useful for tooling.
Extension 2: Hex Dump Mode
Show raw bytes alongside disassembly:
000010: 20 00 local.get 0
000012: 20 01 local.get 1
000014: 6A i32.add
Extension 3: Validation
Check module validity:
- Sections in order?
- Type indices in range?
- Import counts match?
- Stack types balance?
Extension 4: Round-Trip
Parse WASM, emit WAT:
./wasmparser --emit-wat example.wasm > roundtrip.wat
wat2wasm roundtrip.wat -o roundtrip.wasm
diff example.wasm roundtrip.wasm # Should be identical (or semantically equivalent)
Extension 5: Custom Section Parser
Parse the โnameโ custom section to show function and local names:
Function names:
[0] "add"
[1] "multiply"
Real-World Connections
How Browsers Parse WASM
Browsers like Chrome and Firefox have highly optimized WASM parsers:
- Streaming compilation: Begin compiling while downloading
- Parallel parsing: Multiple threads parse different sections
- Memory mapping: Large files mapped directly, not copied
Your parser teaches the same concepts, just without these optimizations.
How wabt Works
The official wasm2wat tool does exactly what youโre building:
- Parse binary with LEB128 decoding
- Build in-memory representation
- Pretty-print as WAT
Study wabtโs source code to see production patterns.
Compiler Output Analysis
Compilers like Emscripten and Rust/wasm32 produce WASM. Your parser lets you:
- Understand what code patterns compile to
- Debug ABI issues
- Optimize output by understanding size costs
Security Research
WASM parsers are security-critical. Malformed WASM has been used in exploits:
- Buffer overflows from bad section sizes
- Integer overflows in LEB128
- Memory corruption from invalid instructions
Your parser teaches you to think about these boundaries.
Real-World Outcome
When your parser is complete, you will have a tool that produces detailed, professional output for any WebAssembly binary. Here is exactly what running your parser should produce:
Example 1: Simple Add Function
Input WAT file (add.wat):
(module
(func $add (export "add") (param i32 i32) (result i32)
local.get 0
local.get 1
i32.add
)
(memory (export "memory") 1)
)
Command and Output:
$ wat2wasm add.wat -o add.wasm
$ ./wasmparser add.wasm
================================================================================
WASM BINARY PARSER v1.0
================================================================================
File: add.wasm
Size: 42 bytes
--------------------------------------------------------------------------------
HEADER ANALYSIS
--------------------------------------------------------------------------------
Offset Bytes Description
------ ----- -----------
0x00 00 61 73 6D Magic number: "\0asm" [VALID]
0x04 01 00 00 00 Version: 1 (little-endian u32)
--------------------------------------------------------------------------------
SECTION MAP
--------------------------------------------------------------------------------
ID Name Offset Size Description
-- ---- ------ ---- -----------
1 Type 0x08 7 Function signatures
3 Function 0x11 2 Function-to-type mappings
5 Memory 0x15 3 Linear memory declarations
7 Export 0x1A 14 Exported names
10 Code 0x2A 9 Function bodies
--------------------------------------------------------------------------------
SECTION 1: TYPE
--------------------------------------------------------------------------------
Count: 1 function type(s)
[0] func (i32, i32) -> (i32)
Encoding: 60 02 7F 7F 01 7F
^ ^ ^--^ ^ ^
| | | | +-- result: i32
| | | +-- 1 result
| | +-- params: i32, i32
| +-- 2 params
+-- func type marker
--------------------------------------------------------------------------------
SECTION 3: FUNCTION
--------------------------------------------------------------------------------
Count: 1 function(s)
[0] type_idx=0 -> (i32, i32) -> (i32)
--------------------------------------------------------------------------------
SECTION 5: MEMORY
--------------------------------------------------------------------------------
Count: 1 memory declaration(s)
[0] limits: min=1 page (64KB), max=unlimited
Encoding: 00 01
^ ^
| +-- minimum pages
+-- flags (no maximum)
--------------------------------------------------------------------------------
SECTION 7: EXPORT
--------------------------------------------------------------------------------
Count: 2 export(s)
[0] "add" -> func[0]
Kind: 0x00 (function)
Index: 0
[1] "memory" -> memory[0]
Kind: 0x02 (memory)
Index: 0
--------------------------------------------------------------------------------
SECTION 10: CODE
--------------------------------------------------------------------------------
Count: 1 function body(ies)
[0] Function $add
Body size: 7 bytes
Locals: (none)
Disassembly:
~~~~~~~~~~~~
Offset Bytes Instruction Stack Effect
------ ----- ----------- ------------
0x00 20 00 local.get 0 [] -> [i32]
0x02 20 01 local.get 1 [i32] -> [i32, i32]
0x04 6A i32.add [i32, i32] -> [i32]
0x05 0B end [i32] -> [i32]
================================================================================
SUMMARY
================================================================================
Total sections: 5
Function types: 1
Functions: 1 (0 imported, 1 defined)
Exports: 2
Memory: 1 page minimum
Validation: PASSED
Parse completed in 0.3ms
Example 2: Module with Imports
Input WAT file (with_imports.wat):
(module
(import "env" "log" (func $log (param i32)))
(import "env" "memory" (memory 1))
(func $greet (export "greet") (param i32)
(local i32)
local.get 0
i32.const 100
i32.add
local.set 1
local.get 1
call $log
)
)
Command and Output:
$ wat2wasm with_imports.wat -o with_imports.wasm
$ ./wasmparser with_imports.wasm --verbose
================================================================================
WASM BINARY PARSER v1.0
================================================================================
File: with_imports.wasm
Size: 68 bytes
--------------------------------------------------------------------------------
HEADER ANALYSIS
--------------------------------------------------------------------------------
Offset Bytes Description
------ ----- -----------
0x00 00 61 73 6D Magic number: "\0asm" [VALID]
0x04 01 00 00 00 Version: 1 (little-endian u32)
--------------------------------------------------------------------------------
SECTION 1: TYPE
--------------------------------------------------------------------------------
Count: 2 function type(s)
[0] func (i32) -> ()
Used by: import "env"."log", function $greet
[1] func (i32) -> ()
Note: Duplicate signature, could be deduplicated
--------------------------------------------------------------------------------
SECTION 2: IMPORT
--------------------------------------------------------------------------------
Count: 2 import(s)
[0] "env"."log" : func
Type index: 0
Signature: (i32) -> ()
Assigned function index: 0
Encoding breakdown:
03 65 6E 76 ; module name length=3, "env"
03 6C 6F 67 ; import name length=3, "log"
00 ; kind: function
00 ; type index: 0
[1] "env"."memory" : memory
Limits: min=1 page, max=none
Assigned memory index: 0
--------------------------------------------------------------------------------
SECTION 3: FUNCTION
--------------------------------------------------------------------------------
Count: 1 function(s)
[0] type_idx=1 -> (i32) -> ()
Note: This is function index 1 (after 1 imported function)
--------------------------------------------------------------------------------
SECTION 7: EXPORT
--------------------------------------------------------------------------------
Count: 1 export(s)
[0] "greet" -> func[1]
--------------------------------------------------------------------------------
SECTION 10: CODE
--------------------------------------------------------------------------------
Count: 1 function body(ies)
[0] Function index 1 (export: "greet")
Body size: 14 bytes
Locals: 1 group(s)
- 1 x i32 (local index: 1, after 1 param)
Disassembly:
~~~~~~~~~~~~
Offset Bytes Instruction Immediate Stack
------ ----- ----------- --------- -----
0x00 20 00 local.get idx=0 [] -> [i32]
0x02 41 E4 00 i32.const 100 [i32] -> [i32,i32]
(LEB128: E4 00 = 100)
0x05 6A i32.add [i32,i32] -> [i32]
0x06 21 01 local.set idx=1 [i32] -> []
0x08 20 01 local.get idx=1 [] -> [i32]
0x0A 10 00 call func_idx=0 [i32] -> []
(calls: "env"."log")
0x0C 0B end [] -> []
================================================================================
INDEX SPACES (after all imports processed)
================================================================================
Function indices:
[0] import "env"."log" : (i32) -> ()
[1] $greet : (i32) -> () [exported as "greet"]
Memory indices:
[0] import "env"."memory" : 1 page min
================================================================================
Parse completed in 0.4ms
Example 3: Hex Dump Mode
$ ./wasmparser add.wasm --hex
WASM Binary Hex Dump: add.wasm
==============================
Header:
00000000: 00 61 73 6D 01 00 00 00 .asm....
Section 1 (Type) - 7 bytes:
00000008: 01 06 01 60 02 7F 7F 01 7F ...`....
Section 3 (Function) - 2 bytes:
00000011: 03 02 01 00 ....
Section 5 (Memory) - 3 bytes:
00000015: 05 03 01 00 01 .....
Section 7 (Export) - 14 bytes:
0000001A: 07 0D 02 03 61 64 64 00 00 06 6D 65 6D 6F 72 79 ....add...memory
0000002A: 02 00 ..
Section 10 (Code) - 9 bytes:
0000002C: 0A 07 01 05 00 20 00 20 01 6A 0B ..... . .j.
Example 4: JSON Output Mode
$ ./wasmparser add.wasm --json | jq .
{
"magic": "00 61 73 6D",
"version": 1,
"sections": [
{
"id": 1,
"name": "type",
"offset": 8,
"size": 7,
"types": [
{
"index": 0,
"params": ["i32", "i32"],
"results": ["i32"]
}
]
},
{
"id": 3,
"name": "function",
"offset": 17,
"size": 2,
"functions": [
{"index": 0, "type_index": 0}
]
},
{
"id": 5,
"name": "memory",
"offset": 21,
"size": 3,
"memories": [
{"index": 0, "min": 1, "max": null}
]
},
{
"id": 7,
"name": "export",
"offset": 26,
"size": 14,
"exports": [
{"name": "add", "kind": "function", "index": 0},
{"name": "memory", "kind": "memory", "index": 0}
]
},
{
"id": 10,
"name": "code",
"offset": 42,
"size": 9,
"bodies": [
{
"index": 0,
"size": 7,
"locals": [],
"instructions": [
{"offset": 0, "opcode": "0x20", "mnemonic": "local.get", "immediate": 0},
{"offset": 2, "opcode": "0x20", "mnemonic": "local.get", "immediate": 1},
{"offset": 4, "opcode": "0x6A", "mnemonic": "i32.add"},
{"offset": 5, "opcode": "0x0B", "mnemonic": "end"}
]
}
]
}
],
"summary": {
"total_sections": 5,
"function_count": 1,
"import_count": 0,
"export_count": 2,
"valid": true
}
}
Example 5: Error Handling
$ ./wasmparser corrupted.wasm
================================================================================
WASM BINARY PARSER v1.0
================================================================================
File: corrupted.wasm
Size: 25 bytes
--------------------------------------------------------------------------------
HEADER ANALYSIS
--------------------------------------------------------------------------------
Offset Bytes Description
------ ----- -----------
0x00 00 61 73 6D Magic number: "\0asm" [VALID]
0x04 01 00 00 00 Version: 1 (little-endian u32)
--------------------------------------------------------------------------------
PARSE ERROR
--------------------------------------------------------------------------------
Error: Invalid section ordering at offset 0x10
Expected: Section ID >= 3
Found: Section ID 1 (Type)
Context: Section 3 (Function) was already parsed at offset 0x08.
Known sections must appear at most once and in ascending ID order.
Only custom sections (ID 0) may appear out of order.
Partial parse results available above this error.
Exit code: 1
The Core Question Youโre Answering
How does a compact binary format encode program structure, and what design tradeoffs enable fast parsing?
This question sits at the heart of systems programming. As you build your parser, youโre uncovering answers to fundamental design decisions:
-
Why variable-length encoding? LEB128 saves bytes for small values (common in indices and sizes) at the cost of complexity. A module with 1000 small functions saves kilobytes compared to fixed 4-byte integers.
-
Why strict section ordering? Single-pass parsing. A streaming compiler can begin generating code the moment function types arrive, without waiting for the entire file. No seeking backward, no multi-pass algorithms.
-
Why separate Function and Code sections? Forward reference resolution. The Function section declares all type signatures upfront, so the Code section can validate call instructions immediately without lookahead.
-
Why magic numbers and version fields? Fast rejection of invalid files. A parser can fail in 8 bytes instead of parsing garbage and failing deep inside.
Reflect on this: Every byte in the format exists for a reason. When you encounter something that seems redundant or complex, ask: โWhat problem does this solve?โ The answer usually involves either space efficiency, parse speed, or validation simplicity.
Concepts You Must Understand First
Before writing code, ensure you have solid mental models for these concepts:
-
LEB128 variable-length encoding - The algorithm for reading integers where small values use fewer bytes, with continuation bits indicating โmore data followsโ
-
Binary file structure and byte-level I/O - How files are organized as sequences of bytes, and how to read them in chunks while tracking position
-
Hexadecimal notation and bit manipulation - Converting between hex, binary, and decimal; using AND, OR, and shift operations to extract fields from bytes
-
Magic numbers and file format identification - How files begin with signature bytes that identify their format (PDF starts with
%PDF, PNG with\x89PNG, WASM with\0asm) -
Type encodings and tagged unions - How a single byte can indicate which variant follows (e.g., 0x60 means โfunction typeโ, 0x7F means โi32โ)
-
Length-prefixed vs. delimiter-terminated data - WASM uses length prefixes (know size before reading) rather than delimiters (scan for terminator), enabling single-pass parsing
-
Index spaces and forward references - How WASM assigns indices to types, functions, memories, etc., and why imports affect the numbering
Book Reference: See โComputer Systems: A Programmerโs Perspectiveโ chapters 2 and 7 for binary representations and linking concepts. The โPractical Binary Analysisโ book covers file format parsing in depth.
Questions to Guide Your Design
Before implementing, think through these design questions:
Reading the byte stream:
- How will you track your current position in the file?
- What happens if you try to read past the end of the file?
- Should you read the whole file into memory, or stream from disk?
Decoding LEB128:
- How do you know when to stop reading bytes?
- Whatโs the maximum number of bytes you might read for a u32? For a u64?
- How does signed LEB128 differ, and when do you need sign extension?
Parsing sections:
- How will you verify you consumed exactly the number of bytes a section claims?
- What should happen if you encounter an unknown section ID (like ID 15)?
- How will you handle custom sections that can appear anywhere?
Validating structure:
- How do you enforce that sections appear in ascending ID order?
- What if a section appears twice?
- Should you validate that function indices are in range, or defer to a separate validation pass?
Decoding instructions:
- How will you map opcode bytes to instruction names?
- Whatโs your strategy for handling the ~200 different opcodes?
- How will you track nesting depth for block/loop/if structures?
Output format:
- Will you build an in-memory AST first, or print while parsing?
- How will you handle alignment and formatting for readable output?
- What information is essential vs. nice-to-have?
Thinking Exercise
Before writing any code, manually decode this WASM binary by hand. This exercise builds intuition that no amount of coding can replace.
The binary (27 bytes):
00 61 73 6D 01 00 00 00 01 07 01 60 02 7F 7F 01
7F 03 02 01 00 0A 07 01 05 00 20 00 20 01 6A 0B
Your task:
- Identify the header (first 8 bytes):
- What is the magic number?
- What version is this module?
- Find the first section (starts at byte 8):
- What is the section ID?
- Decode the section size (itโs a single-byte LEB128)
- What are the raw bytes of the section content?
- Parse the Type section content:
- How many types are declared?
- What is the marker byte for a function type?
- How many parameters? What types?
- How many results? What types?
- Write out the signature in WAT form:
(func (param ...) (result ...))
- Find the second section:
- What is the section ID?
- What is the size?
- This is the Function section. How many functions does it declare?
- What type index does function 0 use?
- Find the third section (the Code section):
- What is the section ID?
- What is the size?
- How many function bodies?
- What is the body size for function 0?
- How many local variable groups?
- Decode each instruction:
- Byte 0x20 = ?
- Byte 0x00 = ?
- Byte 0x20 = ?
- Byte 0x01 = ?
- Byte 0x6A = ?
- Byte 0x0B = ?
- Reconstruct the WAT: Write the complete WAT that would produce this binary.
Hint for checking your work: The module contains a single function that adds two i32 parameters.
The Interview Questions Theyโll Ask
Binary parsing skills translate directly to interview questions. Hereโs what youโll be prepared to answer:
LEB128 and Variable-Length Encoding:
- โImplement a function to decode unsigned LEB128 from a byte stream.โ
- โWhat is the maximum value that can be stored in 3 bytes of LEB128?โ
- โWhy might a format use variable-length encoding instead of fixed-size integers?โ
- โGiven bytes
0xE5 0x8E 0x26, what unsigned integer does this represent?โ
Binary File Formats:
- โHow would you design a binary format for a configuration file?โ
- โWhat are the tradeoffs between length-prefixed and delimiter-terminated strings?โ
- โWhy do file formats use magic numbers?โ
- โHow would you make a binary format extensible for future versions?โ
Parsing and Validation:
- โHow would you implement a streaming parser that can process data as it arrives?โ
- โWhatโs the difference between syntax errors and semantic errors in a binary format?โ
- โHow would you gracefully handle corrupted or truncated input?โ
- โDesign a data structure to represent a parsed AST for a binary format.โ
Systems Programming:
- โWhatโs the difference between big-endian and little-endian byte order?โ
- โHow would you memory-map a large file for efficient parsing?โ
- โWhat security considerations apply when parsing untrusted binary input?โ
- โHow would you implement fuzzing for a binary parser?โ
WebAssembly Specific:
- โExplain the relationship between the Function section and Code section.โ
- โWhy must WASM sections appear in a specific order?โ
- โWhat is the purpose of the Data Count section added in WASM 2.0?โ
- โHow does WASMโs binary format enable streaming compilation?โ
Hints in Layers
If you get stuck, reveal hints progressively. Try to solve problems yourself before looking.
Layer 1: Getting Started
Hint: File reading structure
Create a struct to wrap your byte buffer with position tracking:
struct ByteReader {
data: &[u8],
pos: usize,
}
impl ByteReader {
fn read_byte(&mut self) -> Option<u8> {
if self.pos < self.data.len() {
let b = self.data[self.pos];
self.pos += 1;
Some(b)
} else {
None
}
}
}
All other read operations build on read_byte().
Hint: LEB128 termination condition
The high bit (0x80) of each byte indicates continuation:
- If
byte & 0x80 != 0: more bytes follow - If
byte & 0x80 == 0: this is the last byte
Extract the 7 value bits with byte & 0x7F.
Layer 2: Section Parsing
Hint: Section boundary tracking
Before parsing a sectionโs content, record your position:
let section_start = reader.pos;
let section_size = reader.read_uleb128();
// Parse content...
let bytes_consumed = reader.pos - section_start;
assert!(bytes_consumed == section_size, "Section size mismatch!");
This catches bugs where you read too many or too few bytes.
Hint: Handling unknown sections
If you encounter a section ID you donโt recognize:
if section_id > 12 {
// Unknown section - skip it entirely
reader.skip(section_size);
continue;
}
This makes your parser forward-compatible with future WASM versions.
Layer 3: Instruction Decoding
Hint: Opcode dispatch pattern
Use a match/switch on the opcode byte. Group by category:
match opcode {
// Control flow
0x00 => Instruction::Unreachable,
0x01 => Instruction::Nop,
0x02 => Instruction::Block(read_block_type()),
0x0B => Instruction::End,
// Variables
0x20 => Instruction::LocalGet(read_uleb128()),
0x21 => Instruction::LocalSet(read_uleb128()),
// Constants
0x41 => Instruction::I32Const(read_sleb128()), // Note: signed!
// Arithmetic (no immediates)
0x6A => Instruction::I32Add,
0x6B => Instruction::I32Sub,
_ => Instruction::Unknown(opcode),
}
Hint: Memory instruction immediates
Memory operations like i32.load and i32.store have two immediates:
// memarg = (align, offset)
let align = read_uleb128(); // log2 of alignment (0=1, 1=2, 2=4, 3=8)
let offset = read_uleb128(); // byte offset from address
// Effective address = stack_value + offset
The alignment is a hint for optimization; the offset is semantic.
Layer 4: Advanced Topics
Hint: Block type encoding
Block types use a special encoding:
0x40: void (empty block type, no value produced)0x7F, 0x7E, 0x7D, 0x7C: value type (block produces that type)0x00+: type index (signed LEB128, positive = index into type section)
The type index form allows blocks with multiple returns.
Hint: Nested block tracking
For proper disassembly indentation, track nesting depth:
let mut depth = 0;
for instr in instructions {
match instr {
Block | Loop | If => {
print_indented(depth, instr);
depth += 1;
}
Else => {
print_indented(depth - 1, instr); // Else at same level as If
}
End => {
depth -= 1;
print_indented(depth, instr);
}
_ => print_indented(depth, instr),
}
}
Books That Will Help
| Book | Author(s) | Why Itโs Relevant |
|---|---|---|
| Practical Binary Analysis | Dennis Andriesse | The definitive guide to understanding binary formats. Covers ELF, PE, and general principles of parsing executable formats. Chapters on disassembly directly apply to WASM bytecode. |
| Computer Systems: A Programmerโs Perspective | Bryant & OโHallaron | Foundation for understanding how programs are represented in binary. Chapter 2 on data representation explains bit manipulation; Chapter 7 on linking explains symbol tables and relocations (similar to WASM imports/exports). |
| Low-Level Programming: C, Assembly, and Program Execution | Igor Zhirkov | Teaches the mindset for byte-level programming. Understanding x86 binary encoding helps appreciate WASMโs simpler design. |
| Crafting Interpreters | Robert Nystrom | While focused on text parsing, the bytecode chapter (Part III) shows how to design instruction encodings, directly applicable to WASM disassembly. |
| The Art of WebAssembly | Rick Battagline | WebAssembly-specific book that covers the binary format from a practical perspective. Good companion for understanding the โwhyโ behind format decisions. |
| Compilers: Principles, Techniques, and Tools | Aho, Lam, Sethi, Ullman | The Dragon Bookโs chapters on intermediate representations and code generation explain why WASM is structured the way it is. |
Resources
Specifications
- WebAssembly Binary Format - Definitive reference
- LEB128 on Wikipedia - Algorithm details
Reference Implementations
- wabt source - C++ reference
- parity-wasm - Rust implementation
- aspect-it/aspect-it - Community docs
Tools
wasm-objdump- From wabt, for comparisonxxd- Hex dump utilitywasm-validate- Check if WASM is valid
Self-Assessment Checklist
Before moving to Project 3, verify you can:
- Implement LEB128 encoding/decoding from scratch
- List all 12 section types and their purposes
- Explain why sections must be ordered
- Parse any function signature from the type section
- Decode the code sectionโs instruction stream
- Use
wasm-objdumpto verify your parserโs output - Handle malformed input without crashing
Conceptual Questions
- Why does WASM use LEB128 instead of fixed-size integers?
- What information is in the Function section vs. the Code section?
- How do you know where a function body ends?
- Why must imports be declared before functions?
- How would you find which type a particular function uses?
Next: P03: Build a WASM Interpreter โ execute the bytecode youโve learned to parse