Project 3: Thumb Instruction Encoder/Decoder
Encode and decode a subset of Thumb instructions.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 3 |
| Time Estimate | 10-16 hours |
| Main Programming Language | Python or C (Alternatives: Rust, Go) |
| Alternative Programming Languages | Rust, Go |
| Coolness Level | Level 4 |
| Business Potential | Level 2 |
| Prerequisites | Bitwise operations, Concept 3: Instruction Encoding |
| Key Topics | bitfields, immediates, endianness |
1. Learning Objectives
By completing this project, you will:
- Translate ARM concepts into observable outputs you can verify.
- Explain why each toolchain or hardware step is necessary.
- Detect and fix at least one realistic failure mode.
- Communicate the result clearly in a technical review or interview.
2. All Theory Needed (Per-Concept Breakdown)
Instruction Encoding
Fundamentals Every assembly instruction is encoded into bits. Encoding determines which registers and immediates are accessible, how large constants can be, and which addressing modes are legal. Thumb encodings prioritize compactness and energy efficiency for microcontrollers, while AArch64 uses fixed 32-bit instruction widths to simplify decode and improve pipeline predictability. citeturn0search2 Understanding encoding explains why some instructions are missing or require multi-instruction sequences and why certain address calculations must be split into steps.
Deep Dive
Instruction encoding is the “physics” of assembly. A mnemonic like MOV is a human label for a specific bit pattern; if that pattern cannot fit your operands, the assembler will either refuse or emit a different instruction sequence. Thumb uses a mix of 16-bit and 32-bit encodings (Thumb-2), which means register and immediate fields are often smaller. This is why Cortex-M uses a small set of low registers more naturally and why large constants must be loaded via literal pools or multi-step sequences. AArch64, in contrast, uses 32-bit instructions exclusively, providing more predictable decoding and a richer register space. citeturn0search2 The trade-off is code density versus decode simplicity.
Addressing modes compound encoding constraints. Load/store architectures like ARM separate arithmetic from memory access: you compute addresses in registers, then load or store. But the address computation itself is limited by encoding fields. For example, immediate offsets might be limited to a certain bit width or require alignment. When you understand the bit fields, you can predict when the assembler will need to generate extra instructions, which in turn affects performance and size. This is especially important in microcontrollers where code size is constrained and instruction fetches may come from slow flash.
Encoding also interacts with endianness and instruction alignment. Many ARM cores require instructions to be aligned to 2 or 4 bytes depending on the ISA. Misalignment results in faults or unintended behavior. The assembler handles alignment for you, but if you build a binary layout manually (for example in a boot image), you must respect alignment rules to avoid hard-to-debug startup failures. This is why toolchain awareness is an essential complement to ISA knowledge.
Finally, encoding shapes the patterns you see in disassembly. A sequence of machine code bytes can decode to different instructions depending on execution state. This is a common pitfall in reverse engineering: decoding AArch64 bytes as Thumb or ARM32 yields nonsense. Knowing the encoding class prevents misinterpretation and helps you verify that your build pipeline is producing the ISA you intended.
How this fits on projects
- Core to P03 (Thumb Instruction Encoder/Decoder) and P01 (Toolchain Pipeline Explorer).
Definitions & key terms
- Encoding: Bit-level representation of an instruction.
- Addressing mode: How an instruction specifies the location of its operands.
- Thumb/Thumb-2: Compact ARM instruction encodings for M-profile.
- A64: 32-bit fixed-length instruction encoding for AArch64. citeturn0search2
Mental model diagram
Thumb Instruction Encoding Examples:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
16-bit Thumb instruction: MOV r0, #42
┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐
│ 0│ 0│ 1│ 0│ 0│ Rd │ imm8 (immediate) │
│ 0│ 0│ 1│ 0│ 0│0│0│0│ 0│ 0│ 1│ 0│ 1│ 0│ 1│ 0│
└──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┘
│r0 │ 42 = 0x2A
│ │
▼ ▼
Encodes to: 0x202A (little-endian: 2A 20)
32-bit Thumb-2 instruction: LDR r0, [r1, #offset] (when offset > 31)
┌──────────────────────────────────────────────────────────────────┐
│ First halfword (16 bits) │ Second halfword (16 bits) │
│ encoding prefix + Rn │ Rt + imm12 offset │
└──────────────────────────────────────────────────────────────────┘
Common Thumb Instructions You'll Use:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Data Movement:
MOV Rd, #imm8 Move 8-bit immediate to register
MOV Rd, Rm Move register to register
LDR Rt, [Rn, #off] Load word from memory
STR Rt, [Rn, #off] Store word to memory
PUSH {reglist} Push registers to stack
POP {reglist} Pop registers from stack
Arithmetic:
ADD Rd, Rn, #imm3 Add 3-bit immediate
ADD Rd, #imm8 Add 8-bit immediate to Rd
SUB Rd, Rn, #imm3 Subtract 3-bit immediate
SUBS Rd, Rn, Rm Subtract with flags update
Logic:
AND Rd, Rm Bitwise AND
ORR Rd, Rm Bitwise OR
EOR Rd, Rm Bitwise XOR (exclusive OR)
LSL Rd, Rm, #imm5 Logical shift left
LSR Rd, Rm, #imm5 Logical shift right
Control Flow:
B label Unconditional branch
BEQ label Branch if equal (Z=1)
BNE label Branch if not equal (Z=0)
BL function Branch with link (function call)
BX Rm Branch to address in register
BLX Rm Branch with link to address in register
LIMITATION: Cortex-M0+ is MISSING many instructions!
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
✗ No hardware divide (UDIV, SDIV) → Must use software division
✗ No bit-field instructions (BFI) → Must use shift/mask sequences
✗ No conditional execution (IT block) → Must use branches
✗ Limited addressing modes → Can't do [Rn, Rm, LSL #2]
✗ No saturation arithmetic → Must check overflow manually
![]()
How it works (step-by-step, with invariants and failure modes)
- The assembler maps mnemonics to encoding templates for the target ISA.
- Register and immediate fields are packed into fixed bit positions.
- If a value doesn’t fit, the assembler emits a sequence or errors out.
- Failure mode: decoding with the wrong ISA yields invalid instructions or faults.
Minimal concrete example (pseudo, not runnable)
ENCODE(op=ADD, rd=R0, rn=R1, imm=5)
→ [opcode bits][rd bits][rn bits][imm bits]
Common misconceptions
- “Assembler will always accept my operands” → Encoding limits still apply.
- “Instruction length doesn’t matter” → It affects alignment and memory layout.
Check-your-understanding questions
- Why does Thumb use shorter encodings than AArch64?
- What happens if an immediate is too large for its field?
- Why can decoding with the wrong execution state break disassembly?
Check-your-understanding answers
- Thumb optimizes for code density and decoder simplicity in M-profile contexts. citeturn0search2
- The assembler emits a sequence or reports an error because it cannot fit the value.
- The same bytes map to different instruction sets depending on state, so decoding mismatches yield nonsense.
Real-world applications
- Building encoders/decoders for tooling and reverse engineering.
- Size-sensitive firmware builds for microcontrollers.
Where you’ll apply it
- This project: see §3.1 and §5.4 in P03-thumb-encoder-decoder.md
- P03 Thumb Instruction Encoder/Decoder
- P01 Toolchain Pipeline Explorer
References
- Arm A-profile overview (AArch64, instruction model). citeturn0search2
Key insights Encoding constraints explain most “mysterious” assembler errors.
Summary Instruction encodings are the boundary between human mnemonics and machine reality; mastering them unlocks predictability.
Homework/Exercises to practice the concept
- Choose a Thumb instruction and manually identify which bits encode the register fields.
- Explain why a large constant may require multiple instructions.
Solutions to the homework/exercises
- The register fields are fixed bit slices in the instruction encoding; their size limits which registers are directly addressable.
- If the immediate field is too small, the assembler must build the constant through multiple steps.
3. Project Specification
3.1 What You Will Build
A CLI that maps between mnemonics and 16-bit encodings for a chosen subset.
3.2 Functional Requirements
- Requirement 1: Encode at least 6 Thumb instructions
- Requirement 2: Decode 16-bit values into mnemonics
- Requirement 3: Validate immediate ranges and register fields
3.3 Non-Functional Requirements
- Clear error messages for invalid encodings
3.4 Example Usage / Output
$ thumb-encode "MOV r0, #42"
encoding: 0b00100 000 00101010
hex: 0x202A
$ thumb-decode 0xFFFF
error: unsupported or illegal opcode
exit code: 1
3.5 Data Formats / Schemas / Protocols
- Mnemonic input: opcode + operands
- Output: binary and hex
3.6 Edge Cases
- Immediate too large
- Invalid register index
3.7 Real World Outcome
This is the golden reference for success:
- You can manually validate machine code bytes against a reference table.
3.7.1 How to Run (Copy/Paste)
- Build: follow the toolchain steps defined in this guide
- Run: use the CLI examples in §3.4 with fixed inputs
- Expected directory: project root
3.7.2 Golden Path Demo (Deterministic)
Run with a fixed input set and confirm output matches §3.4 exactly.
3.7.3 If CLI: Exact Terminal Transcript
$ thumb-encode "MOV r0, #42"
encoding: 0b00100 000 00101010
hex: 0x202A
$ thumb-decode 0xFFFF
error: unsupported or illegal opcode
exit code: 1
4. Solution Architecture
4.1 High-Level Design
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Input Layer │───▶│ Core Logic │───▶│ Output Layer │
└──────────────┘ └──────────────┘ └──────────────┘
4.2 Key Components
| Component | Responsibility | Key Decisions |
|---|---|---|
| Input Parser | Validate and normalize input | Strict error handling |
| Core Engine | Perform the main computation | Deterministic paths |
| Reporter | Produce user-facing output | Stable formatting |
4.3 Data Structures (No Full Code)
Record Entry {
name: string
fields: list
notes: text
}
4.4 Algorithm Overview
Key Algorithm: Core Flow
- Parse input and validate parameters.
- Execute the core transformation or analysis.
- Emit deterministic output or error summary.
Complexity Analysis:
- Time: O(n) in the size of input records
- Space: O(n) for stored mappings and logs
5. Implementation Guide
5.1 Development Environment Setup
# Install toolchain and verify versions
toolchain --version
5.2 Project Structure
project-root/
├── src/
│ ├── core
│ └── io
├── tests/
│ └── fixtures
├── docs/
└── README.md
5.3 The Core Question You’re Answering
“Encode and decode a subset of Thumb instructions.”
5.4 Concepts You Must Understand First
Stop and research these before coding:
- Instruction Encoding
- What is the key invariant you must preserve?
5.5 Questions to Guide Your Design
- Data Flow
- How does input become output?
- Which steps must be deterministic?
- Validation
- What is the simplest test that proves correctness?
- How will you detect regressions?
5.6 Thinking Exercise
Trace the Critical Path
Write a step-by-step trace of the most important workflow in this project.
Questions to answer:
- Where could a subtle bug hide?
- What would you log to prove correctness?
5.7 The Interview Questions They’ll Ask
- “What is the core invariant this project relies on?”
- “How would you debug a failure in this workflow?”
- “What trade-offs did you make in design?”
- “How does this map to real hardware or toolchains?”
- “How do you prove your output is correct?”
5.8 Hints in Layers
Hint 1: Start small Focus on the smallest input that still demonstrates the concept.
Hint 2: Make output deterministic Fix inputs and produce stable logs before expanding functionality.
Hint 3: Validate against a known reference Compare with a known-good output or specification.
Hint 4: Add instrumentation Log internal steps so you can verify each phase explicitly.
5.9 Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Core concept | “ARM Assembly Language” by William Hohl | Ch. 3-5 |
| Binary formats | “Linkers and Loaders” by John R. Levine | Ch. 1-3 |
5.10 Implementation Phases
Phase 1: Foundation (2-4 hours)
Goals:
- Establish a minimal working pipeline
- Validate one end-to-end path
Tasks:
- Build the smallest viable input and output
- Verify outputs against a reference Checkpoint: Output matches expected golden path
Phase 2: Core Functionality (4-8 hours)
Goals:
- Implement main logic and validation
- Add structured error handling
Tasks:
- Implement the core transformation
- Add deterministic reporting Checkpoint: Core tests pass reliably
Phase 3: Polish & Edge Cases (2-4 hours)
Goals:
- Cover edge cases
- Improve output clarity
Tasks:
- Add negative tests
- Document limitations Checkpoint: All edge cases handled gracefully
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale |
|---|---|---|---|
| Input format | Free-form vs structured | Structured | Easier validation |
| Output format | Human vs machine | Both | Supports verification and tooling |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples |
|---|---|---|
| Unit Tests | Validate core logic | Field parsing, bounds checks |
| Integration Tests | Validate full flow | End-to-end CLI runs |
| Edge Case Tests | Validate boundaries | Empty input, invalid flags |
6.2 Critical Test Cases
- Golden path: Fixed input produces known output.
- Invalid input: Error path triggers correct exit code.
- Boundary case: Maximum supported value handled correctly.
6.3 Test Data
Input: fixed seed or fixed fixture
Expected: exact output text from §3.4
7. Common Pitfalls & Debugging
| Pitfall | Symptom | Solution |
|---|---|---|
| Misaligned assumptions | Unexpected output | Re-check invariants |
| Missing validation | Silent failures | Add explicit checks |
| Non-determinism | Flaky output | Fix inputs and seeds |
7.2 Debugging Strategies
- Trace everything: Log each step with stable ordering
- Compare against reference: Use known-good outputs
7.3 Performance Traps
- Avoid repeated parsing of the same input; cache results when possible
8. Extensions & Challenges
8.1 Beginner Extensions
- Add one extra output format
- Add a help screen with examples
8.2 Intermediate Extensions
- Add a verification mode that compares two outputs
- Add structured JSON output
8.3 Advanced Extensions
- Add a batch mode for large inputs
- Add cross-target comparisons (M vs A profile)
9. Real-World Connections
9.1 Industry Applications
- Firmware bring-up: use the same checks to validate early boot images
- Security audits: analyze binaries for ABI or control-flow correctness
9.2 Related Open Source Projects
- binutils: source of many ARM tooling workflows
- QEMU: emulator used for ARM testing
9.3 Interview Relevance
- Explains why ARM behavior differs across profiles
- Demonstrates toolchain literacy and debugging rigor
10. Resources
10.1 Essential Reading
- “ARM Assembly Language” by William Hohl - practical instruction usage
- “Linkers and Loaders” by John R. Levine - binary layout
10.2 Video Resources
- ARM architecture overview talks and lectures
10.3 Tools & Documentation
- GNU binutils documentation
- Arm developer documentation
10.4 Related Projects in This Series
- This project connects with: P01-toolchain-pipeline-explorer.md, P02-register-stack-visualizer.md, P04-mmio-memory-map-notebook.md
11. Self-Assessment Checklist
11.1 Understanding
- I can explain the core concept without notes
- I can explain why my design choices were necessary
- I can describe one realistic failure mode
11.2 Implementation
- All functional requirements are met
- Tests pass deterministically
- Edge cases are documented
11.3 Growth
- I can describe what I would improve next time
- I can explain this project in an interview
12. Submission / Completion Criteria
Minimum Viable Completion:
- Core functionality works on reference inputs
- Deterministic golden path is documented
- At least one failure path is demonstrated
Full Completion:
- All minimum criteria plus:
- Edge cases are covered with tests
- Output format is stable and documented
Excellence (Going Above & Beyond):
- Add a comparison against a second target
- Provide a short write-up of lessons learned