Project 3: Thumb Instruction Encoder/Decoder

Encode and decode a subset of Thumb instructions.

Quick Reference

Attribute Value
Difficulty Level 3
Time Estimate 10-16 hours
Main Programming Language Python or C (Alternatives: Rust, Go)
Alternative Programming Languages Rust, Go
Coolness Level Level 4
Business Potential Level 2
Prerequisites Bitwise operations, Concept 3: Instruction Encoding
Key Topics bitfields, immediates, endianness

1. Learning Objectives

By completing this project, you will:

  1. Translate ARM concepts into observable outputs you can verify.
  2. Explain why each toolchain or hardware step is necessary.
  3. Detect and fix at least one realistic failure mode.
  4. Communicate the result clearly in a technical review or interview.

2. All Theory Needed (Per-Concept Breakdown)

Instruction Encoding

Fundamentals Every assembly instruction is encoded into bits. Encoding determines which registers and immediates are accessible, how large constants can be, and which addressing modes are legal. Thumb encodings prioritize compactness and energy efficiency for microcontrollers, while AArch64 uses fixed 32-bit instruction widths to simplify decode and improve pipeline predictability. citeturn0search2 Understanding encoding explains why some instructions are missing or require multi-instruction sequences and why certain address calculations must be split into steps.

Deep Dive Instruction encoding is the “physics” of assembly. A mnemonic like MOV is a human label for a specific bit pattern; if that pattern cannot fit your operands, the assembler will either refuse or emit a different instruction sequence. Thumb uses a mix of 16-bit and 32-bit encodings (Thumb-2), which means register and immediate fields are often smaller. This is why Cortex-M uses a small set of low registers more naturally and why large constants must be loaded via literal pools or multi-step sequences. AArch64, in contrast, uses 32-bit instructions exclusively, providing more predictable decoding and a richer register space. citeturn0search2 The trade-off is code density versus decode simplicity.

Addressing modes compound encoding constraints. Load/store architectures like ARM separate arithmetic from memory access: you compute addresses in registers, then load or store. But the address computation itself is limited by encoding fields. For example, immediate offsets might be limited to a certain bit width or require alignment. When you understand the bit fields, you can predict when the assembler will need to generate extra instructions, which in turn affects performance and size. This is especially important in microcontrollers where code size is constrained and instruction fetches may come from slow flash.

Encoding also interacts with endianness and instruction alignment. Many ARM cores require instructions to be aligned to 2 or 4 bytes depending on the ISA. Misalignment results in faults or unintended behavior. The assembler handles alignment for you, but if you build a binary layout manually (for example in a boot image), you must respect alignment rules to avoid hard-to-debug startup failures. This is why toolchain awareness is an essential complement to ISA knowledge.

Finally, encoding shapes the patterns you see in disassembly. A sequence of machine code bytes can decode to different instructions depending on execution state. This is a common pitfall in reverse engineering: decoding AArch64 bytes as Thumb or ARM32 yields nonsense. Knowing the encoding class prevents misinterpretation and helps you verify that your build pipeline is producing the ISA you intended.

How this fits on projects

  • Core to P03 (Thumb Instruction Encoder/Decoder) and P01 (Toolchain Pipeline Explorer).

Definitions & key terms

  • Encoding: Bit-level representation of an instruction.
  • Addressing mode: How an instruction specifies the location of its operands.
  • Thumb/Thumb-2: Compact ARM instruction encodings for M-profile.
  • A64: 32-bit fixed-length instruction encoding for AArch64. citeturn0search2

Mental model diagram

Thumb Instruction Encoding Examples:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

16-bit Thumb instruction: MOV r0, #42
┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐
│ 0│ 0│ 1│ 0│ 0│ Rd  │     imm8 (immediate)    │
│ 0│ 0│ 1│ 0│ 0│0│0│0│ 0│ 0│ 1│ 0│ 1│ 0│ 1│ 0│
└──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┘
              │r0 │         42 = 0x2A
              │   │
              ▼   ▼
          Encodes to: 0x202A (little-endian: 2A 20)


32-bit Thumb-2 instruction: LDR r0, [r1, #offset]  (when offset > 31)
┌──────────────────────────────────────────────────────────────────┐
│ First halfword (16 bits)  │  Second halfword (16 bits)           │
│   encoding prefix + Rn    │     Rt + imm12 offset                │
└──────────────────────────────────────────────────────────────────┘


Common Thumb Instructions You'll Use:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Data Movement:
  MOV  Rd, #imm8        Move 8-bit immediate to register
  MOV  Rd, Rm           Move register to register
  LDR  Rt, [Rn, #off]   Load word from memory
  STR  Rt, [Rn, #off]   Store word to memory
  PUSH {reglist}        Push registers to stack
  POP  {reglist}        Pop registers from stack

Arithmetic:
  ADD  Rd, Rn, #imm3    Add 3-bit immediate
  ADD  Rd, #imm8        Add 8-bit immediate to Rd
  SUB  Rd, Rn, #imm3    Subtract 3-bit immediate
  SUBS Rd, Rn, Rm       Subtract with flags update

Logic:
  AND  Rd, Rm           Bitwise AND
  ORR  Rd, Rm           Bitwise OR
  EOR  Rd, Rm           Bitwise XOR (exclusive OR)
  LSL  Rd, Rm, #imm5    Logical shift left
  LSR  Rd, Rm, #imm5    Logical shift right

Control Flow:
  B    label            Unconditional branch
  BEQ  label            Branch if equal (Z=1)
  BNE  label            Branch if not equal (Z=0)
  BL   function         Branch with link (function call)
  BX   Rm               Branch to address in register
  BLX  Rm               Branch with link to address in register


LIMITATION: Cortex-M0+ is MISSING many instructions!
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
✗ No hardware divide (UDIV, SDIV)     → Must use software division
✗ No bit-field instructions (BFI)     → Must use shift/mask sequences
✗ No conditional execution (IT block) → Must use branches
✗ Limited addressing modes            → Can't do [Rn, Rm, LSL #2]
✗ No saturation arithmetic            → Must check overflow manually

Thumb Instruction Encoding

How it works (step-by-step, with invariants and failure modes)

  1. The assembler maps mnemonics to encoding templates for the target ISA.
  2. Register and immediate fields are packed into fixed bit positions.
  3. If a value doesn’t fit, the assembler emits a sequence or errors out.
  4. Failure mode: decoding with the wrong ISA yields invalid instructions or faults.

Minimal concrete example (pseudo, not runnable)

ENCODE(op=ADD, rd=R0, rn=R1, imm=5)
→ [opcode bits][rd bits][rn bits][imm bits]

Common misconceptions

  • “Assembler will always accept my operands” → Encoding limits still apply.
  • “Instruction length doesn’t matter” → It affects alignment and memory layout.

Check-your-understanding questions

  1. Why does Thumb use shorter encodings than AArch64?
  2. What happens if an immediate is too large for its field?
  3. Why can decoding with the wrong execution state break disassembly?

Check-your-understanding answers

  1. Thumb optimizes for code density and decoder simplicity in M-profile contexts. citeturn0search2
  2. The assembler emits a sequence or reports an error because it cannot fit the value.
  3. The same bytes map to different instruction sets depending on state, so decoding mismatches yield nonsense.

Real-world applications

  • Building encoders/decoders for tooling and reverse engineering.
  • Size-sensitive firmware builds for microcontrollers.

Where you’ll apply it

  • This project: see §3.1 and §5.4 in P03-thumb-encoder-decoder.md
  • P03 Thumb Instruction Encoder/Decoder
  • P01 Toolchain Pipeline Explorer

References

  • Arm A-profile overview (AArch64, instruction model). citeturn0search2

Key insights Encoding constraints explain most “mysterious” assembler errors.

Summary Instruction encodings are the boundary between human mnemonics and machine reality; mastering them unlocks predictability.

Homework/Exercises to practice the concept

  1. Choose a Thumb instruction and manually identify which bits encode the register fields.
  2. Explain why a large constant may require multiple instructions.

Solutions to the homework/exercises

  1. The register fields are fixed bit slices in the instruction encoding; their size limits which registers are directly addressable.
  2. If the immediate field is too small, the assembler must build the constant through multiple steps.

3. Project Specification

3.1 What You Will Build

A CLI that maps between mnemonics and 16-bit encodings for a chosen subset.

3.2 Functional Requirements

  1. Requirement 1: Encode at least 6 Thumb instructions
  2. Requirement 2: Decode 16-bit values into mnemonics
  3. Requirement 3: Validate immediate ranges and register fields

3.3 Non-Functional Requirements

  • Clear error messages for invalid encodings

3.4 Example Usage / Output

$ thumb-encode "MOV r0, #42"
encoding: 0b00100 000 00101010
hex: 0x202A

$ thumb-decode 0xFFFF
error: unsupported or illegal opcode
exit code: 1

3.5 Data Formats / Schemas / Protocols

  • Mnemonic input: opcode + operands
  • Output: binary and hex

3.6 Edge Cases

  • Immediate too large
  • Invalid register index

3.7 Real World Outcome

This is the golden reference for success:

  • You can manually validate machine code bytes against a reference table.

3.7.1 How to Run (Copy/Paste)

  • Build: follow the toolchain steps defined in this guide
  • Run: use the CLI examples in §3.4 with fixed inputs
  • Expected directory: project root

3.7.2 Golden Path Demo (Deterministic)

Run with a fixed input set and confirm output matches §3.4 exactly.

3.7.3 If CLI: Exact Terminal Transcript

$ thumb-encode "MOV r0, #42"
encoding: 0b00100 000 00101010
hex: 0x202A

$ thumb-decode 0xFFFF
error: unsupported or illegal opcode
exit code: 1

4. Solution Architecture

4.1 High-Level Design

┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│ Input Layer  │───▶│ Core Logic   │───▶│ Output Layer │
└──────────────┘     └──────────────┘     └──────────────┘

4.2 Key Components

Component Responsibility Key Decisions
Input Parser Validate and normalize input Strict error handling
Core Engine Perform the main computation Deterministic paths
Reporter Produce user-facing output Stable formatting

4.3 Data Structures (No Full Code)

Record Entry {
  name: string
  fields: list
  notes: text
}

4.4 Algorithm Overview

Key Algorithm: Core Flow

  1. Parse input and validate parameters.
  2. Execute the core transformation or analysis.
  3. Emit deterministic output or error summary.

Complexity Analysis:

  • Time: O(n) in the size of input records
  • Space: O(n) for stored mappings and logs

5. Implementation Guide

5.1 Development Environment Setup

# Install toolchain and verify versions
toolchain --version

5.2 Project Structure

project-root/
├── src/
│   ├── core
│   └── io
├── tests/
│   └── fixtures
├── docs/
└── README.md

5.3 The Core Question You’re Answering

“Encode and decode a subset of Thumb instructions.”

5.4 Concepts You Must Understand First

Stop and research these before coding:

  1. Instruction Encoding
    • What is the key invariant you must preserve?

5.5 Questions to Guide Your Design

  1. Data Flow
    • How does input become output?
    • Which steps must be deterministic?
  2. Validation
    • What is the simplest test that proves correctness?
    • How will you detect regressions?

5.6 Thinking Exercise

Trace the Critical Path

Write a step-by-step trace of the most important workflow in this project.

Questions to answer:

  • Where could a subtle bug hide?
  • What would you log to prove correctness?

5.7 The Interview Questions They’ll Ask

  1. “What is the core invariant this project relies on?”
  2. “How would you debug a failure in this workflow?”
  3. “What trade-offs did you make in design?”
  4. “How does this map to real hardware or toolchains?”
  5. “How do you prove your output is correct?”

5.8 Hints in Layers

Hint 1: Start small Focus on the smallest input that still demonstrates the concept.

Hint 2: Make output deterministic Fix inputs and produce stable logs before expanding functionality.

Hint 3: Validate against a known reference Compare with a known-good output or specification.

Hint 4: Add instrumentation Log internal steps so you can verify each phase explicitly.

5.9 Books That Will Help

Topic Book Chapter
Core concept “ARM Assembly Language” by William Hohl Ch. 3-5
Binary formats “Linkers and Loaders” by John R. Levine Ch. 1-3

5.10 Implementation Phases

Phase 1: Foundation (2-4 hours)

Goals:

  • Establish a minimal working pipeline
  • Validate one end-to-end path Tasks:
    1. Build the smallest viable input and output
    2. Verify outputs against a reference Checkpoint: Output matches expected golden path

Phase 2: Core Functionality (4-8 hours)

Goals:

  • Implement main logic and validation
  • Add structured error handling Tasks:
    1. Implement the core transformation
    2. Add deterministic reporting Checkpoint: Core tests pass reliably

Phase 3: Polish & Edge Cases (2-4 hours)

Goals:

  • Cover edge cases
  • Improve output clarity Tasks:
    1. Add negative tests
    2. Document limitations Checkpoint: All edge cases handled gracefully

5.11 Key Implementation Decisions

Decision Options Recommendation Rationale
Input format Free-form vs structured Structured Easier validation
Output format Human vs machine Both Supports verification and tooling

6. Testing Strategy

6.1 Test Categories

Category Purpose Examples
Unit Tests Validate core logic Field parsing, bounds checks
Integration Tests Validate full flow End-to-end CLI runs
Edge Case Tests Validate boundaries Empty input, invalid flags

6.2 Critical Test Cases

  1. Golden path: Fixed input produces known output.
  2. Invalid input: Error path triggers correct exit code.
  3. Boundary case: Maximum supported value handled correctly.

6.3 Test Data

Input: fixed seed or fixed fixture
Expected: exact output text from §3.4

7. Common Pitfalls & Debugging

Pitfall Symptom Solution
Misaligned assumptions Unexpected output Re-check invariants
Missing validation Silent failures Add explicit checks
Non-determinism Flaky output Fix inputs and seeds

7.2 Debugging Strategies

  • Trace everything: Log each step with stable ordering
  • Compare against reference: Use known-good outputs

7.3 Performance Traps

  • Avoid repeated parsing of the same input; cache results when possible

8. Extensions & Challenges

8.1 Beginner Extensions

  • Add one extra output format
  • Add a help screen with examples

8.2 Intermediate Extensions

  • Add a verification mode that compares two outputs
  • Add structured JSON output

8.3 Advanced Extensions

  • Add a batch mode for large inputs
  • Add cross-target comparisons (M vs A profile)

9. Real-World Connections

9.1 Industry Applications

  • Firmware bring-up: use the same checks to validate early boot images
  • Security audits: analyze binaries for ABI or control-flow correctness
  • binutils: source of many ARM tooling workflows
  • QEMU: emulator used for ARM testing

9.3 Interview Relevance

  • Explains why ARM behavior differs across profiles
  • Demonstrates toolchain literacy and debugging rigor

10. Resources

10.1 Essential Reading

  • “ARM Assembly Language” by William Hohl - practical instruction usage
  • “Linkers and Loaders” by John R. Levine - binary layout

10.2 Video Resources

  • ARM architecture overview talks and lectures

10.3 Tools & Documentation

  • GNU binutils documentation
  • Arm developer documentation
  • This project connects with: P01-toolchain-pipeline-explorer.md, P02-register-stack-visualizer.md, P04-mmio-memory-map-notebook.md

11. Self-Assessment Checklist

11.1 Understanding

  • I can explain the core concept without notes
  • I can explain why my design choices were necessary
  • I can describe one realistic failure mode

11.2 Implementation

  • All functional requirements are met
  • Tests pass deterministically
  • Edge cases are documented

11.3 Growth

  • I can describe what I would improve next time
  • I can explain this project in an interview

12. Submission / Completion Criteria

Minimum Viable Completion:

  • Core functionality works on reference inputs
  • Deterministic golden path is documented
  • At least one failure path is demonstrated

Full Completion:

  • All minimum criteria plus:
  • Edge cases are covered with tests
  • Output format is stable and documented

Excellence (Going Above & Beyond):

  • Add a comparison against a second target
  • Provide a short write-up of lessons learned