Project 3: Thumb Instruction Encoder/Decoder

Encode and decode a subset of Thumb instructions.

Quick Reference

Attribute	Value
Difficulty	Level 3
Time Estimate	10-16 hours
Main Programming Language	Python or C (Alternatives: Rust, Go)
Alternative Programming Languages	Rust, Go
Coolness Level	Level 4
Business Potential	Level 2
Prerequisites	Bitwise operations, Concept 3: Instruction Encoding
Key Topics	bitfields, immediates, endianness

1. Learning Objectives

By completing this project, you will:

Translate ARM concepts into observable outputs you can verify.
Explain why each toolchain or hardware step is necessary.
Detect and fix at least one realistic failure mode.
Communicate the result clearly in a technical review or interview.

2. All Theory Needed (Per-Concept Breakdown)

Instruction Encoding

Fundamentals Every assembly instruction is encoded into bits. Encoding determines which registers and immediates are accessible, how large constants can be, and which addressing modes are legal. Thumb encodings prioritize compactness and energy efficiency for microcontrollers, while AArch64 uses fixed 32-bit instruction widths to simplify decode and improve pipeline predictability. citeturn0search2 Understanding encoding explains why some instructions are missing or require multi-instruction sequences and why certain address calculations must be split into steps.

Deep Dive Instruction encoding is the “physics” of assembly. A mnemonic like MOV is a human label for a specific bit pattern; if that pattern cannot fit your operands, the assembler will either refuse or emit a different instruction sequence. Thumb uses a mix of 16-bit and 32-bit encodings (Thumb-2), which means register and immediate fields are often smaller. This is why Cortex-M uses a small set of low registers more naturally and why large constants must be loaded via literal pools or multi-step sequences. AArch64, in contrast, uses 32-bit instructions exclusively, providing more predictable decoding and a richer register space. citeturn0search2 The trade-off is code density versus decode simplicity.

Addressing modes compound encoding constraints. Load/store architectures like ARM separate arithmetic from memory access: you compute addresses in registers, then load or store. But the address computation itself is limited by encoding fields. For example, immediate offsets might be limited to a certain bit width or require alignment. When you understand the bit fields, you can predict when the assembler will need to generate extra instructions, which in turn affects performance and size. This is especially important in microcontrollers where code size is constrained and instruction fetches may come from slow flash.

Encoding also interacts with endianness and instruction alignment. Many ARM cores require instructions to be aligned to 2 or 4 bytes depending on the ISA. Misalignment results in faults or unintended behavior. The assembler handles alignment for you, but if you build a binary layout manually (for example in a boot image), you must respect alignment rules to avoid hard-to-debug startup failures. This is why toolchain awareness is an essential complement to ISA knowledge.

Finally, encoding shapes the patterns you see in disassembly. A sequence of machine code bytes can decode to different instructions depending on execution state. This is a common pitfall in reverse engineering: decoding AArch64 bytes as Thumb or ARM32 yields nonsense. Knowing the encoding class prevents misinterpretation and helps you verify that your build pipeline is producing the ISA you intended.

How this fits on projects

Core to P03 (Thumb Instruction Encoder/Decoder) and P01 (Toolchain Pipeline Explorer).

Definitions & key terms

Encoding: Bit-level representation of an instruction.
Addressing mode: How an instruction specifies the location of its operands.
Thumb/Thumb-2: Compact ARM instruction encodings for M-profile.
A64: 32-bit fixed-length instruction encoding for AArch64. citeturn0search2

Mental model diagram

Thumb Instruction Encoding Examples:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

16-bit Thumb instruction: MOV r0, #42
┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐
│ 0│ 0│ 1│ 0│ 0│ Rd  │     imm8 (immediate)    │
│ 0│ 0│ 1│ 0│ 0│0│0│0│ 0│ 0│ 1│ 0│ 1│ 0│ 1│ 0│
└──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┘
              │r0 │         42 = 0x2A
              │   │
              ▼   ▼
          Encodes to: 0x202A (little-endian: 2A 20)


32-bit Thumb-2 instruction: LDR r0, [r1, #offset]  (when offset > 31)
┌──────────────────────────────────────────────────────────────────┐
│ First halfword (16 bits)  │  Second halfword (16 bits)           │
│   encoding prefix + Rn    │     Rt + imm12 offset                │
└──────────────────────────────────────────────────────────────────┘


Common Thumb Instructions You'll Use:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Data Movement:
  MOV  Rd, #imm8        Move 8-bit immediate to register
  MOV  Rd, Rm           Move register to register
  LDR  Rt, [Rn, #off]   Load word from memory
  STR  Rt, [Rn, #off]   Store word to memory
  PUSH {reglist}        Push registers to stack
  POP  {reglist}        Pop registers from stack

Arithmetic:
  ADD  Rd, Rn, #imm3    Add 3-bit immediate
  ADD  Rd, #imm8        Add 8-bit immediate to Rd
  SUB  Rd, Rn, #imm3    Subtract 3-bit immediate
  SUBS Rd, Rn, Rm       Subtract with flags update

Logic:
  AND  Rd, Rm           Bitwise AND
  ORR  Rd, Rm           Bitwise OR
  EOR  Rd, Rm           Bitwise XOR (exclusive OR)
  LSL  Rd, Rm, #imm5    Logical shift left
  LSR  Rd, Rm, #imm5    Logical shift right

Control Flow:
  B    label            Unconditional branch
  BEQ  label            Branch if equal (Z=1)
  BNE  label            Branch if not equal (Z=0)
  BL   function         Branch with link (function call)
  BX   Rm               Branch to address in register
  BLX  Rm               Branch with link to address in register


LIMITATION: Cortex-M0+ is MISSING many instructions!
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
✗ No hardware divide (UDIV, SDIV)     → Must use software division
✗ No bit-field instructions (BFI)     → Must use shift/mask sequences
✗ No conditional execution (IT block) → Must use branches
✗ Limited addressing modes            → Can't do [Rn, Rm, LSL #2]
✗ No saturation arithmetic            → Must check overflow manually

Thumb Instruction Encoding

How it works (step-by-step, with invariants and failure modes)

The assembler maps mnemonics to encoding templates for the target ISA.
Register and immediate fields are packed into fixed bit positions.
If a value doesn’t fit, the assembler emits a sequence or errors out.
Failure mode: decoding with the wrong ISA yields invalid instructions or faults.

Minimal concrete example (pseudo, not runnable)

ENCODE(op=ADD, rd=R0, rn=R1, imm=5)
→ [opcode bits][rd bits][rn bits][imm bits]

Common misconceptions

“Assembler will always accept my operands” → Encoding limits still apply.
“Instruction length doesn’t matter” → It affects alignment and memory layout.

Check-your-understanding questions

Why does Thumb use shorter encodings than AArch64?
What happens if an immediate is too large for its field?
Why can decoding with the wrong execution state break disassembly?

Check-your-understanding answers

Thumb optimizes for code density and decoder simplicity in M-profile contexts. citeturn0search2
The assembler emits a sequence or reports an error because it cannot fit the value.
The same bytes map to different instruction sets depending on state, so decoding mismatches yield nonsense.

Real-world applications

Building encoders/decoders for tooling and reverse engineering.
Size-sensitive firmware builds for microcontrollers.

Where you’ll apply it

This project: see §3.1 and §5.4 in P03-thumb-encoder-decoder.md
P03 Thumb Instruction Encoder/Decoder
P01 Toolchain Pipeline Explorer

References

Arm A-profile overview (AArch64, instruction model). citeturn0search2

Key insights Encoding constraints explain most “mysterious” assembler errors.

Summary Instruction encodings are the boundary between human mnemonics and machine reality; mastering them unlocks predictability.

Homework/Exercises to practice the concept

Choose a Thumb instruction and manually identify which bits encode the register fields.
Explain why a large constant may require multiple instructions.

Solutions to the homework/exercises

The register fields are fixed bit slices in the instruction encoding; their size limits which registers are directly addressable.
If the immediate field is too small, the assembler must build the constant through multiple steps.

3. Project Specification

3.1 What You Will Build

A CLI that maps between mnemonics and 16-bit encodings for a chosen subset.

3.2 Functional Requirements

Requirement 1: Encode at least 6 Thumb instructions
Requirement 2: Decode 16-bit values into mnemonics
Requirement 3: Validate immediate ranges and register fields

3.3 Non-Functional Requirements

Clear error messages for invalid encodings

3.4 Example Usage / Output

$ thumb-encode "MOV r0, #42"
encoding: 0b00100 000 00101010
hex: 0x202A

$ thumb-decode 0xFFFF
error: unsupported or illegal opcode
exit code: 1

3.5 Data Formats / Schemas / Protocols

Mnemonic input: opcode + operands
Output: binary and hex

3.6 Edge Cases

Immediate too large
Invalid register index

3.7 Real World Outcome

This is the golden reference for success:

You can manually validate machine code bytes against a reference table.

3.7.1 How to Run (Copy/Paste)

Build: follow the toolchain steps defined in this guide
Run: use the CLI examples in §3.4 with fixed inputs
Expected directory: project root

3.7.2 Golden Path Demo (Deterministic)

Run with a fixed input set and confirm output matches §3.4 exactly.

3.7.3 If CLI: Exact Terminal Transcript

$ thumb-encode "MOV r0, #42"
encoding: 0b00100 000 00101010
hex: 0x202A

$ thumb-decode 0xFFFF
error: unsupported or illegal opcode
exit code: 1

4. Solution Architecture

4.1 High-Level Design

┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│ Input Layer  │───▶│ Core Logic   │───▶│ Output Layer │
└──────────────┘     └──────────────┘     └──────────────┘

4.2 Key Components

Component	Responsibility	Key Decisions
Input Parser	Validate and normalize input	Strict error handling
Core Engine	Perform the main computation	Deterministic paths
Reporter	Produce user-facing output	Stable formatting

4.3 Data Structures (No Full Code)

Record Entry {
  name: string
  fields: list
  notes: text
}

4.4 Algorithm Overview

Key Algorithm: Core Flow

Parse input and validate parameters.
Execute the core transformation or analysis.
Emit deterministic output or error summary.

Complexity Analysis:

Time: O(n) in the size of input records
Space: O(n) for stored mappings and logs

5. Implementation Guide

5.1 Development Environment Setup

# Install toolchain and verify versions
toolchain --version

5.2 Project Structure

project-root/
├── src/
│   ├── core
│   └── io
├── tests/
│   └── fixtures
├── docs/
└── README.md

5.3 The Core Question You’re Answering

“Encode and decode a subset of Thumb instructions.”

5.4 Concepts You Must Understand First

Stop and research these before coding:

Instruction Encoding
- What is the key invariant you must preserve?

5.5 Questions to Guide Your Design

Data Flow
- How does input become output?
- Which steps must be deterministic?
Validation
- What is the simplest test that proves correctness?
- How will you detect regressions?

5.6 Thinking Exercise

Trace the Critical Path

Write a step-by-step trace of the most important workflow in this project.

Questions to answer:

Where could a subtle bug hide?
What would you log to prove correctness?

5.7 The Interview Questions They’ll Ask

“What is the core invariant this project relies on?”
“How would you debug a failure in this workflow?”
“What trade-offs did you make in design?”
“How does this map to real hardware or toolchains?”
“How do you prove your output is correct?”

5.8 Hints in Layers

Hint 1: Start small Focus on the smallest input that still demonstrates the concept.

Hint 2: Make output deterministic Fix inputs and produce stable logs before expanding functionality.

Hint 3: Validate against a known reference Compare with a known-good output or specification.

Hint 4: Add instrumentation Log internal steps so you can verify each phase explicitly.

5.9 Books That Will Help

Topic	Book	Chapter
Core concept	“ARM Assembly Language” by William Hohl	Ch. 3-5
Binary formats	“Linkers and Loaders” by John R. Levine	Ch. 1-3

5.10 Implementation Phases

Phase 1: Foundation (2-4 hours)

Goals:

Establish a minimal working pipeline
Validate one end-to-end path Tasks:
1. Build the smallest viable input and output
2. Verify outputs against a reference Checkpoint: Output matches expected golden path

Phase 2: Core Functionality (4-8 hours)

Goals:

Implement main logic and validation
Add structured error handling Tasks:
1. Implement the core transformation
2. Add deterministic reporting Checkpoint: Core tests pass reliably

Phase 3: Polish & Edge Cases (2-4 hours)

Goals:

Cover edge cases
Improve output clarity Tasks:
1. Add negative tests
2. Document limitations Checkpoint: All edge cases handled gracefully

5.11 Key Implementation Decisions

Decision	Options	Recommendation	Rationale
Input format	Free-form vs structured	Structured	Easier validation
Output format	Human vs machine	Both	Supports verification and tooling

6. Testing Strategy

6.1 Test Categories

Category	Purpose	Examples
Unit Tests	Validate core logic	Field parsing, bounds checks
Integration Tests	Validate full flow	End-to-end CLI runs
Edge Case Tests	Validate boundaries	Empty input, invalid flags

6.2 Critical Test Cases

Golden path: Fixed input produces known output.
Invalid input: Error path triggers correct exit code.
Boundary case: Maximum supported value handled correctly.

6.3 Test Data

Input: fixed seed or fixed fixture
Expected: exact output text from §3.4

7. Common Pitfalls & Debugging

Pitfall	Symptom	Solution
Misaligned assumptions	Unexpected output	Re-check invariants
Missing validation	Silent failures	Add explicit checks
Non-determinism	Flaky output	Fix inputs and seeds

7.2 Debugging Strategies

Trace everything: Log each step with stable ordering
Compare against reference: Use known-good outputs

7.3 Performance Traps

Avoid repeated parsing of the same input; cache results when possible

8. Extensions & Challenges

8.1 Beginner Extensions

Add one extra output format
Add a help screen with examples

8.2 Intermediate Extensions

Add a verification mode that compares two outputs
Add structured JSON output

8.3 Advanced Extensions

Add a batch mode for large inputs
Add cross-target comparisons (M vs A profile)

9. Real-World Connections

9.1 Industry Applications

Firmware bring-up: use the same checks to validate early boot images
Security audits: analyze binaries for ABI or control-flow correctness

binutils: source of many ARM tooling workflows
QEMU: emulator used for ARM testing

9.3 Interview Relevance

Explains why ARM behavior differs across profiles
Demonstrates toolchain literacy and debugging rigor

10. Resources

10.1 Essential Reading

“ARM Assembly Language” by William Hohl - practical instruction usage
“Linkers and Loaders” by John R. Levine - binary layout

10.2 Video Resources

ARM architecture overview talks and lectures

10.3 Tools & Documentation

GNU binutils documentation
Arm developer documentation

This project connects with: P01-toolchain-pipeline-explorer.md, P02-register-stack-visualizer.md, P04-mmio-memory-map-notebook.md

11. Self-Assessment Checklist

11.1 Understanding

I can explain the core concept without notes
I can explain why my design choices were necessary
I can describe one realistic failure mode

11.2 Implementation

All functional requirements are met
Tests pass deterministically
Edge cases are documented

11.3 Growth

I can describe what I would improve next time
I can explain this project in an interview

12. Submission / Completion Criteria

Minimum Viable Completion:

Core functionality works on reference inputs
Deterministic golden path is documented
At least one failure path is demonstrated

Full Completion:

All minimum criteria plus:
Edge cases are covered with tests
Output format is stable and documented

Excellence (Going Above & Beyond):

Add a comparison against a second target
Provide a short write-up of lessons learned

Project 3: Thumb Instruction Encoder/Decoder

Quick Reference

1. Learning Objectives

2. All Theory Needed (Per-Concept Breakdown)

Instruction Encoding

3. Project Specification

3.1 What You Will Build

3.2 Functional Requirements

3.3 Non-Functional Requirements

3.4 Example Usage / Output

3.5 Data Formats / Schemas / Protocols

3.6 Edge Cases

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

3.7.2 Golden Path Demo (Deterministic)

3.7.3 If CLI: Exact Terminal Transcript

4. Solution Architecture

4.1 High-Level Design

4.2 Key Components

4.3 Data Structures (No Full Code)

4.4 Algorithm Overview

5. Implementation Guide

5.1 Development Environment Setup

5.2 Project Structure

5.3 The Core Question You’re Answering

5.4 Concepts You Must Understand First

5.5 Questions to Guide Your Design

5.6 Thinking Exercise

5.7 The Interview Questions They’ll Ask

5.8 Hints in Layers

5.9 Books That Will Help

5.10 Implementation Phases

Phase 1: Foundation (2-4 hours)

Phase 2: Core Functionality (4-8 hours)

Phase 3: Polish & Edge Cases (2-4 hours)

5.11 Key Implementation Decisions

6. Testing Strategy

6.1 Test Categories

6.2 Critical Test Cases

6.3 Test Data

7. Common Pitfalls & Debugging

7.2 Debugging Strategies

7.3 Performance Traps

8. Extensions & Challenges

8.1 Beginner Extensions

8.2 Intermediate Extensions

8.3 Advanced Extensions

9. Real-World Connections

9.1 Industry Applications

9.2 Related Open Source Projects

9.3 Interview Relevance

10. Resources

10.1 Essential Reading

10.2 Video Resources

10.3 Tools & Documentation

10.4 Related Projects in This Series

11. Self-Assessment Checklist

11.1 Understanding

11.2 Implementation

11.3 Growth

12. Submission / Completion Criteria