Sprint: X86-64 Assembly Mastery - Real World Projects

Goal: Develop a first-principles understanding of x86-64 assembly as a living system: architectural state, instruction encoding, memory addressing, calling conventions, and the OS interface. You will learn how source ideas become machine behavior, how ABIs glue code together, and how performance emerges from microarchitecture. By the end, you will be able to analyze binaries, reason about correctness at the instruction level, and build tools that reveal what the CPU is actually doing. You will also be able to design, implement, and validate real-world low-level tooling that is useful for debugging, reverse engineering, performance work, and systems programming.

Introduction

  • What is x86-64 assembly? It is the human-facing notation for the 64-bit extension of the x86 instruction set that modern PCs and servers execute. It specifies operations on registers, memory, and control flow as the CPU sees them, constrained by the architecture and ABI.
  • What problem does it solve today? It provides the ground truth for performance, debugging, reverse engineering, and systems programming when high-level abstractions fail or when you must interoperate across compilers, runtimes, or OS boundaries.
  • What will you build across the projects? Binary inspectors, instruction decoders, ABI visualizers, syscall and exception flow tools, and performance probes that make the invisible visible.
  • What is in scope vs out of scope? In scope: user-mode x86-64, System V and Windows ABIs, ELF/PE formats, instruction encoding, and performance reasoning. Out of scope: full OS kernels, hypervisors, and vendor-specific microcode details.

Big-picture diagram:

               HIGH-LEVEL INTENT
                        |
                        v
          +---------------------------+
          |  Source / Pseudocode      |
          +---------------------------+
                        |
                        v
          +---------------------------+
          |  Compiler / Assembler     |
          +---------------------------+
                        |
                        v
          +---------------------------+
          |  Object File (ELF/PE)     |
          +---------------------------+
                        |
                        v
          +---------------------------+
          |  Linker + Loader + ABI    |
          +---------------------------+
                        |
                        v
          +---------------------------+
          |  CPU (x86-64 ISA)         |
          |  Registers / Flags / Mem  |
          +---------------------------+
                        |
                        v
          +---------------------------+
          |  Observable Outcomes      |
          |  (perf, traces, crashes)  |
          +---------------------------+

How to Use This Guide

  • Read the Theory Primer first. Treat it like a short technical book that builds the mental model you will reuse in every project.
  • Pick a learning path that fits your background. If you are new to assembly, start with the ABI and tooling projects. If you are experienced, jump to instruction encoding and performance.
  • After each project, validate your progress using the Definition of Done and the observable outputs provided. If you cannot reproduce the expected output shape, return to the concept chapters.

Prerequisites & Background Knowledge

Essential Prerequisites (Must Have)

  • Comfortable with C or another systems language (pointers, memory, stack frames)
  • Basic understanding of CPU concepts (registers, memory, instructions)
  • Familiarity with Linux or Windows command line tools
  • Recommended Reading: “Computer Systems: A Programmer’s Perspective” by Bryant and O’Hallaron - Ch. 1, 3

Helpful But Not Required

  • Compiler basics (learn during Projects 9-10)
  • Operating systems concepts (learn during Projects 7-8)
  • Debugging symbols and DWARF (learn during Project 6)

Self-Assessment Questions

  1. Can you explain how a function call becomes a stack frame and return address?
  2. Can you describe the difference between virtual memory and physical memory?
  3. Can you interpret a hexdump as little-endian data?

Development Environment Setup Required Tools:

  • Linux or macOS terminal (or Windows with WSL2)
  • GNU binutils (objdump, readelf, nm) or LLVM tools (llvm-objdump, llvm-readobj)
  • A debugger (gdb or lldb)

Recommended Tools:

  • A hex viewer (xxd or hexdump)
  • perf or VTune-like profiler (optional)

Testing Your Setup:

$ objdump --version
GNU objdump (GNU Binutils) [version output]

Time Investment

  • Simple projects: 4-8 hours each
  • Moderate projects: 10-20 hours each
  • Complex projects: 20-40 hours each
  • Total sprint: 3-6 months

Important Reality Check Assembly forces you to confront details that high-level languages hide. Progress can feel slow because every small error is visible in the output. That is expected. The goal is not speed, it is precision and confidence.

Big Picture / Mental Model

x86-64 work sits at the intersection of architecture, ABI, and OS contracts. You are not only learning instructions, you are learning the agreements that make programs executable and composable.

          +-------------------+      +-------------------+
          |  Architecture     |      |       ABI         |
          |  (ISA + State)    |      | (Calling, Syscall)|
          +---------+---------+      +---------+---------+
                    |                          |
                    |                          |
                    v                          v
          +------------------------------------------------+
          |             Binary Representation              |
          |  Encoding, Object Files, Relocations, Symbols  |
          +----------------------+-------------------------+
                                 |
                                 v
          +------------------------------------------------+
          |            OS + Loader + Runtime               |
          |  Privilege, Syscalls, Exceptions, Unwinding    |
          +----------------------+-------------------------+
                                 |
                                 v
          +------------------------------------------------+
          |               Observable Behavior              |
          |   Performance, Crashes, Debug Traces, Output   |
          +------------------------------------------------+

Theory Primer

Concept 1: Architectural State and Execution Modes

Fundamentals

The x86-64 architecture defines a precise set of architectural state: general-purpose registers, flags, instruction pointer, and control registers. x86-64 is a 64-bit extension of the x86 family; it adds 64-bit registers and a 64-bit address space while preserving backward compatibility. In practice, the CPU runs in long mode when executing 64-bit code. Long mode includes both 64-bit submode and compatibility submode for 32-bit code, which is why the same processor can run modern OSes and legacy binaries. Understanding architectural state is foundational: every instruction is just a transformation of state, and every bug is a mismatch between assumed state and actual state. Official architecture references describe the state, registers, and execution modes in detail, and those documents are the ground truth for everything you will do in this guide. (Sources: Intel SDM, Microsoft x64 architecture docs)

Deep Dive

Architectural state is the contract between software and hardware. On x86-64, this contract includes 16 general-purpose registers (GPRs), instruction pointer (RIP), flags (RFLAGS), vector registers, and a collection of control and model-specific registers. Most assembly you write or analyze uses the GPRs, RIP, and RFLAGS. A key idea is that x86-64 is not a clean break from x86; it is an extension. The architecture introduces 64-bit registers by extending the existing 32-bit ones. The 64-bit submode uses RIP-relative addressing as a first-class form of memory reference, enabling position-independent code. Long mode also changes segmentation behavior: segmentation is mostly disabled for code and data, with flat 64-bit addressing, while FS/GS remain available for thread-local storage and certain OS conventions. Compatibility submode exists to run 32-bit code, which uses the legacy 32-bit register view, limited address space, and different calling conventions. This duality matters because it affects how you reason about binaries, how tools interpret instruction encodings, and how the OS sets up execution.

The register file has subtleties that matter in real code. The 64-bit GPRs can be accessed as 32-bit, 16-bit, and 8-bit sub-registers. Writes to the 32-bit sub-registers zero-extend into the full 64-bit register, which is a performance and correctness feature. Writes to 8-bit and 16-bit sub-registers do not zero-extend and can create partial-register dependencies. That can cause performance penalties on some microarchitectures, which is why compilers prefer 32-bit writes when possible. The RFLAGS register contains condition codes and control flags such as the zero flag, carry flag, and direction flag. Understanding which instructions modify which flags is critical when you analyze branches and conditionals. Even if you are not writing assembly, reading disassembly requires you to track how flags and registers evolve.

Execution mode is another layer of architectural state. The CPU can run in real mode, protected mode, or long mode, each with different addressing and privilege semantics. For x86-64 user-mode work, you primarily live in long mode under an OS that manages paging and privilege transitions. The OS configures control registers, enables paging, and establishes the ABI conventions that user code must follow. That is why architecture knowledge is always paired with ABI knowledge; the architecture defines what is possible, the ABI defines what is expected.

Finally, x86-64 is a CISC architecture with variable-length instructions and a complex encoding scheme. This is part of the architectural state because the instruction pointer advances by the decoded length of each instruction, and the decoder depends on the correct interpretation of prefixes and operand sizes. When you study architectural state, keep the decoder in mind because it is how the CPU interprets the instruction stream. The decode rules are defined in the vendor manuals and are not optional. You cannot reason about control flow without understanding how the instruction pointer moves, and you cannot reason about side effects without knowing which registers and flags are architectural vs microarchitectural.

How this fits on projects

  • Projects 1-4 build tools that show and validate architectural state transitions.
  • Projects 5-8 require precise understanding of registers, flags, and execution mode to explain control flow and syscalls.

Definitions & key terms

  • Architectural state: The CPU state visible and defined by the ISA.
  • Long mode: 64-bit execution mode that enables 64-bit addressing and registers.
  • Compatibility submode: 32-bit execution within long mode.
  • RIP: The instruction pointer in 64-bit mode.
  • RFLAGS: The status and control flags register.

Mental model diagram

+------------------------------+
|     Architectural State      |
+------------------------------+
| GPRs: RAX..R15               |
| RIP (instruction pointer)    |
| RFLAGS (status/control)      |
| SIMD regs (XMM/YMM/ZMM)      |
| Control regs (CR0/CR3/CR4)   |
+--------------+---------------+
               |
               v
+------------------------------+
|   Execution Mode (Long)      |
|  - 64-bit submode            |
|  - 32-bit compat submode     |
+--------------+---------------+
               |
               v
+------------------------------+
|   Instruction Decoder        |
|  (interprets byte stream)    |
+------------------------------+

How it works

  1. OS configures control registers to enable long mode and paging.
  2. CPU fetches instruction bytes at RIP.
  3. Decoder interprets prefixes and operand sizes based on mode.
  4. Instruction executes, updating registers and RFLAGS.
  5. RIP advances by decoded instruction length.

Invariants and failure modes:

  • Invariant: RIP always points to the next instruction boundary.
  • Failure: incorrect decoding changes RIP and corrupts control flow.
  • Invariant: mode determines operand sizes and address sizes.
  • Failure: mode confusion leads to misinterpreting data as code.

Minimal concrete example (pseudo-assembly, not real code)

# PSEUDOCODE ONLY
STATE:
  REG_A = 5
  REG_B = 7

INSTRUCTION STREAM:
  LOAD64 REG_TMP, [ADDR_X]
  ADD64  REG_A, REG_B
  CMP64  REG_A, REG_TMP
  JUMP_IF_ZERO LABEL_OK

Common misconceptions

  • “x86-64 is totally different from x86.” It is an extension with compatibility.
  • “All registers are independent.” Sub-register writes can affect full registers.
  • “Flags are only for comparisons.” Many arithmetic instructions update flags.

Check-your-understanding questions

  1. Why does writing a 32-bit sub-register zero-extend the 64-bit register?
  2. What does compatibility mode allow in long mode?
  3. Why is RIP-relative addressing important for position-independent code?

Check-your-understanding answers

  1. It simplifies dependency tracking and enables efficient zero-extension.
  2. It allows 32-bit code to run on a 64-bit CPU under a 64-bit OS.
  3. It allows code to reference nearby data without absolute addresses.

Real-world applications

  • Reverse engineering compiled binaries
  • Debugging crashes and register corruption
  • Building profilers and tracers

Where you will apply it Projects 1, 2, 3, 4, 5, 7

References

  • Intel 64 and IA-32 Architectures Software Developer’s Manual (Intel)
  • Microsoft x64 architecture documentation
  • “The Art of 64-Bit Assembly, Volume 1” by Randall Hyde - Ch. 1-3

Key insights Architectural state is the smallest truth you can trust when everything else is uncertain.

Summary You cannot reason about x86-64 without knowing what the CPU state is and how execution mode shapes instruction meaning.

Homework/Exercises to practice the concept

  • List all architectural registers you can name and group them by purpose.
  • Draw the state transitions of a simple conditional branch.

Solutions to the homework/exercises

  • Group into GPRs, RIP, RFLAGS, SIMD, control registers.
  • Show state before compare, flags after compare, and RIP change on branch.

Concept 2: Data Representation, Memory, and Addressing

Fundamentals

x86-64 is a byte-addressed, little-endian architecture. Data representation determines how values appear in memory, how loads and stores reconstruct those values, and how alignment affects performance. Memory addressing is not just “base + offset”; it is a rich set of forms including base, index, scale, and displacement, plus RIP-relative addressing in 64-bit mode. These addressing forms are part of the ISA and are a primary tool for compilers. Virtual memory adds another layer: the addresses you see in registers are virtual, translated by page tables configured by the OS. When you write or analyze assembly, you are always navigating both representation and translation. Official architecture references and ABI specifications describe these addressing forms and constraints. (Sources: Intel SDM, Microsoft x64 architecture docs)

Deep Dive

Data representation is the mapping between abstract values and physical bytes. On x86-64, integers are typically two’s complement and stored in little-endian order. That means the least significant byte sits at the lowest memory address. When you inspect memory dumps, the order will appear reversed relative to the human-readable hex. This matters for debugging and binary analysis; it also matters for writing correct parsing and serialization logic.

Memory addressing is a key differentiator between x86-64 and many simpler ISAs. The architecture supports effective addresses of the form base + index * scale + displacement, where scale can be 1, 2, 4, or 8. This lets the CPU calculate addresses for arrays and structures in a single instruction, which is why compiler output often uses complex addressing instead of explicit multiply or add instructions. In long mode, RIP-relative addressing is widely used for position-independent code; it allows the instruction stream to refer to nearby constants and jump tables without absolute addresses. That is why you will see references relative to RIP rather than absolute pointers in modern binaries.

Virtual memory is the next layer of meaning. The addresses in registers are virtual; they are translated to physical addresses using a page table hierarchy. As a result, two different processes can have the same virtual address mapping to different physical memory. The OS enforces protection and isolation through page permissions. When you read assembly, you see the virtual addresses. The mapping is invisible unless you consult page tables or OS introspection tools, which is why memory corruption bugs can appear non-deterministic; they might read valid memory but the wrong mapping.

Alignment is another subtlety. Many instructions perform better when data is aligned to its natural width (for example, 8-byte aligned for 64-bit values). Misaligned loads are supported in x86-64 but can be slower or cause extra microarchitectural work. ABI conventions often require stack alignment to 16 bytes at call boundaries, which ensures that SIMD operations and stack-based data are aligned. This alignment rule is part of the ABI, not just a performance hint.

Addressing modes also influence instruction encoding. The ModR/M and SIB bytes encode the base, index, scale, and displacement. Some combinations are invalid or have special meaning (for example, certain base/index fields imply RIP-relative addressing or a displacement-only form). Understanding this encoding is critical for building decoders and for interpreting bytes in memory. It is also how you can verify that a disassembler is correct: the addressing mode can be inferred from the encoding and compared to the textual rendering.

Finally, consider how data representation affects control flow and calling conventions. Arguments passed by reference are simply addresses; the ABI does not enforce type. That means assembly must interpret the bytes correctly, or the program will behave incorrectly even if the instruction sequence is “valid.” This is where assembly becomes a discipline: you must know what the bytes mean, and that meaning is not written anywhere except in the ABI and the program’s logic.

How this fits on projects

  • Projects 2-4 are explicitly about effective address calculation and RIP-relative forms.
  • Projects 9-10 require precise understanding of data layout and alignment inside ELF/PE sections.

Definitions & key terms

  • Little-endian: Least significant byte at lowest address.
  • Effective address: The computed address used by a memory instruction.
  • RIP-relative: Addressing relative to the instruction pointer.
  • Virtual memory: The address space seen by a process, mapped to physical memory.
  • Alignment: Address boundary that improves correctness or performance.

Mental model diagram

VALUE -> BYTES -> VIRTUAL ADDRESS -> PAGE TABLE -> PHYSICAL ADDRESS

      +---------+          +------------------+
      |  Value  |  encode  |  Byte Sequence   |
      +---------+          +---------+--------+
                                    |
                                    v
                         +----------------------+
                         |  Effective Address   |
                         | base + index*scale + |
                         |      displacement    |
                         +----------+-----------+
                                    |
                                    v
                         +----------------------+
                         |   Virtual Address    |
                         +----------+-----------+
                                    |
                                    v
                         +----------------------+
                         |   Page Translation   |
                         +----------+-----------+
                                    |
                                    v
                         +----------------------+
                         |   Physical Address   |
                         +----------------------+

How it works

  1. Program computes effective address from base/index/scale/disp.
  2. CPU uses that effective address as a virtual address.
  3. MMU translates virtual to physical using page tables.
  4. Data is loaded or stored in little-endian byte order.

Invariants and failure modes:

  • Invariant: Effective address is computed before translation.
  • Failure: Misinterpreting endianness yields wrong values.
  • Invariant: ABI defines alignment at call boundaries.
  • Failure: Misalignment can break SIMD assumptions or slow down code.

Minimal concrete example (pseudo-assembly, not real code)

# PSEUDOCODE ONLY
# Compute address of element i in an array of 8-byte elements
EFFECTIVE_ADDRESS = BASE_PTR + INDEX * 8 + OFFSET
LOAD64 REG_X, [EFFECTIVE_ADDRESS]

Common misconceptions

  • “x86-64 is big-endian.” It is little-endian by default.
  • “All addresses are physical.” User code uses virtual addresses.
  • “Alignment is optional.” It is required by ABI for some operations.

Check-your-understanding questions

  1. Why does little-endian matter when reading a hexdump?
  2. What is the difference between effective and virtual address?
  3. Why do compilers use base+index*scale addressing?

Check-your-understanding answers

  1. The byte order is reversed relative to human-readable hex.
  2. Effective is computed by the instruction; virtual is then translated.
  3. It encodes array indexing in a single instruction.

Real-world applications

  • Debugging pointer arithmetic errors
  • Building instruction decoders and disassemblers
  • Understanding how compilers lay out data

Where you will apply it Projects 2, 3, 4, 9, 10

References

  • Intel 64 and IA-32 Architectures Software Developer’s Manual (Intel)
  • Microsoft x64 architecture documentation
  • “Computer Systems: A Programmer’s Perspective” by Bryant and O’Hallaron - Ch. 3

Key insights Memory is not just bytes; it is a layered mapping between representation and address translation.

Summary Effective addressing and data layout are the glue between values in your head and bytes in memory.

Homework/Exercises to practice the concept

  • Convert a 64-bit integer into its little-endian byte sequence.
  • Compute effective addresses for an array with different indices.

Solutions to the homework/exercises

  • List the bytes from least significant to most significant.
  • Use base + index * element_size + offset.

Concept 3: Instruction Encoding and Decoding

Fundamentals

x86-64 instructions are variable-length and encoded as a sequence of bytes that may include prefixes, an opcode, ModR/M and SIB bytes, displacement, and immediates. Unlike fixed-width ISAs, instruction length is determined by decoding. This makes decoding complex but flexible. It is also why disassemblers can become confused when they start decoding at the wrong offset. The authoritative definition of instruction encoding and formats is in the vendor manuals, and any tool that handles x86-64 must follow those rules to be correct. (Source: Intel SDM)

Deep Dive

Instruction encoding is the bridge between human-readable assembly and raw bytes. x86-64 uses a layered encoding scheme that evolved over decades. Most instructions start with optional prefixes that modify operand size, address size, or provide specialized semantics. In 64-bit mode, the REX prefix extends the register set and selects 64-bit operand size. After prefixes comes the opcode, which identifies the basic instruction. Some instructions use a single-byte opcode; others use opcode escape bytes or multi-byte opcodes for extended instruction sets.

The ModR/M byte is a core part of encoding. It specifies the addressing mode and, in many cases, which registers are used. If the ModR/M indicates memory addressing with an index register, the SIB byte is present and encodes scale, index, and base. Then a displacement may follow. Immediate values, if present, appear at the end of the encoding. The length of an instruction therefore depends on which of these elements appear, which is why decoding must be performed left-to-right in a strict order.

In 64-bit mode, REX prefixes are both powerful and subtle. They extend the register specifiers so you can address the full set of 16 GPRs. They also indicate whether the operand size is 64-bit. That means a missing REX prefix can change the meaning of an instruction even if the opcode and ModR/M bytes are the same. This is a common source of decoding errors in custom tooling. Another source of complexity is that some instructions have implicit operands or fixed registers, so the encoding does not explicitly list all registers involved.

Modern extensions introduce additional prefix schemes like VEX and EVEX for SIMD. These prefixes can replace older opcode sequences and encode vector register width, masking, and other features. From a decoding standpoint, these prefixes are distinct instruction classes with their own rules. That is why instruction decoders are often table-driven: they need to map prefix and opcode combinations to specific instruction behaviors.

The practical consequence is that a decoder must be deterministic and validated against known-good outputs. Tools like objdump or llvm-objdump are correct because they implement the official encoding tables. Your own decoder should not guess. You should test it against real binaries, but also against synthetic byte sequences that exercise edge cases: prefix combinations, displacement lengths, and registers beyond the first eight.

Instruction encoding is also where security tools operate. Many binary analysis techniques rely on correctly decoding instruction boundaries. If the decoder is off by one byte, the entire control flow graph can be wrong. That is why you will build tooling that validates decoding and cross-checks against known good references. Understanding encoding is not optional if you want to work in malware analysis, reverse engineering, or binary instrumentation.

How this fits on projects

  • Projects 3 and 4 directly build instruction decoding and boundary validation.
  • Projects 9 and 10 rely on decoding relocations and code references.

Definitions & key terms

  • Prefix: Byte that modifies instruction meaning or operand size.
  • Opcode: The core instruction identifier.
  • ModR/M: Encodes registers and addressing mode.
  • SIB: Scale-Index-Base encoding for complex addresses.
  • Displacement: Constant added to an address.
  • Immediate: Constant operand embedded in the instruction.

Mental model diagram

[ Prefixes ] [ Opcode ] [ ModR/M ] [ SIB ] [ Disp ] [ Imm ]
     |            |         |         |       |       |
     v            v         v         v       v       v
  modifiers    instruction  regs   addr     offset  constant

How it works

  1. Read optional prefixes and record operand/address size.
  2. Read opcode byte(s) and map to instruction class.
  3. If required, read ModR/M and determine register/memory form.
  4. If ModR/M indicates, read SIB.
  5. Read displacement and immediate fields by size.
  6. Compute final instruction length and semantics.

Invariants and failure modes:

  • Invariant: Decoding order is strict and deterministic.
  • Failure: Skipping a required prefix shifts the decode boundary.
  • Invariant: ModR/M determines whether SIB is present.
  • Failure: Incorrect ModR/M parsing breaks address calculation.

Minimal concrete example (pseudo-encoding)

# PSEUDOCODE ONLY
BYTES: [PFX][OP][MR][SIB][DISP32]
DECODE:
  prefix = PFX
  opcode = OP
  addressing = decode_modrm(MR, SIB, DISP32)
  length = 1 + 1 + 1 + 1 + 4

Common misconceptions

  • “Instruction length is fixed.” It is variable-length and decoder-driven.
  • “Opcode alone defines the instruction.” Prefixes and ModR/M change meaning.
  • “Decoding is trivial.” It is table-driven and full of edge cases.

Check-your-understanding questions

  1. Why is decoding order critical in x86-64?
  2. What does the ModR/M byte control?
  3. Why do decoders need tables rather than simple logic?

Check-your-understanding answers

  1. Because prefixes and opcode length determine where fields begin.
  2. It chooses register vs memory forms and which registers are used.
  3. Because many encodings are irregular and context-dependent.

Real-world applications

  • Building disassemblers and binary analyzers
  • Static malware analysis and reverse engineering
  • JIT and instrumentation tooling

Where you will apply it Projects 3, 4, 9, 10

References

  • Intel 64 and IA-32 Architectures Software Developer’s Manual (Intel)
  • “Modern X86 Assembly Language Programming” by Daniel Kusswurm - Ch. 2-4

Key insights Decoding is the gatekeeper: if the bytes are misread, everything else is wrong.

Summary Instruction encoding is the rulebook that allows bytes to become meaning.

Homework/Exercises to practice the concept

  • Write out the fields you would expect to decode from a byte stream.
  • Identify which fields would change if you referenced a high register.

Solutions to the homework/exercises

  • Prefix, opcode, ModR/M, optional SIB, displacement, immediate.
  • REX prefix is needed to access high registers.

Concept 4: Control Flow, Stack, and Calling Conventions

Fundamentals

Control flow in x86-64 is built from instruction pointer changes: calls, returns, jumps, and conditional branches. The stack provides a structured way to save state, pass arguments, and return values. Calling conventions define the contract between caller and callee: which registers hold arguments, which registers must be preserved, how the stack is aligned, and where return values appear. The two dominant 64-bit ABIs are System V AMD64 (used by Linux and macOS) and the Windows x64 convention. They share ideas but differ in register usage, red zone rules, and stack shadow space. These conventions are defined in official ABI documents and OS documentation. (Sources: System V AMD64 ABI, Microsoft x64 calling convention docs)

Deep Dive

The stack is a contiguous region of memory that grows downward. Every call pushes a return address and often creates a stack frame for local storage and saved registers. A function prologue typically adjusts the stack pointer, saves callee-saved registers, and sets up a frame pointer (optional). The epilogue reverses these steps and returns to the caller. This is a convention, not a requirement; optimized code can omit a frame pointer or use a leaf function that never touches the stack. Still, understanding the standard layout is critical for debugging, unwinding, and ABI interoperability.

System V AMD64 ABI defines the primary calling convention for UNIX-like systems. The first six integer or pointer arguments are passed in registers; additional arguments are passed on the stack. Return values are placed in a designated register. The stack must be aligned to a 16-byte boundary at call sites. The ABI also defines a red zone: a small region below the stack pointer that is not touched by signal handlers, allowing leaf functions to use it without adjusting the stack. This subtle rule impacts code generation and is why some stack probes are optional in user code. The ABI also specifies which registers are caller-saved and callee-saved. Caller-saved registers can be clobbered by the callee; callee-saved must be preserved across the call. Understanding these rules allows you to read disassembly and reconstruct the calling context.

Windows x64 uses a different register assignment and requires a 32-byte shadow space (home space) on the stack for the first four arguments, even if they are passed in registers. The callee can use this space to spill arguments, and the caller must allocate it. Windows x64 does not define a red zone, so leaf functions cannot safely use space below the stack pointer. This means that calling convention errors between platforms are common sources of crashes when interfacing with mixed environments or when porting low-level code.

Control flow also includes indirect calls and jumps, which use register or memory operands. These are common in virtual dispatch, function pointers, and dynamic linking. Understanding the ABI rules helps you determine whether a given register contains a valid function pointer or an argument. The stack alignment rule is critical for SIMD operations; misalignment can cause crashes or performance penalties. The ABI is therefore both a functional contract and a performance contract.

Unwinding is another piece of the puzzle. Debuggers and exception handlers need to reconstruct call stacks after a crash or during stack walking. This uses metadata (DWARF on UNIX, PDB on Windows) that describes how to restore registers and adjust the stack. Even if you are not writing that metadata, you must follow the ABI so that compilers and tools can generate it correctly.

How this fits on projects

  • Projects 5 and 6 are focused on calling conventions, stack layout, and unwinding.
  • Projects 7-8 rely on accurate control flow interpretation to map syscalls and exceptions.

Definitions & key terms

  • Calling convention: Contract between caller and callee.
  • Callee-saved: Registers preserved by the callee.
  • Caller-saved: Registers that may be clobbered by the callee.
  • Red zone: Stack space below SP reserved for leaf functions (SysV).
  • Shadow space: Stack space reserved for argument spilling (Windows).

Mental model diagram

CALLER STACK (higher addresses)
+----------------------------+
| argN (stack)               |
| ...                        |
| return address             | <-- pushed by call
| saved regs / locals        |
+----------------------------+
| red zone (SysV only)       |
+----------------------------+ <-- current SP

CALLER REGISTERS (SysV)
ARG1..ARG6 -> REG_A..REG_F
RETURN      -> REG_RET

How it works

  1. Caller places arguments in registers/stack per ABI.
  2. Caller ensures stack alignment and reserves shadow space (Windows).
  3. Call instruction pushes return address and jumps to callee.
  4. Callee saves required registers and sets up locals.
  5. Callee returns value in return register.
  6. Callee restores registers and returns to caller.

Invariants and failure modes:

  • Invariant: Stack alignment at call boundaries.
  • Failure: Misalignment breaks SIMD usage or ABI compliance.
  • Invariant: Callee-saved registers are preserved.
  • Failure: Clobbered callee-saved registers corrupt callers.

Minimal concrete example (pseudo-assembly, not real code)

# PSEUDOCODE ONLY
CALLER:
  ARG_REG1 = VAL1
  ARG_REG2 = VAL2
  ALIGN_STACK_16
  CALL FUNC_X
  RESULT = RET_REG

Common misconceptions

  • “Calling conventions are optional.” They are required for interoperability.
  • “Stack alignment only matters for performance.” It can break correctness.
  • “Red zone exists everywhere.” It is SysV-only and not on Windows.

Check-your-understanding questions

  1. Why does Windows require shadow space?
  2. Why do ABIs define caller-saved vs callee-saved registers?
  3. What breaks if stack alignment is wrong?

Check-your-understanding answers

  1. It gives the callee guaranteed spill space for register arguments.
  2. It establishes a clear contract and enables efficient codegen.
  3. SIMD operations and ABI compliance can fail or crash.

Real-world applications

  • ABI debugging and crash analysis
  • Interfacing assembly with C/C++
  • Reverse engineering function boundaries

Where you will apply it Projects 5, 6, 7

References

  • System V AMD64 ABI Draft 0.99.7
  • Microsoft x64 calling convention documentation
  • “The Art of 64-Bit Assembly, Volume 1” by Randall Hyde - Ch. 4-6

Key insights ABIs are the social contract that makes assembly code usable across compilers and OSes.

Summary Control flow and the stack are simple ideas, but ABIs make them reliable and interoperable.

Homework/Exercises to practice the concept

  • Draw a stack frame for a function with 2 arguments and 3 locals.
  • Identify which registers must be preserved in SysV and Windows.

Solutions to the homework/exercises

  • Include return address, saved registers, locals, and alignment padding.
  • SysV preserves specific callee-saved regs; Windows preserves its own set.

Concept 5: System Interface - Privilege, Exceptions, and Syscalls

Fundamentals

User-mode code does not directly control the hardware. It runs at a lower privilege level and must use system calls to request services. The CPU provides hardware mechanisms for privilege transitions, exceptions, and interrupts. The OS configures these mechanisms, defining how user code enters the kernel and how the kernel returns. In x86-64, the syscall instruction is a primary gateway to the kernel in 64-bit mode. Exception and interrupt handling is governed by descriptor tables and privilege checks. Understanding these boundaries is essential for debugging crashes, interpreting signals, and writing assembly that interacts with the OS. (Sources: Intel SDM, System V AMD64 ABI)

Deep Dive

Privilege levels on x86-64 are often summarized as rings, with ring 0 for the kernel and ring 3 for user code. Although the architecture supports four rings, most modern OSes use two. The transition from user to kernel is tightly controlled; user code cannot just jump into kernel space. Instead, it uses system call instructions (such as syscall) that trigger a controlled transition. The CPU saves certain state, switches to a privileged stack, and transfers control to a kernel entry point configured by the OS. The kernel then validates arguments, performs the requested service, and returns to user mode, restoring state.

Exceptions and interrupts are different but related. Exceptions are synchronous events triggered by the current instruction (for example, divide-by-zero, invalid opcode, or page fault). Interrupts are asynchronous events triggered by hardware or timers. Both use the interrupt descriptor table (IDT) to locate handlers. The CPU pushes an error code or context onto the stack and changes privilege levels if required. This is why an exception can appear to “teleport” control flow; it is a hardware-driven branch with strict rules. For assembly programmers, this means that any instruction can potentially trigger an exception, and that the OS may deliver a signal or exception to the process.

System call ABIs define which registers carry system call numbers and arguments. On Linux x86-64, the system call number is placed in a designated register and arguments are passed in a specific register order. Certain registers are clobbered by the syscall instruction itself. This is defined by the ABI and the kernel conventions. If you do not respect those rules, the kernel may interpret your arguments incorrectly, leading to crashes or undefined behavior. On Windows, the user-mode system call interface is not stable in the same way; documented system calls are wrapped by higher-level APIs. The assembly-level calling convention is still defined, but its details are more platform-specific.

Signals and exceptions are often observed through user-space tools. For example, a segmentation fault is the result of a page fault exception that the OS translates into a signal. Understanding which instruction caused the fault and why is a core assembly skill. That analysis requires knowledge of instruction side effects, address translation, and privilege transitions.

In short, the system interface is the boundary between your assembly code and the OS. It is also the boundary between defined behavior and crashes. By mastering the rules of syscalls and exceptions, you gain the ability to reason about crashes, inspect kernel interactions, and build low-level tools such as tracers and sandboxes.

How this fits on projects

  • Projects 7 and 8 focus on syscall conventions and exception/interrupt flow.
  • Projects 9 and 10 touch privilege when analyzing loaders and relocation behavior.

Definitions & key terms

  • Privilege level: CPU execution ring (user vs kernel).
  • System call: Controlled transition to kernel services.
  • Exception: Synchronous fault triggered by an instruction.
  • Interrupt: Asynchronous event handled by the CPU/OS.
  • IDT: Interrupt Descriptor Table (maps vectors to handlers).

Mental model diagram

USER MODE (Ring 3)
   |
   | syscall / exception
   v
KERNEL MODE (Ring 0)
   |
   | return-from-syscall / iret
   v
USER MODE (Ring 3)

How it works

  1. User code issues a syscall instruction.
  2. CPU switches privilege and jumps to kernel entry.
  3. Kernel validates and executes the requested service.
  4. Kernel returns to user mode, restoring registers.

Invariants and failure modes:

  • Invariant: User code cannot jump directly into kernel space.
  • Failure: Incorrect syscall argument registers yield wrong behavior.
  • Invariant: Exceptions transfer control via IDT.
  • Failure: Misconfigured IDT or invalid instruction causes crash.

Minimal concrete example (pseudo-assembly, not real code)

# PSEUDOCODE ONLY
SYSCALL_NUM = OPEN_FILE
ARG1 = PTR_PATH
ARG2 = FLAGS
ARG3 = MODE
SYSCALL
RET = RESULT_REG

Common misconceptions

  • “Syscalls are just function calls.” They are privilege transitions.
  • “Exceptions only happen on errors.” They can be used for control flow.
  • “Windows syscalls are stable.” They are not part of the public ABI.

Check-your-understanding questions

  1. What is the difference between an exception and an interrupt?
  2. Why are syscalls a controlled entry to the kernel?
  3. Which registers are clobbered by a syscall on Linux?

Check-your-understanding answers

  1. Exceptions are synchronous; interrupts are asynchronous.
  2. To enforce security and isolation between user and kernel.
  3. The ABI defines specific clobbers; they must be saved if needed.

Real-world applications

  • Writing syscall tracers
  • Debugging crashes and segmentation faults
  • Building sandboxes and seccomp-like policies

Where you will apply it Projects 7, 8, 9

References

  • Intel 64 and IA-32 Architectures Software Developer’s Manual (Intel)
  • System V AMD64 ABI Draft 0.99.7
  • “The Linux Programming Interface” by Michael Kerrisk - Ch. 3, 4

Key insights The OS boundary is the most important boundary you will ever cross in assembly.

Summary Syscalls and exceptions define how user code interacts with the kernel, and misunderstanding them leads to crashes.

Homework/Exercises to practice the concept

  • Map a high-level API call to the system call boundary it ultimately uses.
  • Draw a timeline of events for a page fault leading to a signal.

Solutions to the homework/exercises

  • Identify the system call number and arguments used by the API.
  • Show fault, kernel handler, signal dispatch, and user handler.

Concept 6: Object Files, Linking, and Relocations

Fundamentals

Object files are containers for machine code, data, symbols, and relocation information. The linker merges object files into executables or shared libraries, resolving symbols and applying relocations. x86-64 systems primarily use ELF on Linux and macOS (Mach-O on macOS) and PE on Windows. The System V ABI defines the general ABI and the x86-64 psABI specifies details for ELF. Understanding sections, symbols, and relocations is required for interpreting binaries, debugging linking errors, and building tooling that inspects executables. (Sources: System V gABI, System V AMD64 ABI, Linux Foundation refspecs)

Deep Dive

The object file is the bridge between assembly and execution. It holds code and data in sections, along with metadata that tells the linker how to connect references across compilation units. A symbol is a named addressable entity: a function, a global variable, or a section. Relocations are placeholders that tell the linker or loader to adjust addresses when the final layout is known. Without relocation, code would need fixed addresses and would not be portable or shareable.

In ELF, sections such as .text, .data, and .bss hold code and data. The section headers describe offsets, sizes, and flags. The symbol table maps symbol names to section offsets and attributes. Relocation entries refer to symbols and specify how to patch the code or data. The relocation types indicate whether a relocation is absolute, PC-relative, or uses a GOT/PLT indirection. The x86-64 psABI defines these relocation types and their semantics, which is essential for interpreting dynamic linking and position-independent code.

The linker performs symbol resolution: it decides which definition of a symbol to use and patches references accordingly. Static linking resolves all symbols at link time. Dynamic linking defers some resolution to runtime, using the dynamic loader and data structures such as the Global Offset Table (GOT) and Procedure Linkage Table (PLT). The GOT holds addresses of global symbols and is updated by the loader. The PLT provides a stub that jumps through the GOT, enabling lazy binding. This mechanism is central to shared libraries and is a common target for debugging and security analysis.

PE on Windows uses a different layout but similar ideas: sections, import tables, and relocation entries. The import table lists external functions that must be resolved at load time. Base relocations allow the loader to rebase the executable if it cannot be loaded at the preferred address. While the details differ, the mental model is the same: object files are templates, and the loader fills in the addresses.

Relocations also matter for reverse engineering. If you see a relocation against a symbol, you know that the code depends on that symbol even if the address is not fixed. This is how tools recover call graphs and identify external dependencies. It is also how you can locate jump tables, vtables, and other data-driven control flow constructs.

Finally, debugging symbols and unwind info live alongside code in the object file. These sections are optional but invaluable for debugging and profiling. Understanding them helps you build tools that can show file/line information, variable locations, and call stacks, even in optimized binaries.

How this fits on projects

  • Projects 9 and 10 are focused on ELF/PE parsing and relocation resolution.
  • Project 6 uses unwind metadata to visualize stack frames.

Definitions & key terms

  • Object file: Compiled code and metadata before linking.
  • Section: A region of an object file with a specific purpose.
  • Symbol: Named reference to code or data.
  • Relocation: A patch applied by the linker or loader.
  • GOT/PLT: Indirection tables for dynamic linking.

Mental model diagram

SOURCE -> OBJECT (.o) -> LINKER -> EXECUTABLE -> LOADER -> RUNNING

OBJECT:
  [ .text ] [ .data ] [ .bss ] [ .symtab ] [ .rel.* ]

LINKER:
  resolve symbols + apply relocations

LOADER:
  map segments + resolve dynamic relocations

How it works

  1. Compiler/assembler emits object file with symbols and relocations.
  2. Linker merges sections and resolves symbols.
  3. Linker applies relocations or marks them for runtime.
  4. Loader maps segments and resolves dynamic relocations.

Invariants and failure modes:

  • Invariant: Relocations reference valid symbols.
  • Failure: Missing symbols cause link errors or runtime crashes.
  • Invariant: Section permissions match content (code vs data).
  • Failure: Incorrect permissions can cause execution faults.

Minimal concrete example (pseudo-structure)

# PSEUDOCODE ONLY
SECTION .text:
  CALL SYMBOL_F
RELOCATION:
  at .text+0x10 -> SYMBOL_F (PC_REL)

Common misconceptions

  • “Linking is just concatenation.” It includes symbol resolution and relocation.
  • “GOT/PLT is only for performance.” It is required for dynamic linking.
  • “Object files are the final executable.” They are templates.

Check-your-understanding questions

  1. Why do relocation entries exist at all?
  2. What problem does the PLT solve?
  3. How does a loader differ from a linker?

Check-your-understanding answers

  1. Addresses are not final until link or load time.
  2. It enables lazy binding of external functions.
  3. The linker combines objects; the loader maps and relocates at runtime.

Real-world applications

  • Diagnosing link errors and symbol conflicts
  • Reverse engineering binary dependencies
  • Building binary inspection tools

Where you will apply it Projects 6, 9, 10

References

  • System V ABI / gABI (Linux Foundation refspecs)
  • System V AMD64 ABI Draft 0.99.7
  • “Computer Systems: A Programmer’s Perspective” by Bryant and O’Hallaron - Ch. 7

Key insights Relocations and symbols are the invisible glue that makes binaries runnable.

Summary Understanding object files and linking turns binaries from opaque blobs into structured systems.

Homework/Exercises to practice the concept

  • Sketch the sections of a minimal object file and label their purpose.
  • Explain how a call to an external function is resolved.

Solutions to the homework/exercises

  • Include .text, .data, .bss, .symtab, and relocation sections.
  • Linker or loader patches a placeholder using relocation info.

Concept 7: Performance, Microarchitecture, and SIMD

Fundamentals

Performance on x86-64 is shaped by microarchitecture: pipelines, caches, branch prediction, and execution ports. The ISA defines what is correct, but microarchitecture determines how fast it runs. SIMD instructions operate on vectors in XMM/YMM/ZMM registers to process multiple data elements in parallel. The ABI and compiler conventions influence whether values are kept in registers or spilled to memory, which directly impacts performance. Understanding latency vs throughput and cache behavior is essential for reasoning about assembly-level performance. (Sources: Intel SDM, Microsoft x64 architecture docs)

Deep Dive

Microarchitecture is the hidden layer between instructions and performance. Modern x86-64 CPUs decode instructions into micro-operations, schedule them across multiple execution units, and reorder them to maximize throughput while preserving architectural correctness. The result is that two instruction sequences with identical semantics can have very different performance characteristics. This is why assembly-level performance work is about dependency chains, not just counting instructions.

Pipeline stages include fetch, decode, dispatch, execute, and retire. Stalls occur when the pipeline cannot make progress due to data hazards or resource conflicts. Data hazards arise when an instruction depends on the result of a previous instruction that has not completed. This creates a dependency chain that determines latency. Throughput is how many independent operations can be completed per cycle when there are no dependencies. Good performance comes from breaking dependency chains and keeping execution units busy.

Cache behavior is often the dominant factor. The memory hierarchy includes L1, L2, and L3 caches, each with different latencies. If a load misses the cache, it can stall the pipeline for many cycles. This is why data layout and access patterns are critical. Assembly-level optimizations often focus on improving locality, aligning data, and prefetching. Misaligned accesses may be allowed but can cause extra microarchitectural work.

Branch prediction is another major factor. Conditional branches that are hard to predict cause pipeline flushes, which wastes cycles. Assembly programmers sometimes use conditional move or predicated instructions to avoid unpredictable branches. However, these trade off branch penalties for extra execution work, so the right choice depends on workload characteristics.

SIMD expands the architectural state with vector registers and instructions that operate on multiple elements at once. SSE and AVX provide 128-bit and 256-bit registers, and AVX-512 provides 512-bit. The ABI defines how these registers are used for passing floating-point and vector arguments. SIMD is powerful but introduces alignment and instruction selection constraints. For example, some vector loads require alignment, and some operations have different latency/throughput characteristics. Understanding these details helps you interpret compiler-generated vector code and write hand-optimized sequences when necessary.

Performance measurement requires careful methodology. Microbenchmarks must isolate the instruction sequence of interest, warm up caches, and avoid noise from frequency scaling or OS scheduling. Tools like perf or VTune can help, but you should also be able to reason from first principles: count dependencies, identify loads that may miss in cache, and understand how the CPU might reorder operations.

Ultimately, performance at the assembly level is a negotiation between the ISA contract and the microarchitecture’s capabilities. You will learn to interpret disassembly not just as a functional artifact but as a performance story: why the compiler arranged operations in a certain order, which registers are reused, and where stalls might occur.

How this fits on projects

  • Project 11 explores SIMD and vector semantics.
  • Project 12 focuses on microbenchmarks, cache behavior, and branch prediction.

Definitions & key terms

  • Latency: Time for a dependent operation to complete.
  • Throughput: Rate of operations per cycle when independent.
  • Cache miss: Access that must fetch from a slower memory level.
  • Branch prediction: CPU guess of branch direction.
  • SIMD: Single Instruction, Multiple Data processing.

Mental model diagram

INSTRUCTION STREAM
  |
  v
DECODE -> uOPS -> SCHEDULER -> EXEC UNITS -> RETIRE
                 ^      |
                 |      v
               CACHES  MEMORY

How it works

  1. Instructions decode into micro-operations.
  2. Scheduler issues independent uops to execution units.
  3. Data dependencies determine latency chains.
  4. Cache misses stall dependent operations.
  5. Branch mispredicts flush the pipeline.

Invariants and failure modes:

  • Invariant: Architectural state updates in order at retirement.
  • Failure: Mispredicted branches waste cycles and reduce throughput.
  • Invariant: SIMD registers require proper data alignment assumptions.
  • Failure: Misaligned vector operations can cause slowdowns or faults.

Minimal concrete example (pseudo-assembly, not real code)

# PSEUDOCODE ONLY
VEC_LOAD  VREG0, [PTR]
VEC_ADD   VREG0, VREG1
VEC_STORE [PTR], VREG0

Common misconceptions

  • “More instructions always means slower.” Dependency chains matter more.
  • “SIMD is always faster.” It depends on data layout and alignment.
  • “Cache is just memory.” Cache behavior dominates performance.

Check-your-understanding questions

  1. Why can a small number of dependent instructions be slower than many independent ones?
  2. What causes a pipeline flush?
  3. Why does alignment matter for SIMD?

Check-your-understanding answers

  1. Dependencies create latency chains that block parallelism.
  2. Branch misprediction or exceptions cause flushes.
  3. Misalignment can force extra micro-ops or penalties.

Real-world applications

  • Performance tuning and optimization
  • SIMD-heavy workloads (multimedia, ML kernels)
  • Profiling and bottleneck analysis

Where you will apply it Projects 11, 12

References

  • Intel 64 and IA-32 Architectures Software Developer’s Manual (Intel)
  • “Computer Architecture” by Hennessy and Patterson - Ch. 2, 3, 5
  • “Inside the Machine” by Jon Stokes - Ch. 4-6

Key insights Performance is an emergent property of dependencies, caches, and prediction.

Summary You optimize assembly by understanding how the CPU actually executes it, not just what it means.

Homework/Exercises to practice the concept

  • Sketch the dependency graph of a simple arithmetic chain.
  • Predict where cache misses might occur in a loop over a large array.

Solutions to the homework/exercises

  • Dependencies form a linear chain; parallelize by splitting independent ops.
  • Misses occur when the array exceeds cache size and lacks locality.

Glossary

  • ABI: Application Binary Interface; the contract between compiled code and the OS/toolchain.
  • Callee-saved register: Register the callee must preserve across calls.
  • Caller-saved register: Register the caller must save if it needs the value later.
  • ELF: Executable and Linkable Format used on many UNIX-like systems.
  • GOT/PLT: Indirection tables used for dynamic linking.
  • Long mode: 64-bit execution mode for x86-64.
  • ModR/M: Instruction encoding byte describing registers and addressing.
  • RIP-relative: Addressing relative to the instruction pointer.
  • Shadow space: Stack space reserved for argument spilling on Windows x64.
  • Syscall: Controlled entry to the kernel from user space.

Why X86-64 Assembly Matters

Modern systems are built on x86-64. If you want to understand performance, debugging, security, or binary compatibility, you need to understand x86-64 assembly.

Modern motivation and real-world use cases

  • Performance analysis of server workloads and low-level profiling
  • Reverse engineering, malware analysis, and exploit development
  • ABI debugging for mixed-language or mixed-compiler systems

Real-world statistics and impact

  • Steam Hardware Survey (December 2025): Windows 11 64-bit is 70.83% and Windows 10 64-bit is 26.70% of surveyed systems, showing near-universal 64-bit Windows usage on mainstream PCs. (Source: Steam Hardware Survey, December 2025)

Context and evolution (brief)

  • x86-64 (AMD64) extended x86 with 64-bit registers and addressing while retaining compatibility.
  • Intel adopted a near-identical extension (Intel 64), making x86-64 the de facto standard for PCs and many servers.

Old vs new approach (ASCII diagram)

32-bit era                        64-bit era
+--------------------+           +------------------------------+
| 4 GB address space |           | Vast virtual address space   |
| 8 GPRs             |           | 16 GPRs                      |
| Limited ABI regs   |           | Register-based ABI           |
+--------------------+           +------------------------------+

Concept Summary Table

Concept Cluster What You Need to Internalize
Architectural State and Modes Registers, flags, and execution modes define the CPU state you must track to reason about behavior.
Data Representation and Addressing Little-endian layout, effective addresses, and virtual memory define how bytes become values.
Instruction Encoding Variable-length instruction decoding and prefixes determine meaning and boundaries.
Control Flow and Calling Conventions Stack discipline and ABI contracts make function calls interoperable.
System Interface Privilege transitions, syscalls, and exceptions define how user code interacts with the OS.
Object Files and Linking Sections, symbols, and relocations turn object files into executables.
Performance and SIMD Microarchitecture, caches, and SIMD determine real-world speed.

Project-to-Concept Map

Project Concepts Applied
Project 1 Architectural State, Data Representation
Project 2 Data Representation, Addressing
Project 3 Instruction Encoding
Project 4 Instruction Encoding, Addressing
Project 5 Control Flow, ABI
Project 6 Control Flow, Object Files
Project 7 System Interface, ABI
Project 8 System Interface
Project 9 Object Files, Addressing
Project 10 Object Files, Instruction Encoding
Project 11 Performance, SIMD
Project 12 Performance, Microarchitecture

Deep Dive Reading by Concept

Concept Book and Chapter Why This Matters
Architectural State and Modes “The Art of 64-Bit Assembly, Volume 1” - Ch. 1-3 Introduces registers, flags, and execution model.
Data Representation and Addressing “Computer Systems: A Programmer’s Perspective” - Ch. 3 Explains machine-level representation and address computation.
Instruction Encoding “Modern X86 Assembly Language Programming” - Ch. 2-4 Covers instruction formats and decoding structure.
Control Flow and Calling Conventions “The Art of 64-Bit Assembly, Volume 1” - Ch. 4-6 Details stack frames, calls, and ABI usage.
System Interface “The Linux Programming Interface” - Ch. 3-5 Explains syscalls, errors, and OS boundary.
Object Files and Linking “Computer Systems: A Programmer’s Perspective” - Ch. 7 Covers linking, relocation, and ELF structure.
Performance and SIMD “Computer Architecture” (Hennessy, Patterson) - Ch. 2, 3, 5 Provides pipeline and cache performance model.

Quick Start: Your First 48 Hours

Day 1:

  1. Read the Theory Primer chapters on Architectural State, Addressing, and Instruction Encoding.
  2. Start Project 1 and build the register/flags simulator output.

Day 2:

  1. Validate Project 1 against the Definition of Done.
  2. Read the Core Question and Pitfalls sections for Projects 2 and 3.

Path 1: The Systems Beginner

  • Project 1 -> Project 2 -> Project 5 -> Project 7 -> Project 9

Path 2: The Reverse Engineer

  • Project 3 -> Project 4 -> Project 6 -> Project 10 -> Project 12

Path 3: The Performance Engineer

  • Project 2 -> Project 11 -> Project 12 -> Project 5

Success Metrics

  • You can explain any disassembled basic block in terms of register, memory, and flag changes.
  • You can identify the calling convention and stack alignment of a function by inspection.
  • You can explain how a symbol reference is resolved from object file to runtime address.
  • You can design a microbenchmark that isolates a single performance question.

Appendix: Tooling and Debugging Cheat Sheet

Core tools and what they reveal

  • objdump / llvm-objdump: disassembly and section info
  • readelf / llvm-readobj: ELF headers, symbols, relocations
  • nm: symbol tables
  • gdb / lldb: registers, stack frames, breakpoints

Common failure signatures

  • Segmentation fault: invalid address, protection violation, or null pointer
  • Illegal instruction: bad decode or unsupported instruction
  • Stack corruption: ABI mismatch or misaligned stack

Minimal debugging workflow

  1. Reproduce the failure with a deterministic input.
  2. Capture registers, stack pointer, and instruction pointer at fault.
  3. Inspect disassembly around the fault address.
  4. Validate ABI assumptions (argument registers, stack alignment).

Project Overview Table

Project Difficulty Time Depth of Understanding Fun Factor
1. Register and Flags Simulator Level 3 Weekend High ★★★☆☆
2. Addressing Mode Calculator Level 3 Weekend Medium ★★★☆☆
3. Instruction Encoding Decoder Level 4 2-3 weeks High ★★★★☆
4. RIP-Relative Disassembly Explorer Level 4 2-3 weeks High ★★★★☆
5. Calling Convention Visualizer Level 4 2 weeks High ★★★★☆
6. Stack Frame and Unwind Explorer Level 4 2-3 weeks High ★★★★☆
7. Linux Syscall ABI Tracer Level 4 2 weeks High ★★★☆☆
8. Exception and Interrupt Flow Visualizer Level 4 2-3 weeks High ★★★☆☆
9. ELF/PE Loader Map Explorer Level 4 3-4 weeks High ★★★★☆
10. PLT/GOT Relocation Resolver Level 4 3-4 weeks High ★★★★☆
11. SIMD Lane Analyzer Level 4 2-3 weeks Medium ★★★★☆
12. Microbenchmark Cache and Branch Lab Level 4 3-4 weeks High ★★★★★

Project List

The following projects guide you from architectural fundamentals to real-world binary analysis and performance engineering.

Project 1: Register and Flags Simulator

  • File: P01-register-and-flags-simulator.md
  • Main Programming Language: Python or C (text-based simulator)
  • Alternative Programming Languages: Rust, Go
  • Coolness Level: Level 3
  • Business Potential: 1
  • Difficulty: Level 3
  • Knowledge Area: CPU architecture
  • Software or Tool: None (custom simulator)
  • Main Book: “The Art of 64-Bit Assembly, Volume 1”

What you will build: A CLI simulator that executes a tiny pseudo-instruction set and prints register/flag transitions.

Why it teaches x86-64: You build the mental model of architectural state changes without needing real code.

Core challenges you will face:

  • Register modeling -> Architectural state
  • Flag updates -> Condition codes
  • State tracing -> Debugging mental model

Real World Outcome

Your simulator produces a trace that looks like a real CPU state dump, but using a simplified pseudo-ISA.

$ x64sim –program demo.trace

STEP 0 REG_A=0x0000000000000005 REG_B=0x0000000000000007 FLAGS: Z=0 N=0 C=0 O=0

STEP 1 OP=ADD64 REG_A, REG_B REG_A=0x000000000000000C REG_B=0x0000000000000007 FLAGS: Z=0 N=0 C=0 O=0

STEP 2 OP=CMP64 REG_A, 0x000000000000000C FLAGS: Z=1 N=0 C=0 O=0

The Core Question You Are Answering

“How does a single instruction change the CPU state, and how do flags encode that change?”

This teaches you to think in terms of state transitions, which is the basis of reading disassembly.

Concepts You Must Understand First

  1. Architectural State
    • What registers and flags exist?
    • Book Reference: “The Art of 64-Bit Assembly, Volume 1” - Ch. 1-3
  2. Data Representation
    • How do integers appear in registers?
    • Book Reference: “Computer Systems: A Programmer’s Perspective” - Ch. 3

Questions to Guide Your Design

  1. State Modeling
    • How will you represent registers and flags in memory?
    • How will you update flags deterministically?
  2. Trace Format
    • What trace format will you accept as input?
    • How will you render each step for human readability?

Thinking Exercise

Trace the Flags

Draw a table of REG_A, REG_B, and flags after each pseudo-instruction in a three-step program. Explain why the zero flag changes when it does.

Questions to answer:

  • Which operations update flags in your simulator?
  • What does a flag mean when no arithmetic occurred?

The Interview Questions They Will Ask

  1. “How do flags influence conditional branches?”
  2. “Why do some instructions update flags and others do not?”
  3. “What is the difference between signed and unsigned comparisons at the flag level?”
  4. “How would you simulate overflow in 64-bit arithmetic?”
  5. “Why are partial register updates tricky?”

Hints in Layers

Hint 1: Starting Point Design a small struct that holds registers and flags, and a function that applies one instruction.

Hint 2: Next Level Implement arithmetic as pure functions that return both a value and a flag set.

Hint 3: Technical Details Use a table-driven dispatch: opcode -> handler. Keep handlers small and deterministic.

Hint 4: Tools/Debugging Create a golden trace file and compare output with a diff tool after each change.

Books That Will Help

Topic Book Chapter
Register model “The Art of 64-Bit Assembly, Volume 1” Ch. 1-3
Data representation “Computer Systems: A Programmer’s Perspective” Ch. 3

Common Pitfalls and Debugging

Problem 1: “Flags look wrong after subtraction”

  • Why: Signed vs unsigned overflow rules were mixed.
  • Fix: Implement carry and overflow separately.
  • Quick test: Run a trace with a known overflow and compare flags.

Definition of Done

  • Simulator runs a multi-step program and prints state per step
  • Flag updates match documented rules in your design spec
  • Trace output is deterministic and diffable
  • At least 10 test traces cover edge cases

Project 2: Addressing Mode Calculator

  • File: P02-addressing-mode-calculator.md
  • Main Programming Language: Python or C
  • Alternative Programming Languages: Rust, Go
  • Coolness Level: Level 3
  • Business Potential: 1
  • Difficulty: Level 3
  • Knowledge Area: Memory addressing
  • Software or Tool: None (custom calculator)
  • Main Book: “Computer Systems: A Programmer’s Perspective”

What you will build: A CLI tool that computes effective addresses from base, index, scale, and displacement.

Why it teaches x86-64: Address calculation is the heart of memory operations and compiler output.

Core challenges you will face:

  • Addressing formulas -> Effective address calculation
  • Little-endian view -> Data representation
  • Alignment checks -> ABI constraints

Real World Outcome

$ x64addr –base 0x1000 –index 0x20 –scale 8 –disp 0x18

EFFECTIVE_ADDRESS = 0x0000000000001118 ALIGNMENT: 8-byte aligned RIP_RELATIVE: false

The Core Question You Are Answering

“How does the CPU compute the address used by a memory instruction?”

This builds intuition for reading disassembly and understanding compiler output.

Concepts You Must Understand First

  1. Addressing Modes
    • What is base + index * scale + displacement?
    • Book Reference: “Computer Systems: A Programmer’s Perspective” - Ch. 3
  2. Alignment
    • Why does alignment matter for performance?
    • Book Reference: “Write Great Code, Volume 1” - Ch. 6

Questions to Guide Your Design

  1. Input Model
    • How will users specify base/index/scale/disp?
    • How will you handle missing components?
  2. Validation
    • How will you detect invalid scales?
    • How will you report alignment conditions?

Thinking Exercise

Address Walkthrough

Given a base of 0x1000 and an index of 0x3, compute addresses for scale 1,2,4,8 and a displacement of 0x20. Explain which are aligned for 8-byte data.

Questions to answer:

  • Which combination yields the smallest effective address?
  • How does alignment change when scale changes?

The Interview Questions They Will Ask

  1. “What addressing modes does x86-64 support?”
  2. “Why does RIP-relative addressing exist?”
  3. “How does alignment affect performance?”
  4. “What happens if you use a misaligned address?”
  5. “How would you compute an address for a struct field?”

Hints in Layers

Hint 1: Starting Point Represent the address components as integers and compute a single formula.

Hint 2: Next Level Add validation for scale values and optional fields.

Hint 3: Technical Details Treat missing base or index as zero. Keep output in canonical 64-bit hex.

Hint 4: Tools/Debugging Cross-check a few results by hand and confirm the tool matches.

Books That Will Help

Topic Book Chapter
Addressing modes “Computer Systems: A Programmer’s Perspective” Ch. 3
Data layout “Write Great Code, Volume 1” Ch. 6

Common Pitfalls and Debugging

Problem 1: “Addresses are wrong when index is missing”

  • Why: Missing index treated as garbage instead of zero.
  • Fix: Default missing fields to zero.
  • Quick test: Base-only input should equal base + disp.

Definition of Done

  • Computes effective addresses for all valid scales
  • Validates and rejects invalid scale values
  • Reports alignment for common sizes
  • Has at least 20 test cases

Project 3: Instruction Encoding Decoder

  • File: P03-instruction-encoding-decoder.md
  • Main Programming Language: Python or C
  • Alternative Programming Languages: Rust, Go
  • Coolness Level: Level 4
  • Business Potential: 1
  • Difficulty: Level 4
  • Knowledge Area: Instruction encoding
  • Software or Tool: None (custom decoder)
  • Main Book: “Modern X86 Assembly Language Programming”

What you will build: A decoder that parses byte sequences into prefix/opcode/ModR/M/SIB fields and prints a structured view.

Why it teaches x86-64: Decoding is the foundation of disassemblers and binary analyzers.

Core challenges you will face:

  • Prefix parsing -> Instruction semantics
  • ModR/M decoding -> Addressing modes
  • Boundary detection -> Control flow correctness

Real World Outcome

$ x64decode –bytes “48 8B 05 11 22 33 44”

INSTRUCTION: PREFIX: REX.W OPCODE: 0x8B MODRM: mod=00 reg=000 rm=101 DISP: 0x44332211 LENGTH: 7 bytes

The Core Question You Are Answering

“How do raw bytes become a structured instruction?”

Concepts You Must Understand First

  1. Instruction Encoding
    • What are the fields of an x86-64 instruction?
    • Book Reference: “Modern X86 Assembly Language Programming” - Ch. 2-4
  2. Addressing Modes
    • How do ModR/M and SIB select addressing forms?
    • Book Reference: “Computer Systems: A Programmer’s Perspective” - Ch. 3

Questions to Guide Your Design

  1. Decoding Order
    • How will you detect prefixes vs opcodes?
    • How will you decide if ModR/M is required?
  2. Extensibility
    • How will you add new opcode classes later?
    • How will you detect invalid encodings?

Thinking Exercise

Decode by Hand

Take a short byte sequence and annotate which bytes are prefix, opcode, ModR/M, and displacement. Explain how you determined the instruction length.

Questions to answer:

  • Which bytes are optional?
  • What happens if you start decoding at the wrong offset?

The Interview Questions They Will Ask

  1. “Why is x86-64 decoding harder than fixed-width ISAs?”
  2. “What does the ModR/M byte do?”
  3. “How do prefixes change instruction meaning?”
  4. “How would you validate a decoder?”
  5. “What does it mean when a decoder loses sync?”

Hints in Layers

Hint 1: Starting Point Build a small parser that consumes bytes in order and records fields.

Hint 2: Next Level Use a minimal opcode table for a subset of instructions.

Hint 3: Technical Details Represent the instruction as a struct with optional fields. Emit a JSON view.

Hint 4: Tools/Debugging Compare your output with a known disassembler for the same bytes.

Books That Will Help

Topic Book Chapter
Instruction format “Modern X86 Assembly Language Programming” Ch. 2-4
Low-level encoding “Write Great Code, Volume 2” Ch. 3-5

Common Pitfalls and Debugging

Problem 1: “Length is wrong for some instructions”

  • Why: Prefix handling or ModR/M presence is incorrect.
  • Fix: Implement a clear state machine for decoding order.
  • Quick test: Use byte sequences with and without ModR/M.

Definition of Done

  • Parses prefixes, opcode, ModR/M, SIB, displacement, immediate
  • Emits structured output with instruction length
  • Handles at least 10 instruction patterns
  • Validated against a known disassembler on 50 samples

Project 4: RIP-Relative Disassembly Explorer

  • File: P04-rip-relative-disassembly-explorer.md
  • Main Programming Language: Python or C
  • Alternative Programming Languages: Rust, Go
  • Coolness Level: Level 4
  • Business Potential: 1
  • Difficulty: Level 4
  • Knowledge Area: Addressing, decoding
  • Software or Tool: objdump/llvm-objdump
  • Main Book: “Modern X86 Assembly Language Programming”

What you will build: A tool that identifies RIP-relative references in a binary and resolves them to section/label names.

Why it teaches x86-64: RIP-relative addressing is a defining feature of 64-bit code and PIC.

Core challenges you will face:

  • Instruction decoding -> Prefix and ModR/M parsing
  • Effective address computation -> RIP-relative resolution
  • Binary format awareness -> Section boundaries

Real World Outcome

$ x64ripscan demo.bin

RIP-RELATIVE REFERENCES 0x0000000000401050 -> .rodata+0x2A (“CONST_STRING”) 0x0000000000401074 -> .data+0x10 (GLOBAL_COUNTER)

SUMMARY Total RIP-relative refs: 12

The Core Question You Are Answering

“How does position-independent code find its data without absolute addresses?”

Concepts You Must Understand First

  1. RIP-relative addressing
    • How is effective address computed from RIP?
    • Book Reference: “Computer Systems: A Programmer’s Perspective” - Ch. 3
  2. Instruction encoding
    • How do you detect RIP-relative forms in ModR/M?
    • Book Reference: “Modern X86 Assembly Language Programming” - Ch. 2-4

Questions to Guide Your Design

  1. Binary scanning
    • Which sections should you scan for code?
    • How will you avoid decoding data as code?
  2. Symbol resolution
    • How will you map addresses to section names?
    • What if symbols are stripped?

Thinking Exercise

RIP Math

Given a RIP at 0x400100 and a displacement of 0x20, compute the target address. Explain why the displacement is relative to the next instruction.

Questions to answer:

  • Why is RIP-relative addressing stable under relocation?
  • How does it enable shared libraries?

The Interview Questions They Will Ask

  1. “Why is RIP-relative addressing common in 64-bit binaries?”
  2. “How do you compute the target address?”
  3. “How can disassemblers get confused around RIP-relative data?”
  4. “What happens when symbols are stripped?”
  5. “How does PIC relate to ASLR?”

Hints in Layers

Hint 1: Starting Point Use your decoder from Project 3 to find ModR/M patterns that imply RIP-relative.

Hint 2: Next Level Compute target address as RIP_of_next_instruction + displacement.

Hint 3: Technical Details Use ELF section headers to map targets to section names.

Hint 4: Tools/Debugging Compare a few results with objdump output for validation.

Books That Will Help

Topic Book Chapter
PIC and addressing “Computer Systems: A Programmer’s Perspective” Ch. 3
Disassembly “Modern X86 Assembly Language Programming” Ch. 5

Common Pitfalls and Debugging

Problem 1: “Targets are off by instruction length”

  • Why: Using current RIP instead of next instruction RIP.
  • Fix: Add the decoded instruction length to RIP.
  • Quick test: Verify on a short, known sequence.

Definition of Done

  • Identifies RIP-relative references in .text
  • Resolves targets to sections or symbols
  • Correctly computes target addresses
  • Produces a summary report

Project 5: Calling Convention Visualizer

  • File: P05-calling-convention-visualizer.md
  • Main Programming Language: Python or C
  • Alternative Programming Languages: Rust, Go
  • Coolness Level: Level 4
  • Business Potential: 1
  • Difficulty: Level 4
  • Knowledge Area: ABI, control flow
  • Software or Tool: None (custom visualizer)
  • Main Book: “The Art of 64-Bit Assembly, Volume 1”

What you will build: A tool that visualizes argument passing and stack layout for SysV and Windows x64 calls.

Why it teaches x86-64: The ABI is the glue between functions and binaries.

Core challenges you will face:

  • ABI mapping -> Register assignment
  • Stack alignment -> Call boundary rules
  • Cross-ABI comparison -> SysV vs Windows

Real World Outcome

$ x64abi –sysv –args 8 –locals 3

CALL LAYOUT (SysV) ARG_REGS: A1->REG_A A2->REG_B A3->REG_C A4->REG_D A5->REG_E A6->REG_F STACK_ARGS: ARG7, ARG8 STACK ALIGNMENT: 16-byte at call RED ZONE: enabled

$ x64abi –win64 –args 8 –locals 3

CALL LAYOUT (Win64) ARG_REGS: A1->REG_C A2->REG_D A3->REG_E A4->REG_F STACK_ARGS: ARG5..ARG8 SHADOW SPACE: 32 bytes reserved RED ZONE: disabled

The Core Question You Are Answering

“How do different ABIs move arguments and manage the stack, and why does it matter?”

Concepts You Must Understand First

  1. Calling conventions
    • Which registers are used for arguments?
    • Book Reference: “The Art of 64-Bit Assembly, Volume 1” - Ch. 4-6
  2. Stack alignment
    • Why does the stack need 16-byte alignment?
    • Book Reference: “Computer Systems: A Programmer’s Perspective” - Ch. 3

Questions to Guide Your Design

  1. Visualization
    • How will you represent the stack frame in output?
    • How will you show preserved vs volatile registers?
  2. Rules engine
    • How will you encode ABI rules for each platform?
    • How will you handle variadic or mixed argument types?

Thinking Exercise

ABI Walkthrough

Sketch the register and stack layout for a function with 8 integer arguments on both SysV and Windows.

Questions to answer:

  • Which arguments spill to the stack?
  • Where is alignment padding needed?

The Interview Questions They Will Ask

  1. “How do SysV and Windows x64 calling conventions differ?”
  2. “What is the red zone and why does it exist?”
  3. “Why does Windows require shadow space?”
  4. “Which registers are callee-saved?”
  5. “How do variadic arguments affect the ABI?”

Hints in Layers

Hint 1: Starting Point Create a static table for each ABI: arg registers, caller-saved, callee-saved.

Hint 2: Next Level Compute the stack layout with alignment rules and show a diagram.

Hint 3: Technical Details Model the stack as a list of slots with offsets from SP.

Hint 4: Tools/Debugging Cross-check with ABI documentation examples and known compiler output.

Books That Will Help

Topic Book Chapter
ABI fundamentals “The Art of 64-Bit Assembly, Volume 1” Ch. 4-6
Stack layout “Computer Systems: A Programmer’s Perspective” Ch. 3

Common Pitfalls and Debugging

Problem 1: “Arguments are mapped to wrong registers”

  • Why: Mixing SysV and Windows rules.
  • Fix: Keep separate rule tables and tests for each ABI.
  • Quick test: Use a known 4-arg example and verify output.

Definition of Done

  • Visualizes SysV and Windows argument registers
  • Computes stack argument slots and alignment
  • Shows red zone and shadow space differences
  • Produces consistent output for multiple cases

Project 6: Stack Frame and Unwind Explorer

  • File: P06-stack-frame-unwind-explorer.md
  • Main Programming Language: Python or C
  • Alternative Programming Languages: Rust, Go
  • Coolness Level: Level 4
  • Business Potential: 1
  • Difficulty: Level 4
  • Knowledge Area: ABI, debugging
  • Software or Tool: readelf, dwarfdump (or llvm-dwarfdump)
  • Main Book: “The Art of Debugging with GDB”

What you will build: A tool that reads unwind metadata and reconstructs a call stack view.

Why it teaches x86-64: Unwind info exposes the real stack discipline used by compilers.

Core challenges you will face:

  • Debug metadata parsing -> Object file layout
  • Stack frame reconstruction -> ABI compliance
  • Edge cases -> Omitted frame pointers

Real World Outcome

$ x64unwind demo.elf

FRAME 0: func_current SP=0x7fffffffe0c0 BP=0x7fffffffe0e0 SAVED: REG_B, REG_C

FRAME 1: func_parent SP=0x7fffffffe110 BP=0x7fffffffe130 SAVED: REG_B

FRAME 2: main SP=0x7fffffffe160 BP=0x7fffffffe180

The Core Question You Are Answering

“How can a debugger reconstruct a call stack without executing the code?”

Concepts You Must Understand First

  1. Stack frames
    • How do frames store return addresses and saved registers?
    • Book Reference: “The Art of 64-Bit Assembly, Volume 1” - Ch. 4-6
  2. Object file metadata
    • Where is unwind info stored in ELF?
    • Book Reference: “Computer Systems: A Programmer’s Perspective” - Ch. 7

Questions to Guide Your Design

  1. Metadata parsing
    • Which sections hold unwind data?
    • How will you interpret rules for stack pointer recovery?
  2. Visualization
    • How will you present each frame to be human-readable?
    • How will you handle missing frame pointers?

Thinking Exercise

Frame Reconstruction

Given a stack pointer and a saved base pointer, draw the frame layout and indicate where the return address lives.

Questions to answer:

  • How would a debugger find the previous frame?
  • Why do compilers sometimes omit frame pointers?

The Interview Questions They Will Ask

  1. “What is unwind metadata used for?”
  2. “How does a debugger walk a stack without frame pointers?”
  3. “Why would a compiler omit a frame pointer?”
  4. “What is the difference between .eh_frame and .debug_frame?”
  5. “How does exception unwinding differ from stack tracing?”

Hints in Layers

Hint 1: Starting Point Use readelf or llvm-dwarfdump to locate unwind sections and inspect them.

Hint 2: Next Level Build a parser that maps unwind rules to register restore operations.

Hint 3: Technical Details Simulate the unwind process using a synthetic register snapshot.

Hint 4: Tools/Debugging Compare your reconstructed frames with gdb backtraces.

Books That Will Help

Topic Book Chapter
Debugging “The Art of Debugging with GDB” Ch. 5-7
Linking and symbols “Computer Systems: A Programmer’s Perspective” Ch. 7

Common Pitfalls and Debugging

Problem 1: “Stack frames look inconsistent”

  • Why: Unwind rules were misinterpreted or missing.
  • Fix: Cross-check with gdb and confirm register restore rules.
  • Quick test: Use a binary compiled with frame pointers enabled.

Definition of Done

  • Parses unwind metadata and reconstructs frames
  • Matches gdb backtrace for a test binary
  • Handles functions without frame pointers
  • Outputs a clean frame summary

Project 7: Linux Syscall ABI Tracer

  • File: P07-linux-syscall-abi-tracer.md
  • Main Programming Language: Python or C
  • Alternative Programming Languages: Rust, Go
  • Coolness Level: Level 4
  • Business Potential: 1
  • Difficulty: Level 4
  • Knowledge Area: System interface
  • Software or Tool: ptrace, strace (reference only)
  • Main Book: “The Linux Programming Interface”

What you will build: A tracer that logs syscall numbers and argument registers at runtime (conceptually similar to strace but narrower).

Why it teaches x86-64: Syscall ABIs expose the exact boundary between user code and kernel services.

Core challenges you will face:

  • Syscall register mapping -> ABI specifics
  • Process tracing -> OS interaction
  • Argument decoding -> Data representation

Real World Outcome

$ x64syscall-trace ./demo_app

SYSCALL 0x01 ARG1=0x0000000000402000 ARG2=0x0000000000000012 ARG3=0x0000000000000001 RESULT=0x0000000000000012

SYSCALL 0x02 ARG1=0x0000000000403000 ARG2=0x0000000000000000 ARG3=0x0000000000000000 RESULT=0x0000000000000003

The Core Question You Are Answering

“What exactly crosses the boundary when user code asks the kernel for help?”

Concepts You Must Understand First

  1. Syscall ABI
    • Which registers carry syscall number and arguments?
    • Book Reference: “The Linux Programming Interface” - Ch. 3-4
  2. Privilege transitions
    • Why do syscalls change CPU mode?
    • Book Reference: “Operating Systems: Three Easy Pieces” - Ch. 10

Questions to Guide Your Design

  1. Tracing model
    • Will you trace at entry, exit, or both?
    • How will you map syscall numbers to names?
  2. Safety
    • How will you avoid corrupting the target process state?
    • How will you handle errors and signals?

Thinking Exercise

Syscall Boundary

Sketch the register state at syscall entry and exit. Explain which registers are preserved and which are clobbered.

Questions to answer:

  • Which registers are safe to use after syscall?
  • How would you detect an error return?

The Interview Questions They Will Ask

  1. “Which registers carry Linux syscall arguments on x86-64?”
  2. “What does syscall clobber?”
  3. “How does a syscall differ from a function call?”
  4. “How would you trace syscalls without ptrace?”
  5. “Why is syscall ABI different from the C ABI?”

Hints in Layers

Hint 1: Starting Point Use ptrace or a similar mechanism to inspect registers at syscall entry/exit.

Hint 2: Next Level Maintain a map of syscall numbers to names for readability.

Hint 3: Technical Details Log only the argument registers defined by the ABI and ignore the rest.

Hint 4: Tools/Debugging Compare your output with a standard strace run.

Books That Will Help

Topic Book Chapter
Syscalls “The Linux Programming Interface” Ch. 3-4
OS boundary “Operating Systems: Three Easy Pieces” Ch. 10

Common Pitfalls and Debugging

Problem 1: “Arguments are garbage”

  • Why: Using the wrong register mapping for syscall ABI.
  • Fix: Follow the ABI register order strictly.
  • Quick test: Trace a known syscall like write and compare output.

Definition of Done

  • Logs syscall numbers and argument registers correctly
  • Distinguishes entry and exit values
  • Produces stable output across runs
  • Matches a reference trace for a test program

Project 8: Exception and Interrupt Flow Visualizer

  • File: P08-exception-interrupt-flow-visualizer.md
  • Main Programming Language: Python or C
  • Alternative Programming Languages: Rust, Go
  • Coolness Level: Level 4
  • Business Potential: 1
  • Difficulty: Level 4
  • Knowledge Area: CPU/OS interaction
  • Software or Tool: None (simulation)
  • Main Book: “Operating Systems: Three Easy Pieces”

What you will build: A visualizer that simulates exception and interrupt delivery paths.

Why it teaches x86-64: Exceptions and interrupts explain most crashes and OS behavior.

Core challenges you will face:

  • Event modeling -> Synchronous vs asynchronous
  • Privilege transition -> Ring changes
  • Stack switching -> Kernel stack usage

Real World Outcome

$ x64trapviz –event page_fault

EVENT: PAGE_FAULT MODE: user -> kernel STACK SWITCH: yes PUSHED: RIP, CS, RFLAGS, ERROR_CODE HANDLER: vector 14

The Core Question You Are Answering

“What actually happens when an exception or interrupt occurs?”

Concepts You Must Understand First

  1. Exceptions and interrupts
    • How do synchronous and asynchronous events differ?
    • Book Reference: “Operating Systems: Three Easy Pieces” - Ch. 7, 10
  2. Privilege transitions
    • How does CPU mode change on a fault?
    • Book Reference: “Computer Systems: A Programmer’s Perspective” - Ch. 8

Questions to Guide Your Design

  1. Event model
    • Which events will you simulate (page fault, divide error, timer)?
    • What state must be saved on entry?
  2. Output design
    • How will you show stack transitions and handler vectors?
    • How will you show return to user mode?

Thinking Exercise

Fault Timeline

Draw the sequence of events from a page fault in user code to a signal delivered to the process.

Questions to answer:

  • What state must the CPU save to resume later?
  • How does the OS decide which signal to deliver?

The Interview Questions They Will Ask

  1. “What is the difference between an exception and an interrupt?”
  2. “What happens on a page fault?”
  3. “Why does the CPU switch stacks on an exception?”
  4. “How does an OS return from an interrupt?”
  5. “How do signals relate to exceptions?”

Hints in Layers

Hint 1: Starting Point Create a simple state machine with states: user, kernel, handler.

Hint 2: Next Level Model what the CPU pushes on the stack for each event.

Hint 3: Technical Details Include the vector number and error code in your model.

Hint 4: Tools/Debugging Validate against OS manuals and debugger traces for faults.

Books That Will Help

Topic Book Chapter
Exceptions “Operating Systems: Three Easy Pieces” Ch. 7, 10
Control flow “Computer Systems: A Programmer’s Perspective” Ch. 8

Common Pitfalls and Debugging

Problem 1: “Interrupts look identical to exceptions”

  • Why: Asynchronous vs synchronous triggers were not modeled.
  • Fix: Include a trigger type and show timing differences.
  • Quick test: Simulate a timer interrupt vs divide error.

Definition of Done

  • Models exceptions and interrupts distinctly
  • Shows privilege and stack transitions
  • Includes handler vector details
  • Produces clear, deterministic output

Project 9: ELF/PE Loader Map Explorer

  • File: P09-elf-pe-loader-map-explorer.md
  • Main Programming Language: Python or C
  • Alternative Programming Languages: Rust, Go
  • Coolness Level: Level 4
  • Business Potential: 2
  • Difficulty: Level 4
  • Knowledge Area: Object files, loaders
  • Software or Tool: readelf, objdump
  • Main Book: “Computer Systems: A Programmer’s Perspective”

What you will build: A tool that maps ELF or PE headers to a virtual memory layout diagram.

Why it teaches x86-64: Executables are not magic; the loader maps sections into memory with permissions and alignment.

Core challenges you will face:

  • Header parsing -> File format understanding
  • Segment mapping -> Virtual address reasoning
  • Permission modeling -> Security and correctness

Real World Outcome

$ x64map demo.elf

SEGMENT MAP .text VA=0x400000 SIZE=0x1000 PERM=R-X .rodata VA=0x401000 SIZE=0x0800 PERM=R– .data VA=0x402000 SIZE=0x0400 PERM=RW- .bss VA=0x403000 SIZE=0x0200 PERM=RW-

The Core Question You Are Answering

“How does the loader transform an on-disk file into a running memory image?”

Concepts You Must Understand First

  1. ELF/PE headers
    • What do program headers and section headers represent?
    • Book Reference: “Computer Systems: A Programmer’s Perspective” - Ch. 7
  2. Virtual memory
    • How are segments mapped into memory?
    • Book Reference: “Operating Systems: Three Easy Pieces” - Ch. 13

Questions to Guide Your Design

  1. Format choice
    • Will you support ELF only or both ELF and PE?
    • How will you detect file type?
  2. Visualization
    • How will you render the memory map?
    • How will you show permissions and alignment gaps?

Thinking Exercise

Segment Layout

Draw a memory map with .text, .rodata, .data, and .bss. Show which are executable and which are writable.

Questions to answer:

  • Why should code be non-writable?
  • Why is .bss not stored on disk?

The Interview Questions They Will Ask

  1. “What is the difference between ELF sections and segments?”
  2. “Why does the loader care about program headers?”
  3. “What is the purpose of .bss?”
  4. “How do permissions affect security?”
  5. “Why can executables be relocated in memory?”

Hints in Layers

Hint 1: Starting Point Parse only the minimal header fields you need for mapping.

Hint 2: Next Level Build a table of segments with virtual address, size, and permissions.

Hint 3: Technical Details Align segments to page boundaries in your visualization.

Hint 4: Tools/Debugging Compare with readelf -l output for a known binary.

Books That Will Help

Topic Book Chapter
Linking and loading “Computer Systems: A Programmer’s Perspective” Ch. 7
Virtual memory “Operating Systems: Three Easy Pieces” Ch. 13

Common Pitfalls and Debugging

Problem 1: “Segments overlap or appear out of order”

  • Why: Misinterpreting file offsets vs virtual addresses.
  • Fix: Use program headers for mapping, not section headers.
  • Quick test: Compare with loader output for a simple binary.

Definition of Done

  • Parses ELF or PE headers correctly
  • Produces a virtual memory map diagram
  • Indicates segment permissions
  • Matches readelf/objdump output on test binaries

Project 10: PLT/GOT Relocation Resolver

  • File: P10-plt-got-relocation-resolver.md
  • Main Programming Language: Python or C
  • Alternative Programming Languages: Rust, Go
  • Coolness Level: Level 4
  • Business Potential: 2
  • Difficulty: Level 4
  • Knowledge Area: Linking, relocation
  • Software or Tool: readelf, objdump
  • Main Book: “Computer Systems: A Programmer’s Perspective”

What you will build: A tool that inspects PLT/GOT entries and resolves which symbols they correspond to.

Why it teaches x86-64: Dynamic linking is central to real-world binaries and relies on relocation mechanics.

Core challenges you will face:

  • Relocation parsing -> ABI rules
  • Symbol resolution -> Linking model
  • Runtime mapping -> PLT/GOT semantics

Real World Outcome

$ x64plt demo.elf

PLT/GOT RESOLUTION PLT[0] -> _dl_runtime_resolve PLT[1] -> puts PLT[2] -> malloc

GOT[1] -> puts@GLIBC GOT[2] -> malloc@GLIBC

The Core Question You Are Answering

“How does a call to an external function become a concrete address at runtime?”

Concepts You Must Understand First

  1. Relocations
    • What do relocation entries represent?
    • Book Reference: “Computer Systems: A Programmer’s Perspective” - Ch. 7
  2. Dynamic linking
    • What are PLT and GOT used for?
    • Book Reference: “Computer Systems: A Programmer’s Perspective” - Ch. 7

Questions to Guide Your Design

  1. Data extraction
    • Which sections contain PLT and GOT information?
    • How will you map relocation entries to symbols?
  2. Output format
    • How will you display the mapping in a readable form?
    • How will you handle stripped binaries?

Thinking Exercise

Resolution Timeline

Draw a timeline showing when a PLT entry is first resolved and how subsequent calls behave.

Questions to answer:

  • What changes in the GOT after resolution?
  • Why is lazy binding useful?

The Interview Questions They Will Ask

  1. “What is the PLT and why does it exist?”
  2. “How does the GOT get populated?”
  3. “What is lazy binding?”
  4. “What happens if relocations are missing?”
  5. “How does ASLR affect dynamic linking?”

Hints in Layers

Hint 1: Starting Point Parse relocation sections that target the PLT and GOT.

Hint 2: Next Level Use symbol tables to map relocation entries to names.

Hint 3: Technical Details Track the relationship between PLT stubs and GOT slots.

Hint 4: Tools/Debugging Compare your output with readelf -r and objdump -d.

Books That Will Help

Topic Book Chapter
Dynamic linking “Computer Systems: A Programmer’s Perspective” Ch. 7
Object file format “Linkers and Loaders” by John Levine Ch. 3-5

Common Pitfalls and Debugging

Problem 1: “Symbols do not resolve”

  • Why: Stripped binaries or missing symbol table.
  • Fix: Fall back to dynamic symbol table or note unknowns.
  • Quick test: Use a binary with known imports.

Definition of Done

  • Parses PLT and GOT sections
  • Maps relocations to symbol names
  • Explains lazy binding behavior
  • Works on at least two real binaries

Project 11: SIMD Lane Analyzer

  • File: P11-simd-lane-analyzer.md
  • Main Programming Language: Python or C
  • Alternative Programming Languages: Rust, Go
  • Coolness Level: Level 4
  • Business Potential: 1
  • Difficulty: Level 4
  • Knowledge Area: SIMD, performance
  • Software or Tool: None (custom analyzer)
  • Main Book: “Modern X86 Assembly Language Programming”

What you will build: A tool that simulates SIMD lane operations on vectors and shows lane-by-lane transformations.

Why it teaches x86-64: SIMD is central to modern performance and requires understanding vector semantics.

Core challenges you will face:

  • Vector modeling -> SIMD semantics
  • Alignment rules -> ABI constraints
  • Lane visualization -> Debugging vector logic

Real World Outcome

$ x64simd –op VEC_ADD –lanes 4 –input “[1,2,3,4]” –input2 “[10,20,30,40]”

VEC_ADD RESULT LANE0: 11 LANE1: 22 LANE2: 33 LANE3: 44

ALIGNMENT CHECK: 16-byte aligned

The Core Question You Are Answering

“How do SIMD instructions transform multiple values at once, and what constraints matter?”

Concepts You Must Understand First

  1. SIMD registers
    • How do lanes map onto vector registers?
    • Book Reference: “Modern X86 Assembly Language Programming” - Ch. 9-10
  2. Alignment
    • Why do some vector loads require alignment?
    • Book Reference: “Computer Architecture” (Hennessy, Patterson) - Ch. 5

Questions to Guide Your Design

  1. Vector model
    • How will you represent lanes and element sizes?
    • How will you handle different vector widths?
  2. Operations
    • Which vector operations will you model first?
    • How will you show lane-wise results?

Thinking Exercise

Lane Mapping

Given a 128-bit vector with 4 lanes of 32-bit values, draw how values map to bytes in memory.

Questions to answer:

  • How does endianness affect lane order?
  • How does alignment affect a vector load?

The Interview Questions They Will Ask

  1. “What is SIMD and why is it faster?”
  2. “How do vector lanes map to memory?”
  3. “What happens on a misaligned vector load?”
  4. “How do compilers auto-vectorize loops?”
  5. “What is the difference between SSE and AVX?”

Hints in Layers

Hint 1: Starting Point Represent a vector as an array of lanes and implement lane-wise operations.

Hint 2: Next Level Add alignment checks and report when an operation would be slow.

Hint 3: Technical Details Allow different element sizes and show byte-level representation.

Hint 4: Tools/Debugging Use known vector examples from documentation and verify output.

Books That Will Help

Topic Book Chapter
SIMD basics “Modern X86 Assembly Language Programming” Ch. 9-10
Cache and alignment “Computer Architecture” (Hennessy, Patterson) Ch. 5

Common Pitfalls and Debugging

Problem 1: “Lane order is reversed”

  • Why: Endianness was misapplied to lane indexing.
  • Fix: Define lane order explicitly and stick to it.
  • Quick test: Use a simple ascending vector and verify mapping.

Definition of Done

  • Simulates at least 5 vector operations
  • Shows lane-wise transformation
  • Includes alignment checks
  • Produces deterministic output

Project 12: Microbenchmark Cache and Branch Lab

  • File: P12-microbenchmark-cache-branch-lab.md
  • Main Programming Language: C or C++ (with assembly inspection)
  • Alternative Programming Languages: Rust, Go
  • Coolness Level: Level 4
  • Business Potential: 1
  • Difficulty: Level 4
  • Knowledge Area: Performance, microarchitecture
  • Software or Tool: perf, time, objdump
  • Main Book: “Computer Architecture” (Hennessy, Patterson)

What you will build: A microbenchmark suite that measures cache effects and branch prediction behavior, and correlates results with disassembly.

Why it teaches x86-64: It connects assembly-level code to real performance behavior.

Core challenges you will face:

  • Benchmark design -> Noise reduction
  • Disassembly correlation -> Instruction-level analysis
  • Cache modeling -> Memory hierarchy intuition

Real World Outcome

$ x64bench –case cache_stride

CASE: cache_stride STRIDE=64 bytes ITER=10000000 RESULT: 1.35 ns/iter (L1 hit)

CASE: cache_stride STRIDE=4096 bytes ITER=10000000 RESULT: 12.8 ns/iter (L3/memory)

The Core Question You Are Answering

“How do cache behavior and branch prediction show up in real timing results?”

Concepts You Must Understand First

  1. Cache hierarchy
    • Why does stride affect cache hits?
    • Book Reference: “Computer Architecture” (Hennessy, Patterson) - Ch. 5
  2. Branch prediction
    • What causes mispredict penalties?
    • Book Reference: “Inside the Machine” by Jon Stokes - Ch. 4-6

Questions to Guide Your Design

  1. Measurement strategy
    • How will you reduce noise (warmup, pinning, repeat)?
    • How will you report confidence intervals?
  2. Disassembly analysis
    • How will you map results to the instruction sequence?
    • Which counters or metrics will you collect?

Thinking Exercise

Predict the Curve

Sketch the expected latency curve as array size grows beyond L1, L2, and L3 caches.

Questions to answer:

  • Where should the knee points appear?
  • What would a branch-mispredict curve look like?

The Interview Questions They Will Ask

  1. “Why does memory access latency jump with stride?”
  2. “How does branch prediction affect tight loops?”
  3. “What is the difference between latency and throughput?”
  4. “How do you design a reliable microbenchmark?”
  5. “Why is assembly inspection useful in performance work?”

Hints in Layers

Hint 1: Starting Point Start with a simple loop and measure time per iteration.

Hint 2: Next Level Vary stride length and measure latency changes.

Hint 3: Technical Details Pin the process to a core and disable frequency scaling if possible.

Hint 4: Tools/Debugging Use perf stat to gather branch and cache miss counters.

Books That Will Help

Topic Book Chapter
Caches “Computer Architecture” (Hennessy, Patterson) Ch. 5
Pipelines “Inside the Machine” by Jon Stokes Ch. 4-6

Common Pitfalls and Debugging

Problem 1: “Results are noisy and inconsistent”

  • Why: OS scheduling and CPU frequency changes.
  • Fix: Pin to a core and repeat runs with warmups.
  • Quick test: Compare variance across 10 repeated runs.

Definition of Done

  • Measures cache effects with clear latency tiers
  • Demonstrates branch predict vs mispredict behavior
  • Includes disassembly correlation notes
  • Results are reproducible within a small variance

Project Comparison Table

Project Difficulty Time Depth of Understanding Fun Factor
1. Register and Flags Simulator Level 3 Weekend High ★★★☆☆
2. Addressing Mode Calculator Level 3 Weekend Medium ★★★☆☆
3. Instruction Encoding Decoder Level 4 2-3 weeks High ★★★★☆
4. RIP-Relative Disassembly Explorer Level 4 2-3 weeks High ★★★★☆
5. Calling Convention Visualizer Level 4 2 weeks High ★★★★☆
6. Stack Frame and Unwind Explorer Level 4 2-3 weeks High ★★★★☆
7. Linux Syscall ABI Tracer Level 4 2 weeks High ★★★☆☆
8. Exception and Interrupt Flow Visualizer Level 4 2-3 weeks High ★★★☆☆
9. ELF/PE Loader Map Explorer Level 4 3-4 weeks High ★★★★☆
10. PLT/GOT Relocation Resolver Level 4 3-4 weeks High ★★★★☆
11. SIMD Lane Analyzer Level 4 2-3 weeks Medium ★★★★☆
12. Microbenchmark Cache and Branch Lab Level 4 3-4 weeks High ★★★★★

Recommendation

If you are new to x86-64: Start with Project 1. It builds the state-transition mental model without overwhelming tooling. If you are a reverse engineer: Start with Project 3. Instruction decoding is the core of binary analysis. If you want performance mastery: Focus on Project 12 after completing Project 11.

Final Overall Project: The Binary Truth Toolkit

The Goal: Combine Projects 3, 4, 9, and 10 into a single toolkit that can decode instructions, resolve RIP-relative references, and explain dynamic linking behavior for a given binary.

  1. Parse the binary into sections and segments.
  2. Decode instruction boundaries and detect RIP-relative references.
  3. Resolve PLT/GOT entries and map external dependencies.

Success Criteria: Given a stripped binary, the toolkit produces a clear report that explains where code lives, where data lives, and how external calls are resolved.

From Learning to Production: What Is Next

Your Project Production Equivalent Gap to Fill
Instruction Decoder Commercial disassembler Full instruction coverage, vendor updates
Syscall Tracer strace / dtrace Robust process handling and syscall database
ELF/PE Explorer objdump / readelf Complete format support and edge cases
Microbenchmark Lab perf / VTune High-fidelity counters, automation, reporting

Summary

This learning path covers x86-64 through 12 hands-on projects.

# Project Name Main Language Difficulty Time Estimate
1 Register and Flags Simulator Python or C Level 3 Weekend
2 Addressing Mode Calculator Python or C Level 3 Weekend
3 Instruction Encoding Decoder Python or C Level 4 2-3 weeks
4 RIP-Relative Disassembly Explorer Python or C Level 4 2-3 weeks
5 Calling Convention Visualizer Python or C Level 4 2 weeks
6 Stack Frame and Unwind Explorer Python or C Level 4 2-3 weeks
7 Linux Syscall ABI Tracer Python or C Level 4 2 weeks
8 Exception and Interrupt Flow Visualizer Python or C Level 4 2-3 weeks
9 ELF/PE Loader Map Explorer Python or C Level 4 3-4 weeks
10 PLT/GOT Relocation Resolver Python or C Level 4 3-4 weeks
11 SIMD Lane Analyzer Python or C Level 4 2-3 weeks
12 Microbenchmark Cache and Branch Lab C/C++ Level 4 3-4 weeks

Expected Outcomes

  • You can decode x86-64 instructions and explain their effect on architectural state.
  • You can map ABI rules to concrete stack frames and calling conventions.
  • You can reason about dynamic linking and loader behavior in real binaries.

Additional Resources and References

Standards and Specifications

Industry Analysis

Books

  • “The Art of 64-Bit Assembly, Volume 1” by Randall Hyde - Practical x86-64 foundations
  • “Modern X86 Assembly Language Programming” by Daniel Kusswurm - Encoding and instruction forms
  • “Computer Systems: A Programmer’s Perspective” by Bryant and O’Hallaron - ABI, linking, and system view