Project 4: RIP-Relative Disassembly Explorer

A tool that identifies RIP-relative references in a binary and resolves them to section/label names.

Quick Reference

Attribute	Value
Difficulty	Level 4
Time Estimate	2-3 weeks
Main Programming Language	Python or C (Alternatives: Rust, Go)
Alternative Programming Languages	Rust, Go
Coolness Level	Level 4
Business Potential	1
Prerequisites	Data Representation, Memory, and Addressing, Instruction Encoding and Decoding
Key Topics	Data Representation, Memory, and Addressing, Instruction Encoding and Decoding

1. Learning Objectives

By completing this project, you will:

Explain why rip-relative disassembly explorer reveals key x86-64 behaviors.
Build a deterministic tool with clear, inspectable output.
Validate correctness against a golden reference output.
Connect the tool output to ABI and architecture rules.
RIP-relative addressing is a defining feature of 64-bit code and PIC.

2. All Theory Needed (Per-Concept Breakdown)

Data Representation, Memory, and Addressing

Fundamentals x86-64 is a byte-addressed, little-endian architecture. Data representation determines how values appear in memory, how loads and stores reconstruct those values, and how alignment affects performance. Memory addressing is not just “base + offset”; it is a rich set of forms including base, index, scale, and displacement, plus RIP-relative addressing in 64-bit mode. These addressing forms are part of the ISA and are a primary tool for compilers. Virtual memory adds another layer: the addresses you see in registers are virtual, translated by page tables configured by the OS. When you write or analyze assembly, you are always navigating both representation and translation. Official architecture references and ABI specifications describe these addressing forms and constraints. (Sources: Intel SDM, Microsoft x64 architecture docs)

Deep Dive Data representation is the mapping between abstract values and physical bytes. On x86-64, integers are typically two’s complement and stored in little-endian order. That means the least significant byte sits at the lowest memory address. When you inspect memory dumps, the order will appear reversed relative to the human-readable hex. This matters for debugging and binary analysis; it also matters for writing correct parsing and serialization logic.

Memory addressing is a key differentiator between x86-64 and many simpler ISAs. The architecture supports effective addresses of the form base + index * scale + displacement, where scale can be 1, 2, 4, or 8. This lets the CPU calculate addresses for arrays and structures in a single instruction, which is why compiler output often uses complex addressing instead of explicit multiply or add instructions. In long mode, RIP-relative addressing is widely used for position-independent code; it allows the instruction stream to refer to nearby constants and jump tables without absolute addresses. That is why you will see references relative to RIP rather than absolute pointers in modern binaries.

Virtual memory is the next layer of meaning. The addresses in registers are virtual; they are translated to physical addresses using a page table hierarchy. As a result, two different processes can have the same virtual address mapping to different physical memory. The OS enforces protection and isolation through page permissions. When you read assembly, you see the virtual addresses. The mapping is invisible unless you consult page tables or OS introspection tools, which is why memory corruption bugs can appear non-deterministic; they might read valid memory but the wrong mapping.

Alignment is another subtlety. Many instructions perform better when data is aligned to its natural width (for example, 8-byte aligned for 64-bit values). Misaligned loads are supported in x86-64 but can be slower or cause extra microarchitectural work. ABI conventions often require stack alignment to 16 bytes at call boundaries, which ensures that SIMD operations and stack-based data are aligned. This alignment rule is part of the ABI, not just a performance hint.

Addressing modes also influence instruction encoding. The ModR/M and SIB bytes encode the base, index, scale, and displacement. Some combinations are invalid or have special meaning (for example, certain base/index fields imply RIP-relative addressing or a displacement-only form). Understanding this encoding is critical for building decoders and for interpreting bytes in memory. It is also how you can verify that a disassembler is correct: the addressing mode can be inferred from the encoding and compared to the textual rendering.

Finally, consider how data representation affects control flow and calling conventions. Arguments passed by reference are simply addresses; the ABI does not enforce type. That means assembly must interpret the bytes correctly, or the program will behave incorrectly even if the instruction sequence is “valid.” This is where assembly becomes a discipline: you must know what the bytes mean, and that meaning is not written anywhere except in the ABI and the program’s logic.

How this fits on projects

Projects 2-4 are explicitly about effective address calculation and RIP-relative forms.
Projects 9-10 require precise understanding of data layout and alignment inside ELF/PE sections.

Definitions & key terms

Little-endian: Least significant byte at lowest address.
Effective address: The computed address used by a memory instruction.
RIP-relative: Addressing relative to the instruction pointer.
Virtual memory: The address space seen by a process, mapped to physical memory.
Alignment: Address boundary that improves correctness or performance.

Mental model diagram

VALUE -> BYTES -> VIRTUAL ADDRESS -> PAGE TABLE -> PHYSICAL ADDRESS

      +---------+          +------------------+
      |  Value  |  encode  |  Byte Sequence   |
      +---------+          +---------+--------+
                                    |
                                    v
                         +----------------------+
                         |  Effective Address   |
                         | base + index*scale + |
                         |      displacement    |
                         +----------+-----------+
                                    |
                                    v
                         +----------------------+
                         |   Virtual Address    |
                         +----------+-----------+
                                    |
                                    v
                         +----------------------+
                         |   Page Translation   |
                         +----------+-----------+
                                    |
                                    v
                         +----------------------+
                         |   Physical Address   |
                         +----------------------+

How it works

Program computes effective address from base/index/scale/disp.
CPU uses that effective address as a virtual address.
MMU translates virtual to physical using page tables.
Data is loaded or stored in little-endian byte order.

Invariants and failure modes:

Invariant: Effective address is computed before translation.
Failure: Misinterpreting endianness yields wrong values.
Invariant: ABI defines alignment at call boundaries.
Failure: Misalignment can break SIMD assumptions or slow down code.

Minimal concrete example (pseudo-assembly, not real code)

# PSEUDOCODE ONLY
# Compute address of element i in an array of 8-byte elements
EFFECTIVE_ADDRESS = BASE_PTR + INDEX * 8 + OFFSET
LOAD64 REG_X, [EFFECTIVE_ADDRESS]

Common misconceptions

“x86-64 is big-endian.” It is little-endian by default.
“All addresses are physical.” User code uses virtual addresses.
“Alignment is optional.” It is required by ABI for some operations.

Check-your-understanding questions

Why does little-endian matter when reading a hexdump?
What is the difference between effective and virtual address?
Why do compilers use base+index*scale addressing?

Check-your-understanding answers

The byte order is reversed relative to human-readable hex.
Effective is computed by the instruction; virtual is then translated.
It encodes array indexing in a single instruction.

Real-world applications

Debugging pointer arithmetic errors
Building instruction decoders and disassemblers
Understanding how compilers lay out data

Where you will apply it Projects 2, 3, 4, 9, 10

References

Intel 64 and IA-32 Architectures Software Developer’s Manual (Intel)
Microsoft x64 architecture documentation
“Computer Systems: A Programmer’s Perspective” by Bryant and O’Hallaron - Ch. 3

Key insights Memory is not just bytes; it is a layered mapping between representation and address translation.

Summary Effective addressing and data layout are the glue between values in your head and bytes in memory.

Homework/Exercises to practice the concept

Convert a 64-bit integer into its little-endian byte sequence.
Compute effective addresses for an array with different indices.

Solutions to the homework/exercises

List the bytes from least significant to most significant.
Use base + index * element_size + offset.
Instruction Encoding and Decoding

Fundamentals x86-64 instructions are variable-length and encoded as a sequence of bytes that may include prefixes, an opcode, ModR/M and SIB bytes, displacement, and immediates. Unlike fixed-width ISAs, instruction length is determined by decoding. This makes decoding complex but flexible. It is also why disassemblers can become confused when they start decoding at the wrong offset. The authoritative definition of instruction encoding and formats is in the vendor manuals, and any tool that handles x86-64 must follow those rules to be correct. (Source: Intel SDM)

Deep Dive Instruction encoding is the bridge between human-readable assembly and raw bytes. x86-64 uses a layered encoding scheme that evolved over decades. Most instructions start with optional prefixes that modify operand size, address size, or provide specialized semantics. In 64-bit mode, the REX prefix extends the register set and selects 64-bit operand size. After prefixes comes the opcode, which identifies the basic instruction. Some instructions use a single-byte opcode; others use opcode escape bytes or multi-byte opcodes for extended instruction sets.

The ModR/M byte is a core part of encoding. It specifies the addressing mode and, in many cases, which registers are used. If the ModR/M indicates memory addressing with an index register, the SIB byte is present and encodes scale, index, and base. Then a displacement may follow. Immediate values, if present, appear at the end of the encoding. The length of an instruction therefore depends on which of these elements appear, which is why decoding must be performed left-to-right in a strict order.

In 64-bit mode, REX prefixes are both powerful and subtle. They extend the register specifiers so you can address the full set of 16 GPRs. They also indicate whether the operand size is 64-bit. That means a missing REX prefix can change the meaning of an instruction even if the opcode and ModR/M bytes are the same. This is a common source of decoding errors in custom tooling. Another source of complexity is that some instructions have implicit operands or fixed registers, so the encoding does not explicitly list all registers involved.

Modern extensions introduce additional prefix schemes like VEX and EVEX for SIMD. These prefixes can replace older opcode sequences and encode vector register width, masking, and other features. From a decoding standpoint, these prefixes are distinct instruction classes with their own rules. That is why instruction decoders are often table-driven: they need to map prefix and opcode combinations to specific instruction behaviors.

The practical consequence is that a decoder must be deterministic and validated against known-good outputs. Tools like objdump or llvm-objdump are correct because they implement the official encoding tables. Your own decoder should not guess. You should test it against real binaries, but also against synthetic byte sequences that exercise edge cases: prefix combinations, displacement lengths, and registers beyond the first eight.

Instruction encoding is also where security tools operate. Many binary analysis techniques rely on correctly decoding instruction boundaries. If the decoder is off by one byte, the entire control flow graph can be wrong. That is why you will build tooling that validates decoding and cross-checks against known good references. Understanding encoding is not optional if you want to work in malware analysis, reverse engineering, or binary instrumentation.

How this fits on projects

Projects 3 and 4 directly build instruction decoding and boundary validation.
Projects 9 and 10 rely on decoding relocations and code references.

Definitions & key terms

Prefix: Byte that modifies instruction meaning or operand size.
Opcode: The core instruction identifier.
ModR/M: Encodes registers and addressing mode.
SIB: Scale-Index-Base encoding for complex addresses.
Displacement: Constant added to an address.
Immediate: Constant operand embedded in the instruction.

Mental model diagram

[ Prefixes ] [ Opcode ] [ ModR/M ] [ SIB ] [ Disp ] [ Imm ]
     |            |         |         |       |       |
     v            v         v         v       v       v
  modifiers    instruction  regs   addr     offset  constant

How it works

Read optional prefixes and record operand/address size.
Read opcode byte(s) and map to instruction class.
If required, read ModR/M and determine register/memory form.
If ModR/M indicates, read SIB.
Read displacement and immediate fields by size.
Compute final instruction length and semantics.

Invariants and failure modes:

Invariant: Decoding order is strict and deterministic.
Failure: Skipping a required prefix shifts the decode boundary.
Invariant: ModR/M determines whether SIB is present.
Failure: Incorrect ModR/M parsing breaks address calculation.

Minimal concrete example (pseudo-encoding)

# PSEUDOCODE ONLY
BYTES: [PFX][OP][MR][SIB][DISP32]
DECODE:
  prefix = PFX
  opcode = OP
  addressing = decode_modrm(MR, SIB, DISP32)
  length = 1 + 1 + 1 + 1 + 4

Common misconceptions

“Instruction length is fixed.” It is variable-length and decoder-driven.
“Opcode alone defines the instruction.” Prefixes and ModR/M change meaning.
“Decoding is trivial.” It is table-driven and full of edge cases.

Check-your-understanding questions

Why is decoding order critical in x86-64?
What does the ModR/M byte control?
Why do decoders need tables rather than simple logic?

Check-your-understanding answers

Because prefixes and opcode length determine where fields begin.
It chooses register vs memory forms and which registers are used.
Because many encodings are irregular and context-dependent.

Real-world applications

Building disassemblers and binary analyzers
Static malware analysis and reverse engineering
JIT and instrumentation tooling

Where you will apply it Projects 3, 4, 9, 10

References

Intel 64 and IA-32 Architectures Software Developer’s Manual (Intel)
“Modern X86 Assembly Language Programming” by Daniel Kusswurm - Ch. 2-4

Key insights Decoding is the gatekeeper: if the bytes are misread, everything else is wrong.

Summary Instruction encoding is the rulebook that allows bytes to become meaning.

Homework/Exercises to practice the concept

Write out the fields you would expect to decode from a byte stream.
Identify which fields would change if you referenced a high register.

Solutions to the homework/exercises

Prefix, opcode, ModR/M, optional SIB, displacement, immediate.
REX prefix is needed to access high registers.

3. Project Specification

3.1 What You Will Build

A tool that identifies RIP-relative references in a binary and resolves them to section/label names.

Why this teaches x86-64: RIP-relative addressing is a defining feature of 64-bit code and PIC.

Included:

Deterministic CLI output for a fixed input
Clear mapping between inputs and architectural meaning
A small test suite with edge cases

Excluded:

Full compiler or full disassembler coverage
Production-grade UI or packaging

3.2 Functional Requirements

Deterministic Output: Same input yields identical output.
Architecture-Aware: Output references ABI/ISA rules where relevant.
Validation Mode: Provide a compare mode against a golden output.

3.3 Non-Functional Requirements

Performance: Fast enough for small inputs and interactive use.
Reliability: Handles malformed inputs with clear errors.
Usability: Outputs are readable and documented.

3.4 Example Usage / Output

$ x64ripscan demo.bin

RIP-RELATIVE REFERENCES
0x0000000000401050 -> .rodata+0x2A  ("CONST_STRING")
0x0000000000401074 -> .data+0x10    (GLOBAL_COUNTER)

SUMMARY
Total RIP-relative refs: 12

3.5 Data Formats / Schemas / Protocols

Input format: line-oriented text or hex bytes (documented in README)
Output format: stable, human-readable report with labeled fields

3.6 Edge Cases

Empty input or missing fields
Invalid numeric values or malformed hex
Inputs that exercise maximum/minimum bounds

3.7 Real World Outcome

This section is your golden reference. Match it exactly.

3.7.1 How to Run (Copy/Paste)

Build: (if needed) make or equivalent
Run: P04-rip-relative-disassembly-explorer with sample input
Working directory: project root

3.7.2 Golden Path Demo (Deterministic)

Run with the provided demo input and confirm output matches the transcript.

3.7.3 If CLI: exact terminal transcript

$ x64ripscan demo.bin

RIP-RELATIVE REFERENCES
0x0000000000401050 -> .rodata+0x2A  ("CONST_STRING")
0x0000000000401074 -> .data+0x10    (GLOBAL_COUNTER)

SUMMARY
Total RIP-relative refs: 12

4. Solution Architecture

4.1 High-Level Design

INPUT -> PARSER -> MODEL -> RENDERER -> REPORT

4.2 Key Components

Component	Responsibility	Key Decisions
Parser	Turn input into structured records	Strict vs permissive parsing
Model	Apply ISA/ABI rules	Deterministic state transitions
Renderer	Produce readable output	Stable formatting

4.4 Data Structures (No Full Code)

Record: holds one instruction/event with decoded fields
State: represents register/flag or address state
Report: list of formatted output lines

4.4 Algorithm Overview

Key Algorithm: Parse and Evaluate

Parse input into records.
Apply rules to update state.
Render the state and summary output.

Complexity Analysis:

Time: O(n) over input records
Space: O(n) for report output

5. Implementation Guide

5.1 Development Environment Setup

# Ensure basic tools are installed
# build-essential or clang, plus objdump/readelf if needed

5.2 Project Structure

project-root/
├── src/
│   ├── main.*
│   ├── parser.*
│   └── model.*
├── tests/
│   └── test_cases.*
└── README.md

5.3 The Core Question You’re Answering

How does position-independent code find its data without absolute addresses?

5.4 Concepts You Must Understand First

RIP-relative addressing
- How is effective address computed from RIP?
- Book Reference: “Computer Systems: A Programmer’s Perspective” - Ch. 3
Instruction encoding
- How do you detect RIP-relative forms in ModR/M?
- Book Reference: “Modern X86 Assembly Language Programming” - Ch. 2-4

5.5 Questions to Guide Your Design

Binary scanning
- Which sections should you scan for code?
- How will you avoid decoding data as code?
Symbol resolution
- How will you map addresses to section names?
- What if symbols are stripped?

5.6 Thinking Exercise

RIP Math

Given a RIP at 0x400100 and a displacement of 0x20, compute the target address. Explain why the displacement is relative to the next instruction.

Questions to answer:

Why is RIP-relative addressing stable under relocation?
How does it enable shared libraries?

5.7 The Interview Questions They’ll Ask

“Why is RIP-relative addressing common in 64-bit binaries?”
“How do you compute the target address?”
“How can disassemblers get confused around RIP-relative data?”
“What happens when symbols are stripped?”
“How does PIC relate to ASLR?”

5.8 Hints in Layers

Hint 1: Starting Point Use your decoder from Project 3 to find ModR/M patterns that imply RIP-relative.

Hint 2: Next Level Compute target address as RIP_of_next_instruction + displacement.

Hint 3: Technical Details Use ELF section headers to map targets to section names.

Hint 4: Tools/Debugging Compare a few results with objdump output for validation.

5.9 Books That Will Help

Topic	Book	Chapter
PIC and addressing	“Computer Systems: A Programmer’s Perspective”	Ch. 3
Disassembly	“Modern X86 Assembly Language Programming”	Ch. 5

5.10 Implementation Phases

Phase 1: Foundation (2-3 days)

Goals:

Parse input format
Produce a minimal output Tasks:
1. Define input grammar and example files.
2. Implement a minimal parser and renderer. Checkpoint: Golden output matches a small input.

Phase 2: Core Functionality (1 week)

Goals:

Implement full rule set
Add validation and errors Tasks:
1. Implement rule engine for core cases.
2. Add error handling for invalid inputs. Checkpoint: All core tests pass.

Phase 3: Polish & Edge Cases (2-3 days)

Goals:

Add edge-case coverage
Improve output readability Tasks:
1. Add edge-case tests.
2. Refine output formatting and summary. Checkpoint: Output matches golden transcript for all cases.

5.11 Key Implementation Decisions

Decision	Options	Recommendation	Rationale
Input format	Text, JSON	Text	Easiest to audit and diff
Output format	Plain text, JSON	Plain text	Matches CLI tooling

6. Testing Strategy

6.1 Test Categories

Category	Purpose	Examples
Unit Tests	Validate parsing and rule application	Valid/invalid inputs
Integration Tests	End-to-end output comparison	Golden transcripts
Edge Case Tests	Stress unusual inputs	Empty input, max values

6.2 Critical Test Cases

Minimal Input: One record, verify output.
Boundary Values: Largest/smallest values.
Malformed Input: Ensure clean error messages.

6.3 Test Data

INPUT: sample_min.txt
EXPECTED: matches golden transcript

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall	Symptom	Solution
Wrong assumptions	Output mismatches	Re-read ABI/ISA rules
Off-by-one parsing	Missing fields	Add explicit length checks
Ambiguous output	Hard to verify	Add labels and separators

Project-specific pitfalls

Problem 1: “Targets are off by instruction length”

Why: Using current RIP instead of next instruction RIP.
Fix: Add the decoded instruction length to RIP.
Quick test: Verify on a short, known sequence.

7.2 Debugging Strategies

Golden diffing: Use diff to compare outputs line by line.
State logging: Print intermediate state after each step.

7.3 Performance Traps

Avoid over-optimizing; correctness and determinism matter most.

8. Extensions & Challenges

8.1 Beginner Extensions

Add a new input case and golden output
Add a summary line with counts

8.2 Intermediate Extensions

Add JSON output mode
Add validation warnings for suspicious inputs

8.3 Advanced Extensions

Support additional ABI or instruction variants
Integrate with a real binary to collect inputs

9. Real-World Connections

9.1 Industry Applications

Profilers and tracers: Use similar decoding and state models.
Security analysis: Use precise ABI knowledge to interpret crashes.

objdump: reference tool for binary inspection.
llvm-objdump: LLVM-based disassembly and inspection.

9.3 Interview Relevance

ABI and calling conventions are common systems interview topics.
Explaining decoding and linking demonstrates low-level fluency.

10. Resources

10.1 Essential Reading

Intel 64 and IA-32 Architectures Software Developer’s Manual - ISA reference
System V AMD64 ABI Draft 0.99.7 - calling convention rules

10.2 Video Resources

Vendor and university lectures on x86-64 and ABIs (search official channels)

Project 4: RIP-Relative Disassembly Explorer

Quick Reference

1. Learning Objectives

2. All Theory Needed (Per-Concept Breakdown)

Data Representation, Memory, and Addressing

Instruction Encoding and Decoding

3. Project Specification

3.1 What You Will Build

3.2 Functional Requirements

3.3 Non-Functional Requirements

3.4 Example Usage / Output

3.5 Data Formats / Schemas / Protocols

3.6 Edge Cases

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

3.7.2 Golden Path Demo (Deterministic)

3.7.3 If CLI: exact terminal transcript

4. Solution Architecture

4.1 High-Level Design

4.2 Key Components

4.4 Data Structures (No Full Code)

4.4 Algorithm Overview

5. Implementation Guide

5.1 Development Environment Setup

5.2 Project Structure

5.3 The Core Question You’re Answering

5.4 Concepts You Must Understand First

5.5 Questions to Guide Your Design

5.6 Thinking Exercise

5.7 The Interview Questions They’ll Ask

5.8 Hints in Layers

5.9 Books That Will Help

5.10 Implementation Phases

Phase 1: Foundation (2-3 days)

Phase 2: Core Functionality (1 week)

Phase 3: Polish & Edge Cases (2-3 days)

5.11 Key Implementation Decisions

6. Testing Strategy

6.1 Test Categories

6.2 Critical Test Cases

6.3 Test Data

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

7.2 Debugging Strategies

7.3 Performance Traps

8. Extensions & Challenges

8.1 Beginner Extensions

8.2 Intermediate Extensions

8.3 Advanced Extensions

9. Real-World Connections

9.1 Industry Applications

9.2 Related Open Source Projects

9.3 Interview Relevance

10. Resources

10.1 Essential Reading

10.2 Video Resources