Project 8: File Signature (Magic Number) Identifier
- File: P08-file-signature-id.md
- Main Programming Language: C
- Alternative Programming Languages: Rust, Go, Python
- Coolness Level: Level 3 (See REFERENCE.md)
- Business Potential: Level 2 (See REFERENCE.md)
- Difficulty: Level 3 (See REFERENCE.md)
- Knowledge Area: File Forensics
- Software or Tool: CLI
- Main Book: “Practical Binary Analysis”
What you will build: A tool that reads bytes at offsets and identifies files based on magic patterns.
Why it teaches binary/hex: File identification relies on magic-number rules evaluated against byte sequences.
Core challenges you will face:
- Binary file I/O -> Encoding & Forensics
- Pattern matching -> Bits/Bytes/Nibbles
- Rule-driven parsing -> Encoding & Forensics
Real World Outcome
$ magicid samples/
file.bin: unknown
image.bin: matched rule "PNG"
doc.bin: matched rule "PDF"
The Core Question You Are Answering
“How can you identify a file by its bytes instead of its name?”
Concepts You Must Understand First
- Magic patterns
- What do magic files test?
- Book Reference: “Practical Binary Analysis” - Ch. 3
- Offsets and types
- Why do rules specify offsets and data types?
- Book Reference: “The C Programming Language” - Ch. 7
Questions to Guide Your Design
- Rule format
- Will you define your own rule syntax or parse an existing one?
- Matching
- How will you handle overlapping patterns?
Thinking Exercise
Offset Reasoning
Imagine a signature rule that checks bytes 0-3. Why might another rule check bytes at offset 512?
Questions to answer:
- What does an offset mean in the file layout?
- How do you avoid false positives?
The Interview Questions They Will Ask
- “How does file(1) identify file types?”
- “What is a magic number?”
- “Why are offsets part of signature rules?”
- “How would you design a signature database?”
- “How do you handle endian-specific fields in a signature?”
Hints in Layers
Hint 1: Starting Point Start with a small JSON or text file of rules: offset + bytes + label.
Hint 2: Next Level Read only the maximum offset you need for all rules to avoid full-file reads.
Hint 3: Technical Details Pseudocode:
for each rule:
read bytes at offset
if bytes match pattern: report label
Hint 4: Tools/Debugging
Compare results with the file command for the same inputs.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| File signatures | “Practical Binary Analysis” | Ch. 3 |
| Binary I/O | “The C Programming Language” | Ch. 7 |
Common Pitfalls and Debugging
Problem 1: “Matches are inconsistent”
- Why: You are not reading at the correct offset.
- Fix: Print the bytes you compare and verify offset math.
- Quick test: Create a file with a known pattern at offset 0.
Definition of Done
- Loads signature rules from a file
- Tests multiple offsets per file
- Matches at least 5 known file types
- Includes a false-positive test set