Project 8: File Signature (Magic Number) Identifier

  • File: P08-file-signature-id.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust, Go, Python
  • Coolness Level: Level 3 (See REFERENCE.md)
  • Business Potential: Level 2 (See REFERENCE.md)
  • Difficulty: Level 3 (See REFERENCE.md)
  • Knowledge Area: File Forensics
  • Software or Tool: CLI
  • Main Book: “Practical Binary Analysis”

What you will build: A tool that reads bytes at offsets and identifies files based on magic patterns.

Why it teaches binary/hex: File identification relies on magic-number rules evaluated against byte sequences.

Core challenges you will face:

  • Binary file I/O -> Encoding & Forensics
  • Pattern matching -> Bits/Bytes/Nibbles
  • Rule-driven parsing -> Encoding & Forensics

Real World Outcome

$ magicid samples/
file.bin: unknown
image.bin: matched rule "PNG"
doc.bin: matched rule "PDF"

The Core Question You Are Answering

“How can you identify a file by its bytes instead of its name?”

Concepts You Must Understand First

  1. Magic patterns
    • What do magic files test?
    • Book Reference: “Practical Binary Analysis” - Ch. 3
  2. Offsets and types
    • Why do rules specify offsets and data types?
    • Book Reference: “The C Programming Language” - Ch. 7

Questions to Guide Your Design

  1. Rule format
    • Will you define your own rule syntax or parse an existing one?
  2. Matching
    • How will you handle overlapping patterns?

Thinking Exercise

Offset Reasoning

Imagine a signature rule that checks bytes 0-3. Why might another rule check bytes at offset 512?

Questions to answer:

  • What does an offset mean in the file layout?
  • How do you avoid false positives?

The Interview Questions They Will Ask

  1. “How does file(1) identify file types?”
  2. “What is a magic number?”
  3. “Why are offsets part of signature rules?”
  4. “How would you design a signature database?”
  5. “How do you handle endian-specific fields in a signature?”

Hints in Layers

Hint 1: Starting Point Start with a small JSON or text file of rules: offset + bytes + label.

Hint 2: Next Level Read only the maximum offset you need for all rules to avoid full-file reads.

Hint 3: Technical Details Pseudocode:

for each rule:
  read bytes at offset
  if bytes match pattern: report label

Hint 4: Tools/Debugging Compare results with the file command for the same inputs.

Books That Will Help

Topic Book Chapter
File signatures “Practical Binary Analysis” Ch. 3
Binary I/O “The C Programming Language” Ch. 7

Common Pitfalls and Debugging

Problem 1: “Matches are inconsistent”

  • Why: You are not reading at the correct offset.
  • Fix: Print the bytes you compare and verify offset math.
  • Quick test: Create a file with a known pattern at offset 0.

Definition of Done

  • Loads signature rules from a file
  • Tests multiple offsets per file
  • Matches at least 5 known file types
  • Includes a false-positive test set