Project 6: ASCII/UTF-8 Byte Inspector

  • File: P06-ascii-utf8-inspector.md
  • Main Programming Language: C or Python
  • Alternative Programming Languages: Rust, Go, JavaScript
  • Coolness Level: Level 2 (See REFERENCE.md)
  • Business Potential: Level 1 (See REFERENCE.md)
  • Difficulty: Level 2 (See REFERENCE.md)
  • Knowledge Area: Encoding
  • Software or Tool: CLI
  • Main Book: “The C Programming Language”

What you will build: A tool that reads a file and prints the byte offset, hex value, and ASCII/UTF-8 interpretation.

Why it teaches binary/hex: ASCII is 7-bit and UTF-8 is 1-4 octets with ASCII compatibility.

Core challenges you will face:

  • Encoding detection -> Encoding & Forensics
  • Byte inspection -> Bits/Bytes/Nibbles
  • UTF-8 validation -> Encoding & Forensics

Real World Outcome

$ textinspect sample.txt
00000000  48 65 6c 6c 6f 0a   ASCII: Hello.

The Core Question You Are Answering

“What do bytes mean when I claim they are text?”

Concepts You Must Understand First

  1. ASCII
    • Why is it 7-bit?
    • Book Reference: “The C Programming Language” - Ch. 7
  2. UTF-8
    • How does it encode 1-4 octets?
    • Book Reference: “The C Programming Language” - Ch. 7

Questions to Guide Your Design

  1. Output format
    • How will you align hex and text columns?
  2. Validation
    • Will you reject invalid UTF-8 sequences or mark them?

Thinking Exercise

ASCII in Hex

Write the hex values for the string “Hi” and mark where ASCII ends in UTF-8.

Questions to answer:

  • Why does UTF-8 preserve ASCII bytes?
  • How would you display a non-ASCII byte?

The Interview Questions They Will Ask

  1. “Why is ASCII compatible with UTF-8?”
  2. “How do you validate UTF-8 byte sequences?”
  3. “Why is ASCII only 7-bit?”
  4. “What should a hexdump show for control characters?”
  5. “How do you display non-printable bytes?”

Hints in Layers

Hint 1: Starting Point Print offset, hex bytes, and a ‘.’ for non-printable bytes.

Hint 2: Next Level Implement a minimal UTF-8 validator that checks leading byte patterns.

Hint 3: Technical Details Pseudocode:

for each byte:
  if 0x20 <= byte <= 0x7E: print ASCII char
  else: print '.'

Hint 4: Tools/Debugging Compare your output with xxd for the same file.

Books That Will Help

Topic Book Chapter
I/O and text “The C Programming Language” Ch. 7

Common Pitfalls and Debugging

Problem 1: “My UTF-8 validator rejects valid text”

  • Why: You mis-handle continuation bytes.
  • Fix: Ensure continuation bytes begin with 10xxxxxx.
  • Quick test: Validate a pure ASCII file; it should pass.

Definition of Done

  • Prints offset, hex, and ASCII columns
  • Handles non-printable bytes consistently
  • Validates UTF-8 sequences
  • Matches xxd output layout for ASCII files