Project 6: ASCII/UTF-8 Byte Inspector
- File: P06-ascii-utf8-inspector.md
- Main Programming Language: C or Python
- Alternative Programming Languages: Rust, Go, JavaScript
- Coolness Level: Level 2 (See REFERENCE.md)
- Business Potential: Level 1 (See REFERENCE.md)
- Difficulty: Level 2 (See REFERENCE.md)
- Knowledge Area: Encoding
- Software or Tool: CLI
- Main Book: “The C Programming Language”
What you will build: A tool that reads a file and prints the byte offset, hex value, and ASCII/UTF-8 interpretation.
Why it teaches binary/hex: ASCII is 7-bit and UTF-8 is 1-4 octets with ASCII compatibility.
Core challenges you will face:
- Encoding detection -> Encoding & Forensics
- Byte inspection -> Bits/Bytes/Nibbles
- UTF-8 validation -> Encoding & Forensics
Real World Outcome
$ textinspect sample.txt
00000000 48 65 6c 6c 6f 0a ASCII: Hello.
The Core Question You Are Answering
“What do bytes mean when I claim they are text?”
Concepts You Must Understand First
- ASCII
- Why is it 7-bit?
- Book Reference: “The C Programming Language” - Ch. 7
- UTF-8
- How does it encode 1-4 octets?
- Book Reference: “The C Programming Language” - Ch. 7
Questions to Guide Your Design
- Output format
- How will you align hex and text columns?
- Validation
- Will you reject invalid UTF-8 sequences or mark them?
Thinking Exercise
ASCII in Hex
Write the hex values for the string “Hi” and mark where ASCII ends in UTF-8.
Questions to answer:
- Why does UTF-8 preserve ASCII bytes?
- How would you display a non-ASCII byte?
The Interview Questions They Will Ask
- “Why is ASCII compatible with UTF-8?”
- “How do you validate UTF-8 byte sequences?”
- “Why is ASCII only 7-bit?”
- “What should a hexdump show for control characters?”
- “How do you display non-printable bytes?”
Hints in Layers
Hint 1: Starting Point Print offset, hex bytes, and a ‘.’ for non-printable bytes.
Hint 2: Next Level Implement a minimal UTF-8 validator that checks leading byte patterns.
Hint 3: Technical Details Pseudocode:
for each byte:
if 0x20 <= byte <= 0x7E: print ASCII char
else: print '.'
Hint 4: Tools/Debugging
Compare your output with xxd for the same file.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| I/O and text | “The C Programming Language” | Ch. 7 |
Common Pitfalls and Debugging
Problem 1: “My UTF-8 validator rejects valid text”
- Why: You mis-handle continuation bytes.
- Fix: Ensure continuation bytes begin with 10xxxxxx.
- Quick test: Validate a pure ASCII file; it should pass.
Definition of Done
- Prints offset, hex, and ASCII columns
- Handles non-printable bytes consistently
- Validates UTF-8 sequences
- Matches xxd output layout for ASCII files