Project 5: Build Your Own grep (Simplified)
Implement a simplified grep to understand regex engines and streaming search.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Advanced |
| Time Estimate | 2-3 weeks |
| Language | Python (Alternatives: Go, Rust) |
| Prerequisites | Regex basics, file I/O |
| Key Topics | regex matching, streaming, CLI UX |
1. Learning Objectives
By completing this project, you will:
- Implement a basic regex matcher (subset).
- Stream input from files and STDIN safely.
- Support grep-like flags and exit codes.
- Understand performance tradeoffs in regex engines.
- Produce predictable output suitable for pipelines.
2. Theoretical Foundation
2.1 Core Concepts
- Regex Engines: NFA backtracking is simplest to implement but can be slow.
- Anchoring:
^and$control position-based matches. - Character Classes:
[]expand matching to sets of characters. - Streaming Search: Line-by-line matching avoids memory blowups.
2.2 Why This Matters
Building your own grep forces you to understand regex mechanics and why certain patterns cause performance issues.
2.3 Common Misconceptions
- “Regex is just substring search”: It is formal pattern matching.
- “All regex behave the same”: engines differ in backtracking and performance.
3. Project Specification
3.1 What You Will Build
A CLI named mygrep that:
- Accepts a pattern and file list
- Supports
-i,-n,-v - Matches lines using your own regex engine (subset)
3.2 Functional Requirements
- Input: Files or STDIN.
- Pattern Matching: Implement literal,
.,*,?, and[]. - Flags: case-insensitive, line numbers, invert match.
- Exit Codes: 0 if match, 1 if no match, 2 on error.
3.3 Non-Functional Requirements
- Correctness: Match semantics must be consistent.
- Performance: Handle large files line-by-line.
- Compatibility: Output should match grep style.
3.4 Example Usage / Output
$ mygrep "error.*timeout" app.log
[2025-01-01] error: timeout after 30s
3.5 Real World Outcome
When run on logs, mygrep filters matching lines exactly like grep. It works in pipelines and returns correct exit codes for automation.
4. Solution Architecture
4.1 High-Level Design
+---------+ +---------+ +---------+
| Reader | -> | Matcher | -> | Output |
+---------+ +---------+ +---------+
4.2 Key Components
| Component | Responsibility | Key Decisions |
|---|---|---|
| Reader | Stream input | buffered read |
| Matcher | Regex engine | backtracking NFA |
| Output | Print matches | grep-style format |
4.3 Data Structures
class Token:
kind: str
value: str
4.4 Algorithm Overview
Key Algorithm: Backtracking matcher
- Tokenize pattern.
- For each line, attempt match at each position.
- Support
*and?with recursion/backtracking.
Complexity Analysis:
- Time: potentially exponential in worst case
- Space: O(P) for pattern tokens
5. Implementation Guide
5.1 Development Environment Setup
python3 -m venv .venv
source .venv/bin/activate
5.2 Project Structure
mygrep/
├── mygrep.py
├── matcher.py
└── README.md
5.3 The Core Question You Are Answering
“How does grep decide whether a line matches a pattern, and why can that be slow?”
5.4 Concepts You Must Understand First
- Regex operators and precedence
- Backtracking vs deterministic matching
- Line-oriented processing
5.5 Questions to Guide Your Design
- How do you handle
*without infinite loops? - Should matching be greedy or minimal?
- How do you support
-iefficiently?
5.6 Thinking Exercise
Take the pattern a*b and text aaac. Trace how your matcher backtracks.
5.7 The Interview Questions They Will Ask
- Why can regex be exponential?
- How do you implement character classes?
- How do you match from every line position?
5.8 Hints in Layers
Hint 1: Implement literal matching first.
Hint 2: Add . and ? before *.
Hint 3: Add character classes last.
5.9 Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Regex theory | “Mastering Regular Expressions” | Ch. 1-4 |
| NFA/DFA | “Introduction to Automata Theory” | Ch. 2 |
5.10 Implementation Phases
Phase 1: Literal matching
- basic substring search
Phase 2: Regex support
- add ., *, ?, []
Phase 3: CLI polish
- flags and exit codes
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale |
|---|---|---|---|
| Engine | backtracking NFA | backtracking | simpler to implement |
| Flags | subset vs full grep | subset | manageable scope |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples |
|---|---|---|
| Unit Tests | Matcher | pattern cases |
| Integration Tests | CLI | file input |
| Edge Cases | empty lines | no matches |
6.2 Critical Test Cases
a*bmatchesaaab.^foomatches start only.-vreturns non-matching lines.
7. Common Pitfalls and Debugging
| Pitfall | Symptom | Solution |
|---|---|---|
| Infinite loops with * | hang | add recursion limit |
| Wrong precedence | false matches | explicit parsing |
| Slow performance | timeouts | restrict features |
8. Extensions and Challenges
8.1 Beginner Extensions
- Add
--countflag
8.2 Intermediate Extensions
- Add
+quantifier
8.3 Advanced Extensions
- Implement DFA-based matcher
9. Real-World Connections
- Search tools (ripgrep, ack)
- Log analysis pipelines
10. Resources
- Python
redocs for comparison - GNU grep source code
11. Self-Assessment Checklist
- I can explain backtracking regex
- I can implement character classes
12. Submission / Completion Criteria
Minimum Viable Completion:
- Literal matching + file input
Full Completion:
- Regex subset + flags
Excellence (Going Above and Beyond):
- DFA engine or performance improvements