Project 5: Build Your Own grep (Simplified)

Implement a simplified grep to understand regex engines and streaming search.

Quick Reference

Attribute Value
Difficulty Advanced
Time Estimate 2-3 weeks
Language Python (Alternatives: Go, Rust)
Prerequisites Regex basics, file I/O
Key Topics regex matching, streaming, CLI UX

1. Learning Objectives

By completing this project, you will:

  1. Implement a basic regex matcher (subset).
  2. Stream input from files and STDIN safely.
  3. Support grep-like flags and exit codes.
  4. Understand performance tradeoffs in regex engines.
  5. Produce predictable output suitable for pipelines.

2. Theoretical Foundation

2.1 Core Concepts

  • Regex Engines: NFA backtracking is simplest to implement but can be slow.
  • Anchoring: ^ and $ control position-based matches.
  • Character Classes: [] expand matching to sets of characters.
  • Streaming Search: Line-by-line matching avoids memory blowups.

2.2 Why This Matters

Building your own grep forces you to understand regex mechanics and why certain patterns cause performance issues.

2.3 Common Misconceptions

  • “Regex is just substring search”: It is formal pattern matching.
  • “All regex behave the same”: engines differ in backtracking and performance.

3. Project Specification

3.1 What You Will Build

A CLI named mygrep that:

  • Accepts a pattern and file list
  • Supports -i, -n, -v
  • Matches lines using your own regex engine (subset)

3.2 Functional Requirements

  1. Input: Files or STDIN.
  2. Pattern Matching: Implement literal, ., *, ?, and [].
  3. Flags: case-insensitive, line numbers, invert match.
  4. Exit Codes: 0 if match, 1 if no match, 2 on error.

3.3 Non-Functional Requirements

  • Correctness: Match semantics must be consistent.
  • Performance: Handle large files line-by-line.
  • Compatibility: Output should match grep style.

3.4 Example Usage / Output

$ mygrep "error.*timeout" app.log
[2025-01-01] error: timeout after 30s

3.5 Real World Outcome

When run on logs, mygrep filters matching lines exactly like grep. It works in pipelines and returns correct exit codes for automation.


4. Solution Architecture

4.1 High-Level Design

+---------+    +---------+    +---------+
| Reader  | -> | Matcher | -> | Output  |
+---------+    +---------+    +---------+

4.2 Key Components

Component Responsibility Key Decisions
Reader Stream input buffered read
Matcher Regex engine backtracking NFA
Output Print matches grep-style format

4.3 Data Structures

class Token:
    kind: str
    value: str

4.4 Algorithm Overview

Key Algorithm: Backtracking matcher

  1. Tokenize pattern.
  2. For each line, attempt match at each position.
  3. Support * and ? with recursion/backtracking.

Complexity Analysis:

  • Time: potentially exponential in worst case
  • Space: O(P) for pattern tokens

5. Implementation Guide

5.1 Development Environment Setup

python3 -m venv .venv
source .venv/bin/activate

5.2 Project Structure

mygrep/
├── mygrep.py
├── matcher.py
└── README.md

5.3 The Core Question You Are Answering

“How does grep decide whether a line matches a pattern, and why can that be slow?”

5.4 Concepts You Must Understand First

  1. Regex operators and precedence
  2. Backtracking vs deterministic matching
  3. Line-oriented processing

5.5 Questions to Guide Your Design

  1. How do you handle * without infinite loops?
  2. Should matching be greedy or minimal?
  3. How do you support -i efficiently?

5.6 Thinking Exercise

Take the pattern a*b and text aaac. Trace how your matcher backtracks.

5.7 The Interview Questions They Will Ask

  1. Why can regex be exponential?
  2. How do you implement character classes?
  3. How do you match from every line position?

5.8 Hints in Layers

Hint 1: Implement literal matching first.

Hint 2: Add . and ? before *.

Hint 3: Add character classes last.

5.9 Books That Will Help

Topic Book Chapter
Regex theory “Mastering Regular Expressions” Ch. 1-4
NFA/DFA “Introduction to Automata Theory” Ch. 2

5.10 Implementation Phases

Phase 1: Literal matching

  • basic substring search

Phase 2: Regex support

  • add ., *, ?, []

Phase 3: CLI polish

  • flags and exit codes

5.11 Key Implementation Decisions

Decision Options Recommendation Rationale
Engine backtracking NFA backtracking simpler to implement
Flags subset vs full grep subset manageable scope

6. Testing Strategy

6.1 Test Categories

Category Purpose Examples
Unit Tests Matcher pattern cases
Integration Tests CLI file input
Edge Cases empty lines no matches

6.2 Critical Test Cases

  1. a*b matches aaab.
  2. ^foo matches start only.
  3. -v returns non-matching lines.

7. Common Pitfalls and Debugging

Pitfall Symptom Solution
Infinite loops with * hang add recursion limit
Wrong precedence false matches explicit parsing
Slow performance timeouts restrict features

8. Extensions and Challenges

8.1 Beginner Extensions

  • Add --count flag

8.2 Intermediate Extensions

  • Add + quantifier

8.3 Advanced Extensions

  • Implement DFA-based matcher

9. Real-World Connections

  • Search tools (ripgrep, ack)
  • Log analysis pipelines

10. Resources

  • Python re docs for comparison
  • GNU grep source code

11. Self-Assessment Checklist

  • I can explain backtracking regex
  • I can implement character classes

12. Submission / Completion Criteria

Minimum Viable Completion:

  • Literal matching + file input

Full Completion:

  • Regex subset + flags

Excellence (Going Above and Beyond):

  • DFA engine or performance improvements