Project 5: Build Your Own grep (Simplified)

Implement a small grep-like tool that matches patterns against lines and demonstrates regex engine fundamentals.

Quick Reference

Attribute	Value
Difficulty	Level 3: Advanced
Time Estimate	2-3 weeks
Main Programming Language	C or Rust
Alternative Programming Languages	Go, Python
Coolness Level	Level 4: Systems Internals
Business Potential	Level 2: Learning / Education
Prerequisites	Basic C/Rust, comfort with recursion, regex basics
Key Topics	line I/O, regex parsing, NFA/DFA, matching algorithms

1. Learning Objectives

By completing this project, you will:

Read input line-by-line and process streams efficiently.
Parse a regex pattern into an internal representation.
Implement a simple NFA-based matching engine.
Handle basic regex operators with correct semantics.
Match lines and produce correct grep-like exit codes.

2. All Theory Needed (Per-Concept Breakdown)

2.1 Line-Oriented I/O and Streaming Semantics

Fundamentals

grep is fundamentally a line-oriented filter: it reads input line by line, tests each line against a pattern, and outputs matches. Line-oriented I/O is efficient because it avoids loading entire files into memory. It also aligns with the Unix pipeline model, where each line is a record. Understanding line-oriented I/O means understanding buffering, newline handling, and end-of-file behavior. Your implementation needs to handle lines of arbitrary length, including the final line that may not end with a newline. This concept ensures that your grep behaves correctly on real-world input. It also clarifies why streaming tools can process gigabytes of input with constant memory.

Deep Dive into the concept

Streaming I/O means processing data incrementally as it arrives. In C, this often means using getline() or a custom buffered reader; in Rust, it might be std::io::BufRead::read_line. The key property is that you do not need to know the total input size in advance. This is critical for grep: it should work on files larger than memory and on stdin streams.

Line handling is subtle. A line is defined by a newline delimiter, but the final line of a file may not end in a newline. POSIX tools typically treat the last line as a valid record. Your implementation should do the same. When using getline, it returns the line including the newline character; you must decide whether to keep or strip it. grep generally matches against the line without the newline, but outputs the line with its newline to preserve formatting. This means you should strip the newline for matching but keep it for output, or store both versions.

Buffering affects performance. Reading one character at a time is slow; using buffered input reduces syscalls. Most standard libraries already buffer, but if you implement your own reader, you should use a buffer (e.g., 4KB or 8KB). This also affects how you handle long lines that exceed the buffer size. A robust implementation should grow the buffer dynamically as needed. In C, this means reallocating; in Rust, you can use a String that grows.

Streaming also affects error handling. If an input file cannot be opened, your tool should report the error and continue with other files, similar to grep. The exit code should reflect whether any errors occurred. This behavior is important for scripting: grep returns 2 on errors, 1 if no matches, and 0 if matches are found. Your implementation must track these outcomes across multiple inputs. This is not just a CLI detail; it shapes how your tool behaves in pipelines.

Finally, line-oriented processing interacts with encoding. grep typically operates on bytes, not Unicode code points. This means your matching engine should treat the input as bytes or ASCII. If you use a Unicode-aware language like Rust or Python, be careful not to inadvertently normalize or reject invalid UTF-8. For a simplified grep, you can restrict to UTF-8 and document it, or treat input as raw bytes. The simpler approach is to treat input as bytes and avoid complex Unicode semantics.

Another practical concern is line length. Some logs or minified files contain very long lines that can exceed default buffer sizes. A robust line reader should grow its buffer dynamically, and it should avoid quadratic behavior when appending chunks. If you implement your own reader, use a doubling strategy for buffer growth to keep amortized linear time. Also consider binary data: grep traditionally treats binary files as a special case, sometimes printing "Binary file matches" instead of raw output. For a simplified tool, you can treat binary as bytes and still match, but you should document this behavior so users are not surprised when non-printable characters appear in output.

How this fit on projects

This concept is the foundation of your grep implementation. Correct line handling ensures your matcher sees the intended data and your output preserves the input structure.

Definitions & key terms

Record: One line of input.
Buffering: Reading data in chunks to reduce syscalls.
EOF: End-of-file; indicates no more input.
Exit status: Numeric code indicating success, no match, or error.

Mental model diagram (ASCII)

stdin/file -> [line reader] -> line -> [matcher] -> output if match

How it works (step-by-step, with invariants and failure modes)

Open each input file (or stdin).
Read a line into a buffer.
Strip trailing newline for matching.
Run pattern matcher.
If match, print original line.
Repeat until EOF.

Invariant: Each input line is tested exactly once. Failure modes: truncated long lines, missing final line, unhandled file errors.

Minimal concrete example

// C example using getline
char *line = NULL;
size_t n = 0;
while (getline(&line, &n, stdin) != -1) {
    // match line
}

Common misconceptions

“Lines always end with newline” -> Last line may not.
“Reading byte-by-byte is fine” -> It’s too slow for large files.
“Exit code doesn’t matter” -> It matters for scripts and pipelines.

Check-your-understanding questions

Why must the last line be processed even without a newline?
How does buffering improve performance?
What exit code should grep return if no matches are found?
Why is treating input as bytes sometimes safer than Unicode?

Check-your-understanding answers

POSIX tools treat the final partial line as a valid record.
It reduces system calls by reading in chunks.
Exit code 1.
Byte-level processing avoids Unicode decoding errors and matches grep behavior.

Real-world applications

Text scanning in large log files.
Data pipelines that rely on grep-like filters.

Where you’ll apply it

See §3.2 for input handling requirements.
See §3.7 for CLI transcript and exit codes.
Also used in: P01 Log Analyzer for stream processing.

References

POSIX getline and fgets behavior
“The C Programming Language” by Kernighan & Ritchie

Key insights

Line-oriented streaming is the backbone of grep; correct line handling is more important than it looks.

Summary

A grep implementation must process input line by line, preserve formatting, and handle EOF correctly. Buffering and exit codes are part of the contract.

Homework/Exercises to practice the concept

Implement a line reader and test it on a file without a trailing newline.
Measure performance difference between byte-by-byte and buffered reading.
Implement exit status logic for match/no-match/error.

Solutions to the homework/exercises

Use getline and ensure the last line is processed even without \n.
Compare runtimes using time on a large file.
Track found_match and had_error flags to set exit code.

2.2 Regex Engine Fundamentals (NFA/DFA and Thompson Construction)

Fundamentals

Regex engines can be implemented using nondeterministic finite automata (NFA) or deterministic finite automata (DFA). grep-like tools often use NFA-style engines because they are simpler to build and can support a range of regex features with acceptable performance. Thompson construction is a method to convert a regex into an NFA using epsilon transitions. Understanding NFA/DFA trade-offs helps you explain why your simplified grep is structured the way it is and how it differs from backtracking engines in some languages. At a high level, regex matching is a finite-state problem, which makes automata theory directly applicable.

Deep Dive into the concept

A regex describes a regular language. The formal model for regular languages is a finite automaton. A deterministic finite automaton (DFA) has exactly one next state for each input symbol, while a nondeterministic finite automaton (NFA) can have multiple possible next states. In practice, DFAs are fast because matching is O(n) with a single state, but building a DFA can cause an exponential blowup in the number of states (state explosion). NFAs are more compact but require tracking sets of active states during matching.

Thompson construction is a canonical algorithm that builds an NFA from a regex. Each regex operator corresponds to a small NFA fragment: concatenation connects fragments end-to-start, alternation creates a split with epsilon transitions, and repetition (*) creates a loop with epsilon edges. The resulting NFA has epsilon transitions that consume no input. During matching, you compute the epsilon-closure (the set of states reachable via epsilon transitions) and advance the set of active states as you consume input characters. This is the basis of the classic NFA simulation used in many grep-like tools.

For a simplified grep, you can implement a Thompson NFA that supports operators like . (any char), *, +, ?, character classes [abc], and anchors ^ and $. The algorithm works as follows: parse the regex into a syntax tree or postfix representation, build an NFA, then simulate it against each input line. You do not need to build a full DFA; the NFA simulation is sufficient and easier to implement.

Understanding the difference between NFA simulation and backtracking is important. Backtracking engines explore possible matches by recursion, which can lead to exponential time in certain patterns. NFA simulation avoids this by tracking all possible states at once, guaranteeing linear time with respect to the input length (though with a factor for the number of NFA states). This is why many grep implementations are efficient even on large inputs. Your simplified grep should aim for NFA simulation rather than naive backtracking, even if you implement it in a simple way.

The concept of “leftmost-longest” matching in POSIX regex is also relevant. POSIX requires the leftmost-longest match, which affects how alternatives are resolved. Implementing this exactly is complex. For a simplified grep, you can use a leftmost-first match (first match found by NFA), but you should document the limitation. This is acceptable for a learning project, but understanding the POSIX requirement helps you appreciate the complexity of real grep implementations.

Implementation details matter for performance. When simulating an NFA, you maintain a set of active states. A naive implementation might use a list and scan it repeatedly for duplicates, which can be slow. A simple optimization is to tag states with a generation counter so you can avoid duplicates without clearing large arrays. Another approach is to store active states in a bitset indexed by state ID, which allows fast membership checks. These techniques are common in production regex engines and provide a taste of how algorithmic theory turns into practical performance engineering.

If you later want multi-line mode, you can extend the engine to treat \\n as a regular character, but document that this changes matching semantics.

How this fit on projects

This concept defines the core matching engine. Your grep’s correctness and performance depend on the automaton model you choose.

Definitions & key terms

NFA: Nondeterministic finite automaton.
DFA: Deterministic finite automaton.
Epsilon transition: A state transition that consumes no input.
Thompson construction: Algorithm to build an NFA from a regex.
State explosion: Exponential growth in DFA states.

Mental model diagram (ASCII)

Regex: a(b|c)*d
NFA:
 (a) -> (split) -> (b) ->
            \          \
             -> (c) -> (loop) -> (d)

How it works (step-by-step, with invariants and failure modes)

Parse regex into postfix form.
Build NFA fragments for each operator.
Combine fragments into a full NFA with start/end states.
For each input line, compute epsilon-closure of start state.
Advance active states for each character.
If any active state is accept at end, match succeeds.

Invariant: The set of active states represents all possible matches at that point. Failure modes: incorrect epsilon-closure, missing transitions for operators.

Minimal concrete example

Regex: ab*
NFA:
(start) -a-> (s1) -b-> (s1) -> (accept)

Common misconceptions

“NFA is slower than DFA” -> NFA simulation can be efficient for typical patterns.
“Backtracking is simpler and safe” -> It can be exponential time.
“All regex engines behave the same” -> POSIX vs PCRE differ.

Check-your-understanding questions

Why can DFAs cause state explosion?
What role do epsilon transitions play in Thompson NFA?
Why is NFA simulation usually safer than backtracking?
What does leftmost-longest mean in POSIX regex?

Check-your-understanding answers

Alternation and repetition can create exponential numbers of combined states.
They allow moving between fragments without consuming input.
It tracks all possibilities at once, avoiding exponential backtracking.
It selects the earliest match, and among those, the longest match.

Real-world applications

grep-like text search tools.
Input validation systems.
Intrusion detection rules.

Where you’ll apply it

See §4.4 for the NFA-based matching algorithm.
See §5.10 Phase 2 for implementing Thompson construction.
Also used in: P01 Log Analyzer for regex reasoning.

References

Russ Cox, “Regular Expression Matching Can Be Simple And Fast”
“Compilers: Principles, Techniques, and Tools” (automata theory)

Key insights

A regex engine is a finite automaton problem. Thompson NFA gives you power without the DFA explosion.

Summary

Thompson construction builds a compact NFA that can be simulated linearly. This is the core of a simplified grep.

Homework/Exercises to practice the concept

Draw the NFA for regex a(b|c)d.
Implement epsilon-closure for a small NFA graph.
Simulate the NFA on the string “abbd” and track active states.

Solutions to the homework/exercises

Use a split from a into b and c branches, then join into d.
Use a stack/queue to traverse epsilon edges.
Track state sets after each character; accept if final state reachable.

2.3 Regex Parsing and Tokenization

Fundamentals

Before you can match, you must parse the pattern. Parsing turns a string like a(b|c)*d into a structure that the engine can interpret. This involves tokenization (recognizing literals, operators, and character classes) and handling precedence (concatenation before alternation, repetition before concatenation). A simplified grep can parse into postfix (Reverse Polish) notation using the shunting-yard algorithm, which makes NFA construction straightforward. Correct parsing ensures the engine interprets patterns as intended. Without clear parsing rules, even a perfect matcher will produce incorrect results. Clear error handling for malformed patterns is part of parsing. A small parser bug can invalidate every match.

Deep Dive into the concept

Regex parsing is a miniature compiler problem. The first step is tokenization: scanning the pattern character by character and emitting tokens like LITERAL(‘a’), STAR(‘*’), OR(‘|’), LPAREN, RPAREN, CHAR_CLASS, and ANY(‘.’). Escapes (\) change the meaning of the next character, so the tokenizer must treat them as literals. Character classes ([abc] or [a-z]) require special handling: they are multi-character tokens that may include ranges and negation ([^a-z]). For a simplified grep, you can support basic classes and ranges, and ignore more advanced features like POSIX character classes inside brackets.

After tokenization, you must handle implicit concatenation. In regex, concatenation has higher precedence than alternation, but is not represented by an explicit operator. The common approach is to insert an explicit CONCAT token between tokens that should be concatenated. For example, ab becomes a CONCAT b. This transformation is critical for parsing with precedence rules.

Once you have tokens and explicit CONCAT operators, you can use the shunting-yard algorithm to convert the infix regex into postfix notation. The precedence order is typically: repetition (*, +, ?) highest, CONCAT next, and OR lowest. Postfix notation makes NFA construction simple: you process tokens from left to right, push fragments on a stack, and apply operators by popping fragments and combining them.

Parsing also affects error handling. Your tool should detect invalid patterns (unbalanced parentheses, dangling operators, invalid ranges) and report them with a clear error message. This is part of being a good CLI citizen. In a simplified implementation, you can reject patterns that include unsupported features (like backreferences) with a friendly error.

Another important aspect is escaping and literal handling. . is special unless escaped; \. should match a literal dot. Similarly, * and + are operators unless escaped. Your tokenizer must track whether the previous character was a backslash. Edge cases such as a trailing backslash should be treated as an error.

Finally, consider anchors ^ and $. In grep, these are special only at the start or end of the pattern (or line). In a simplified parser, you can treat ^ and $ as special tokens if they appear at the beginning or end; otherwise treat them as literals. Document this behavior clearly.

When inserting explicit CONCAT tokens, it helps to model the parser as a small state machine that tracks what kinds of tokens can follow others. For example, a literal followed by another literal implies concatenation, but a literal followed by | does not. This logic is easy to get wrong, so write unit tests for the concatenation inserter itself. Also, be careful with character classes that include escaped characters like \\] or \\-; your tokenizer should treat these as part of the class rather than as class terminators. Clear error messages for malformed classes (like an empty []) make the tool easier to debug.

If you support ranges inside classes (like a-z), decide how to interpret locale and case. A simple, deterministic approach is to treat ranges as ASCII codepoint intervals. That keeps behavior predictable and avoids locale surprises. Document the choice so users understand the limits of the simplified engine.

How this fit on projects

Parsing defines the semantics of your regex engine. Without correct parsing, your matcher will behave unpredictably.

Definitions & key terms

Tokenization: Converting a string into a sequence of tokens.
Concatenation operator: Explicit operator inserted between adjacent tokens.
Shunting-yard: Algorithm to convert infix to postfix notation.
Postfix (RPN): Operator notation that simplifies NFA construction.
Escape: Backslash that turns a metacharacter into a literal.

Mental model diagram (ASCII)

regex: a(b|c)*d
 tokens -> a ( b | c ) * d
 insert CONCAT -> a . ( b | c ) * . d
 postfix -> a b c | * . d .

How it works (step-by-step, with invariants and failure modes)

Tokenize the regex into literals and operators.
Insert CONCAT tokens where appropriate.
Use shunting-yard to produce postfix.
Validate parentheses and operator placement.
Pass postfix to NFA constructor.

Invariant: The postfix sequence is a valid expression tree. Failure modes: unbalanced parentheses, invalid escapes, unsupported tokens.

Minimal concrete example

Pattern: ab|c
Tokens: a b | c
Insert CONCAT: a . b | c
Postfix: a b . c |

Common misconceptions

“Regex parsing is trivial” -> It requires precedence and concatenation handling.
“All regex features are needed” -> A simplified grep can omit advanced features.
“Escapes are just literals” -> Escapes change token meaning and must be tracked.

Check-your-understanding questions

Why do you need to insert CONCAT tokens?
What is the precedence order of regex operators?
How does shunting-yard help in regex parsing?
What should happen if the pattern ends with a backslash?

Check-your-understanding answers

Concatenation is implicit and must be made explicit for parsing.
Repetition highest, concatenation next, alternation lowest.
It converts infix to postfix, which simplifies NFA construction.
It should be treated as an error (dangling escape).

Real-world applications

Building custom search tools.
Understanding how regex libraries parse patterns.

Where you’ll apply it

See §4.3 for data structures used in the parser.
See §5.10 Phase 2 for the tokenizer and parser implementation.
Also used in: P03 Codebase Refactoring Toolkit for pattern precision.

References

Russ Cox regex articles
“Engineering a Compiler” (parsing basics)

Key insights

Parsing determines meaning. A correct parser is the difference between a working grep and a buggy one.

Summary

Regex parsing requires tokenization, explicit concatenation, precedence handling, and error detection. Once parsed into postfix, building an NFA becomes straightforward.

Homework/Exercises to practice the concept

Implement a tokenizer that recognizes literals, *, +, ?, |, and parentheses.
Write a function to insert CONCAT tokens.
Convert a(b|c)d to postfix by hand.

Solutions to the homework/exercises

Track escapes and bracketed character classes.
Insert CONCAT between literals, closing parens, and * followed by a literal.
Postfix: a b c | . d ..

2.4 CLI Semantics, Match Reporting, and Exit Codes

Fundamentals

grep is as much about its CLI contract as it is about matching. It returns exit code 0 if any match is found, 1 if no matches, and 2 if an error occurred. It supports options like -n for line numbers and -v for inverted matches. A simplified grep should implement a subset of these options consistently. CLI semantics are crucial because grep is often used in scripts; incorrect exit codes break pipelines. This concept ensures your tool behaves like a well-behaved Unix utility. Even small inconsistencies in flags or exit codes can cause automation to fail silently. Consistency is the feature.

Deep Dive into the concept

CLI design is about predictable behavior. grep defaults to printing matching lines and returns 0 if at least one match is found. This makes it usable in conditional chains like if grep -q pattern file; then .... The -q flag suppresses output and returns immediately on the first match. Even if you don’t implement -q, you should at least implement correct exit codes and document them. -n adds line numbers to output, which is helpful for refactoring and debugging. -v inverts the match, outputting lines that do NOT match. Implementing -v is straightforward: just negate the match result.

Multiple files introduce additional complexity. grep prefixes each output line with the filename when multiple files are searched. A simplified grep can adopt a simpler rule: if more than one file is provided, prefix with filename:. This makes output unambiguous. If you only support a single file or stdin, document that limitation. When reading from stdin, grep treats the input as a single stream; line numbers count from 1. This is easy to implement but must be explicitly tested.

Exit codes require careful tracking. If any file fails to open, grep sets the overall exit code to 2 even if matches were found in other files. This behavior is important in scripts: errors should be visible. You should implement this by tracking three flags: found_match, had_error, and processed_any. At the end, return 2 if had_error, else 0 if found_match, else 1. This is the canonical grep behavior.

For deterministic behavior, the output should be stable. If you process files in the order given, output lines should appear in that order. For stdin, there is no file prefix. If you add a --count option, output should be the count and not the lines, but the exit code should still reflect match/no-match. If you do not implement --count, document it.

CLI semantics are also about error messages. Use stderr for errors, stdout for matches. This is the standard Unix contract and allows piping outputs safely. For example, mygrep pattern file | wc -l should count matches without being polluted by error messages. When reporting invalid regex, print a clear message and exit with code 2.

Another subtlety is early exit behavior. In -q (quiet) mode, grep stops reading input after the first match, which can make pipelines faster. Even if you do not implement -q, you should design your matcher so it could short-circuit easily. Similarly, --count mode would require reading all input but suppressing line output. These variants all rely on the same match loop but change output behavior, which is a good exercise in separating matching logic from output formatting.

Output formatting also affects downstream tools. For example, if you print file names and line numbers, separators should be consistent (file:line:content) so tools can parse them. This is why grep uses a stable format and does not add extra punctuation. Building the habit of stable, parseable output is part of writing Unix-friendly tools.

How this fit on projects

This concept ensures your grep tool behaves correctly in pipelines and scripts. It is the difference between a toy and a usable utility.

Definitions & key terms

Exit code: Numeric result of program execution.
Inverted match: Matching lines that do not match the pattern.
Filename prefix: Output format when scanning multiple files.
Quiet mode: Stop after first match without output.

Mental model diagram (ASCII)

args -> [parse flags] -> [process files] -> stdout (matches) + stderr (errors)
                          |                          |
                          v                          v
                       exit code                 line format

How it works (step-by-step, with invariants and failure modes)

Parse CLI flags and pattern.
For each input file, read line by line.
Apply matcher; optionally invert.
Print matches with optional line numbers and file prefix.
Track match/error flags and set exit code.

Invariant: Errors are printed to stderr, not stdout. Failure modes: wrong exit codes, mixed stderr/stdout, incorrect line numbers.

Minimal concrete example

$ ./mygrep -n "error" app.log
12:error: failed to connect

Common misconceptions

“Exit code 0 means success only” -> For grep, it means matches found.
“Errors can be printed to stdout” -> They must go to stderr.
“Line numbers start at 0” -> They start at 1.

Check-your-understanding questions

What exit code should you return if no matches are found?
How should errors be reported relative to stdout?
What should happen if one of multiple files cannot be opened?
Why is deterministic file ordering important?

Check-your-understanding answers

Exit code 1.
Errors go to stderr; matches go to stdout.
Exit code should be 2 even if other files matched.
It makes output predictable and testable.

Real-world applications

Scripted checks (grep -q in CI).
Search tools in developer workflows.

Where you’ll apply it

See §3.2 for CLI requirements.
See §3.7 for exact terminal transcripts and exit codes.
Also used in: P06 Personal DevOps Toolkit as a subcommand.

References

POSIX grep exit status definitions
grep(1) manual page

Key insights

Correct CLI semantics turn a regex engine into a usable Unix tool.

Summary

Implementing grep is not just about matching; it is about exit codes, stderr/stdout separation, and predictable output.

Homework/Exercises to practice the concept

Implement -n and verify line numbering.
Implement -v and test inverted matches.
Verify exit codes for match/no-match/error cases.

Solutions to the homework/exercises

Track a line counter starting at 1 and prefix output.
Invert the match result before deciding to print.
Track flags and return 0/1/2 accordingly.

3. Project Specification

3.1 What You Will Build

A simplified mygrep tool that:

Reads from files or stdin.
Supports a subset of regex operators (., *, +, ?, |, [], ^, $).
Prints matching lines (with optional -n line numbers).
Supports -v for inverted matching.
Returns grep-compatible exit codes.

Included:

NFA-based matcher.
Deterministic behavior.

Excluded:

Backreferences, lookahead, full POSIX compliance.

3.2 Functional Requirements

Input handling: Read from stdin and one or more files.
Regex parsing: Parse and validate patterns.
Matching: Match patterns against lines with NFA simulation.
Options: Support -n and -v.
Exit codes: 0/1/2 semantics.
Error reporting: Invalid pattern or file errors to stderr.

3.3 Non-Functional Requirements

Performance: Handle large files efficiently.
Reliability: Deterministic output for tests.
Usability: Clear usage message.

3.4 Example Usage / Output

$ ./mygrep -n "ERR(OR)?" app.log
12:ERROR: disk full
17:ERR: disk warning

3.5 Data Formats / Schemas / Protocols

Input: raw text lines.

Output format (default):

[optional filename:][line_number:]line

3.6 Edge Cases

Empty pattern.
Lines without newline at EOF.
Invalid regex (unbalanced parentheses).
Binary files (treat as bytes).

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

./mygrep -n "ERROR" sample/app.log

3.7.2 Golden Path Demo (Deterministic)

$ ./mygrep -n "ERROR" sample/app.log
3:ERROR: failed to connect
7:ERROR: timeout

3.7.3 Failure Demo (Deterministic)

$ ./mygrep "(unclosed" sample/app.log
ERROR: invalid pattern (unbalanced parentheses)
exit code: 2

3.7.4 If CLI: exact terminal transcript

$ ./mygrep -v "INFO" sample/app.log
WARN: slow request
ERROR: failed to connect
$ echo $?
0

Exit codes:

0: Match found.
1: No matches.
2: Error (invalid pattern, file missing).

4. Solution Architecture

4.1 High-Level Design

pattern -> [parser] -> NFA
input -> [line reader] -> [matcher] -> output

4.2 Key Components

Component	Responsibility	Key Decisions
Parser	Tokenize and parse regex	shunting-yard to postfix
NFA Builder	Build Thompson NFA	epsilon transitions
Matcher	Simulate NFA on each line	set of active states
CLI	Parse flags and set exit codes	grep-compatible semantics

4.3 Data Structures (No Full Code)

State { transitions, epsilon, is_accept }
Stack<Fragment> for NFA construction
Vec<State> for NFA graph

4.4 Algorithm Overview

Key Algorithm: NFA Simulation

Build NFA from regex.
For each line, compute epsilon-closure of start state.
For each character, advance to next state set.
If accept state in set at end, match.

Complexity Analysis:

Time: O(n * s) where n is input length and s is number of states.
Space: O(s).

5. Implementation Guide

5.1 Development Environment Setup

# C example
cc -Wall -O2 -o mygrep main.c

5.2 Project Structure

mygrep/
├── src/
│   ├── main.c
│   ├── lexer.c
│   ├── parser.c
│   ├── nfa.c
│   └── matcher.c
├── tests/
│   └── fixtures/
└── README.md

5.3 The Core Question You’re Answering

“What does grep actually do internally when it decides a line matches a regex?”

5.4 Concepts You Must Understand First

Line-based streaming I/O
Regex parsing and concatenation rules
Thompson NFA construction
grep exit code semantics

5.5 Questions to Guide Your Design

Which regex operators will you support?
How will you represent NFA states and transitions?
What behavior do you choose for anchors ^ and $?

5.6 Thinking Exercise

Draw the NFA for a(b|c)*d and simulate it by hand on the string abbd.

5.7 The Interview Questions They’ll Ask

What is the difference between an NFA and DFA?
Why can backtracking regex engines be slow?
How does grep decide its exit codes?

5.8 Hints in Layers

Hint 1: Start with literal matching Implement substring search without regex operators.

Hint 2: Add ‘.’ and ‘*‘ Build a recursive matcher, then migrate to NFA.

Hint 3: Add character classes Represent classes as sets of allowed characters.

Hint 4: Add alternation Use the Thompson split fragment for |.

5.9 Books That Will Help

Topic	Book	Chapter
Regex theory	“Mastering Regular Expressions”	Ch. 4-6
Automata	“Compilers” (Dragon Book)	Automata chapters
Systems I/O	“Computer Systems: A Programmer’s Perspective”	I/O chapters

5.10 Implementation Phases

Phase 1: Foundation (4-5 days)

Goals: Line reader and literal matching.

Tasks:

Implement line reader and CLI parsing.
Implement simple literal substring search.

Checkpoint: mygrep literal file works on sample file.

Phase 2: Core Regex Engine (6-7 days)

Goals: Parser and NFA matcher.

Tasks:

Implement tokenizer and shunting-yard parser.
Build Thompson NFA and simulate it.

Checkpoint: Patterns with . * and | match correctly.

Phase 3: Polish & Options (3-4 days)

Goals: Add -n, -v, error handling.

Tasks:

Implement line numbering and inverted match.
Implement error reporting and exit codes.

Checkpoint: CLI behavior matches grep semantics.

5.11 Key Implementation Decisions

Decision	Options	Recommendation	Rationale
Engine type	backtracking vs NFA	NFA	predictable performance
Parsing	recursive descent vs shunting-yard	shunting-yard	simpler for regex
Input type	bytes vs UTF-8	bytes	closer to grep behavior

6. Testing Strategy

6.1 Test Categories

Category	Purpose	Examples
Unit Tests	Parser correctness	postfix output
Integration Tests	End-to-end matching	fixtures with expected matches
Edge Case Tests	invalid patterns	unbalanced parentheses

6.2 Critical Test Cases

a*b matches aaab.
^abc$ matches only full line.
Invalid regex returns exit code 2.

6.3 Test Data

INFO: startup
ERROR: failed
WARN: slow

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall	Symptom	Solution
Wrong precedence	matches incorrect	implement CONCAT precedence
Missing epsilon closure	false negatives	compute closure before each step
Wrong exit codes	scripts fail	track match/no-match/error flags

7.2 Debugging Strategies

Print the postfix regex to validate parsing.
Dump NFA state graph for small patterns.
Compare against grep -E on small inputs.

7.3 Performance Traps

Avoid recursion-based backtracking for * and |; use NFA state sets.

8. Extensions & Challenges

8.1 Beginner Extensions

Add -q quiet mode.
Add -c count-only mode.

8.2 Intermediate Extensions

Add {m,n} repetition.
Add character class negation [^...].

8.3 Advanced Extensions

Add DFA compilation for performance.
Add POSIX leftmost-longest semantics.

9. Real-World Connections

9.1 Industry Applications

Search tools in build systems.
Intrusion detection log scanning.

ripgrep: modern grep replacement.
RE2: safe regex engine.

9.3 Interview Relevance

Automata theory applied to real tools.
Performance trade-offs in regex engines.

10. Resources

10.1 Essential Reading

Russ Cox regex articles
“Mastering Regular Expressions” by Jeffrey Friedl

10.2 Video Resources

“Regex engines explained” (lecture)

10.3 Tools & Documentation

regex(7) manual page

11. Self-Assessment Checklist

11.1 Understanding

I can explain NFA vs DFA.
I can describe how my parser works.
I can justify exit code behavior.

11.2 Implementation

All functional requirements are met.
Tests pass for core regex operators.
Exit codes match grep semantics.

11.3 Growth

I can explain how to extend this engine.
I can describe performance trade-offs.

12. Submission / Completion Criteria

Minimum Viable Completion:

Literal matching and line-based I/O.
Exit code semantics implemented.
Basic regex operators supported.

Full Completion:

NFA engine with alternation and classes.
-n and -v options.
Golden tests pass.

Excellence (Going Above & Beyond):

DFA compilation or leftmost-longest semantics.
Additional operators and performance benchmarks.