Project 4: my_grep - The Pattern Matcher
Build a
grep-style tool that searches input for patterns, first withstrstr, then with POSIX regex.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Intermediate |
| Time Estimate | 1-2 weeks |
| Language | C (Alternatives: Rust, Go) |
| Prerequisites | Line-based I/O, basic algorithms |
| Key Topics | String search, regex, flags, streaming |
1. Learning Objectives
By completing this project, you will:
- Read and process input line-by-line using
getline(). - Implement a simple substring search using
strstr. - Integrate POSIX regex with
regcompandregexec. - Add CLI flags that change matching behavior.
2. Theoretical Foundation
2.1 Core Concepts
- Line-buffered processing:
grepmatches on lines.getline()reads until newline and grows buffers dynamically. - Substring search:
strstr()uses a naive algorithm; it is fine for small patterns and simple use. - Regex engines: POSIX regex provides a state machine for pattern matching, supporting classes and repetition.
2.2 Why This Matters
grep is the workhorse of debugging and systems operations. Building it teaches you to build fast filters and to reason about text as a stream.
2.3 Historical Context / Background
The original grep (global regular expression print) popularized regex in Unix workflows. It showed that simple tools could become powerful when combined with pipes.
2.4 Common Misconceptions
- “Regex is just fancy substring search”: Regex engines parse and execute patterns as automata.
- “
getline()returns a newline”: It includes the newline when present, which affects matching and printing.
3. Project Specification
3.1 What You Will Build
A command-line tool my_grep that searches for a pattern in files or stdin. It should support simple string matching and a regex mode, plus flags like -i and -v.
3.2 Functional Requirements
- Search stdin when no files are provided.
- Search multiple files and print matching lines.
- Support
-ifor case-insensitive matching. - Support
-vto invert matches. - Support regex mode using
regex.h.
3.3 Non-Functional Requirements
- Performance: Handle large files with streaming reads.
- Reliability: Cleanly handle invalid regex patterns.
- Usability: Match behavior should align with basic
grepexpectations.
3.4 Example Usage / Output
# Search for a string in a file
$ ./my_grep "main" my_ls.c
int main(int argc, char **argv) {
# Invert match and ignore case from stdin
$ echo -e "Hello\nWorld\nhello" | ./my_grep -v -i "hello"
World
3.5 Real World Outcome
Run ./my_grep -i error log.txt and you get a list of lines containing case-insensitive matches. When you add -v, the output flips to show everything that does not match. This is the exact workflow used to filter logs and source trees with GNU grep.
4. Solution Architecture
4.1 High-Level Design
input -> getline -> matcher -> output
4.2 Key Components
| Component | Responsibility | Key Decisions |
|---|---|---|
| Input reader | Read lines from stdin/files | Use getline() |
| Matcher | String or regex match | Flag-controlled |
| Option parser | Handle -i, -v, -E |
Manual parse |
| Output writer | Print matching lines | Preserve line endings |
4.3 Data Structures
typedef struct {
int ignore_case;
int invert_match;
int use_regex;
regex_t regex;
} GrepOptions;
4.4 Algorithm Overview
Key Algorithm: Line scan and match
- Read a line with
getline(). - Check match using either
strstr()orregexec(). - Apply invert flag.
- Print line if it matches.
Complexity Analysis:
- Time: O(n * m) for naive substring search
- Space: O(k) for line buffer
5. Implementation Guide
5.1 Development Environment Setup
cc -Wall -Wextra -O2 -o my_grep my_grep.c
5.2 Project Structure
my_grep/
├── src/
│ └── my_grep.c
├── tests/
│ └── test_grep.sh
├── Makefile
└── README.md
5.3 The Core Question You’re Answering
“How do I turn a stream of bytes into meaningful matches without loading everything into memory?”
5.4 Concepts You Must Understand First
Stop and research these before coding:
getline()semantics- Who allocates the buffer?
- How do you free it?
- String search
- How does
strstr()behave? - Why is naive search sometimes enough?
- How does
- POSIX regex
- What does
regcomp()return? - How do
REG_ICASEandREG_NOSUBwork?
- What does
5.5 Questions to Guide Your Design
Before implementing, think through these:
- Should you print filenames when multiple files are provided?
- How will you handle invalid regex patterns?
- How do you preserve original line endings?
5.6 Thinking Exercise
Build a Manual Regex Check
Take the pattern h.llo and the input hello. What does the regex engine need to do differently than strstr()?
5.7 The Interview Questions They’ll Ask
Prepare to answer these:
- “Why is
getline()useful forgrep?” - “What is the difference between a regex and substring search?”
- “How would you implement
-vinvert match?”
5.8 Hints in Layers
Hint 1: Start with strstr
Ignore regex and flags, just match a fixed substring.
Hint 2: Add -i
Normalize to lowercase before matching.
Hint 3: Add regex mode
Use regex.h and compile once, execute per line.
5.9 Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Regex basics | “Mastering Regular Expressions” | Ch. 1-3 |
| POSIX I/O | “The Linux Programming Interface” | Ch. 48 |
5.10 Implementation Phases
Phase 1: Foundation (2-3 days)
Goals:
- Read lines and match with
strstr
Tasks:
- Implement stdin reading.
- Print matching lines.
Checkpoint: ./my_grep "foo" file.txt works.
Phase 2: Core Functionality (4-6 days)
Goals:
- Add flags
-iand-v
Tasks:
- Parse options.
- Modify match logic.
Checkpoint: -i and -v behave like GNU grep.
Phase 3: Regex Mode (3-5 days)
Goals:
- Add
regex.hmatching
Tasks:
- Compile pattern with
regcomp(). - Execute with
regexec()per line.
Checkpoint: Regex patterns match correctly.
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale |
|---|---|---|---|
| Match engine | strstr vs regex |
Both | Teach progression |
| Case handling | Lowercase copy vs regex flag | Regex flag | Less memory duplication |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples |
|---|---|---|
| Unit Tests | Matcher logic | Fixed strings |
| Integration Tests | Compare with grep |
diff outputs |
| Edge Case Tests | Long lines, invalid regex | Huge log lines |
6.2 Critical Test Cases
- Case-insensitive match:
-iworks. - Invert match:
-vprints non-matching lines. - Invalid regex: Program reports error and exits non-zero.
6.3 Test Data
Hello
world
HELLO
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution |
|---|---|---|
Forgetting to free getline buffer |
Memory leak | free(line) after loop |
| Misusing regex flags | Wrong matches | Use REG_ICASE for -i |
| Printing extra newlines | Double-spaced output | Preserve original line endings |
7.2 Debugging Strategies
- Compare output with GNU
grepon small files. - Print regex compilation errors using
regerror().
7.3 Performance Traps
Lowercasing every line can be expensive. Use regex flags for case-insensitive matching when possible.
8. Extensions & Challenges
8.1 Beginner Extensions
- Add
-nto print line numbers. - Add
-cto print count of matches.
8.2 Intermediate Extensions
- Add
-rto search directories recursively. - Add
-Efor extended regex.
8.3 Advanced Extensions
- Implement Boyer-Moore or KMP search.
- Add colored match highlighting.
9. Real-World Connections
9.1 Industry Applications
- Log filtering: Extract errors or warnings.
- Code search: Find patterns across codebases.
9.2 Related Open Source Projects
- GNU grep: Reference implementation.
9.3 Interview Relevance
String matching and streaming I/O are common interview topics for systems roles.
10. Resources
10.1 Essential Reading
- “Mastering Regular Expressions” by Jeffrey E. F. Friedl - Ch. 1-3
- “The Linux Programming Interface” by Michael Kerrisk - Ch. 48
10.2 Video Resources
- Regex fundamentals lectures (any OS or text processing series)
10.3 Tools & Documentation
man 3 getline: Dynamic line inputman 3 regex: POSIX regex API
10.4 Related Projects in This Series
- my_ls: Metadata and filesystem traversal.
- simple shell: Uses
grepin pipelines you build.
11. Self-Assessment Checklist
11.1 Understanding
- I can explain how
getline()manages memory. - I can describe the difference between regex and substring search.
- I can explain how
-vchanges matching.
11.2 Implementation
my_grepworks with stdin and files.- Flags
-iand-vbehave correctly. - Regex mode handles errors correctly.
11.3 Growth
- I can implement an improved search algorithm.
- I can explain this project in an interview.
12. Submission / Completion Criteria
Minimum Viable Completion:
- Matches a fixed string from stdin.
- Works on a single file.
Full Completion:
- Supports
-i,-v, and regex mode. - Handles errors cleanly.
Excellence (Going Above & Beyond):
- Adds recursive search and color highlighting.
- Includes performance comparisons of search algorithms.
This guide was generated from LEARN_GNU_TOOLS_DEEP_DIVE.md. For the complete learning path, see the parent directory.