Project 4: my_grep - The Pattern Matcher

Build a grep-style tool that searches input for patterns, first with strstr, then with POSIX regex.

Quick Reference

Attribute Value
Difficulty Intermediate
Time Estimate 1-2 weeks
Language C (Alternatives: Rust, Go)
Prerequisites Line-based I/O, basic algorithms
Key Topics String search, regex, flags, streaming

1. Learning Objectives

By completing this project, you will:

  1. Read and process input line-by-line using getline().
  2. Implement a simple substring search using strstr.
  3. Integrate POSIX regex with regcomp and regexec.
  4. Add CLI flags that change matching behavior.

2. Theoretical Foundation

2.1 Core Concepts

  • Line-buffered processing: grep matches on lines. getline() reads until newline and grows buffers dynamically.
  • Substring search: strstr() uses a naive algorithm; it is fine for small patterns and simple use.
  • Regex engines: POSIX regex provides a state machine for pattern matching, supporting classes and repetition.

2.2 Why This Matters

grep is the workhorse of debugging and systems operations. Building it teaches you to build fast filters and to reason about text as a stream.

2.3 Historical Context / Background

The original grep (global regular expression print) popularized regex in Unix workflows. It showed that simple tools could become powerful when combined with pipes.

2.4 Common Misconceptions

  • “Regex is just fancy substring search”: Regex engines parse and execute patterns as automata.
  • getline() returns a newline”: It includes the newline when present, which affects matching and printing.

3. Project Specification

3.1 What You Will Build

A command-line tool my_grep that searches for a pattern in files or stdin. It should support simple string matching and a regex mode, plus flags like -i and -v.

3.2 Functional Requirements

  1. Search stdin when no files are provided.
  2. Search multiple files and print matching lines.
  3. Support -i for case-insensitive matching.
  4. Support -v to invert matches.
  5. Support regex mode using regex.h.

3.3 Non-Functional Requirements

  • Performance: Handle large files with streaming reads.
  • Reliability: Cleanly handle invalid regex patterns.
  • Usability: Match behavior should align with basic grep expectations.

3.4 Example Usage / Output

# Search for a string in a file
$ ./my_grep "main" my_ls.c
int main(int argc, char **argv) {

# Invert match and ignore case from stdin
$ echo -e "Hello\nWorld\nhello" | ./my_grep -v -i "hello"
World

3.5 Real World Outcome

Run ./my_grep -i error log.txt and you get a list of lines containing case-insensitive matches. When you add -v, the output flips to show everything that does not match. This is the exact workflow used to filter logs and source trees with GNU grep.


4. Solution Architecture

4.1 High-Level Design

input -> getline -> matcher -> output

4.2 Key Components

Component Responsibility Key Decisions
Input reader Read lines from stdin/files Use getline()
Matcher String or regex match Flag-controlled
Option parser Handle -i, -v, -E Manual parse
Output writer Print matching lines Preserve line endings

4.3 Data Structures

typedef struct {
    int ignore_case;
    int invert_match;
    int use_regex;
    regex_t regex;
} GrepOptions;

4.4 Algorithm Overview

Key Algorithm: Line scan and match

  1. Read a line with getline().
  2. Check match using either strstr() or regexec().
  3. Apply invert flag.
  4. Print line if it matches.

Complexity Analysis:

  • Time: O(n * m) for naive substring search
  • Space: O(k) for line buffer

5. Implementation Guide

5.1 Development Environment Setup

cc -Wall -Wextra -O2 -o my_grep my_grep.c

5.2 Project Structure

my_grep/
├── src/
│   └── my_grep.c
├── tests/
│   └── test_grep.sh
├── Makefile
└── README.md

5.3 The Core Question You’re Answering

“How do I turn a stream of bytes into meaningful matches without loading everything into memory?”

5.4 Concepts You Must Understand First

Stop and research these before coding:

  1. getline() semantics
    • Who allocates the buffer?
    • How do you free it?
  2. String search
    • How does strstr() behave?
    • Why is naive search sometimes enough?
  3. POSIX regex
    • What does regcomp() return?
    • How do REG_ICASE and REG_NOSUB work?

5.5 Questions to Guide Your Design

Before implementing, think through these:

  1. Should you print filenames when multiple files are provided?
  2. How will you handle invalid regex patterns?
  3. How do you preserve original line endings?

5.6 Thinking Exercise

Build a Manual Regex Check

Take the pattern h.llo and the input hello. What does the regex engine need to do differently than strstr()?

5.7 The Interview Questions They’ll Ask

Prepare to answer these:

  1. “Why is getline() useful for grep?”
  2. “What is the difference between a regex and substring search?”
  3. “How would you implement -v invert match?”

5.8 Hints in Layers

Hint 1: Start with strstr Ignore regex and flags, just match a fixed substring.

Hint 2: Add -i Normalize to lowercase before matching.

Hint 3: Add regex mode Use regex.h and compile once, execute per line.

5.9 Books That Will Help

Topic Book Chapter
Regex basics “Mastering Regular Expressions” Ch. 1-3
POSIX I/O “The Linux Programming Interface” Ch. 48

5.10 Implementation Phases

Phase 1: Foundation (2-3 days)

Goals:

  • Read lines and match with strstr

Tasks:

  1. Implement stdin reading.
  2. Print matching lines.

Checkpoint: ./my_grep "foo" file.txt works.

Phase 2: Core Functionality (4-6 days)

Goals:

  • Add flags -i and -v

Tasks:

  1. Parse options.
  2. Modify match logic.

Checkpoint: -i and -v behave like GNU grep.

Phase 3: Regex Mode (3-5 days)

Goals:

  • Add regex.h matching

Tasks:

  1. Compile pattern with regcomp().
  2. Execute with regexec() per line.

Checkpoint: Regex patterns match correctly.

5.11 Key Implementation Decisions

Decision Options Recommendation Rationale
Match engine strstr vs regex Both Teach progression
Case handling Lowercase copy vs regex flag Regex flag Less memory duplication

6. Testing Strategy

6.1 Test Categories

Category Purpose Examples
Unit Tests Matcher logic Fixed strings
Integration Tests Compare with grep diff outputs
Edge Case Tests Long lines, invalid regex Huge log lines

6.2 Critical Test Cases

  1. Case-insensitive match: -i works.
  2. Invert match: -v prints non-matching lines.
  3. Invalid regex: Program reports error and exits non-zero.

6.3 Test Data

Hello
world
HELLO

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall Symptom Solution
Forgetting to free getline buffer Memory leak free(line) after loop
Misusing regex flags Wrong matches Use REG_ICASE for -i
Printing extra newlines Double-spaced output Preserve original line endings

7.2 Debugging Strategies

  • Compare output with GNU grep on small files.
  • Print regex compilation errors using regerror().

7.3 Performance Traps

Lowercasing every line can be expensive. Use regex flags for case-insensitive matching when possible.


8. Extensions & Challenges

8.1 Beginner Extensions

  • Add -n to print line numbers.
  • Add -c to print count of matches.

8.2 Intermediate Extensions

  • Add -r to search directories recursively.
  • Add -E for extended regex.

8.3 Advanced Extensions

  • Implement Boyer-Moore or KMP search.
  • Add colored match highlighting.

9. Real-World Connections

9.1 Industry Applications

  • Log filtering: Extract errors or warnings.
  • Code search: Find patterns across codebases.
  • GNU grep: Reference implementation.

9.3 Interview Relevance

String matching and streaming I/O are common interview topics for systems roles.


10. Resources

10.1 Essential Reading

  • “Mastering Regular Expressions” by Jeffrey E. F. Friedl - Ch. 1-3
  • “The Linux Programming Interface” by Michael Kerrisk - Ch. 48

10.2 Video Resources

  • Regex fundamentals lectures (any OS or text processing series)

10.3 Tools & Documentation

  • man 3 getline: Dynamic line input
  • man 3 regex: POSIX regex API
  • my_ls: Metadata and filesystem traversal.
  • simple shell: Uses grep in pipelines you build.

11. Self-Assessment Checklist

11.1 Understanding

  • I can explain how getline() manages memory.
  • I can describe the difference between regex and substring search.
  • I can explain how -v changes matching.

11.2 Implementation

  • my_grep works with stdin and files.
  • Flags -i and -v behave correctly.
  • Regex mode handles errors correctly.

11.3 Growth

  • I can implement an improved search algorithm.
  • I can explain this project in an interview.

12. Submission / Completion Criteria

Minimum Viable Completion:

  • Matches a fixed string from stdin.
  • Works on a single file.

Full Completion:

  • Supports -i, -v, and regex mode.
  • Handles errors cleanly.

Excellence (Going Above & Beyond):

  • Adds recursive search and color highlighting.
  • Includes performance comparisons of search algorithms.

This guide was generated from LEARN_GNU_TOOLS_DEEP_DIVE.md. For the complete learning path, see the parent directory.