Project 1: Naive String Search (Baseline)
Build a correct, deterministic, brute-force string matcher you can use as the performance and correctness baseline for every other project.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 1: Beginner |
| Time Estimate | 2-3 hours |
| Main Programming Language | C (Alternatives: Rust, Zig, Go) |
| Alternative Programming Languages | Rust, Zig, Go |
| Coolness Level | Level 2: Solid Foundation |
| Business Potential | Level 1: Internal Tooling |
| Prerequisites | C basics, pointers, arrays, CLI args |
| Key Topics | Naive string matching, complexity, byte scanning, overlapping matches |
1. Learning Objectives
By completing this project, you will:
- Implement a correct O(n*m) substring search and explain its worst-case behavior.
- Produce deterministic output with stable ordering and fixed timestamps for tests.
- Handle overlapping matches and edge cases (empty pattern, pattern longer than text).
- Build a baseline benchmark harness for later comparisons.
- Explain why naive search can still be fast for short patterns and small files.
2. All Theory Needed (Per-Concept Breakdown)
2.1 Naive String Matching
Fundamentals
Naive string matching compares the pattern against the text at every possible alignment. If the pattern has length m and the text has length n, there are n - m + 1 alignments. At each alignment, you compare the pattern bytes from left to right until you hit a mismatch or reach the end. This is the simplest correct algorithm for substring search and establishes the baseline for all optimizations. It is useful precisely because it is brutally straightforward: it does no preprocessing, keeps no state beyond the current indices, and is easy to reason about for correctness. That makes it the perfect reference implementation for validating more complex algorithms such as Boyer-Moore, Aho-Corasick, and regex engines.
In systems work, you need a baseline not just for performance but for semantics. The naive algorithm defines what it means for a match to exist at a position: every pattern byte equals the corresponding text byte at that offset. There is no skipping, no heuristic, no probabilistic prefilter. That makes it the ideal oracle for fuzzing and regression tests. If an optimized matcher disagrees with the naive baseline on deterministic inputs, the optimized version is wrong, not the baseline.
Deep Dive into the concept
The naive algorithm’s worst case is O(n*m). This occurs when the pattern shares long prefixes with many positions in the text, such as pattern “aaaaab” in text “aaaaaaaaaaaaaaaaab”. At each alignment, the algorithm compares nearly all m characters before failing on the last comparison, leading to roughly (n - m + 1) * (m - 1) comparisons. This is the canonical example of why preprocessing and skipping heuristics matter. But the naive algorithm can still be fast in practice for small m because its inner loop is tight and predictable: a simple byte compare plus a branch. Modern branch predictors and caches handle this well for many workloads.
Understanding overlaps is critical. If the pattern is “ana” and the text is “banana”, matches appear at positions 1 and 3. A naive implementation that jumps ahead by m after finding a match will miss the overlap. For correctness, you must define match reporting as “every alignment with full equality” and advance by exactly one position after each alignment. This matters later in regex engines and multi-pattern matchers, where overlapping matches are common and expected.
Finally, naive search connects directly to how bytes live in memory. Strings are byte arrays; the algorithm is literally comparing bytes at fixed offsets. That is why optimized search can be seen as reorganizing the exact same comparisons, not changing the meaning of a match. This mental model helps you debug off-by-one errors, boundary conditions, and output ordering bugs.
How this fits on projects
This baseline appears in Section 3.4 Example Usage, Section 5.10 Phase 1, and is used as the correctness oracle in Projects 2 through 15.
Definitions & key terms
- alignment -> placing the pattern at a specific text offset for comparison
- overlap -> matches that share characters but start at different positions
- worst case -> input that maximizes comparisons (e.g., repeated prefixes)
- oracle -> a trusted reference implementation for validation
Mental model diagram (ASCII)
text: b a n a n a
pattern: a n a
^^^^^ match at offset 1
a n a
^^^^^ match at offset 3
How it works (step-by-step)
- For each offset i from 0 to n - m:
- Compare pattern[0] with text[i]. If mismatch, move to i+1.
- If match, compare pattern[1] with text[i+1], and so on.
- If all m bytes match, record a match at i.
- Advance to i+1 to allow overlaps.
Minimal concrete example
int naive_search(const unsigned char* text, size_t n,
const unsigned char* pat, size_t m,
size_t* out, size_t out_cap) {
if (m == 0 || m > n) return 0;
size_t count = 0;
for (size_t i = 0; i + m <= n; i++) {
size_t j = 0;
while (j < m && text[i + j] == pat[j]) j++;
if (j == m && count < out_cap) out[count++] = i;
}
return (int)count;
}
Common misconceptions
- Misconception: After a match, you should jump by m. Correction: That skips overlapping matches.
- Misconception: Naive search is always slow. Correction: For very short patterns, it can be competitive.
Check-your-understanding questions
- Why does the pattern “aaaaab” produce worst-case behavior in a text of “aaaaaa…”?
- What happens when the pattern is empty? What should your program return?
- Predict how many matches exist for pattern “aa” in text “aaaa”.
- Why is it safe to use the naive algorithm as a correctness oracle?
Check-your-understanding answers
- Each alignment matches almost all characters before failing at the last one.
- Define it explicitly; for this project, treat empty pattern as error.
- Matches at offsets 0, 1, and 2 (overlapping).
- It directly implements the definition of substring equality.
Real-world applications
- Baseline correctness testing for optimized search tools
- Educational tools and algorithm validation
- Tiny embedded systems with very small patterns
Where you’ll apply it
- This project: Section 3.4 Example Usage and Section 6.2 Critical Test Cases.
- Also used in: P02-boyer-moore-string-search, P03-aho-corasick-multi-pattern-matching, P15-build-your-own-ripgrep.
References
- “Introduction to Algorithms” (CLRS), Chapter 32.1
- “Algorithms” (Sedgewick & Wayne), String Processing
Key insights
The naive algorithm is slow only because it forgets; its simplicity makes it the perfect oracle.
Summary
Naive search is the simplest correct substring matcher. Its clarity and determinism make it essential even when you later build far faster systems.
Homework/Exercises to practice the concept
- Implement a version that reports matches with line and column numbers.
- Modify the algorithm to count comparisons and print the total.
Solutions to the homework/exercises
- Track line breaks and map offsets to line/column in a second pass.
- Increment a counter on each byte comparison and report it.
3. Project Specification
3.1 What You Will Build
A CLI tool naive-search that takes a pattern and one or more files, scans byte-by-byte, and prints every match with file path and byte offset. It must be deterministic, handle overlaps, and provide a JSON output mode for testability. It is intentionally slow but always correct.
3.2 Functional Requirements
- Search one file:
naive-search PATTERN file.txtprints all matches. - Search multiple files:
naive-search PATTERN dir/file1 dir/file2. - Overlapping matches: must report overlaps in order.
- Binary detection: if NUL byte appears, mark file as binary and skip unless
--binary. - Output formats: text (default) and JSON (
--json). - Deterministic mode:
--fixed-tsprints fixed timestamps (all zeros). - Exit codes: 0 success, 1 IO error, 2 invalid args.
3.3 Non-Functional Requirements
- Performance: correctness over speed; still finish 10MB file in under 2s on modern hardware.
- Reliability: never crash on short files or empty input.
- Usability: clear error messages and
--helpoutput.
3.4 Example Usage / Output
$ ./naive-search "ana" banana.txt
banana.txt:1:1: ana
banana.txt:1:3: ana
$ ./naive-search --json "TODO" notes.txt
{"file":"notes.txt","byte_offset":120,"line":5,"col":7,"match":"TODO"}
3.5 Data Formats / Schemas / Protocols
Text output
<file>:<line>:<col>: <matching line>
JSON output
{
"file": "notes.txt",
"byte_offset": 120,
"line": 5,
"col": 7,
"match": "TODO"
}
3.6 Edge Cases
- Pattern longer than file
- Empty file
- Pattern with embedded NUL
- File without trailing newline
- Very large single line (>1MB)
3.7 Real World Outcome
3.7.1 How to Run (Copy/Paste)
make
./naive-search "error" ./logs/sample.log
./naive-search --json "TODO" ./notes.txt
3.7.2 Golden Path Demo (Deterministic)
- Use a fixed input file
fixtures/banana.txtand--fixed-tsto make output stable.
3.7.3 CLI Transcript (Success + Failure)
$ ./naive-search --fixed-ts "ana" fixtures/banana.txt
fixtures/banana.txt:1:2: banana
fixtures/banana.txt:1:4: banana
exit code: 0
$ ./naive-search
error: missing PATTERN and FILE arguments
exit code: 2
4. Solution Architecture
4.1 High-Level Design
+-----------+ +--------------+ +-----------------+
| CLI Parse | --> | File Reader | --> | Naive Matcher |
+-----------+ +--------------+ +--------+--------+
|
v
+-------+--------+
| Formatter |
+----------------+
4.2 Key Components
| Component | Responsibility | Key Decisions | |———–|—————-|—————| | CLI parser | parse args, flags | simple positional args + flags | | Reader | read file into buffer | read() with fixed chunk size | | Matcher | compare pattern to text | byte-by-byte, allow overlaps | | Formatter | text/JSON output | stable ordering for tests |
4.3 Data Structures (No Full Code)
struct Match {
size_t offset;
size_t line;
size_t col;
};
4.4 Algorithm Overview
Naive Scan
- Load file into buffer (or stream in chunks with overlap).
- For each position i, compare pattern bytes sequentially.
- If full match, record offset and map to line/col.
Complexity Analysis
- Time: O(n*m) worst-case
- Space: O(n) if full file buffer, O(m) if streaming
5. Implementation Guide
5.1 Development Environment Setup
cc -O2 -Wall -Wextra -o naive-search src/main.c
5.2 Project Structure
naive-search/
├── src/
│ ├── main.c
│ ├── matcher.c
│ └── formatter.c
├── tests/
│ └── test_matches.txt
├── fixtures/
│ └── banana.txt
├── Makefile
└── README.md
5.3 The Core Question You’re Answering
“What does it mean, precisely, for a pattern to match a string, and how expensive is that definition?”
5.4 Concepts You Must Understand First
Stop and research these before coding:
- Byte arrays and bounds: how to avoid reading past the buffer.
- Overlapping matches: why skipping can be wrong.
- Line/column mapping: how to translate byte offsets to lines.
5.5 Questions to Guide Your Design
- How will you avoid off-by-one errors at the end of the file?
- Will you scan the file once or load it entirely into memory?
- How will you detect and handle binary files?
5.6 Thinking Exercise
Write the matches for pattern “aba” in text “ababa” by hand. How many matches do you get and why?
5.7 The Interview Questions They’ll Ask
- Why is naive search O(n*m)?
- How would you handle overlapping matches?
- When is naive search fast enough in practice?
5.8 Hints in Layers
Hint 1: Start with a function that returns the first match only. Hint 2: Add a loop that continues from the next byte to allow overlaps. Hint 3: Add a line/column map by precomputing newline offsets.
5.9 Books That Will Help
| Topic | Book | Chapter | |——-|——|———| | String matching basics | CLRS | Ch. 32.1 | | C arrays and pointers | K&R | Ch. 5 | | Debugging off-by-one | “The Practice of Programming” | Ch. 5 |
5.10 Implementation Phases
Phase 1: Foundation (1 hour)
- Goals: parse CLI args, read file, compare bytes.
- Checkpoint: program prints first match or “no match”.
Phase 2: Core Functionality (1 hour)
- Goals: report all matches, handle overlaps, line/col mapping.
- Checkpoint: matches on
banana.txtmatch expected output.
Phase 3: Polish & Edge Cases (1 hour)
- Goals: JSON output, binary detection, proper exit codes.
- Checkpoint: tests pass for error cases.
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale | |———-|———|—————-|———–| | File I/O | read all vs stream | read all (v1) | simplicity and deterministic output | | Output | text vs JSON | both | JSON helps tests |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples | |———-|———|———-| | Unit | matcher correctness | small strings with overlaps | | Integration | CLI output | golden file comparisons | | Edge | boundary conditions | empty file, pattern longer |
6.2 Critical Test Cases
- Pattern “ana” in “banana” returns offsets 1 and 3.
- Pattern longer than text returns no matches.
- Empty pattern triggers usage error (exit code 2).
6.3 Test Data
text: "aaaa"
pattern: "aa"
expected offsets: 0,1,2
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution | |———|———|———-| | Off-by-one | missing last match | ensure i + m <= n | | Skipping overlaps | fewer matches | advance by 1 not m | | Wrong line/col | incorrect output | precompute newline positions |
7.2 Debugging Strategies
- Print indices during matching on tiny inputs.
- Compare output against a reference Python script.
8. Extensions & Challenges
8.1 Beginner Extensions
- Add
-nto print line numbers only. - Add
-oto print only matching substring.
8.2 Intermediate Extensions
- Implement streaming search with chunk overlap.
- Add case-insensitive search.
8.3 Advanced Extensions
- Add SIMD prefiltering (prepare for Project 7).
- Build a microbenchmark harness.
9. Real-World Connections
9.1 Industry Applications
- Baseline correctness tests in search engines
- Text processing pipelines in CI/CD tools
9.2 Related Open Source Projects
- GNU grep (baseline correctness checks)
- ripgrep (uses naive validation in tests)
9.3 Interview Relevance
- Explaining time complexity and overlap handling
- Describing why optimized algorithms are needed
10. Resources
10.1 Essential Reading
- “Introduction to Algorithms” (CLRS), String Matching chapter
- “The Practice of Programming” (Kernighan & Pike), debugging sections
10.2 Tools & Documentation
hyperfinefor benchmarkingvalgrindorasanfor memory checks
10.3 Related Projects in This Series
- Previous: none
- Next: P02-boyer-moore-string-search
11. Self-Assessment Checklist
11.1 Understanding
- I can explain the worst-case of naive search.
- I can explain why overlaps must be considered.
11.2 Implementation
- Output matches the golden fixtures.
- Exit codes match the spec.
11.3 Growth
- I can describe at least one optimization idea for the baseline.
12. Submission / Completion Criteria
Minimum Viable Completion:
- Correct matches on provided fixtures
- CLI usage and help output
- Deterministic output mode
Full Completion:
- JSON output + binary detection
- Benchmark harness
Excellence (Going Above & Beyond):
- Streaming mode with chunk overlap
- Profiling report with comparison to
grep