Project 1: Naive String Search (Baseline)

Build a correct, deterministic, brute-force string matcher you can use as the performance and correctness baseline for every other project.

Quick Reference

Attribute	Value
Difficulty	Level 1: Beginner
Time Estimate	2-3 hours
Main Programming Language	C (Alternatives: Rust, Zig, Go)
Alternative Programming Languages	Rust, Zig, Go
Coolness Level	Level 2: Solid Foundation
Business Potential	Level 1: Internal Tooling
Prerequisites	C basics, pointers, arrays, CLI args
Key Topics	Naive string matching, complexity, byte scanning, overlapping matches

1. Learning Objectives

By completing this project, you will:

Implement a correct O(n*m) substring search and explain its worst-case behavior.
Produce deterministic output with stable ordering and fixed timestamps for tests.
Handle overlapping matches and edge cases (empty pattern, pattern longer than text).
Build a baseline benchmark harness for later comparisons.
Explain why naive search can still be fast for short patterns and small files.

2. All Theory Needed (Per-Concept Breakdown)

2.1 Naive String Matching

Fundamentals

Naive string matching compares the pattern against the text at every possible alignment. If the pattern has length m and the text has length n, there are n - m + 1 alignments. At each alignment, you compare the pattern bytes from left to right until you hit a mismatch or reach the end. This is the simplest correct algorithm for substring search and establishes the baseline for all optimizations. It is useful precisely because it is brutally straightforward: it does no preprocessing, keeps no state beyond the current indices, and is easy to reason about for correctness. That makes it the perfect reference implementation for validating more complex algorithms such as Boyer-Moore, Aho-Corasick, and regex engines.

In systems work, you need a baseline not just for performance but for semantics. The naive algorithm defines what it means for a match to exist at a position: every pattern byte equals the corresponding text byte at that offset. There is no skipping, no heuristic, no probabilistic prefilter. That makes it the ideal oracle for fuzzing and regression tests. If an optimized matcher disagrees with the naive baseline on deterministic inputs, the optimized version is wrong, not the baseline.

Deep Dive into the concept

The naive algorithm’s worst case is O(n*m). This occurs when the pattern shares long prefixes with many positions in the text, such as pattern “aaaaab” in text “aaaaaaaaaaaaaaaaab”. At each alignment, the algorithm compares nearly all m characters before failing on the last comparison, leading to roughly (n - m + 1) * (m - 1) comparisons. This is the canonical example of why preprocessing and skipping heuristics matter. But the naive algorithm can still be fast in practice for small m because its inner loop is tight and predictable: a simple byte compare plus a branch. Modern branch predictors and caches handle this well for many workloads.

Understanding overlaps is critical. If the pattern is “ana” and the text is “banana”, matches appear at positions 1 and 3. A naive implementation that jumps ahead by m after finding a match will miss the overlap. For correctness, you must define match reporting as “every alignment with full equality” and advance by exactly one position after each alignment. This matters later in regex engines and multi-pattern matchers, where overlapping matches are common and expected.

Finally, naive search connects directly to how bytes live in memory. Strings are byte arrays; the algorithm is literally comparing bytes at fixed offsets. That is why optimized search can be seen as reorganizing the exact same comparisons, not changing the meaning of a match. This mental model helps you debug off-by-one errors, boundary conditions, and output ordering bugs.

How this fits on projects

This baseline appears in Section 3.4 Example Usage, Section 5.10 Phase 1, and is used as the correctness oracle in Projects 2 through 15.

Definitions & key terms

alignment -> placing the pattern at a specific text offset for comparison
overlap -> matches that share characters but start at different positions
worst case -> input that maximizes comparisons (e.g., repeated prefixes)
oracle -> a trusted reference implementation for validation

Mental model diagram (ASCII)

text:    b a n a n a
pattern:   a n a
           ^^^^^ match at offset 1
            a n a
              ^^^^^ match at offset 3

How it works (step-by-step)

For each offset i from 0 to n - m:
Compare pattern[0] with text[i]. If mismatch, move to i+1.
If match, compare pattern[1] with text[i+1], and so on.
If all m bytes match, record a match at i.
Advance to i+1 to allow overlaps.

Minimal concrete example

int naive_search(const unsigned char* text, size_t n,
                 const unsigned char* pat, size_t m,
                 size_t* out, size_t out_cap) {
    if (m == 0 || m > n) return 0;
    size_t count = 0;
    for (size_t i = 0; i + m <= n; i++) {
        size_t j = 0;
        while (j < m && text[i + j] == pat[j]) j++;
        if (j == m && count < out_cap) out[count++] = i;
    }
    return (int)count;
}

Common misconceptions

Misconception: After a match, you should jump by m. Correction: That skips overlapping matches.
Misconception: Naive search is always slow. Correction: For very short patterns, it can be competitive.

Check-your-understanding questions

Why does the pattern “aaaaab” produce worst-case behavior in a text of “aaaaaa…”?
What happens when the pattern is empty? What should your program return?
Predict how many matches exist for pattern “aa” in text “aaaa”.
Why is it safe to use the naive algorithm as a correctness oracle?

Check-your-understanding answers

Each alignment matches almost all characters before failing at the last one.
Define it explicitly; for this project, treat empty pattern as error.
Matches at offsets 0, 1, and 2 (overlapping).
It directly implements the definition of substring equality.

Real-world applications

Baseline correctness testing for optimized search tools
Educational tools and algorithm validation
Tiny embedded systems with very small patterns

Where you’ll apply it

This project: Section 3.4 Example Usage and Section 6.2 Critical Test Cases.
Also used in: P02-boyer-moore-string-search, P03-aho-corasick-multi-pattern-matching, P15-build-your-own-ripgrep.

References

“Introduction to Algorithms” (CLRS), Chapter 32.1
“Algorithms” (Sedgewick & Wayne), String Processing

Key insights

The naive algorithm is slow only because it forgets; its simplicity makes it the perfect oracle.

Summary

Naive search is the simplest correct substring matcher. Its clarity and determinism make it essential even when you later build far faster systems.

Homework/Exercises to practice the concept

Implement a version that reports matches with line and column numbers.
Modify the algorithm to count comparisons and print the total.

Solutions to the homework/exercises

Track line breaks and map offsets to line/column in a second pass.
Increment a counter on each byte comparison and report it.

3. Project Specification

3.1 What You Will Build

A CLI tool naive-search that takes a pattern and one or more files, scans byte-by-byte, and prints every match with file path and byte offset. It must be deterministic, handle overlaps, and provide a JSON output mode for testability. It is intentionally slow but always correct.

3.2 Functional Requirements

Search one file: naive-search PATTERN file.txt prints all matches.
Search multiple files: naive-search PATTERN dir/file1 dir/file2.
Overlapping matches: must report overlaps in order.
Binary detection: if NUL byte appears, mark file as binary and skip unless --binary.
Output formats: text (default) and JSON (--json).
Deterministic mode: --fixed-ts prints fixed timestamps (all zeros).
Exit codes: 0 success, 1 IO error, 2 invalid args.

3.3 Non-Functional Requirements

Performance: correctness over speed; still finish 10MB file in under 2s on modern hardware.
Reliability: never crash on short files or empty input.
Usability: clear error messages and --help output.

3.4 Example Usage / Output

$ ./naive-search "ana" banana.txt
banana.txt:1:1: ana
banana.txt:1:3: ana

$ ./naive-search --json "TODO" notes.txt
{"file":"notes.txt","byte_offset":120,"line":5,"col":7,"match":"TODO"}

3.5 Data Formats / Schemas / Protocols

Text output

<file>:<line>:<col>: <matching line>

JSON output

{
  "file": "notes.txt",
  "byte_offset": 120,
  "line": 5,
  "col": 7,
  "match": "TODO"
}

3.6 Edge Cases

Pattern longer than file
Empty file
Pattern with embedded NUL
File without trailing newline
Very large single line (>1MB)

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

make
./naive-search "error" ./logs/sample.log
./naive-search --json "TODO" ./notes.txt

3.7.2 Golden Path Demo (Deterministic)

Use a fixed input file fixtures/banana.txt and --fixed-ts to make output stable.

3.7.3 CLI Transcript (Success + Failure)

$ ./naive-search --fixed-ts "ana" fixtures/banana.txt
fixtures/banana.txt:1:2: banana
fixtures/banana.txt:1:4: banana
exit code: 0

$ ./naive-search
error: missing PATTERN and FILE arguments
exit code: 2

4. Solution Architecture

4.1 High-Level Design

+-----------+     +--------------+     +-----------------+
| CLI Parse | --> | File Reader  | --> | Naive Matcher   |
+-----------+     +--------------+     +--------+--------+
                                              |
                                              v
                                      +-------+--------+
                                      | Formatter     |
                                      +----------------+

4.2 Key Components

4.3 Data Structures (No Full Code)

struct Match {
    size_t offset;
    size_t line;
    size_t col;
};

4.4 Algorithm Overview

Naive Scan

Load file into buffer (or stream in chunks with overlap).
For each position i, compare pattern bytes sequentially.
If full match, record offset and map to line/col.

Complexity Analysis

Time: O(n*m) worst-case
Space: O(n) if full file buffer, O(m) if streaming

5. Implementation Guide

5.1 Development Environment Setup

cc -O2 -Wall -Wextra -o naive-search src/main.c

5.2 Project Structure

naive-search/
├── src/
│   ├── main.c
│   ├── matcher.c
│   └── formatter.c
├── tests/
│   └── test_matches.txt
├── fixtures/
│   └── banana.txt
├── Makefile
└── README.md

5.3 The Core Question You’re Answering

“What does it mean, precisely, for a pattern to match a string, and how expensive is that definition?”

5.4 Concepts You Must Understand First

Stop and research these before coding:

Byte arrays and bounds: how to avoid reading past the buffer.
Overlapping matches: why skipping can be wrong.
Line/column mapping: how to translate byte offsets to lines.

5.5 Questions to Guide Your Design

How will you avoid off-by-one errors at the end of the file?
Will you scan the file once or load it entirely into memory?
How will you detect and handle binary files?

5.6 Thinking Exercise

Write the matches for pattern “aba” in text “ababa” by hand. How many matches do you get and why?

5.7 The Interview Questions They’ll Ask

Why is naive search O(n*m)?
How would you handle overlapping matches?
When is naive search fast enough in practice?

5.8 Hints in Layers

Hint 1: Start with a function that returns the first match only. Hint 2: Add a loop that continues from the next byte to allow overlaps. Hint 3: Add a line/column map by precomputing newline offsets.

5.9 Books That Will Help

| Topic | Book | Chapter | |——-|——|———| | String matching basics | CLRS | Ch. 32.1 | | C arrays and pointers | K&R | Ch. 5 | | Debugging off-by-one | “The Practice of Programming” | Ch. 5 |

5.10 Implementation Phases

Phase 1: Foundation (1 hour)

Goals: parse CLI args, read file, compare bytes.
Checkpoint: program prints first match or “no match”.

Phase 2: Core Functionality (1 hour)

Goals: report all matches, handle overlaps, line/col mapping.
Checkpoint: matches on banana.txt match expected output.

Phase 3: Polish & Edge Cases (1 hour)

Goals: JSON output, binary detection, proper exit codes.
Checkpoint: tests pass for error cases.

5.11 Key Implementation Decisions

6. Testing Strategy

6.1 Test Categories

6.2 Critical Test Cases

Pattern “ana” in “banana” returns offsets 1 and 3.
Pattern longer than text returns no matches.
Empty pattern triggers usage error (exit code 2).

6.3 Test Data

text: "aaaa"
pattern: "aa"
expected offsets: 0,1,2

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

7.2 Debugging Strategies

Print indices during matching on tiny inputs.
Compare output against a reference Python script.

8. Extensions & Challenges

8.1 Beginner Extensions

Add -n to print line numbers only.
Add -o to print only matching substring.

8.2 Intermediate Extensions

Implement streaming search with chunk overlap.
Add case-insensitive search.

8.3 Advanced Extensions

Add SIMD prefiltering (prepare for Project 7).
Build a microbenchmark harness.

9. Real-World Connections

9.1 Industry Applications

Baseline correctness tests in search engines
Text processing pipelines in CI/CD tools

GNU grep (baseline correctness checks)
ripgrep (uses naive validation in tests)

9.3 Interview Relevance

Explaining time complexity and overlap handling
Describing why optimized algorithms are needed

10. Resources

10.1 Essential Reading

“Introduction to Algorithms” (CLRS), String Matching chapter
“The Practice of Programming” (Kernighan & Pike), debugging sections

10.2 Tools & Documentation

hyperfine for benchmarking
valgrind or asan for memory checks

Previous: none
Next: P02-boyer-moore-string-search

11. Self-Assessment Checklist

11.1 Understanding

I can explain the worst-case of naive search.
I can explain why overlaps must be considered.

11.2 Implementation

Output matches the golden fixtures.
Exit codes match the spec.

11.3 Growth

I can describe at least one optimization idea for the baseline.

12. Submission / Completion Criteria

Minimum Viable Completion:

Correct matches on provided fixtures
CLI usage and help output
Deterministic output mode

Full Completion:

JSON output + binary detection
Benchmark harness

Excellence (Going Above & Beyond):

Streaming mode with chunk overlap
Profiling report with comparison to grep