Project 3: Aho-Corasick Multi-Pattern Matching

Build a multi-pattern matcher using a trie + failure links, capable of finding thousands of literals in one pass.

Quick Reference

Attribute	Value
Difficulty	Level 3: Advanced
Time Estimate	2 weeks
Main Programming Language	C (Alternatives: Rust, C++)
Alternative Programming Languages	Rust, C++
Coolness Level	Level 4: Hardcore
Business Potential	Level 3: Security / Data tooling
Prerequisites	P01, basic data structures (tries, queues)
Key Topics	Trie construction, failure links, automaton traversal, multi-pattern search

1. Learning Objectives

By completing this project, you will:

Build a trie for thousands of patterns and compute failure links.
Implement a streaming automaton that matches all patterns in one pass.
Report overlapping and nested matches correctly.
Compare multi-pattern performance against running Boyer-Moore repeatedly.
Explain why Aho-Corasick is linear in the input size.

2. All Theory Needed (Per-Concept Breakdown)

2.1 Trie + Failure Links (Aho-Corasick Automaton)

Fundamentals

Aho-Corasick turns a set of patterns into a single finite automaton that can be run over the text once. The core structure is a trie: each node represents a prefix of some pattern, and edges represent next characters. To handle mismatches efficiently, the algorithm adds failure links. A failure link from a node points to the longest proper suffix of that node that is also a trie prefix. When a mismatch occurs, you follow the failure link and continue, just like the failure function in KMP but generalized to many patterns.

This transforms multi-pattern matching into a single pass over the text with O(n + total pattern length) time. Every input byte causes at most a constant amount of work as you traverse edges or follow failure links.

Deep Dive into the concept

Failure link construction uses BFS over the trie. The root’s children have failure link to the root. For each node, you follow its parent’s failure link to find the longest suffix that can be extended by the current character. If none is found, the failure link is set to root. This creates a fallback chain that avoids re-scanning text. Additionally, each node can have output links: a list of patterns that end at that node or along its failure chain. This is how you report multiple matches at a single position (e.g., patterns “he”, “she”, and “hers” at one index).

The linear-time guarantee is significant for search tools: unlike backtracking regex, Aho-Corasick cannot explode exponentially. It is ideal for keyword scanning, intrusion detection, and log filtering where you match many fixed strings.

Memory trade-offs matter. A naive trie with 256-way edges per node is large. A space-optimized implementation uses sparse edge lists or a vector of (byte, next) pairs. This affects speed. Your project should start with a simpler dense or map-based representation, then explore trade-offs.

How this fits on projects

This automaton is the heart of multi-literal filters, used in literal extraction (P06) and Teddy-style SIMD prefilters (P09). It also underpins many security scanners.

Definitions & key terms

trie -> prefix tree of patterns
failure link -> fallback pointer on mismatch
output -> list of pattern IDs that end at a node
automaton -> state machine for matching patterns

Mental model diagram (ASCII)

patterns: he, she, his, hers

Trie:
(root)
  h -> e* -> r -> s*
   \        \
    i -> s*
  s -> h -> e*

Failure links point to longest suffixes.

How it works (step-by-step)

Insert all patterns into a trie and mark terminal nodes.
Build failure links using BFS from root.
For each input byte, follow trie edges; on mismatch, follow failure links.
When you reach a node with outputs, emit matches.

Minimal concrete example

struct Node { int next[256]; int fail; int out[4]; int out_len; };

Common misconceptions

Misconception: Aho-Corasick is only for huge pattern sets. Correction: It can be worthwhile even for a few dozen patterns.
Misconception: Output is only at terminal nodes. Correction: You must emit outputs along the failure chain.

Check-your-understanding questions

Why do we use BFS to build failure links?
What happens when multiple patterns end at one node?
How does Aho-Corasick avoid re-scanning text?

Check-your-understanding answers

Because failure links depend on shorter prefixes which are processed earlier.
The node stores multiple pattern IDs; you emit all of them.
Failure links reuse previous matched prefix information.

Real-world applications

Intrusion detection (Snort signatures)
Spam filters with keyword lists
Fast log scanning for many tokens

Where you’ll apply it

This project: Section 4.4 Algorithm Overview and Section 6.2 Tests.
Also used in: P06-literal-extraction-optimizer, P09-teddy-algorithm-multi-literal-simd.

References

Aho & Corasick (1975), “Efficient string matching”
“Algorithms on Strings” by Crochemore

Key insights

Aho-Corasick makes multi-pattern search a single linear pass by building a failure-linked trie.

Summary

Aho-Corasick is the standard tool for matching many literals at once with linear-time guarantees.

Homework/Exercises to practice the concept

Build a trie for 5 patterns and draw failure links by hand.
Trace the automaton on a short input and list matches.

Solutions to the homework/exercises

Use BFS: root children fail to root; deeper nodes follow parent fail edges.
Every time you hit a node, emit its outputs and those of its failure chain.

3. Project Specification

3.1 What You Will Build

A CLI tool ac-search that loads a set of patterns, builds an Aho-Corasick automaton, and scans files to report all matches (pattern ID, offset, and line/col). Includes a benchmark mode comparing against repeated Boyer-Moore searches.

3.2 Functional Requirements

Pattern input: read patterns from a file (--patterns patterns.txt).
Automaton build: construct trie + failure links.
Search: stream through file and emit all matches.
Output: text and JSON formats.
Benchmark: compare with repeated single-pattern search.

3.3 Non-Functional Requirements

Performance: linear scan should outperform repeated BM by 2x+ for 1,000 patterns.
Reliability: handle overlapping and nested matches correctly.
Usability: clear pattern count and stats.

3.4 Example Usage / Output

$ ./ac-search --patterns patterns.txt logs.txt
logs.txt:120: error
logs.txt:121: timeout

3.5 Data Formats / Schemas / Protocols

Patterns file

error
timeout
panic

Match output

<file>:<line>:<col>: <pattern>

3.6 Edge Cases

Patterns that are prefixes of each other
Empty lines in pattern file
Large pattern set (10k+)

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

make
./ac-search --patterns fixtures/patterns.txt fixtures/logs.txt

3.7.2 Golden Path Demo (Deterministic)

Use fixed fixtures and output ordering by offset then pattern ID.

3.7.3 CLI Transcript (Success + Failure)

$ ./ac-search --patterns fixtures/patterns.txt fixtures/logs.txt
fixtures/logs.txt:3:5: error
fixtures/logs.txt:5:1: timeout
exit code: 0

$ ./ac-search --patterns missing.txt fixtures/logs.txt
error: patterns file not found
exit code: 1

4. Solution Architecture

4.1 High-Level Design

+------------+   +----------------+   +-----------------+
| Pattern IO |-->| Trie/Fail Build|-->| Stream Matcher  |
+------------+   +----------------+   +--------+--------+
                                                |
                                                v
                                       +--------+-------+
                                       | Formatter      |
                                       +----------------+

4.2 Key Components

4.3 Data Structures (No Full Code)

struct Node {
    int next[256];
    int fail;
    int* out; int out_len;
};

4.4 Algorithm Overview

Insert patterns into trie.
Compute failure links with BFS.
Stream over text; follow transitions/failures.
Emit outputs for current node + failure chain.

Complexity Analysis

Build: O(total pattern length)
Search: O(n + matches)

5. Implementation Guide

5.1 Development Environment Setup

cc -O2 -Wall -Wextra -o ac-search src/main.c

5.2 Project Structure

ac-search/
├── src/
│   ├── main.c
│   ├── trie.c
│   ├── failure.c
│   └── matcher.c
├── fixtures/
│   ├── patterns.txt
│   └── logs.txt
└── Makefile

5.3 The Core Question You’re Answering

“How can I search for thousands of strings in one pass without backtracking?”

5.4 Concepts You Must Understand First

Trie insertion and prefix representation.
BFS traversal for failure links.
Output emission for nested matches.

5.5 Questions to Guide Your Design

How will you store outputs for nodes with multiple patterns?
How do you avoid duplicate outputs when following failure chains?
What memory layout gives you acceptable speed?

5.6 Thinking Exercise

Construct the automaton for patterns “he”, “she”, and “hers” and trace the text “ushers”.

5.7 The Interview Questions They’ll Ask

Why is Aho-Corasick linear time?
What is the purpose of failure links?

5.8 Hints in Layers

Hint 1: Start with a dense 256-edge trie for simplicity. Hint 2: Build failure links with a queue (BFS). Hint 3: Emit outputs from the current node and its failure chain.

5.9 Books That Will Help

5.10 Implementation Phases

Phase 1: Trie Build (3 days)

Insert patterns and mark terminal nodes.
Checkpoint: trie paths match patterns.

Phase 2: Failure Links (3 days)

Build failure links using BFS.
Checkpoint: failure links match hand-drawn example.

Phase 3: Streaming Matcher (3 days)

Scan files, emit matches, add JSON output.
Checkpoint: outputs match naive multi-search baseline.

5.11 Key Implementation Decisions

6. Testing Strategy

6.1 Test Categories

6.2 Critical Test Cases

Patterns with shared prefixes.
Overlapping outputs at same index.
Large pattern file (10k) does not crash.

6.3 Test Data

patterns: he, she, his, hers
text: ushers
expected: she, he, hers

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

7.2 Debugging Strategies

Print failure links for tiny pattern sets.
Validate by comparing with repeated naive search.

8. Extensions & Challenges

8.1 Beginner Extensions

Add case-insensitive mode.
Report pattern IDs instead of strings.

8.2 Intermediate Extensions

Sparse edge storage.
Stream patterns from file without loading all into memory.

8.3 Advanced Extensions

Integrate SIMD prefilters before the automaton.
Add multi-threaded scanning of large files.

9. Real-World Connections

9.1 Industry Applications

IDS and malware scanners
Log monitoring platforms

Hyperscan (multi-pattern engine)
aho-corasick Rust crate

9.3 Interview Relevance

Discussing automata construction and linear-time guarantees

10. Resources

10.1 Essential Reading

Aho & Corasick paper
“Algorithms on Strings” AC chapter

10.2 Tools & Documentation

hyperfine for benchmarks
valgrind for memory checks

Previous: P02-boyer-moore-string-search
Next: P04-nfa-regex-engine-thompson-s-construction

11. Self-Assessment Checklist

11.1 Understanding

I can explain failure links and why they work.
I can reason about output emission.

11.2 Implementation

Matches equal baseline across fixtures.
Pattern loading and output are deterministic.

11.3 Growth

I can describe trade-offs between dense and sparse tries.

12. Submission / Completion Criteria

Minimum Viable Completion:

Trie + failure links
Correct match outputs for small fixtures

Full Completion:

Benchmark vs repeated BM
JSON output mode

Excellence (Going Above & Beyond):

Sparse edges + memory analysis
Parallel scanning of large files