Project 3: Aho-Corasick Multi-Pattern Matching

Build a multi-pattern matcher using a trie + failure links, capable of finding thousands of literals in one pass.

Quick Reference

Attribute Value
Difficulty Level 3: Advanced
Time Estimate 2 weeks
Main Programming Language C (Alternatives: Rust, C++)
Alternative Programming Languages Rust, C++
Coolness Level Level 4: Hardcore
Business Potential Level 3: Security / Data tooling
Prerequisites P01, basic data structures (tries, queues)
Key Topics Trie construction, failure links, automaton traversal, multi-pattern search

1. Learning Objectives

By completing this project, you will:

  1. Build a trie for thousands of patterns and compute failure links.
  2. Implement a streaming automaton that matches all patterns in one pass.
  3. Report overlapping and nested matches correctly.
  4. Compare multi-pattern performance against running Boyer-Moore repeatedly.
  5. Explain why Aho-Corasick is linear in the input size.

2. All Theory Needed (Per-Concept Breakdown)

Fundamentals

Aho-Corasick turns a set of patterns into a single finite automaton that can be run over the text once. The core structure is a trie: each node represents a prefix of some pattern, and edges represent next characters. To handle mismatches efficiently, the algorithm adds failure links. A failure link from a node points to the longest proper suffix of that node that is also a trie prefix. When a mismatch occurs, you follow the failure link and continue, just like the failure function in KMP but generalized to many patterns.

This transforms multi-pattern matching into a single pass over the text with O(n + total pattern length) time. Every input byte causes at most a constant amount of work as you traverse edges or follow failure links.

Deep Dive into the concept

Failure link construction uses BFS over the trie. The root’s children have failure link to the root. For each node, you follow its parent’s failure link to find the longest suffix that can be extended by the current character. If none is found, the failure link is set to root. This creates a fallback chain that avoids re-scanning text. Additionally, each node can have output links: a list of patterns that end at that node or along its failure chain. This is how you report multiple matches at a single position (e.g., patterns “he”, “she”, and “hers” at one index).

The linear-time guarantee is significant for search tools: unlike backtracking regex, Aho-Corasick cannot explode exponentially. It is ideal for keyword scanning, intrusion detection, and log filtering where you match many fixed strings.

Memory trade-offs matter. A naive trie with 256-way edges per node is large. A space-optimized implementation uses sparse edge lists or a vector of (byte, next) pairs. This affects speed. Your project should start with a simpler dense or map-based representation, then explore trade-offs.

How this fits on projects

This automaton is the heart of multi-literal filters, used in literal extraction (P06) and Teddy-style SIMD prefilters (P09). It also underpins many security scanners.

Definitions & key terms

  • trie -> prefix tree of patterns
  • failure link -> fallback pointer on mismatch
  • output -> list of pattern IDs that end at a node
  • automaton -> state machine for matching patterns

Mental model diagram (ASCII)

patterns: he, she, his, hers

Trie:
(root)
  h -> e* -> r -> s*
   \        \
    i -> s*
  s -> h -> e*

Failure links point to longest suffixes.

How it works (step-by-step)

  1. Insert all patterns into a trie and mark terminal nodes.
  2. Build failure links using BFS from root.
  3. For each input byte, follow trie edges; on mismatch, follow failure links.
  4. When you reach a node with outputs, emit matches.

Minimal concrete example

struct Node { int next[256]; int fail; int out[4]; int out_len; };

Common misconceptions

  • Misconception: Aho-Corasick is only for huge pattern sets. Correction: It can be worthwhile even for a few dozen patterns.
  • Misconception: Output is only at terminal nodes. Correction: You must emit outputs along the failure chain.

Check-your-understanding questions

  1. Why do we use BFS to build failure links?
  2. What happens when multiple patterns end at one node?
  3. How does Aho-Corasick avoid re-scanning text?

Check-your-understanding answers

  1. Because failure links depend on shorter prefixes which are processed earlier.
  2. The node stores multiple pattern IDs; you emit all of them.
  3. Failure links reuse previous matched prefix information.

Real-world applications

  • Intrusion detection (Snort signatures)
  • Spam filters with keyword lists
  • Fast log scanning for many tokens

Where you’ll apply it

References

  • Aho & Corasick (1975), “Efficient string matching”
  • “Algorithms on Strings” by Crochemore

Key insights

Aho-Corasick makes multi-pattern search a single linear pass by building a failure-linked trie.

Summary

Aho-Corasick is the standard tool for matching many literals at once with linear-time guarantees.

Homework/Exercises to practice the concept

  1. Build a trie for 5 patterns and draw failure links by hand.
  2. Trace the automaton on a short input and list matches.

Solutions to the homework/exercises

  1. Use BFS: root children fail to root; deeper nodes follow parent fail edges.
  2. Every time you hit a node, emit its outputs and those of its failure chain.

3. Project Specification

3.1 What You Will Build

A CLI tool ac-search that loads a set of patterns, builds an Aho-Corasick automaton, and scans files to report all matches (pattern ID, offset, and line/col). Includes a benchmark mode comparing against repeated Boyer-Moore searches.

3.2 Functional Requirements

  1. Pattern input: read patterns from a file (--patterns patterns.txt).
  2. Automaton build: construct trie + failure links.
  3. Search: stream through file and emit all matches.
  4. Output: text and JSON formats.
  5. Benchmark: compare with repeated single-pattern search.

3.3 Non-Functional Requirements

  • Performance: linear scan should outperform repeated BM by 2x+ for 1,000 patterns.
  • Reliability: handle overlapping and nested matches correctly.
  • Usability: clear pattern count and stats.

3.4 Example Usage / Output

$ ./ac-search --patterns patterns.txt logs.txt
logs.txt:120: error
logs.txt:121: timeout

3.5 Data Formats / Schemas / Protocols

Patterns file

error
timeout
panic

Match output

<file>:<line>:<col>: <pattern>

3.6 Edge Cases

  • Patterns that are prefixes of each other
  • Empty lines in pattern file
  • Large pattern set (10k+)

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

make
./ac-search --patterns fixtures/patterns.txt fixtures/logs.txt

3.7.2 Golden Path Demo (Deterministic)

Use fixed fixtures and output ordering by offset then pattern ID.

3.7.3 CLI Transcript (Success + Failure)

$ ./ac-search --patterns fixtures/patterns.txt fixtures/logs.txt
fixtures/logs.txt:3:5: error
fixtures/logs.txt:5:1: timeout
exit code: 0

$ ./ac-search --patterns missing.txt fixtures/logs.txt
error: patterns file not found
exit code: 1

4. Solution Architecture

4.1 High-Level Design

+------------+   +----------------+   +-----------------+
| Pattern IO |-->| Trie/Fail Build|-->| Stream Matcher  |
+------------+   +----------------+   +--------+--------+
                                                |
                                                v
                                       +--------+-------+
                                       | Formatter      |
                                       +----------------+

4.2 Key Components

| Component | Responsibility | Key Decisions | |———–|—————-|—————| | Trie builder | insert patterns | choose dense vs sparse edges | | Failure builder | BFS links | queue-based construction | | Matcher | state transitions | emit outputs in order |

4.3 Data Structures (No Full Code)

struct Node {
    int next[256];
    int fail;
    int* out; int out_len;
};

4.4 Algorithm Overview

  1. Insert patterns into trie.
  2. Compute failure links with BFS.
  3. Stream over text; follow transitions/failures.
  4. Emit outputs for current node + failure chain.

Complexity Analysis

  • Build: O(total pattern length)
  • Search: O(n + matches)

5. Implementation Guide

5.1 Development Environment Setup

cc -O2 -Wall -Wextra -o ac-search src/main.c

5.2 Project Structure

ac-search/
├── src/
│   ├── main.c
│   ├── trie.c
│   ├── failure.c
│   └── matcher.c
├── fixtures/
│   ├── patterns.txt
│   └── logs.txt
└── Makefile

5.3 The Core Question You’re Answering

“How can I search for thousands of strings in one pass without backtracking?”

5.4 Concepts You Must Understand First

  1. Trie insertion and prefix representation.
  2. BFS traversal for failure links.
  3. Output emission for nested matches.

5.5 Questions to Guide Your Design

  1. How will you store outputs for nodes with multiple patterns?
  2. How do you avoid duplicate outputs when following failure chains?
  3. What memory layout gives you acceptable speed?

5.6 Thinking Exercise

Construct the automaton for patterns “he”, “she”, and “hers” and trace the text “ushers”.

5.7 The Interview Questions They’ll Ask

  1. Why is Aho-Corasick linear time?
  2. What is the purpose of failure links?

5.8 Hints in Layers

Hint 1: Start with a dense 256-edge trie for simplicity. Hint 2: Build failure links with a queue (BFS). Hint 3: Emit outputs from the current node and its failure chain.

5.9 Books That Will Help

| Topic | Book | Chapter | |——-|——|———| | Aho-Corasick | “Algorithms on Strings” | AC section | | String matching | CLRS | Ch. 32 |

5.10 Implementation Phases

Phase 1: Trie Build (3 days)

  • Insert patterns and mark terminal nodes.
  • Checkpoint: trie paths match patterns.
  • Build failure links using BFS.
  • Checkpoint: failure links match hand-drawn example.

Phase 3: Streaming Matcher (3 days)

  • Scan files, emit matches, add JSON output.
  • Checkpoint: outputs match naive multi-search baseline.

5.11 Key Implementation Decisions

| Decision | Options | Recommendation | Rationale | |———-|———|—————-|———–| | Edge storage | dense array vs map | dense array (v1) | simpler and fast | | Output storage | vector vs linked list | vector | compact and cache-friendly |


6. Testing Strategy

6.1 Test Categories

| Category | Purpose | Examples | |———-|———|———-| | Unit | trie + failure correctness | small pattern sets | | Integration | match output | fixtures | | Edge | prefixes | “a”, “ab”, “abc” |

6.2 Critical Test Cases

  1. Patterns with shared prefixes.
  2. Overlapping outputs at same index.
  3. Large pattern file (10k) does not crash.

6.3 Test Data

patterns: he, she, his, hers
text: ushers
expected: she, he, hers

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

| Pitfall | Symptom | Solution | |———|———|———-| | Missing outputs | fewer matches | emit failure-chain outputs | | Wrong fail links | incorrect matches | compare with hand-drawn trie | | Excess memory | high RAM | use compact node storage |

7.2 Debugging Strategies

  • Print failure links for tiny pattern sets.
  • Validate by comparing with repeated naive search.

8. Extensions & Challenges

8.1 Beginner Extensions

  • Add case-insensitive mode.
  • Report pattern IDs instead of strings.

8.2 Intermediate Extensions

  • Sparse edge storage.
  • Stream patterns from file without loading all into memory.

8.3 Advanced Extensions

  • Integrate SIMD prefilters before the automaton.
  • Add multi-threaded scanning of large files.

9. Real-World Connections

9.1 Industry Applications

  • IDS and malware scanners
  • Log monitoring platforms
  • Hyperscan (multi-pattern engine)
  • aho-corasick Rust crate

9.3 Interview Relevance

  • Discussing automata construction and linear-time guarantees

10. Resources

10.1 Essential Reading

  • Aho & Corasick paper
  • “Algorithms on Strings” AC chapter

10.2 Tools & Documentation

  • hyperfine for benchmarks
  • valgrind for memory checks

11. Self-Assessment Checklist

11.1 Understanding

  • I can explain failure links and why they work.
  • I can reason about output emission.

11.2 Implementation

  • Matches equal baseline across fixtures.
  • Pattern loading and output are deterministic.

11.3 Growth

  • I can describe trade-offs between dense and sparse tries.

12. Submission / Completion Criteria

Minimum Viable Completion:

  • Trie + failure links
  • Correct match outputs for small fixtures

Full Completion:

  • Benchmark vs repeated BM
  • JSON output mode

Excellence (Going Above & Beyond):

  • Sparse edges + memory analysis
  • Parallel scanning of large files