Project 9: Teddy Algorithm (Multi-Literal SIMD)

Build a Teddy-style multi-literal SIMD prefilter that can scan for dozens of literals at once using bit masks.

Quick Reference

Attribute	Value
Difficulty	Level 5: Master
Time Estimate	3-4 weeks
Main Programming Language	Rust (Alternatives: C++)
Alternative Programming Languages	C++
Coolness Level	Level 5: Pure Magic
Business Potential	Level 4: Infra / Security
Prerequisites	P07-P08, bitsets, SIMD intrinsics
Key Topics	SIMD masks, multi-literal prefiltering, candidate verification

1. Learning Objectives

By completing this project, you will:

Implement a Teddy-style SIMD prefilter for multiple literals.
Build a mask table that maps bytes to candidate literal positions.
Verify candidates with exact literal comparison.
Measure speedups over Aho-Corasick and scalar matching.
Explain the trade-offs of Teddy vs automata.

2. All Theory Needed (Per-Concept Breakdown)

2.1 Teddy Multi-Literal Prefiltering

Fundamentals

Teddy is a SIMD-based algorithm for quickly finding many literals. The key idea is to use multiple SIMD registers and bitmasks to represent which literals could match based on a few selected bytes. For example, if each literal is length 3, you can use two or three byte positions and create masks that indicate which literals have a given byte at that position. You then scan the text with SIMD, gather masks for each byte in the window, and intersect them to get candidate literals. Those candidates are then verified with exact comparison.

This is not a full automaton like Aho-Corasick; it is a fast prefilter that reduces candidate positions dramatically. It is particularly effective when you have many literals with common lengths and can afford a verification step.

Deep Dive into the concept

A common Teddy layout uses 3 byte positions per literal. For each position, you build a 256-entry table of bitmasks. Each mask is a bitset where bit i indicates literal i has that byte at that position. While scanning, you load three SIMD windows (offsets p0, p1, p2), look up the masks for each byte, and AND them together. A non-zero result means some literal(s) might match at that position. You then verify each candidate literal at that position.

The choice of positions matters. Ideally you choose positions that maximize discrimination, such as first, middle, and last bytes. For longer literals, you can pack more information, or split into groups. Teddy trades memory (mask tables) and verification cost for SIMD speed. It shines when you have many medium-length literals and need very high throughput.

In ripgrep, Teddy is used as a prefilter for literal sets extracted from regexes. It avoids the state explosion or overhead of full automata while still being very fast. Correctness depends on the verification step; the prefilter can produce false positives but never false negatives if constructed correctly.

How this fits on projects

This prefilter is the fastest path for multi-literal regex acceleration (P06) and is an important option in P15.

Definitions & key terms

Teddy -> SIMD prefilter using masks for multi-literal search
mask table -> maps byte value to a bitset of possible literals
verification -> exact literal compare after prefilter
false positive -> candidate reported by prefilter but not a true match

Mental model diagram (ASCII)

literals: [foo, bar, baz]
positions: 0,1,2
mask[0]['f'] = 001
mask[0]['b'] = 110
mask[1]['o'] = 001
...
scan text bytes -> mask0 & mask1 & mask2 -> candidates

How it works (step-by-step)

Group literals by length (or pick a fixed length window).
Choose byte positions per literal (e.g., 0, 1, 2).
Build mask tables for each position and byte value.
SIMD scan text, gather masks, AND them to get candidates.
Verify candidates with full literal comparison.

Minimal concrete example

let m0 = table0[byte0] & table1[byte1] & table2[byte2];
if m0 != 0 { verify_candidates(m0, pos); }

Common misconceptions

Misconception: Teddy alone guarantees matches. Correction: Teddy is a prefilter; verification is required.
Misconception: More positions always better. Correction: More positions increase table size and complexity.

Check-your-understanding questions

Why does Teddy require verification?
What does the mask table represent?
How do you choose positions for each literal?

Check-your-understanding answers

The mask intersection can yield false positives.
A mapping from byte to which literals contain it at a given position.
Choose positions with high discrimination (first, last, rare bytes).

Real-world applications

ripgrep multi-literal prefilters
malware scanning and keyword detection

Where you’ll apply it

This project: Section 4.4 Algorithm Overview and Section 6.2 Tests.
Also used in: P15-build-your-own-ripgrep.

References

Teddy algorithm notes in ripgrep and regex-automata
SIMD substring search literature

Key insights

Teddy turns multi-literal search into a SIMD mask intersection problem.

Summary

Teddy is a high-speed prefilter that trades a little memory and verification work for massive throughput.

Homework/Exercises to practice the concept

Build mask tables for three literals of length 3 by hand.
Simulate one SIMD window and compute candidate masks.

Solutions to the homework/exercises

Create a 256-entry table for each position with bits for each literal.
AND the masks and list which literal bits remain.

3. Project Specification

3.1 What You Will Build

A library + CLI teddy-search that loads a list of literals, builds Teddy mask tables, scans files with SIMD, and verifies matches. It should support grouping by length and a configurable number of positions.

3.2 Functional Requirements

Literal input: read literals from file and group by length.
Mask table build: for chosen positions.
SIMD scan: compute candidate masks per window.
Verification: exact compare for candidate literals.
Stats: report candidate rate and false positives.

3.3 Non-Functional Requirements

Performance: faster than Aho-Corasick on large literal sets.
Reliability: no false negatives.
Usability: clear output of which literals matched.

3.4 Example Usage / Output

$ ./teddy-search --patterns literals.txt logs.txt
logs.txt:120: bar
logs.txt:130: baz
Stats: candidates=1024 verified=20

3.5 Data Formats / Schemas / Protocols

Patterns file with one literal per line.

3.6 Edge Cases

Literals shorter than the position count
Highly similar literals (many false positives)
Very large literal sets (100k+)

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

cargo run -- --patterns fixtures/lits.txt fixtures/logs.txt

3.7.2 Golden Path Demo (Deterministic)

Use fixed literals and fixtures; report deterministic candidate counts.

3.7.3 CLI Transcript (Success + Failure)

$ ./teddy-search --patterns fixtures/lits.txt fixtures/logs.txt
fixtures/logs.txt:3: foo
fixtures/logs.txt:8: bar
exit code: 0

$ ./teddy-search --patterns missing.txt fixtures/logs.txt
error: patterns file not found
exit code: 1

4. Solution Architecture

4.1 High-Level Design

+------------+   +-----------+   +-----------+   +-----------+
| Literal IO |-->| Mask Build|-->| SIMD Scan |-->| Verify     |
+------------+   +-----------+   +-----------+   +-----------+

4.2 Key Components

4.3 Data Structures (No Full Code)

struct MaskTable { table: [[u64; 256]; POSITIONS]; }

4.4 Algorithm Overview

Load literals, group by length.
Build mask tables for each position.
Scan text with SIMD windows.
AND masks to get candidates and verify.

Complexity Analysis

Build: O(total literal length)
Scan: O(n/vec_width) + verification cost

5. Implementation Guide

5.1 Development Environment Setup

cargo build --release

5.2 Project Structure

teddy-search/
├── src/
│   ├── main.rs
│   ├── masks.rs
│   ├── scan.rs
│   └── verify.rs
└── fixtures/

5.3 The Core Question You’re Answering

“How can I filter many literals at once using SIMD without building a full automaton?”

5.4 Concepts You Must Understand First

Bitset representation of literal sets.
SIMD loads and mask extraction.
Verification pipelines.

5.5 Questions to Guide Your Design

How many positions per literal should you use?
How will you handle literals of different lengths?
How will you keep verification fast?

5.6 Thinking Exercise

Given 8 literals, how many bits are needed in each mask entry?

5.7 The Interview Questions They’ll Ask

Why does Teddy need verification?
How does it compare to Aho-Corasick?

5.8 Hints in Layers

Hint 1: Start with literals of fixed length 3. Hint 2: Add grouping by length. Hint 3: Add auto position selection.

5.9 Books That Will Help

5.10 Implementation Phases

Phase 1: Fixed-Length Teddy (1 week)

Implement masks and SIMD scan for fixed length.
Checkpoint: correct matches on fixtures.

Phase 2: Grouping + Verification (1 week)

Support multiple length buckets and verification.
Checkpoint: results match baseline search.

Phase 3: Benchmarking (1 week)

Compare against Aho-Corasick and BM.
Checkpoint: speedup on large literal sets.

5.11 Key Implementation Decisions

6. Testing Strategy

6.1 Test Categories

6.2 Critical Test Cases

Literals with same prefix produce candidates.
No false negatives (compare with naive).
Mask intersection correctness.

6.3 Test Data

literals: foo, bar, baz
text: "xxfooyybar"
expected: foo, bar

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

7.2 Debugging Strategies

Log mask values for a small literal set.
Compare candidate positions with naive scan.

8. Extensions & Challenges

8.1 Beginner Extensions

Add case-insensitive mode.
Support literals with different encodings.

8.2 Intermediate Extensions

Auto-select byte positions based on entropy.
Add AVX2 path.

8.3 Advanced Extensions

Integrate with literal extraction and regex verification.
Build a hybrid Teddy + Aho-Corasick engine.

9. Real-World Connections

9.1 Industry Applications

High-throughput log scanning
IDS and signature matching

ripgrep (Teddy prefilter)
Hyperscan (multi-pattern matching)

9.3 Interview Relevance

Discussing SIMD prefilter design and trade-offs

10. Resources

10.1 Essential Reading

ripgrep regex-automata docs
SIMD substring search articles

10.2 Tools & Documentation

perf for profiling
cargo bench for benchmarks

Previous: P08-simd-substring-search-memmem
Next: P10-parallel-directory-traversal

11. Self-Assessment Checklist

11.1 Understanding

I can explain mask tables and candidate selection.
I can explain why verification is required.

11.2 Implementation

Matches agree with baseline on fixtures.
Candidate rate is measured and reported.

11.3 Growth

I can describe when Teddy outperforms Aho-Corasick.

12. Submission / Completion Criteria

Minimum Viable Completion:

Fixed-length Teddy with SIMD scan + verification

Full Completion:

Grouping by length + benchmark stats

Excellence (Going Above & Beyond):

Auto position selection and AVX2 support
Integration with regex literal extraction