Project 9: Teddy Algorithm (Multi-Literal SIMD)

Build a Teddy-style multi-literal SIMD prefilter that can scan for dozens of literals at once using bit masks.

Quick Reference

Attribute Value
Difficulty Level 5: Master
Time Estimate 3-4 weeks
Main Programming Language Rust (Alternatives: C++)
Alternative Programming Languages C++
Coolness Level Level 5: Pure Magic
Business Potential Level 4: Infra / Security
Prerequisites P07-P08, bitsets, SIMD intrinsics
Key Topics SIMD masks, multi-literal prefiltering, candidate verification

1. Learning Objectives

By completing this project, you will:

  1. Implement a Teddy-style SIMD prefilter for multiple literals.
  2. Build a mask table that maps bytes to candidate literal positions.
  3. Verify candidates with exact literal comparison.
  4. Measure speedups over Aho-Corasick and scalar matching.
  5. Explain the trade-offs of Teddy vs automata.

2. All Theory Needed (Per-Concept Breakdown)

2.1 Teddy Multi-Literal Prefiltering

Fundamentals

Teddy is a SIMD-based algorithm for quickly finding many literals. The key idea is to use multiple SIMD registers and bitmasks to represent which literals could match based on a few selected bytes. For example, if each literal is length 3, you can use two or three byte positions and create masks that indicate which literals have a given byte at that position. You then scan the text with SIMD, gather masks for each byte in the window, and intersect them to get candidate literals. Those candidates are then verified with exact comparison.

This is not a full automaton like Aho-Corasick; it is a fast prefilter that reduces candidate positions dramatically. It is particularly effective when you have many literals with common lengths and can afford a verification step.

Deep Dive into the concept

A common Teddy layout uses 3 byte positions per literal. For each position, you build a 256-entry table of bitmasks. Each mask is a bitset where bit i indicates literal i has that byte at that position. While scanning, you load three SIMD windows (offsets p0, p1, p2), look up the masks for each byte, and AND them together. A non-zero result means some literal(s) might match at that position. You then verify each candidate literal at that position.

The choice of positions matters. Ideally you choose positions that maximize discrimination, such as first, middle, and last bytes. For longer literals, you can pack more information, or split into groups. Teddy trades memory (mask tables) and verification cost for SIMD speed. It shines when you have many medium-length literals and need very high throughput.

In ripgrep, Teddy is used as a prefilter for literal sets extracted from regexes. It avoids the state explosion or overhead of full automata while still being very fast. Correctness depends on the verification step; the prefilter can produce false positives but never false negatives if constructed correctly.

How this fits on projects

This prefilter is the fastest path for multi-literal regex acceleration (P06) and is an important option in P15.

Definitions & key terms

  • Teddy -> SIMD prefilter using masks for multi-literal search
  • mask table -> maps byte value to a bitset of possible literals
  • verification -> exact literal compare after prefilter
  • false positive -> candidate reported by prefilter but not a true match

Mental model diagram (ASCII)

literals: [foo, bar, baz]
positions: 0,1,2
mask[0]['f'] = 001
mask[0]['b'] = 110
mask[1]['o'] = 001
...
scan text bytes -> mask0 & mask1 & mask2 -> candidates

How it works (step-by-step)

  1. Group literals by length (or pick a fixed length window).
  2. Choose byte positions per literal (e.g., 0, 1, 2).
  3. Build mask tables for each position and byte value.
  4. SIMD scan text, gather masks, AND them to get candidates.
  5. Verify candidates with full literal comparison.

Minimal concrete example

let m0 = table0[byte0] & table1[byte1] & table2[byte2];
if m0 != 0 { verify_candidates(m0, pos); }

Common misconceptions

  • Misconception: Teddy alone guarantees matches. Correction: Teddy is a prefilter; verification is required.
  • Misconception: More positions always better. Correction: More positions increase table size and complexity.

Check-your-understanding questions

  1. Why does Teddy require verification?
  2. What does the mask table represent?
  3. How do you choose positions for each literal?

Check-your-understanding answers

  1. The mask intersection can yield false positives.
  2. A mapping from byte to which literals contain it at a given position.
  3. Choose positions with high discrimination (first, last, rare bytes).

Real-world applications

  • ripgrep multi-literal prefilters
  • malware scanning and keyword detection

Where you’ll apply it

References

  • Teddy algorithm notes in ripgrep and regex-automata
  • SIMD substring search literature

Key insights

Teddy turns multi-literal search into a SIMD mask intersection problem.

Summary

Teddy is a high-speed prefilter that trades a little memory and verification work for massive throughput.

Homework/Exercises to practice the concept

  1. Build mask tables for three literals of length 3 by hand.
  2. Simulate one SIMD window and compute candidate masks.

Solutions to the homework/exercises

  1. Create a 256-entry table for each position with bits for each literal.
  2. AND the masks and list which literal bits remain.

3. Project Specification

3.1 What You Will Build

A library + CLI teddy-search that loads a list of literals, builds Teddy mask tables, scans files with SIMD, and verifies matches. It should support grouping by length and a configurable number of positions.

3.2 Functional Requirements

  1. Literal input: read literals from file and group by length.
  2. Mask table build: for chosen positions.
  3. SIMD scan: compute candidate masks per window.
  4. Verification: exact compare for candidate literals.
  5. Stats: report candidate rate and false positives.

3.3 Non-Functional Requirements

  • Performance: faster than Aho-Corasick on large literal sets.
  • Reliability: no false negatives.
  • Usability: clear output of which literals matched.

3.4 Example Usage / Output

$ ./teddy-search --patterns literals.txt logs.txt
logs.txt:120: bar
logs.txt:130: baz
Stats: candidates=1024 verified=20

3.5 Data Formats / Schemas / Protocols

Patterns file with one literal per line.

3.6 Edge Cases

  • Literals shorter than the position count
  • Highly similar literals (many false positives)
  • Very large literal sets (100k+)

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

cargo run -- --patterns fixtures/lits.txt fixtures/logs.txt

3.7.2 Golden Path Demo (Deterministic)

Use fixed literals and fixtures; report deterministic candidate counts.

3.7.3 CLI Transcript (Success + Failure)

$ ./teddy-search --patterns fixtures/lits.txt fixtures/logs.txt
fixtures/logs.txt:3: foo
fixtures/logs.txt:8: bar
exit code: 0

$ ./teddy-search --patterns missing.txt fixtures/logs.txt
error: patterns file not found
exit code: 1

4. Solution Architecture

4.1 High-Level Design

+------------+   +-----------+   +-----------+   +-----------+
| Literal IO |-->| Mask Build|-->| SIMD Scan |-->| Verify     |
+------------+   +-----------+   +-----------+   +-----------+

4.2 Key Components

| Component | Responsibility | Key Decisions | |———–|—————-|—————| | Grouper | group literals by length | fixed length buckets | | Mask builder | build tables per position | bitset per literal | | Scanner | SIMD window scan | reuse P07 intrinsics | | Verifier | exact compare | memcmp |

4.3 Data Structures (No Full Code)

struct MaskTable { table: [[u64; 256]; POSITIONS]; }

4.4 Algorithm Overview

  1. Load literals, group by length.
  2. Build mask tables for each position.
  3. Scan text with SIMD windows.
  4. AND masks to get candidates and verify.

Complexity Analysis

  • Build: O(total literal length)
  • Scan: O(n/vec_width) + verification cost

5. Implementation Guide

5.1 Development Environment Setup

cargo build --release

5.2 Project Structure

teddy-search/
├── src/
│   ├── main.rs
│   ├── masks.rs
│   ├── scan.rs
│   └── verify.rs
└── fixtures/

5.3 The Core Question You’re Answering

“How can I filter many literals at once using SIMD without building a full automaton?”

5.4 Concepts You Must Understand First

  1. Bitset representation of literal sets.
  2. SIMD loads and mask extraction.
  3. Verification pipelines.

5.5 Questions to Guide Your Design

  1. How many positions per literal should you use?
  2. How will you handle literals of different lengths?
  3. How will you keep verification fast?

5.6 Thinking Exercise

Given 8 literals, how many bits are needed in each mask entry?

5.7 The Interview Questions They’ll Ask

  1. Why does Teddy need verification?
  2. How does it compare to Aho-Corasick?

5.8 Hints in Layers

Hint 1: Start with literals of fixed length 3. Hint 2: Add grouping by length. Hint 3: Add auto position selection.

5.9 Books That Will Help

| Topic | Book | Chapter | |——-|——|———| | SIMD search | 0x80 articles | SIMD substring search | | Bitsets | “Hacker’s Delight” | bit tricks |

5.10 Implementation Phases

Phase 1: Fixed-Length Teddy (1 week)

  • Implement masks and SIMD scan for fixed length.
  • Checkpoint: correct matches on fixtures.

Phase 2: Grouping + Verification (1 week)

  • Support multiple length buckets and verification.
  • Checkpoint: results match baseline search.

Phase 3: Benchmarking (1 week)

  • Compare against Aho-Corasick and BM.
  • Checkpoint: speedup on large literal sets.

5.11 Key Implementation Decisions

| Decision | Options | Recommendation | Rationale | |———-|———|—————-|———–| | Positions | 2 vs 3 vs 4 | 3 | good balance | | Bitset size | u64 vs Vec | Vec | supports many literals |


6. Testing Strategy

6.1 Test Categories

| Category | Purpose | Examples | |———-|———|———-| | Unit | mask generation | known literals | | Integration | matches vs baseline | fixture files | | Performance | candidate rates | large literal sets |

6.2 Critical Test Cases

  1. Literals with same prefix produce candidates.
  2. No false negatives (compare with naive).
  3. Mask intersection correctness.

6.3 Test Data

literals: foo, bar, baz
text: "xxfooyybar"
expected: foo, bar

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

| Pitfall | Symptom | Solution | |———|———|———-| | Incorrect mask build | missing matches | unit test mask tables | | Wrong position offsets | false negatives | verify candidate byte positions | | Overflow in bitsets | dropped literals | use enough bits per literal |

7.2 Debugging Strategies

  • Log mask values for a small literal set.
  • Compare candidate positions with naive scan.

8. Extensions & Challenges

8.1 Beginner Extensions

  • Add case-insensitive mode.
  • Support literals with different encodings.

8.2 Intermediate Extensions

  • Auto-select byte positions based on entropy.
  • Add AVX2 path.

8.3 Advanced Extensions

  • Integrate with literal extraction and regex verification.
  • Build a hybrid Teddy + Aho-Corasick engine.

9. Real-World Connections

9.1 Industry Applications

  • High-throughput log scanning
  • IDS and signature matching
  • ripgrep (Teddy prefilter)
  • Hyperscan (multi-pattern matching)

9.3 Interview Relevance

  • Discussing SIMD prefilter design and trade-offs

10. Resources

10.1 Essential Reading

  • ripgrep regex-automata docs
  • SIMD substring search articles

10.2 Tools & Documentation

  • perf for profiling
  • cargo bench for benchmarks

11. Self-Assessment Checklist

11.1 Understanding

  • I can explain mask tables and candidate selection.
  • I can explain why verification is required.

11.2 Implementation

  • Matches agree with baseline on fixtures.
  • Candidate rate is measured and reported.

11.3 Growth

  • I can describe when Teddy outperforms Aho-Corasick.

12. Submission / Completion Criteria

Minimum Viable Completion:

  • Fixed-length Teddy with SIMD scan + verification

Full Completion:

  • Grouping by length + benchmark stats

Excellence (Going Above & Beyond):

  • Auto position selection and AVX2 support
  • Integration with regex literal extraction