Project 13: Fuzzy String Matcher (fzf-style)
Build a fast fuzzy matcher that scores candidate strings based on subsequence matches and ranking heuristics.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 3: Advanced |
| Time Estimate | 2 weeks |
| Main Programming Language | Go (Alternatives: Rust, C++) |
| Alternative Programming Languages | Rust, C++ |
| Coolness Level | Level 4: Hardcore |
| Business Potential | Level 3: UX tooling |
| Prerequisites | basic algorithms, string scoring |
| Key Topics | subsequence matching, scoring heuristics, DP optimization |
1. Learning Objectives
By completing this project, you will:
- Implement subsequence-based fuzzy matching.
- Design scoring heuristics for ranking results.
- Optimize scoring for large candidate sets.
- Compare your results with fzf scoring behavior.
- Explain trade-offs between speed and optimal scoring.
2. All Theory Needed (Per-Concept Breakdown)
2.1 Fuzzy Matching and Scoring Heuristics
Fundamentals
Fuzzy matching is about finding approximate matches, often by checking whether a query is a subsequence of a candidate string. For example, query “src” matches “source” by selecting s, r, c in order. The quality of a match is not just whether it exists, but how well it aligns: contiguous matches are better than scattered ones; matches at word boundaries or path segments are often better. A scoring function encodes these preferences.
The simplest algorithm scans the candidate and tries to match query characters in order. It yields a boolean match and possibly positions. To rank candidates, you assign scores based on positions: contiguous runs, early matches, word boundary bonuses, and penalties for gaps.
Deep Dive into the concept
fzf uses a scoring model that favors contiguous matches and word boundaries. A typical scoring function includes:
- Bonus for matches at the start of a word or path segment.
- Bonus for consecutive matches.
- Penalty for large gaps between matched characters.
- Slight penalty for late matches.
A DP-based approach can compute an optimal score, but it is more expensive. fzf provides two algorithms: a fast greedy method and a slower optimal method. For large candidate sets, you want a fast heuristic that is “good enough” but stable.
You also need to support case-insensitive matching and normalized paths. For example, query “mrg” should match “my_rust_grep” strongly due to word boundaries. These heuristics are the difference between a tool that feels smart and one that feels random.
How this fits on projects
This matcher is the core of the interactive fuzzy finder (P14) and a common feature in modern CLIs.
Definitions & key terms
- subsequence -> characters appear in order, not necessarily contiguous
- scoring heuristic -> rule that assigns match quality score
- word boundary -> transition between path segments or case changes
- DP -> dynamic programming for optimal scoring
Mental model diagram (ASCII)
query: s r c
text: s o u r c e
match: s . . r c .
score: contiguous bonus + early bonus
How it works (step-by-step)
- Normalize query and candidate (case fold if needed).
- Scan candidate to see if query is a subsequence.
- Record match positions and compute score.
- Sort candidates by score.
Minimal concrete example
func match(q, s string) (score int, ok bool) { /* subsequence + score */ }
Common misconceptions
- Misconception: fuzzy matching is edit distance. Correction: fzf-style is subsequence-based, not Levenshtein.
- Misconception: higher score always means longer match. Correction: scoring is heuristic and context-dependent.
Check-your-understanding questions
- Why are word boundary bonuses important?
- What is the difference between subsequence and substring?
- Why might DP be too slow for large candidate sets?
Check-your-understanding answers
- They make matches align with human expectations.
- Subsequence allows gaps; substring requires contiguous chars.
- DP is O(n*m) per candidate and expensive at scale.
Real-world applications
- fzf and interactive pickers
- IDE file search
Where you’ll apply it
- This project: Section 4.4 Algorithm Overview and Section 6.2 Tests.
- Also used in: P14-interactive-fuzzy-finder-full-fzf-clone.
References
- fzf source code (algo.go)
- Smith-Waterman style scoring (for comparison)
Key insights
A good fuzzy matcher is mostly about scoring that matches human intuition.
Summary
Fuzzy matching balances correctness, scoring quality, and speed for large candidate sets.
Homework/Exercises to practice the concept
- Compare scoring for “src” against “source” and “subdir/src”.
- Implement a word boundary bonus and observe rank changes.
Solutions to the homework/exercises
- “source” should score high due to contiguous match; path matches may get boundary bonus.
- Add bonus when match aligns with ‘/’ or ‘_’ or camelCase transition.
3. Project Specification
3.1 What You Will Build
A library + CLI fuzzy-score that scores a query against a list of candidate strings and outputs the top-K matches with scores.
3.2 Functional Requirements
- Subsequence match: determine if query matches candidate.
- Scoring: implement bonuses for contiguous matches and boundaries.
- Sorting: output top-K results by score.
- Case-insensitive: optional flag.
- Performance: handle 100k candidates quickly.
3.3 Non-Functional Requirements
- Performance: 100k candidates scored in <50ms on modern hardware.
- Reliability: deterministic ranking for equal scores.
- Usability: clear output and flags.
3.4 Example Usage / Output
$ ./fuzzy-score --query "src" --top 3 fixtures/candidates.txt
1) src/main.rs (score=120)
2) source/config.yml (score=98)
3) subdir/src/lib.c (score=92)
3.5 Data Formats / Schemas / Protocols
Candidates file: one string per line.
3.6 Edge Cases
- Empty query
- Very short candidates
- Candidates with Unicode characters
3.7 Real World Outcome
3.7.1 How to Run (Copy/Paste)
./fuzzy-score --query "src" --top 5 fixtures/candidates.txt
3.7.2 Golden Path Demo (Deterministic)
Use fixed candidates and deterministic tie-breakers (lexicographic).
3.7.3 CLI Transcript (Success + Failure)
$ ./fuzzy-score --query "abc" --top 2 fixtures/candidates.txt
1) abc.txt (score=130)
2) a_b_c.txt (score=90)
exit code: 0
$ ./fuzzy-score --query "" fixtures/candidates.txt
error: empty query not allowed
exit code: 2
4. Solution Architecture
4.1 High-Level Design
+-----------+ +-----------+ +-----------+ +-----------+
| CLI Parse |-->| Scorer |-->| Top-K |-->| Formatter |
+-----------+ +-----------+ +-----------+ +-----------+
4.2 Key Components
| Component | Responsibility | Key Decisions | |———–|—————-|—————| | Matcher | subsequence check | greedy scan | | Scorer | compute score | bonuses + penalties | | Top-K | maintain best results | min-heap |
4.3 Data Structures (No Full Code)
type Result struct { Text string; Score int }
4.4 Algorithm Overview
- Read candidates.
- For each candidate, check subsequence and compute score.
- Maintain top-K results.
- Sort and output.
Complexity Analysis
- Time: O(N * L) where L is candidate length
- Space: O(K)
5. Implementation Guide
5.1 Development Environment Setup
go build -o fuzzy-score ./cmd/fuzzy-score
5.2 Project Structure
fuzzy-score/
├── cmd/
│ └── fuzzy-score/
│ └── main.go
├── internal/
│ ├── matcher.go
│ └── scorer.go
└── fixtures/
5.3 The Core Question You’re Answering
“How do I rank fuzzy matches so results feel intuitive and consistent?”
5.4 Concepts You Must Understand First
- Subsequence matching logic.
- Scoring heuristics and bonuses.
- Top-K selection with heap.
5.5 Questions to Guide Your Design
- How should you score contiguous matches?
- How should you score word boundaries?
- How will you handle ties deterministically?
5.6 Thinking Exercise
Given query “mk” and candidate “my_kernel”, how should it score vs “make”?
5.7 The Interview Questions They’ll Ask
- Why use subsequence matching instead of edit distance?
- How do you keep fuzzy matching fast on large sets?
5.8 Hints in Layers
Hint 1: Start with a simple greedy matcher. Hint 2: Add bonuses for contiguous matches. Hint 3: Add boundary bonuses for ‘/’, ‘_’ and camelCase.
5.9 Books That Will Help
| Topic | Book | Chapter | |——-|——|———| | Algorithms | “Algorithms” (Sedgewick) | sorting, heaps | | Search UX | “Designing Interfaces” | search patterns |
5.10 Implementation Phases
Phase 1: Basic Matcher (3 days)
- Implement subsequence check and boolean result.
- Checkpoint: correct matches on fixtures.
Phase 2: Scoring (3 days)
- Add bonuses and penalties.
- Checkpoint: ranking looks reasonable.
Phase 3: Performance (3 days)
- Add top-K heap and benchmarks.
- Checkpoint: <50ms on 100k candidates.
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale | |———-|———|—————-|———–| | Scoring | greedy vs DP | greedy | faster, simpler | | Top-K | sort all vs heap | heap | memory efficient |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples | |———-|———|———-| | Unit | scoring rules | boundary bonuses | | Integration | ranking output | fixtures | | Performance | large candidate set | 100k lines |
6.2 Critical Test Cases
- Contiguous match scores higher than scattered.
- Word boundary match beats internal match.
- Empty query is rejected.
6.3 Test Data
query: "src"
list: ["src/main.rs", "source.txt", "s r c"]
expected: src/main.rs first
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution | |———|———|———-| | Scores feel random | poor ranking | tune bonuses and penalties | | Slow performance | lag on 100k items | use heap and early exit | | Case handling bugs | missed matches | normalize case consistently |
7.2 Debugging Strategies
- Print score breakdown per candidate.
- Compare ranking with fzf on the same list.
8. Extensions & Challenges
8.1 Beginner Extensions
- Add
--tie-breakflag. - Add highlighting of matched characters.
8.2 Intermediate Extensions
- Implement DP-based optimal scoring.
- Add path-aware scoring (directory depth).
8.3 Advanced Extensions
- SIMD-accelerated subsequence matching.
- Learned scoring model from user feedback.
9. Real-World Connections
9.1 Industry Applications
- Terminal fuzzy finders
- IDE quick open
9.2 Related Open Source Projects
- fzf algorithm
- skim (Rust fuzzy finder)
9.3 Interview Relevance
- Discussing scoring heuristics and trade-offs
10. Resources
10.1 Essential Reading
- fzf source code (algo.go)
- skim algorithm docs
10.2 Tools & Documentation
go test -benchfor microbenchmarks
10.3 Related Projects in This Series
- Previous: P12-line-oriented-search-with-simd-line-counting
- Next: P14-interactive-fuzzy-finder-full-fzf-clone
11. Self-Assessment Checklist
11.1 Understanding
- I can explain subsequence matching.
- I can justify scoring heuristics.
11.2 Implementation
- Ranking is deterministic and stable.
- Performance meets target on 100k candidates.
11.3 Growth
- I can compare greedy vs optimal scoring trade-offs.
12. Submission / Completion Criteria
Minimum Viable Completion:
- Greedy fuzzy matcher with scoring
Full Completion:
- Boundary bonuses + top-K output
Excellence (Going Above & Beyond):
- DP scoring mode
- Highlighted match output