Project 13: Fuzzy String Matcher (fzf-style)

Build a fast fuzzy matcher that scores candidate strings based on subsequence matches and ranking heuristics.

Quick Reference

Attribute	Value
Difficulty	Level 3: Advanced
Time Estimate	2 weeks
Main Programming Language	Go (Alternatives: Rust, C++)
Alternative Programming Languages	Rust, C++
Coolness Level	Level 4: Hardcore
Business Potential	Level 3: UX tooling
Prerequisites	basic algorithms, string scoring
Key Topics	subsequence matching, scoring heuristics, DP optimization

1. Learning Objectives

By completing this project, you will:

Implement subsequence-based fuzzy matching.
Design scoring heuristics for ranking results.
Optimize scoring for large candidate sets.
Compare your results with fzf scoring behavior.
Explain trade-offs between speed and optimal scoring.

2. All Theory Needed (Per-Concept Breakdown)

2.1 Fuzzy Matching and Scoring Heuristics

Fundamentals

Fuzzy matching is about finding approximate matches, often by checking whether a query is a subsequence of a candidate string. For example, query “src” matches “source” by selecting s, r, c in order. The quality of a match is not just whether it exists, but how well it aligns: contiguous matches are better than scattered ones; matches at word boundaries or path segments are often better. A scoring function encodes these preferences.

The simplest algorithm scans the candidate and tries to match query characters in order. It yields a boolean match and possibly positions. To rank candidates, you assign scores based on positions: contiguous runs, early matches, word boundary bonuses, and penalties for gaps.

Deep Dive into the concept

fzf uses a scoring model that favors contiguous matches and word boundaries. A typical scoring function includes:

Bonus for matches at the start of a word or path segment.
Bonus for consecutive matches.
Penalty for large gaps between matched characters.
Slight penalty for late matches.

A DP-based approach can compute an optimal score, but it is more expensive. fzf provides two algorithms: a fast greedy method and a slower optimal method. For large candidate sets, you want a fast heuristic that is “good enough” but stable.

You also need to support case-insensitive matching and normalized paths. For example, query “mrg” should match “my_rust_grep” strongly due to word boundaries. These heuristics are the difference between a tool that feels smart and one that feels random.

How this fits on projects

This matcher is the core of the interactive fuzzy finder (P14) and a common feature in modern CLIs.

Definitions & key terms

subsequence -> characters appear in order, not necessarily contiguous
scoring heuristic -> rule that assigns match quality score
word boundary -> transition between path segments or case changes
DP -> dynamic programming for optimal scoring

Mental model diagram (ASCII)

query:  s r c
text:   s o u r c e
match:  s . . r c .
score:  contiguous bonus + early bonus

How it works (step-by-step)

Normalize query and candidate (case fold if needed).
Scan candidate to see if query is a subsequence.
Record match positions and compute score.
Sort candidates by score.

Minimal concrete example

func match(q, s string) (score int, ok bool) { /* subsequence + score */ }

Common misconceptions

Misconception: fuzzy matching is edit distance. Correction: fzf-style is subsequence-based, not Levenshtein.
Misconception: higher score always means longer match. Correction: scoring is heuristic and context-dependent.

Check-your-understanding questions

Why are word boundary bonuses important?
What is the difference between subsequence and substring?
Why might DP be too slow for large candidate sets?

Check-your-understanding answers

They make matches align with human expectations.
Subsequence allows gaps; substring requires contiguous chars.
DP is O(n*m) per candidate and expensive at scale.

Real-world applications

fzf and interactive pickers
IDE file search

Where you’ll apply it

This project: Section 4.4 Algorithm Overview and Section 6.2 Tests.
Also used in: P14-interactive-fuzzy-finder-full-fzf-clone.

References

fzf source code (algo.go)
Smith-Waterman style scoring (for comparison)

Key insights

A good fuzzy matcher is mostly about scoring that matches human intuition.

Summary

Fuzzy matching balances correctness, scoring quality, and speed for large candidate sets.

Homework/Exercises to practice the concept

Compare scoring for “src” against “source” and “subdir/src”.
Implement a word boundary bonus and observe rank changes.

Solutions to the homework/exercises

“source” should score high due to contiguous match; path matches may get boundary bonus.
Add bonus when match aligns with ‘/’ or ‘_’ or camelCase transition.

3. Project Specification

3.1 What You Will Build

A library + CLI fuzzy-score that scores a query against a list of candidate strings and outputs the top-K matches with scores.

3.2 Functional Requirements

Subsequence match: determine if query matches candidate.
Scoring: implement bonuses for contiguous matches and boundaries.
Sorting: output top-K results by score.
Case-insensitive: optional flag.
Performance: handle 100k candidates quickly.

3.3 Non-Functional Requirements

Performance: 100k candidates scored in <50ms on modern hardware.
Reliability: deterministic ranking for equal scores.
Usability: clear output and flags.

3.4 Example Usage / Output

$ ./fuzzy-score --query "src" --top 3 fixtures/candidates.txt
1) src/main.rs (score=120)
2) source/config.yml (score=98)
3) subdir/src/lib.c (score=92)

3.5 Data Formats / Schemas / Protocols

Candidates file: one string per line.

3.6 Edge Cases

Empty query
Very short candidates
Candidates with Unicode characters

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

./fuzzy-score --query "src" --top 5 fixtures/candidates.txt

3.7.2 Golden Path Demo (Deterministic)

Use fixed candidates and deterministic tie-breakers (lexicographic).

3.7.3 CLI Transcript (Success + Failure)

$ ./fuzzy-score --query "abc" --top 2 fixtures/candidates.txt
1) abc.txt (score=130)
2) a_b_c.txt (score=90)
exit code: 0

$ ./fuzzy-score --query "" fixtures/candidates.txt
error: empty query not allowed
exit code: 2

4. Solution Architecture

4.1 High-Level Design

+-----------+   +-----------+   +-----------+   +-----------+
| CLI Parse |-->| Scorer    |-->| Top-K     |-->| Formatter |
+-----------+   +-----------+   +-----------+   +-----------+

4.2 Key Components

4.3 Data Structures (No Full Code)

type Result struct { Text string; Score int }

4.4 Algorithm Overview

Read candidates.
For each candidate, check subsequence and compute score.
Maintain top-K results.
Sort and output.

Complexity Analysis

Time: O(N * L) where L is candidate length
Space: O(K)

5. Implementation Guide

5.1 Development Environment Setup

go build -o fuzzy-score ./cmd/fuzzy-score

5.2 Project Structure

fuzzy-score/
├── cmd/
│   └── fuzzy-score/
│       └── main.go
├── internal/
│   ├── matcher.go
│   └── scorer.go
└── fixtures/

5.3 The Core Question You’re Answering

“How do I rank fuzzy matches so results feel intuitive and consistent?”

5.4 Concepts You Must Understand First

Subsequence matching logic.
Scoring heuristics and bonuses.
Top-K selection with heap.

5.5 Questions to Guide Your Design

How should you score contiguous matches?
How should you score word boundaries?
How will you handle ties deterministically?

5.6 Thinking Exercise

Given query “mk” and candidate “my_kernel”, how should it score vs “make”?

5.7 The Interview Questions They’ll Ask

Why use subsequence matching instead of edit distance?
How do you keep fuzzy matching fast on large sets?

5.8 Hints in Layers

Hint 1: Start with a simple greedy matcher. Hint 2: Add bonuses for contiguous matches. Hint 3: Add boundary bonuses for ‘/’, ‘_’ and camelCase.

5.9 Books That Will Help

5.10 Implementation Phases

Phase 1: Basic Matcher (3 days)

Implement subsequence check and boolean result.
Checkpoint: correct matches on fixtures.

Phase 2: Scoring (3 days)

Add bonuses and penalties.
Checkpoint: ranking looks reasonable.

Phase 3: Performance (3 days)

Add top-K heap and benchmarks.
Checkpoint: <50ms on 100k candidates.

5.11 Key Implementation Decisions

6. Testing Strategy

6.1 Test Categories

6.2 Critical Test Cases

Contiguous match scores higher than scattered.
Word boundary match beats internal match.
Empty query is rejected.

6.3 Test Data

query: "src"
list: ["src/main.rs", "source.txt", "s r c"]
expected: src/main.rs first

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

7.2 Debugging Strategies

Print score breakdown per candidate.
Compare ranking with fzf on the same list.

8. Extensions & Challenges

8.1 Beginner Extensions

Add --tie-break flag.
Add highlighting of matched characters.

8.2 Intermediate Extensions

Implement DP-based optimal scoring.
Add path-aware scoring (directory depth).

8.3 Advanced Extensions

SIMD-accelerated subsequence matching.
Learned scoring model from user feedback.

9. Real-World Connections

9.1 Industry Applications

Terminal fuzzy finders
IDE quick open

fzf algorithm
skim (Rust fuzzy finder)

9.3 Interview Relevance

Discussing scoring heuristics and trade-offs

10. Resources

10.1 Essential Reading

fzf source code (algo.go)
skim algorithm docs

10.2 Tools & Documentation

go test -bench for microbenchmarks

Previous: P12-line-oriented-search-with-simd-line-counting
Next: P14-interactive-fuzzy-finder-full-fzf-clone

11. Self-Assessment Checklist

11.1 Understanding

I can explain subsequence matching.
I can justify scoring heuristics.

11.2 Implementation

Ranking is deterministic and stable.
Performance meets target on 100k candidates.

11.3 Growth

I can compare greedy vs optimal scoring trade-offs.

12. Submission / Completion Criteria

Minimum Viable Completion:

Greedy fuzzy matcher with scoring

Full Completion:

Boundary bonuses + top-K output

Excellence (Going Above & Beyond):

DP scoring mode
Highlighted match output