Project 13: Fuzzy String Matcher (fzf-style)

Build a fast fuzzy matcher that scores candidate strings based on subsequence matches and ranking heuristics.

Quick Reference

Attribute Value
Difficulty Level 3: Advanced
Time Estimate 2 weeks
Main Programming Language Go (Alternatives: Rust, C++)
Alternative Programming Languages Rust, C++
Coolness Level Level 4: Hardcore
Business Potential Level 3: UX tooling
Prerequisites basic algorithms, string scoring
Key Topics subsequence matching, scoring heuristics, DP optimization

1. Learning Objectives

By completing this project, you will:

  1. Implement subsequence-based fuzzy matching.
  2. Design scoring heuristics for ranking results.
  3. Optimize scoring for large candidate sets.
  4. Compare your results with fzf scoring behavior.
  5. Explain trade-offs between speed and optimal scoring.

2. All Theory Needed (Per-Concept Breakdown)

2.1 Fuzzy Matching and Scoring Heuristics

Fundamentals

Fuzzy matching is about finding approximate matches, often by checking whether a query is a subsequence of a candidate string. For example, query “src” matches “source” by selecting s, r, c in order. The quality of a match is not just whether it exists, but how well it aligns: contiguous matches are better than scattered ones; matches at word boundaries or path segments are often better. A scoring function encodes these preferences.

The simplest algorithm scans the candidate and tries to match query characters in order. It yields a boolean match and possibly positions. To rank candidates, you assign scores based on positions: contiguous runs, early matches, word boundary bonuses, and penalties for gaps.

Deep Dive into the concept

fzf uses a scoring model that favors contiguous matches and word boundaries. A typical scoring function includes:

  • Bonus for matches at the start of a word or path segment.
  • Bonus for consecutive matches.
  • Penalty for large gaps between matched characters.
  • Slight penalty for late matches.

A DP-based approach can compute an optimal score, but it is more expensive. fzf provides two algorithms: a fast greedy method and a slower optimal method. For large candidate sets, you want a fast heuristic that is “good enough” but stable.

You also need to support case-insensitive matching and normalized paths. For example, query “mrg” should match “my_rust_grep” strongly due to word boundaries. These heuristics are the difference between a tool that feels smart and one that feels random.

How this fits on projects

This matcher is the core of the interactive fuzzy finder (P14) and a common feature in modern CLIs.

Definitions & key terms

  • subsequence -> characters appear in order, not necessarily contiguous
  • scoring heuristic -> rule that assigns match quality score
  • word boundary -> transition between path segments or case changes
  • DP -> dynamic programming for optimal scoring

Mental model diagram (ASCII)

query:  s r c
text:   s o u r c e
match:  s . . r c .
score:  contiguous bonus + early bonus

How it works (step-by-step)

  1. Normalize query and candidate (case fold if needed).
  2. Scan candidate to see if query is a subsequence.
  3. Record match positions and compute score.
  4. Sort candidates by score.

Minimal concrete example

func match(q, s string) (score int, ok bool) { /* subsequence + score */ }

Common misconceptions

  • Misconception: fuzzy matching is edit distance. Correction: fzf-style is subsequence-based, not Levenshtein.
  • Misconception: higher score always means longer match. Correction: scoring is heuristic and context-dependent.

Check-your-understanding questions

  1. Why are word boundary bonuses important?
  2. What is the difference between subsequence and substring?
  3. Why might DP be too slow for large candidate sets?

Check-your-understanding answers

  1. They make matches align with human expectations.
  2. Subsequence allows gaps; substring requires contiguous chars.
  3. DP is O(n*m) per candidate and expensive at scale.

Real-world applications

  • fzf and interactive pickers
  • IDE file search

Where you’ll apply it

References

  • fzf source code (algo.go)
  • Smith-Waterman style scoring (for comparison)

Key insights

A good fuzzy matcher is mostly about scoring that matches human intuition.

Summary

Fuzzy matching balances correctness, scoring quality, and speed for large candidate sets.

Homework/Exercises to practice the concept

  1. Compare scoring for “src” against “source” and “subdir/src”.
  2. Implement a word boundary bonus and observe rank changes.

Solutions to the homework/exercises

  1. “source” should score high due to contiguous match; path matches may get boundary bonus.
  2. Add bonus when match aligns with ‘/’ or ‘_’ or camelCase transition.

3. Project Specification

3.1 What You Will Build

A library + CLI fuzzy-score that scores a query against a list of candidate strings and outputs the top-K matches with scores.

3.2 Functional Requirements

  1. Subsequence match: determine if query matches candidate.
  2. Scoring: implement bonuses for contiguous matches and boundaries.
  3. Sorting: output top-K results by score.
  4. Case-insensitive: optional flag.
  5. Performance: handle 100k candidates quickly.

3.3 Non-Functional Requirements

  • Performance: 100k candidates scored in <50ms on modern hardware.
  • Reliability: deterministic ranking for equal scores.
  • Usability: clear output and flags.

3.4 Example Usage / Output

$ ./fuzzy-score --query "src" --top 3 fixtures/candidates.txt
1) src/main.rs (score=120)
2) source/config.yml (score=98)
3) subdir/src/lib.c (score=92)

3.5 Data Formats / Schemas / Protocols

Candidates file: one string per line.

3.6 Edge Cases

  • Empty query
  • Very short candidates
  • Candidates with Unicode characters

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

./fuzzy-score --query "src" --top 5 fixtures/candidates.txt

3.7.2 Golden Path Demo (Deterministic)

Use fixed candidates and deterministic tie-breakers (lexicographic).

3.7.3 CLI Transcript (Success + Failure)

$ ./fuzzy-score --query "abc" --top 2 fixtures/candidates.txt
1) abc.txt (score=130)
2) a_b_c.txt (score=90)
exit code: 0

$ ./fuzzy-score --query "" fixtures/candidates.txt
error: empty query not allowed
exit code: 2

4. Solution Architecture

4.1 High-Level Design

+-----------+   +-----------+   +-----------+   +-----------+
| CLI Parse |-->| Scorer    |-->| Top-K     |-->| Formatter |
+-----------+   +-----------+   +-----------+   +-----------+

4.2 Key Components

| Component | Responsibility | Key Decisions | |———–|—————-|—————| | Matcher | subsequence check | greedy scan | | Scorer | compute score | bonuses + penalties | | Top-K | maintain best results | min-heap |

4.3 Data Structures (No Full Code)

type Result struct { Text string; Score int }

4.4 Algorithm Overview

  1. Read candidates.
  2. For each candidate, check subsequence and compute score.
  3. Maintain top-K results.
  4. Sort and output.

Complexity Analysis

  • Time: O(N * L) where L is candidate length
  • Space: O(K)

5. Implementation Guide

5.1 Development Environment Setup

go build -o fuzzy-score ./cmd/fuzzy-score

5.2 Project Structure

fuzzy-score/
├── cmd/
│   └── fuzzy-score/
│       └── main.go
├── internal/
│   ├── matcher.go
│   └── scorer.go
└── fixtures/

5.3 The Core Question You’re Answering

“How do I rank fuzzy matches so results feel intuitive and consistent?”

5.4 Concepts You Must Understand First

  1. Subsequence matching logic.
  2. Scoring heuristics and bonuses.
  3. Top-K selection with heap.

5.5 Questions to Guide Your Design

  1. How should you score contiguous matches?
  2. How should you score word boundaries?
  3. How will you handle ties deterministically?

5.6 Thinking Exercise

Given query “mk” and candidate “my_kernel”, how should it score vs “make”?

5.7 The Interview Questions They’ll Ask

  1. Why use subsequence matching instead of edit distance?
  2. How do you keep fuzzy matching fast on large sets?

5.8 Hints in Layers

Hint 1: Start with a simple greedy matcher. Hint 2: Add bonuses for contiguous matches. Hint 3: Add boundary bonuses for ‘/’, ‘_’ and camelCase.

5.9 Books That Will Help

| Topic | Book | Chapter | |——-|——|———| | Algorithms | “Algorithms” (Sedgewick) | sorting, heaps | | Search UX | “Designing Interfaces” | search patterns |

5.10 Implementation Phases

Phase 1: Basic Matcher (3 days)

  • Implement subsequence check and boolean result.
  • Checkpoint: correct matches on fixtures.

Phase 2: Scoring (3 days)

  • Add bonuses and penalties.
  • Checkpoint: ranking looks reasonable.

Phase 3: Performance (3 days)

  • Add top-K heap and benchmarks.
  • Checkpoint: <50ms on 100k candidates.

5.11 Key Implementation Decisions

| Decision | Options | Recommendation | Rationale | |———-|———|—————-|———–| | Scoring | greedy vs DP | greedy | faster, simpler | | Top-K | sort all vs heap | heap | memory efficient |


6. Testing Strategy

6.1 Test Categories

| Category | Purpose | Examples | |———-|———|———-| | Unit | scoring rules | boundary bonuses | | Integration | ranking output | fixtures | | Performance | large candidate set | 100k lines |

6.2 Critical Test Cases

  1. Contiguous match scores higher than scattered.
  2. Word boundary match beats internal match.
  3. Empty query is rejected.

6.3 Test Data

query: "src"
list: ["src/main.rs", "source.txt", "s r c"]
expected: src/main.rs first

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

| Pitfall | Symptom | Solution | |———|———|———-| | Scores feel random | poor ranking | tune bonuses and penalties | | Slow performance | lag on 100k items | use heap and early exit | | Case handling bugs | missed matches | normalize case consistently |

7.2 Debugging Strategies

  • Print score breakdown per candidate.
  • Compare ranking with fzf on the same list.

8. Extensions & Challenges

8.1 Beginner Extensions

  • Add --tie-break flag.
  • Add highlighting of matched characters.

8.2 Intermediate Extensions

  • Implement DP-based optimal scoring.
  • Add path-aware scoring (directory depth).

8.3 Advanced Extensions

  • SIMD-accelerated subsequence matching.
  • Learned scoring model from user feedback.

9. Real-World Connections

9.1 Industry Applications

  • Terminal fuzzy finders
  • IDE quick open
  • fzf algorithm
  • skim (Rust fuzzy finder)

9.3 Interview Relevance

  • Discussing scoring heuristics and trade-offs

10. Resources

10.1 Essential Reading

  • fzf source code (algo.go)
  • skim algorithm docs

10.2 Tools & Documentation

  • go test -bench for microbenchmarks

11. Self-Assessment Checklist

11.1 Understanding

  • I can explain subsequence matching.
  • I can justify scoring heuristics.

11.2 Implementation

  • Ranking is deterministic and stable.
  • Performance meets target on 100k candidates.

11.3 Growth

  • I can compare greedy vs optimal scoring trade-offs.

12. Submission / Completion Criteria

Minimum Viable Completion:

  • Greedy fuzzy matcher with scoring

Full Completion:

  • Boundary bonuses + top-K output

Excellence (Going Above & Beyond):

  • DP scoring mode
  • Highlighted match output