Project 5: Build a String Interning System

Implement a string interner that deduplicates strings and returns lifetime-bound references or stable handles.

Quick Reference

Attribute Value
Difficulty Expert
Time Estimate 1-2 weeks
Main Programming Language Rust (Alternatives: C++ for comparison)
Alternative Programming Languages C++, Java
Coolness Level High
Business Potential Medium
Prerequisites Lifetimes, ownership, hashing, arenas
Key Topics Lifetime-tied references, stable addresses, interning tables

1. Learning Objectives

By completing this project, you will:

  1. Design a string interner API that avoids dangling references.
  2. Implement deduplication using hash maps and stable storage.
  3. Compare handle-based vs reference-based interning.
  4. Measure memory savings from interning.

2. All Theory Needed (Per-Concept Breakdown)

2.1 Lifetimes and Stored References

Fundamentals

A string interner stores owned strings and returns references to them. The returned &str must not outlive the storage. In Rust, this means the lifetime of the returned reference must be tied to the interner itself. If the interner is moved or dropped, references become invalid. Therefore, the interner must provide a stable address for stored strings and expose lifetimes correctly in its API.

Deep Dive into the concept

A string interner is a perfect case study in Rust lifetimes. The interner owns a collection of strings; each time you intern a string, you either return an existing reference or store a new string and return a reference to it. The crucial rule is: the returned reference must be valid as long as the interner lives. In Rust, you express this with a lifetime parameter: fn intern<'a>(&'a mut self, s: &str) -> &'a str. The 'a ties the output reference to the borrow of self, ensuring the compiler enforces that the interner outlives the reference.

However, lifetimes alone do not guarantee safety if the storage can move. If you store strings in a Vec<String> and then push a new string, the Vec might reallocate, moving all strings. That would invalidate existing references, even though the lifetimes are correct. This is why you need stable storage. There are three common approaches: (1) use Box<str> stored in a Vec<Box<str>> and never reallocate the boxes themselves (the boxes are stable on the heap); (2) use an arena allocator that never moves allocated strings; or (3) return handles (indices) rather than references. Each approach has trade-offs in ergonomics and performance.

The handle approach is the most robust: you return a StringId (like a u32) and allow users to resolve it to a &str when needed. This avoids borrow-checker issues and allows the interner to grow freely. The downside is that it’s less ergonomic and requires an extra lookup. The reference approach is more ergonomic but requires careful storage design. Box<str> is a good compromise: the Box itself never moves the allocation, even if the Vec reorders or grows. The Vec may move the Box pointers, but the heap allocations remain stable, so references to the contents remain valid as long as the Box is alive.

Another lifetime subtlety is self-referential structures. You might be tempted to store &str references alongside the owned String in the same struct for faster lookup, but this is unsafe unless you pin the struct, because moving it would invalidate references. Instead, use hash maps keyed by String or Box<str>, or use indices as keys. Rust’s HashMap stores keys by value, so if you store Box<str> as the key, the string data is stable.

Interning also interacts with borrowing rules. If you return &str, you cannot mutably borrow the interner while the reference is alive (because you already borrowed it to get the reference). This can make APIs cumbersome. Many designs provide a two-phase API: one phase for interning (mutable), one phase for lookup (immutable), or they use handles to avoid borrowing. Understanding these borrow conflicts is part of this project’s learning goal.

How this fit on projects

This concept is applied in §3.1 (what you build), §4.2 (storage choices), and §5.10 (phase 2). It connects to Project 1 (arena allocation) and Project 7 (pinning for self-referential safety).

Definitions & key terms

  • Interning: Deduplicating strings by storing one copy.
  • Stable address: Memory location that does not change.
  • Handle: Opaque identifier for a stored value.
  • Lifetime tie: Output reference bound to the interner’s lifetime.

Mental model diagram (ASCII)

Interner
  |-- Vec<Box<str>>  -> stable heap strings
  |-- HashMap<String, Id>
return &str tied to &self

How it works (step-by-step)

  1. Hash the input string.
  2. If already present, return existing reference/handle.
  3. Otherwise, allocate a new owned string in stable storage.
  4. Insert into the map and return reference/handle.

Minimal concrete example

pub fn intern<'a>(&'a mut self, s: &str) -> &'a str { ... }

Common misconceptions

  • Vec<String> is stable.” It can reallocate and move strings.
  • “Lifetimes guarantee memory stability.” They only prove scope, not movement.

Check-your-understanding questions

  1. Why is Vec<String> unsafe for returning &str without care?
  2. How does Box<str> help?
  3. What is the trade-off of using handles?

Check-your-understanding answers

  1. Reallocation can move strings, invalidating references.
  2. The boxed allocation stays stable on the heap.
  3. Handles are less ergonomic but more flexible and safer.

Real-world applications

  • Compiler symbol tables.
  • Query plan identifiers in databases.

Where you’ll apply it

References

  • TRPL Ch. 10 (lifetimes).
  • Rustonomicon “Unbounded Lifetimes”.

Key insights

The interner is safe only if storage never moves or you return handles instead of references.

Summary

Lifetimes prove scope, but stability of storage is what makes &str safe to return.

Homework/Exercises to practice the concept

  1. Implement an interner that returns handles, not references.
  2. Add a method that converts a handle back to &str.

Solutions to the homework/exercises

  1. Store strings in a Vec<Box<str>> and return index.
  2. Index into the vector and return &str.

2.2 Hashing, Deduplication, and Map Design

Fundamentals

String interning uses a hash map to deduplicate strings. The map keys are the string contents; the values are either references or handles. A good hash function minimizes collisions. In Rust, HashMap uses a secure default hasher; for performance you may use FxHash or ahash in non-adversarial contexts.

Deep Dive into the concept

Deduplication relies on hashing. When you intern a string, you compute its hash and check if it already exists in the table. If it does, return the existing reference or handle. If not, store it. The key design choice is how you represent keys and values to avoid unnecessary allocations. A common pattern is HashMap<Box<str>, usize>, where the Box<str> is the owned key and the value is an index into a vector of the same boxes. This avoids duplication of storage while keeping keys stable.

Another pattern is HashMap<String, &'a str>, but this is hard in Rust because the map would need to store references into itself, which is self-referential and unsafe. Avoid that pattern unless you use Pin and unsafe code. Therefore, the canonical design is to store owned strings as keys and return references to them by borrowing from the Box<str>.

Hashing also affects performance. For a high-throughput interner, you want to avoid re-hashing strings multiple times. You can use HashMap::entry to compute the hash once. For performance benchmarking, you should fix the input set to make results deterministic.

Collisions are handled by the hash map internally, but you must still be mindful of worst-case scenarios. In adversarial contexts, using a secure hash is important. If performance is the goal and inputs are trusted, a faster hasher may be acceptable.

How this fit on projects

This concept is used in §3.2 (requirements), §4.4 (data structures), and §6 (testing). It connects to Project 6 (iterators over interned values).

Definitions & key terms

  • Hash map: Data structure mapping keys to values via hashing.
  • Deduplication: Storing one copy of identical strings.
  • Hasher: Algorithm that computes hash codes.

Mental model diagram (ASCII)

HashMap
"alpha" -> id 0
"beta"  -> id 1
Vec<Box<str>>: ["alpha", "beta"]

How it works (step-by-step)

  1. Compute hash of input string.
  2. Look up in map.
  3. If exists, return existing handle.
  4. If not, insert and return new handle.

Minimal concrete example

match map.entry(s.to_owned().into_boxed_str()) { ... }

Common misconceptions

  • “Hashing is free.” It can dominate performance for large strings.
  • “Keys can be &str referencing the map.” That is self-referential.

Check-your-understanding questions

  1. Why not store &str keys referencing the map’s own storage?
  2. What does entry help with?

Check-your-understanding answers

  1. It would be self-referential and unsafe to move.
  2. It avoids double hashing and simplifies insertion.

Real-world applications

  • Symbol tables in compilers.
  • Log processing pipelines.

Where you’ll apply it

References

  • Rust std HashMap docs.

Key insights

A fast interner is mostly about minimizing allocations and hashes.

Summary

Deduplication is a map design problem: choose key/value representations that avoid self-references.

Homework/Exercises to practice the concept

  1. Benchmark HashMap with default hasher vs ahash.
  2. Count collisions by instrumenting the map.

Solutions to the homework/exercises

  1. Use fixed input and compare elapsed time.
  2. Track number of probes if implementing your own map.

2.3 Stable Storage and Pinning Alternatives

Fundamentals

Stable storage means the memory location of stored strings never changes. Box<str> provides stable heap allocations. An arena allocator can also provide stability. Pinning is a more advanced approach that guarantees a value will not move, but it introduces complexity. For an interner, stable heap allocations are usually enough.

Deep Dive into the concept

If you store strings in a Vec<String>, the String objects move when the vector reallocates. The heap allocations that the String points to might also be moved if the String itself moves and you later take references to its buffer. This makes returning &str unsafe. The Box<str> approach avoids this: you allocate the string directly on the heap and store a Box<str> in the vector or map. The Box can move, but the heap allocation remains at a stable address, so references to its contents remain valid.

Arena allocation is another option. You can allocate string bytes into an arena and then store &str slices referencing the arena. This is fast and memory-efficient, but you must ensure the arena never moves and is not dropped before the references. This connects to Project 1. The interner then becomes a thin wrapper around the arena plus a hash map of string slices or hashes.

Pinning is generally unnecessary for a string interner but is instructive. If you attempted to build a self-referential interner that stores references into its own String fields, you would need to pin the interner to prevent moves. This is complex and often avoided in favor of stable heap allocations. This project should mention pinning as an advanced extension, but not require it.

How this fit on projects

This concept connects to Project 1 (arena allocation) and Project 7 (pinning). It is used in §4.2 storage design and §5.10 Phase 2.

Definitions & key terms

  • Stable storage: Memory that does not move during the object’s lifetime.
  • **Box**: Heap-allocated string slice.
  • Pin: Marker that prevents moves.

Mental model diagram (ASCII)

Vec<Box<str>>
  box -> heap "alpha" (stable)
  box -> heap "beta"  (stable)

How it works (step-by-step)

  1. Allocate string as Box<str>.
  2. Store in vector or map.
  3. Return &str pointing into the box.

Minimal concrete example

let b: Box<str> = s.to_owned().into_boxed_str();
let r: &str = &b;

Common misconceptions

  • “Box moves the data.” The box moves; the allocation stays.
  • “Pin is required for any reference.” It is only needed for self-references.

Check-your-understanding questions

  1. Why is Box<str> stable?
  2. When would pinning be required?

Check-your-understanding answers

  1. The heap allocation doesn’t move when the box is moved.
  2. When you store references into the same struct that might move.

Real-world applications

  • Interning tables in compilers.
  • Caches of deduplicated strings.

Where you’ll apply it

References

  • Rustonomicon “Pinning”.

Key insights

Stable heap allocations are the simplest safe foundation for string interning.

Summary

Use stable storage or handles; avoid self-referential designs unless you can pin safely.

Homework/Exercises to practice the concept

  1. Implement a Box<str>-based interner.
  2. Add a handle-based API and compare ergonomics.

Solutions to the homework/exercises

  1. Store boxes in a vector, return &str slices.
  2. Return indices and add lookup method.

3. Project Specification

3.1 What You Will Build

A crate string_interner that stores unique strings and returns either &str references tied to the interner lifetime or stable StringId handles, with a CLI demo and memory benchmarks.

3.2 Functional Requirements

  1. intern(&mut self, &str) returns existing or new string reference.
  2. intern_id(&mut self, &str) returns stable handle.
  3. resolve(id) returns &str.
  4. CLI demo shows deduplication and memory savings.

3.3 Non-Functional Requirements

  • Safety: No dangling references.
  • Performance: Deduplication reduces memory.

3.4 Example Usage / Output

$ cargo run --example interner_demo
interned: "alpha" (id=0)
interned: "alpha" (id=0)
unique strings: 1
memory saved: 999 duplicates -> 1 allocation
exit code: 0

3.5 Data Formats / Schemas / Protocols

  • HashMap<Box<str>, StringId> + Vec<Box<str>>.

3.6 Edge Cases

  • Empty string.
  • Large strings.
  • High collision inputs.

3.7 Real World Outcome

Deterministic demo with success and failure cases.

3.7.1 How to Run (Copy/Paste)

cargo run --example interner_demo

3.7.2 Golden Path Demo (Deterministic)

  • Input list with repeated strings.
  • Expect unique count to match known set.

3.7.3 CLI Transcript (Success)

$ cargo run --example interner_demo -- --input data/strings.txt
unique strings: 120
saved: 880 duplicates
exit code: 0

3.7.4 Failure Demo (Missing File)

$ cargo run --example interner_demo -- --input missing.txt
error: input file not found
exit code: 2

4. Solution Architecture

4.1 High-Level Design

Interner
  |-- HashMap<Box<str>, Id>
  |-- Vec<Box<str>>

4.2 Key Components

Component Responsibility Key Decisions
Interner API surface reference vs handle
StringId Handle type newtype wrapper
CLI demo deterministic input fixed data set

4.4 Data Structures (No Full Code)

struct Interner {
    map: HashMap<Box<str>, StringId>,
    store: Vec<Box<str>>,
}

4.4 Algorithm Overview

Key Algorithm: Intern

  1. Look up string in map.
  2. If present, return existing id.
  3. Otherwise, allocate, insert, return new id.

Complexity Analysis:

  • Time: O(1) average per insert
  • Space: O(n) unique strings

5. Implementation Guide

5.1 Development Environment Setup

cargo new string_interner
cd string_interner

5.2 Project Structure

string_interner/
├── src/
│   ├── lib.rs
│   └── interner.rs
├── examples/
│   └── interner_demo.rs
└── benches/
    └── interner_bench.rs

5.3 The Core Question You’re Answering

“How can I return references to stored strings without violating lifetimes or moving memory?”

5.4 Concepts You Must Understand First

  1. Lifetimes tied to &self.
  2. Stable storage with Box<str> or arenas.
  3. Hash map entry API.

5.5 Questions to Guide Your Design

  1. Will you expose references or handles?
  2. How will you avoid self-referential storage?
  3. What hashing strategy will you use?

5.6 Thinking Exercise

Design two APIs: intern -> &str and intern_id -> StringId. Compare ergonomics.

5.7 The Interview Questions They’ll Ask

  1. “Why is string interning useful?”
  2. “How do you ensure returned references remain valid?”
  3. “What are the trade-offs between handles and references?”

5.8 Hints in Layers

Hint 1: Use Box<str> for stable storage. Hint 2: Use HashMap::entry to avoid double hashing. Hint 3: Add handle-based API for flexibility.

5.9 Books That Will Help

| Topic | Book | Chapter | |—|—|—| | Lifetimes | TRPL | Ch. 10 | | Ownership patterns | Rust for Rustaceans | Ownership chapters |

5.10 Implementation Phases

Phase 1: Basic Interner (2-3 days)

  • Implement intern_id and storage.

Phase 2: Reference API (2 days)

  • Add intern returning &str.

Phase 3: Benchmarks (2 days)

  • Measure memory savings.

5.11 Key Implementation Decisions

| Decision | Options | Recommendation | Rationale | |—|—|—|—| | Storage | Vec vs Vec<Box> | Box | stable addresses | | API | handles vs references | both | flexibility |


6. Testing Strategy

6.1 Test Categories

| Category | Purpose | Examples | |—|—|—| | Unit Tests | dedup correctness | intern same string twice | | Integration Tests | CLI demo | deterministic output | | Edge Tests | empty string | handle id 0 |

6.2 Critical Test Cases

  1. Interning same string returns same id.
  2. References remain valid after multiple inserts.
  3. Memory savings computed correctly.

6.3 Test Data

strings: ["alpha", "beta", "alpha", "gamma"]

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

| Pitfall | Symptom | Solution | |—|—|—| | Using Vec | Dangling refs | Use Box | | Self-referential map | Compiler errors | Avoid &str keys | | Mutable borrow conflicts | Borrow checker errors | Use handles or separate phases |

7.2 Debugging Strategies

  • Add pointer address printing to confirm stability.

7.3 Performance Traps

  • Hashing dominates for large inputs.

8. Extensions & Challenges

8.1 Beginner Extensions

  • Add case-insensitive interning.

8.2 Intermediate Extensions

  • Implement arena-backed interner.

8.3 Advanced Extensions

  • Add interner persistence to disk.

9. Real-World Connections

9.1 Industry Applications

  • Compiler symbol tables.
  • Language servers.
  • string_cache crate.

9.3 Interview Relevance

  • Explain lifetime safety with storage stability.

10. Resources

10.1 Essential Reading

  • TRPL Ch. 10.
  • Rustonomicon “Unbounded Lifetimes”.

10.2 Video Resources

  • RustConf talk on lifetimes.

10.3 Tools & Documentation

  • hashbrown crate for HashMap internals.

11. Self-Assessment Checklist

11.1 Understanding

  • I can explain why references must be tied to the interner.
  • I can explain stable storage.

11.2 Implementation

  • Deduplication works.
  • References remain valid.

11.3 Growth

  • I can compare handle-based and reference-based designs.

12. Submission / Completion Criteria

Minimum Viable Completion:

  • Interner deduplicates strings.
  • CLI demo shows deterministic output and failure case.

Full Completion:

  • Both reference and handle APIs implemented.

Excellence (Going Above & Beyond):

  • Arena-backed interner or persistent storage.