Project 7: Hash Table From Scratch
Implement a hash table with collision resolution, resizing, and tests.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 3: Advanced |
| Time Estimate | 2 weeks |
| Language | C (Alternatives: Rust, Go, Python) |
| Prerequisites | Projects 5-6, modular arithmetic |
| Key Topics | Hashing, collisions, amortized analysis |
1. Learning Objectives
By completing this project, you will:
- Implement a hash table with separate chaining or open addressing.
- Design and test hash functions for strings and integers.
- Implement resizing based on load factor.
- Analyze expected vs worst-case performance.
2. Theoretical Foundation
2.1 Core Concepts
- Hash function: Maps keys to bucket indices; should distribute well.
- Collision resolution: Separate chaining or open addressing (linear, quadratic, double hashing).
- Load factor: Ratio of items to buckets; drives resizing.
- Amortized analysis: Resizing makes average insertion O(1).
2.2 Why This Matters
Hash tables power dictionaries, caches, symbol tables, and databases. Understanding collisions and resizing is essential for reliable performance.
2.3 Historical Context / Background
Hashing became widespread with the rise of symbol tables and compiler design, where fast lookups are critical.
2.4 Common Misconceptions
- Misconception: Hash tables are always O(1). Reality: Worst-case can be O(n).
- Misconception: Any hash function is fine. Reality: Poor hashes cause clustering.
3. Project Specification
3.1 What You Will Build
A hash table library that supports insert, lookup, delete, and iteration, with configurable collision resolution and resizing.
3.2 Functional Requirements
- Insert: Add or update a key/value pair.
- Lookup: Return value or not-found.
- Delete: Remove key/value pair.
- Resize: Expand when load factor exceeds threshold.
- Stats: Report collisions and load factor.
3.3 Non-Functional Requirements
- Performance: O(1) average insert and lookup.
- Reliability: No memory leaks; valgrind clean.
- Usability: Clear API for key types.
3.4 Example Usage / Output
$ ./hash-demo
> put apple 3
> put banana 5
> get apple
3
> stats
size=2 capacity=8 load=0.25 collisions=1
3.5 Real World Outcome
You will have a fully functional hash table with metrics showing how collisions and resizing impact performance. You can use it as a base for caches or symbol tables.
4. Solution Architecture
4.1 High-Level Design
┌───────────┐ ┌────────────┐ ┌──────────────┐
│ CLI/Test │────▶│ Hash Table │────▶│ Hash Function│
└───────────┘ └────────────┘ └──────────────┘
4.2 Key Components
| Component | Responsibility | Key Decisions |
|---|---|---|
| hash.c | Core operations | Chaining vs open addressing |
| hashfn.c | Hash functions | FNV-1a or djb2 |
| stats.c | Metrics | Track collisions per insert |
4.3 Data Structures
typedef struct Entry {
char* key;
int value;
struct Entry* next;
} Entry;
typedef struct {
Entry** buckets;
size_t capacity;
size_t size;
} HashTable;
4.4 Algorithm Overview
Key Algorithm: Insert with Separate Chaining
- Hash key to bucket index.
- Traverse bucket list to update or append.
- Resize if load factor exceeds threshold.
Complexity Analysis:
- Time: O(1) average, O(n) worst
- Space: O(n)
5. Implementation Guide
5.1 Development Environment Setup
cc -Wall -Wextra -o hash-demo src/*.c
valgrind ./hash-demo
5.2 Project Structure
hash-table/
├── src/
│ ├── hash.c
│ ├── hash.h
│ ├── hashfn.c
│ ├── demo.c
├── tests/
└── README.md
5.3 The Core Question You’re Answering
“How do hash functions and collision strategies determine real-world performance?”
5.4 Concepts You Must Understand First
Stop and research these before coding:
- Modulo arithmetic and bucket selection
- Collision strategies
- Load factor and resizing
5.5 Questions to Guide Your Design
- Will you store keys as strings only or support generic keys?
- What load factor triggers a resize?
- How will you count collisions consistently?
5.6 Thinking Exercise
Take the keys “cat”, “tac”, “act”. Compute hash values and see if they collide with your hash function.
5.7 The Interview Questions They’ll Ask
- Explain how a hash table works.
- Why do we resize when load factor is high?
- How does separate chaining differ from open addressing?
5.8 Hints in Layers
Hint 1: Start with chaining Chaining is simpler and avoids tombstones.
Hint 2: Use a prime capacity Prime bucket counts reduce clustering for poor hashes.
Hint 3: Rehash on resize Reinsert all entries into a new bucket array.
5.9 Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Hashing | “The Joys of Hashing” | Ch. 1-4 |
| Collision resolution | “Introduction to Algorithms” | Ch. 11 |
5.10 Implementation Phases
Phase 1: Foundation (4-5 days)
Goals:
- Implement basic hash table with chaining
- Add insert and lookup
Tasks:
- Implement hash function.
- Build insert/lookup for string keys.
Checkpoint: Insert and lookup work for simple keys.
Phase 2: Resizing (3-4 days)
Goals:
- Add delete
- Add resizing based on load factor
Tasks:
- Implement delete in chain.
- Rehash into new buckets.
Checkpoint: Table resizes and still returns correct values.
Phase 3: Metrics and Polish (3-4 days)
Goals:
- Track collisions
- Add CLI demo and stats
Tasks:
- Count collisions per insert.
- Add CLI commands for stats.
Checkpoint: Stats match expected load factors.
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale |
|---|---|---|---|
| Collision strategy | chaining, open addressing | chaining | Simpler deletes |
| Hash function | FNV-1a, djb2 | FNV-1a | Better distribution |
| Resize threshold | 0.5-0.9 | 0.75 | Common trade-off |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples |
|---|---|---|
| Unit Tests | Insert/lookup/delete | Basic key sets |
| Integration Tests | Resize correctness | Insert 1000 keys |
| Edge Case Tests | Collisions | Keys with same hash |
6.2 Critical Test Cases
- Insert same key twice updates value.
- Delete removes key from chain.
- Resize preserves all keys.
6.3 Test Data
{"a":1, "b":2, "c":3}
keys designed to collide
7. Common Pitfalls and Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution |
|---|---|---|
| Forgetting to duplicate strings | Use-after-free | strdup keys |
| Not updating size on delete | Wrong load factor | Decrement size |
| Incorrect rehash | Missing keys after resize | Recompute indices |
7.2 Debugging Strategies
- Log bucket indices on insert.
- Use valgrind to detect leaks.
7.3 Performance Traps
Using a poor hash function can turn O(1) into O(n). Test distribution early.
8. Extensions and Challenges
8.1 Beginner Extensions
- Add iteration over keys.
- Add load factor warnings.
8.2 Intermediate Extensions
- Add open addressing mode.
- Add generic value types with void pointers.
8.3 Advanced Extensions
- Build an LRU cache with hash + list.
- Add hashing for binary keys.
9. Real-World Connections
9.1 Industry Applications
- Symbol tables in compilers
- Key-value stores and caches
9.2 Related Open Source Projects
- uthash: Single-header C hash tables.
- redis: Hashes for field/value structures.
9.3 Interview Relevance
Hash tables are central to many interview problems (two-sum, anagrams, caches).
10. Resources
10.1 Essential Reading
- “The Joys of Hashing” by Thomas Mailund - Ch. 1-4
- “Introduction to Algorithms” - Ch. 11
10.2 Video Resources
- Hash table visualization videos
10.3 Tools and Documentation
- Valgrind: https://valgrind.org
10.4 Related Projects in This Series
11. Self-Assessment Checklist
11.1 Understanding
- I can explain how a hash function maps keys to buckets.
- I can explain why resizing is needed.
- I can compare chaining and open addressing.
11.2 Implementation
- All operations work correctly.
- Resizing preserves data.
11.3 Growth
- I can reason about collision behavior in practice.
12. Submission / Completion Criteria
Minimum Viable Completion:
- Insert, lookup, delete
- Basic tests
Full Completion:
- Resizing with load factor threshold
- Collision metrics
Excellence (Going Above and Beyond):
- Open addressing variant
- LRU cache on top
This guide was generated from LEARN_ALGORITHMS_DEEP_DIVE.md. For the complete learning path, see the parent directory.