Project 5: Hash Table

Build a hash table with open addressing or chaining, then use it as a dictionary and cache.

Quick Reference

Attribute Value
Difficulty Advanced
Time Estimate 2 weeks
Language C (Alternatives: Rust, C++)
Prerequisites Projects 1-4, strong pointers
Key Topics Hashing, collisions, load factor, resizing

1. Learning Objectives

By completing this project, you will:

  1. Implement a hash table with insert, delete, find.
  2. Compare collision strategies (chaining vs open addressing).
  3. Tune load factor and resize policies.
  4. Use a hash table to build a simple LRU cache.

2. Theoretical Foundation

2.1 Core Concepts

  • Hash function: Maps keys to uniform bucket indices.
  • Collision handling: Multiple keys may hash to the same index.
  • Load factor: n / capacity impacts speed and memory.
  • Amortized resizing: Grow table to maintain O(1) average ops.

2.2 Why This Matters

Hash tables power databases, compilers, caches, and language runtimes. Understanding their trade-offs helps you choose the right design and avoid pathological performance.

2.3 Historical Context / Background

Hashing became practical in the 1950s and is now a standard tool for constant-time lookup in modern systems.

2.4 Common Misconceptions

  • “O(1) always”: Worst-case is O(n) if hashing is poor.
  • “Any hash function works”: Distribution quality matters.

3. Project Specification

3.1 What You Will Build

  • A hash table supporting string keys and integer values.
  • Collision handling with either chaining or open addressing.
  • An LRU cache using a hash table + doubly-linked list.

3.2 Functional Requirements

  1. Insert, find, delete operations.
  2. Resize when load factor exceeds threshold.
  3. Support iteration over keys.
  4. Optional: LRU cache with fixed capacity.

3.3 Non-Functional Requirements

  • Performance: Average O(1) operations.
  • Reliability: Proper handling of collisions and deletes.
  • Usability: CLI demo for dictionary and cache behavior.

3.4 Example Usage / Output

$ ./hashtable_demo
Insert alice=30
Insert bob=25
Find bob -> 25
Delete alice
Load factor: 0.62 -> resize to 128 buckets

3.5 Real World Outcome

You run the demo and insert key/value pairs. The program prints the bucket index, collision count, and when it resizes. When you enable LRU mode, it shows cache hits, misses, and evictions as you exceed capacity.


4. Solution Architecture

4.1 High-Level Design

┌──────────────┐     ┌──────────────────┐     ┌────────────────┐
│ Hash Table   │────▶│ Collision Policy │────▶│ LRU Cache      │
└──────────────┘     └──────────────────┘     └────────────────┘

4.2 Key Components

Component Responsibility Key Decisions
Hash fn Map keys to indices Use FNV-1a or djb2
Buckets Store entries Chaining vs open addressing
Resizer Maintain load factor Double size at 0.75
LRU list Track recency Doubly-linked list

4.3 Data Structures

typedef struct Entry {
    char* key;
    int value;
    struct Entry* next;  // for chaining
} Entry;

typedef struct {
    Entry** buckets;
    size_t capacity;
    size_t size;
} HashTable;

4.4 Algorithm Overview

Key Algorithm: Insert with Chaining

  1. Hash key to index.
  2. Scan chain for existing key.
  3. Insert new entry at head if absent.
  4. Resize if load factor exceeds threshold.

Complexity Analysis:

  • Time: O(1) average, O(n) worst-case
  • Space: O(n + capacity)

5. Implementation Guide

5.1 Development Environment Setup

clang -Wall -Wextra -g -o hashtable_demo src/*.c

5.2 Project Structure

project-root/
├── src/
│   ├── hash.c
│   ├── hashtable.c
│   ├── lru.c
│   └── main.c
├── tests/
│   └── test_hashtable.c
└── Makefile

5.3 The Core Question You’re Answering

“How can I trade extra memory to make lookup nearly constant time?”

5.4 Concepts You Must Understand First

Stop and research these before coding:

  1. Hash function properties
    • Uniform distribution
    • Determinism
  2. Collision handling
    • Separate chaining vs open addressing
  3. Load factor
    • Why performance drops as it grows

5.5 Questions to Guide Your Design

  1. Which collision strategy will you implement first?
  2. How will you handle deletion in open addressing?
  3. What load factor threshold makes sense?

5.6 Thinking Exercise

Take 10 keys, hash them with a simple hash, and map to 8 buckets. Count collisions and estimate load factor impact.

5.7 The Interview Questions They’ll Ask

  1. “What happens when two keys hash to the same bucket?”
  2. “Why does load factor matter?”
  3. “How does LRU caching use a hash table?”

5.8 Hints in Layers

Hint 1: Start with chaining It is simpler and deletion is straightforward.

Hint 2: Add resize early Your performance will degrade without it.

Hint 3: Use a real hash function FNV-1a is short and reliable for strings.

5.9 Books That Will Help

Topic Book Chapter
Hash tables CLRS Ch. 11
C pointers C Programming: A Modern Approach Ch. 11

5.10 Implementation Phases

Phase 1: Foundation (3 days)

Goals:

  • Hash function and basic insert/find

Tasks:

  1. Implement string hashing.
  2. Insert into buckets with chaining.

Checkpoint: Insert and find operations work on small sets.

Phase 2: Core Functionality (4 days)

Goals:

  • Deletion and resizing

Tasks:

  1. Implement delete with chain update.
  2. Add resize when load factor > 0.75.

Checkpoint: Table grows and maintains data.

Phase 3: Polish & Edge Cases (3 days)

Goals:

  • LRU cache demo
  • Metrics

Tasks:

  1. Add doubly-linked list for recency.
  2. Print collisions and load factor.

Checkpoint: Cache evicts least-recently used key.

5.11 Key Implementation Decisions

Decision Options Recommendation Rationale
Collision strategy Chaining, open addressing Chaining Simpler deletes, easier to debug
Resize threshold 0.5, 0.75, 0.9 0.75 Common balance of space/time
Hash fn djb2, FNV-1a FNV-1a Good distribution for strings

6. Testing Strategy

6.1 Test Categories

Category Purpose Examples
Unit Tests Hash + insert/find collisions
Integration Tests Resizing insert 1000 keys
Edge Case Tests Deletion delete head of chain

6.2 Critical Test Cases

  1. Insert duplicate key updates value.
  2. Resize preserves all keys.
  3. Delete removes key and leaves others intact.

6.3 Test Data

Keys: "a", "b", "c", "aa", "bb", "cc"

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall Symptom Solution
Forgetting to rehash on resize Missing keys Recompute indices
Poor hash function Many collisions Switch hash fn
Memory leaks valgrind warnings Free keys and entries

7.2 Debugging Strategies

  • Print bucket lengths after inserts.
  • Track collision counts and load factor.

7.3 Performance Traps

  • Excessive collisions turn O(1) into O(n).

8. Extensions & Challenges

8.1 Beginner Extensions

  • Implement open addressing (linear probing).
  • Add a contains API.

8.2 Intermediate Extensions

  • Add tombstones and rehashing for open addressing.
  • Make hash table generic with void* values.

8.3 Advanced Extensions

  • Implement Robin Hood hashing.
  • Add incremental resizing.

9. Real-World Connections

9.1 Industry Applications

  • Databases: hash indexes for equality lookups.
  • Caches: O(1) access to hot data.
  • Redis: dictionary implementation.
  • CPython: open addressing hash table.

9.3 Interview Relevance

  • Hash table design and collision handling are frequent interview topics.

10. Resources

10.1 Essential Reading

  • CLRS - Ch. 11
  • Algorithms, 4th Edition by Sedgewick - hashing chapter

10.2 Video Resources

  • “Hash Tables” - algorithms courses

10.3 Tools & Documentation

  • Valgrind: https://valgrind.org/

11. Self-Assessment Checklist

11.1 Understanding

  • I can explain collisions and load factor.
  • I can justify resize thresholds.
  • I can compare chaining vs open addressing.

11.2 Implementation

  • All operations pass tests.
  • Resizing preserves entries.
  • LRU cache behaves correctly.

11.3 Growth

  • I can swap in a better hash function.

12. Submission / Completion Criteria

Minimum Viable Completion:

  • Insert/find/delete operations with chaining.
  • Load factor tracking.

Full Completion:

  • Resizing and LRU cache demo.

Excellence (Going Above & Beyond):

  • Open addressing with Robin Hood hashing.

This guide was generated from DATA_STRUCTURES_FROM_FIRST_PRINCIPLES.md. For the complete learning path, see the parent directory README.