Project 8: Lock Contention and Concurrency Profiler
Project 8: Lock Contention and Concurrency Profiler
Project Overview
| Attribute | Details |
|---|---|
| Difficulty | Advanced |
| Time Estimate | 1-2 weeks |
| Primary Language | C |
| Alternative Languages | Go, Rust, C++ |
| Knowledge Area | Concurrency and Contention |
| Tools Required | perf, tracing tools, pthreads |
| Primary Reference | โThe Art of Multiprocessor Programmingโ by Herlihy & Shavit |
Learning Objectives
By completing this project, you will be able to:
- Measure lock hold time and wait time separately and accurately
- Identify contention hotspots that limit scalability
- Visualize lock dependency graphs showing thread relationships
- Calculate throughput impact of contention on parallel efficiency
- Apply lock optimization techniques (lock striping, RCU, lock-free)
- Diagnose priority inversion and convoy effects
Deep Theoretical Foundation
Why Locks Kill Performance
Locks serialize access to shared resources. When multiple threads compete for one lock:
Thread execution timeline with lock contention:
Time โ 0 1 2 3 4 5 6 7 8 9
T1: [====LOCK HELD====][ ]
T2: [WAIT][WAIT][WAIT][====LOCK HELD====][ ]
T3: [WAIT][WAIT][WAIT][WAIT][WAIT][WAIT][==LOCK]
T4: [WAIT][WAIT][WAIT][WAIT][WAIT][WAIT][WAIT][WAIT]
[====] = Doing useful work
[WAIT] = Blocked on lock
With 4 threads but 1 lock, maximum parallelism is 1. Threads spend more time waiting than working.
Amdahlโs Law for Contention
The serialized portion limits speedup:
Speedup = 1 / (S + P/N)
Where:
S = serial fraction (lock-protected code)
P = parallel fraction (1 - S)
N = number of threads
Example: 5% serial, 8 threads
Speedup = 1 / (0.05 + 0.95/8) = 1 / (0.05 + 0.119) = 5.9x
With 5% serial code, maximum speedup is 5.9x even with 8 threads!
Lock Metrics Explained
Hold Time: Duration lock is held by a thread
- Longer hold time = more waiting for others
- Target: microseconds, not milliseconds
Wait Time: Duration thread waits before acquiring lock
- High wait time indicates contention
- Directly impacts tail latency
Contention Rate: Attempts to acquire already-held lock
- Rate > 50% indicates hot lock
- Rate > 90% indicates severe bottleneck
Lock Convoy: Phenomenon where threads line up behind slow holder
- One slow thread delays entire queue
- Unfair locks exacerbate convoys
Types of Locking Bugs
1. Contention Hotspot
// BAD: Single global lock
pthread_mutex_t global_lock;
void increment(int id) {
pthread_mutex_lock(&global_lock); // All threads compete here
counters[id]++;
pthread_mutex_unlock(&global_lock);
}
// GOOD: Per-element lock
pthread_mutex_t counter_locks[N];
void increment(int id) {
pthread_mutex_lock(&counter_locks[id]); // No contention
counters[id]++;
pthread_mutex_unlock(&counter_locks[id]);
}
2. Priority Inversion
High-priority thread H waits for lock held by low-priority L
Medium-priority M preempts L (it doesn't need the lock)
Result: H waits for M, even though H > M priority
Solution: Priority inheritance - L temporarily gets H's priority
3. Lock Ordering Deadlock
Thread 1: lock(A) โ lock(B)
Thread 2: lock(B) โ lock(A)
If both hold first lock, neither can proceed
Solution: Always acquire locks in consistent order
4. False Sharing
struct ThreadData {
int counter; // Thread 0 writes here
// No padding
int other; // Thread 1 writes here
};
// Both variables on same cache line (64 bytes)
// Writing one invalidates the other's cache
// Solution: Pad to cache line boundary
struct ThreadData {
int counter;
char padding[60]; // Ensure separate cache lines
int other;
};
Complete Project Specification
What Youโre Building
A lock profiling toolkit called lock_prof that:
- Measures lock hold and wait times per-lock and per-thread
- Identifies contention hotspots with ranking
- Visualizes lock access patterns as timeline and graph
- Recommends optimizations based on patterns
- Compares throughput at different thread counts
Functional Requirements
lock_prof run --workload <name> --threads <n> --duration <sec>
lock_prof analyze --data <file.csv>
lock_prof timeline --data <file.csv> --output <timeline.html>
lock_prof contention-map --data <file.csv> --output <map.svg>
lock_prof scale-test --workload <name> --max-threads <n>
Example Output
Lock Contention Report
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Workload: database_query
Threads: 8
Duration: 60 seconds
Lock Summary:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Lock Name Acquisitions Hold Time Wait Time
(avg/p99) (avg/p99)
connection_pool 45,231 0.8/4.2 ms 8.1/42.7 ms
query_cache 892,451 12/87 ยตs 0.4/2.1 ms
stats_lock 156,789 3/18 ยตs 0.02/0.1 ms
Contention Analysis:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Lock: connection_pool (CRITICAL CONTENTION)
Contention rate: 73.2%
Wait/Hold ratio: 10.1x โ Waiting 10x longer than holding!
Threads blocked: avg 5.8 / max 8
Impact: 42% of total latency from this lock
Recommendation:
โข Pool size: 4 connections for 8 threads (UNDERSIZED)
โข Consider: Increase pool to 8+ connections
โข Consider: Connection multiplexing
Lock: query_cache (MODERATE CONTENTION)
Contention rate: 31.4%
Wait/Hold ratio: 0.03x โ Acceptable
Threads blocked: avg 0.8 / max 3
Recommendation:
โข Lock hold time is low, contention manageable
โข Consider: Read-write lock (many reads, few writes)
Scalability Analysis:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Threads Throughput Efficiency Bottleneck
1 1000 op/s 100% -
2 1850 op/s 92.5% connection_pool
4 2890 op/s 72.3% connection_pool
8 3120 op/s 39.0% connection_pool
Scaling efficiency drops at 4+ threads due to connection_pool
Theoretical max throughput: 5200 op/s (with infinite pool)
Solution Architecture
Component Design
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ CLI Interface โ
โ Thread count, workload configuration โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโ
โ โ โ
โผ โผ โผ
โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ
โ Workload โ โ Lock โ โ Tracing โ
โ Generator โ โ Wrappers โ โ Engine โ
โโโโโโโโฌโโโโโโโ โโโโโโโโฌโโโโโโโ โโโโโโโโฌโโโโโโโ
โ โ โ
โโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Lock Event Collector โ
โ Acquire/Release timestamps โ
โโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโ
โ โ โ
โผ โผ โผ
โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ
โ Hold Time โ โ Wait Time โ โ Contention โ
โ Calculator โ โ Calculator โ โ Analyzer โ
โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ
Instrumented Lock Wrapper
typedef struct {
pthread_mutex_t mutex;
const char *name;
uint64_t acquire_count;
uint64_t contention_count; // Times acquired when already held
uint64_t total_hold_ns;
uint64_t total_wait_ns;
uint64_t max_hold_ns;
uint64_t max_wait_ns;
} profiled_mutex_t;
typedef struct {
profiled_mutex_t *lock;
int thread_id;
uint64_t request_time;
uint64_t acquire_time;
uint64_t release_time;
} lock_event_t;
void profiled_lock(profiled_mutex_t *m, int tid) {
uint64_t request_time = now_ns();
// Try to acquire - if fails, it's contention
if (pthread_mutex_trylock(&m->mutex) != 0) {
atomic_fetch_add(&m->contention_count, 1);
pthread_mutex_lock(&m->mutex); // Now wait
}
uint64_t acquire_time = now_ns();
uint64_t wait_time = acquire_time - request_time;
atomic_fetch_add(&m->total_wait_ns, wait_time);
atomic_fetch_add(&m->acquire_count, 1);
// Store acquire time for hold time calculation
thread_local_acquire_time[tid] = acquire_time;
}
void profiled_unlock(profiled_mutex_t *m, int tid) {
uint64_t release_time = now_ns();
uint64_t hold_time = release_time - thread_local_acquire_time[tid];
atomic_fetch_add(&m->total_hold_ns, hold_time);
pthread_mutex_unlock(&m->mutex);
}
Analysis Functions
// Calculate contention rate
double contention_rate(profiled_mutex_t *m) {
if (m->acquire_count == 0) return 0;
return (double)m->contention_count / m->acquire_count;
}
// Calculate wait/hold ratio
double wait_hold_ratio(profiled_mutex_t *m) {
if (m->total_hold_ns == 0) return 0;
return (double)m->total_wait_ns / m->total_hold_ns;
}
// Rank locks by contention impact
void rank_contention_hotspots(profiled_mutex_t *locks, int n) {
// Sort by total wait time (impact on throughput)
qsort(locks, n, sizeof(profiled_mutex_t), compare_wait_time);
for (int i = 0; i < n; i++) {
printf("%s: %.1f%% contention, %.2fx wait/hold\n",
locks[i].name,
contention_rate(&locks[i]) * 100,
wait_hold_ratio(&locks[i]));
}
}
Phased Implementation Guide
Phase 1: Instrumented Locks (Days 1-3)
Goal: Create lock wrappers that track timing.
Steps:
- Implement profiled_mutex_t structure
- Create lock/unlock wrappers with timing
- Use atomic operations for counters
- Test with simple multi-threaded workload
- Verify timing accuracy against wall clock
Validation: Hold + wait times sum correctly.
Phase 2: Contention Detection (Days 4-5)
Goal: Detect and count contention events.
Steps:
- Use trylock to detect contention
- Track per-thread wait times
- Calculate contention rate
- Identify hot locks
Validation: High-contention workload shows expected rates.
Phase 3: Scalability Analysis (Days 6-8)
Goal: Measure throughput at different thread counts.
Steps:
- Run workload at 1, 2, 4, 8, 16 threads
- Calculate throughput at each level
- Compute parallel efficiency
- Identify scaling bottlenecks
Validation: Matches Amdahlโs law predictions.
Phase 4: Visualization (Days 9-11)
Goal: Generate lock timeline and contention graphs.
Steps:
- Log lock events with timestamps
- Generate timeline visualization (HTML/SVG)
- Create contention heat map
- Show thread blocking relationships
Validation: Timeline accurately shows blocking.
Phase 5: Recommendations Engine (Days 12-14)
Goal: Suggest optimizations based on patterns.
Steps:
- Classify lock usage patterns
- Match patterns to known solutions
- Generate actionable recommendations
- Estimate improvement potential
Validation: Recommendations are sensible for test cases.
Testing Strategy
Synthetic Workloads
1. Single Hot Lock
// All threads compete for one lock
pthread_mutex_t hot_lock;
for (int i = 0; i < 1000000; i++) {
pthread_mutex_lock(&hot_lock);
work();
pthread_mutex_unlock(&hot_lock);
}
Expected: High contention, poor scaling.
2. Striped Locks
// Each thread has own lock
pthread_mutex_t locks[N_THREADS];
for (int i = 0; i < 1000000; i++) {
pthread_mutex_lock(&locks[thread_id]);
work();
pthread_mutex_unlock(&locks[thread_id]);
}
Expected: Zero contention, linear scaling.
3. Reader-Writer Imbalance
// 95% reads, 5% writes
pthread_rwlock_t rwlock;
if (random() < 0.95) {
pthread_rwlock_rdlock(&rwlock);
read_data();
pthread_rwlock_unlock(&rwlock);
} else {
pthread_rwlock_wrlock(&rwlock);
write_data();
pthread_rwlock_unlock(&rwlock);
}
Expected: Moderate contention on writes.
Correctness Tests
- Race condition detection: Use ThreadSanitizer
- Timing accuracy: Compare with external timer
- Counter overflow: Handle long-running tests
Common Pitfalls and Debugging
Pitfall 1: Probe Effect (Observer Changes Outcome)
Symptom: Contention disappears when profiling.
Cause: Profiling overhead changes timing.
Solution: Use minimal overhead tracing:
// BAD: String formatting in hot path
pthread_mutex_lock(&m);
printf("Thread %d acquired lock %s\n", tid, m->name);
// GOOD: Just store event, format later
lock_events[event_count++] = (lock_event_t){
.lock = m, .tid = tid, .time = now_ns()
};
Pitfall 2: Non-Atomic Counter Updates
Symptom: Counter values are wrong/negative.
Cause: Multiple threads updating same counter.
Solution:
// BAD: Race condition
m->total_hold_ns += hold_time;
// GOOD: Atomic operation
atomic_fetch_add(&m->total_hold_ns, hold_time);
// ALSO GOOD: Per-thread counters, aggregate later
thread_local_hold_ns[tid] += hold_time;
Pitfall 3: Measuring Lock Thatโs Never Contended
Symptom: All locks show 0% contention.
Cause: Workload doesnโt have enough parallelism.
Solution: Ensure work inside critical section is substantial:
// BAD: No real work, lock cycles too fast
pthread_mutex_lock(&m);
counter++;
pthread_mutex_unlock(&m);
// GOOD: Simulate real work
pthread_mutex_lock(&m);
for (int i = 0; i < 1000; i++) {
data[i] = compute(data[i]);
}
pthread_mutex_unlock(&m);
Pitfall 4: Clock Skew Between Threads
Symptom: Negative wait times.
Cause: Different threads using different time sources.
Solution: Always use CLOCK_MONOTONIC, same for all threads.
Extensions and Challenges
Extension 1: Lock Hierarchy Verification
Build checker that:
- Records lock acquisition order per thread
- Detects potential deadlock cycles
- Reports violations of lock ordering
Extension 2: RCU Comparison
Implement Read-Copy-Update pattern:
- Compare with reader-writer lock
- Measure read-side overhead
- Show write-side cost trade-off
Extension 3: Lock-Free Comparison
For counter workload, compare:
- Mutex-protected counter
- Atomic increment
- Thread-local counters with periodic merge
Challenge: Real System Profiling
Profile actual multi-threaded application:
- Instrument its lock usage
- Identify real contention hotspots
- Implement and validate optimization
Real-World Connections
Industry Patterns
- Java synchronized: Biased locking, lock elision
- Linux kernel: Spinlocks, RCU, seqlocks
- Database systems: Two-phase locking, MVCC
- Go runtime: Mutex with spinning, channel-based sync
Famous Lock Optimizations
- glibc pthread_mutex: Futex-based, minimal syscalls
- WebKit WTF::Lock: Parking lot, efficient wait
- Folly SharedMutex: Distributed reader counts
- mcs_lock: Fair queuing, cache-friendly
Self-Assessment Checklist
Before considering this project complete, verify:
- You can measure hold time and wait time separately
- You correctly identify which locks are contended
- Scalability test shows expected Amdahlโs law behavior
- Timeline visualization accurately shows blocking
- Recommendations make sense for synthetic workloads
- Profiling overhead is < 10% of workload time
- You understand when to use each lock optimization
Resources
Essential Reading
- โThe Art of Multiprocessor Programmingโ by Herlihy & Shavit
- โSystems Performanceโ by Gregg, Chapter 6
- โOperating Systems: Three Easy Piecesโ by Arpaci-Dusseau, Chapter 26
Reference
- POSIX Threads (pthreads) man pages
- Linux futex documentation
- โIs Parallel Programming Hard?โ (Paul McKenney)
Tools
- pthread profilers (mutrace, lockdep)
- ThreadSanitizer for race detection
- perf lock: Built-in Linux lock profiler