Project 8: Lock Contention and Concurrency Profiler

Project 8: Lock Contention and Concurrency Profiler

Project Overview

Attribute Details
Difficulty Advanced
Time Estimate 1-2 weeks
Primary Language C
Alternative Languages Go, Rust, C++
Knowledge Area Concurrency and Contention
Tools Required perf, tracing tools, pthreads
Primary Reference โ€œThe Art of Multiprocessor Programmingโ€ by Herlihy & Shavit

Learning Objectives

By completing this project, you will be able to:

  1. Measure lock hold time and wait time separately and accurately
  2. Identify contention hotspots that limit scalability
  3. Visualize lock dependency graphs showing thread relationships
  4. Calculate throughput impact of contention on parallel efficiency
  5. Apply lock optimization techniques (lock striping, RCU, lock-free)
  6. Diagnose priority inversion and convoy effects

Deep Theoretical Foundation

Why Locks Kill Performance

Locks serialize access to shared resources. When multiple threads compete for one lock:

Thread execution timeline with lock contention:
Time โ†’  0    1    2    3    4    5    6    7    8    9
T1:    [====LOCK HELD====][                        ]
T2:    [WAIT][WAIT][WAIT][====LOCK HELD====][     ]
T3:    [WAIT][WAIT][WAIT][WAIT][WAIT][WAIT][==LOCK]
T4:    [WAIT][WAIT][WAIT][WAIT][WAIT][WAIT][WAIT][WAIT]

[====] = Doing useful work
[WAIT] = Blocked on lock

With 4 threads but 1 lock, maximum parallelism is 1. Threads spend more time waiting than working.

Amdahlโ€™s Law for Contention

The serialized portion limits speedup:

Speedup = 1 / (S + P/N)

Where:
  S = serial fraction (lock-protected code)
  P = parallel fraction (1 - S)
  N = number of threads

Example: 5% serial, 8 threads
  Speedup = 1 / (0.05 + 0.95/8) = 1 / (0.05 + 0.119) = 5.9x

With 5% serial code, maximum speedup is 5.9x even with 8 threads!

Lock Metrics Explained

Hold Time: Duration lock is held by a thread

  • Longer hold time = more waiting for others
  • Target: microseconds, not milliseconds

Wait Time: Duration thread waits before acquiring lock

  • High wait time indicates contention
  • Directly impacts tail latency

Contention Rate: Attempts to acquire already-held lock

  • Rate > 50% indicates hot lock
  • Rate > 90% indicates severe bottleneck

Lock Convoy: Phenomenon where threads line up behind slow holder

  • One slow thread delays entire queue
  • Unfair locks exacerbate convoys

Types of Locking Bugs

1. Contention Hotspot

// BAD: Single global lock
pthread_mutex_t global_lock;

void increment(int id) {
    pthread_mutex_lock(&global_lock);  // All threads compete here
    counters[id]++;
    pthread_mutex_unlock(&global_lock);
}

// GOOD: Per-element lock
pthread_mutex_t counter_locks[N];

void increment(int id) {
    pthread_mutex_lock(&counter_locks[id]);  // No contention
    counters[id]++;
    pthread_mutex_unlock(&counter_locks[id]);
}

2. Priority Inversion

High-priority thread H waits for lock held by low-priority L
Medium-priority M preempts L (it doesn't need the lock)
Result: H waits for M, even though H > M priority

Solution: Priority inheritance - L temporarily gets H's priority

3. Lock Ordering Deadlock

Thread 1: lock(A) โ†’ lock(B)
Thread 2: lock(B) โ†’ lock(A)

If both hold first lock, neither can proceed
Solution: Always acquire locks in consistent order

4. False Sharing

struct ThreadData {
    int counter;  // Thread 0 writes here
    // No padding
    int other;    // Thread 1 writes here
};

// Both variables on same cache line (64 bytes)
// Writing one invalidates the other's cache
// Solution: Pad to cache line boundary
struct ThreadData {
    int counter;
    char padding[60];  // Ensure separate cache lines
    int other;
};

Complete Project Specification

What Youโ€™re Building

A lock profiling toolkit called lock_prof that:

  1. Measures lock hold and wait times per-lock and per-thread
  2. Identifies contention hotspots with ranking
  3. Visualizes lock access patterns as timeline and graph
  4. Recommends optimizations based on patterns
  5. Compares throughput at different thread counts

Functional Requirements

lock_prof run --workload <name> --threads <n> --duration <sec>
lock_prof analyze --data <file.csv>
lock_prof timeline --data <file.csv> --output <timeline.html>
lock_prof contention-map --data <file.csv> --output <map.svg>
lock_prof scale-test --workload <name> --max-threads <n>

Example Output

Lock Contention Report
โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
Workload: database_query
Threads: 8
Duration: 60 seconds

Lock Summary:
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
Lock Name        Acquisitions   Hold Time     Wait Time
                               (avg/p99)     (avg/p99)
connection_pool  45,231        0.8/4.2 ms    8.1/42.7 ms
query_cache      892,451       12/87 ยตs      0.4/2.1 ms
stats_lock       156,789       3/18 ยตs       0.02/0.1 ms

Contention Analysis:
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
Lock: connection_pool (CRITICAL CONTENTION)
  Contention rate: 73.2%
  Wait/Hold ratio: 10.1x  โ† Waiting 10x longer than holding!
  Threads blocked: avg 5.8 / max 8
  Impact: 42% of total latency from this lock

  Recommendation:
  โ€ข Pool size: 4 connections for 8 threads (UNDERSIZED)
  โ€ข Consider: Increase pool to 8+ connections
  โ€ข Consider: Connection multiplexing

Lock: query_cache (MODERATE CONTENTION)
  Contention rate: 31.4%
  Wait/Hold ratio: 0.03x  โ† Acceptable
  Threads blocked: avg 0.8 / max 3

  Recommendation:
  โ€ข Lock hold time is low, contention manageable
  โ€ข Consider: Read-write lock (many reads, few writes)

Scalability Analysis:
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
Threads  Throughput  Efficiency  Bottleneck
1        1000 op/s   100%        -
2        1850 op/s   92.5%       connection_pool
4        2890 op/s   72.3%       connection_pool
8        3120 op/s   39.0%       connection_pool

Scaling efficiency drops at 4+ threads due to connection_pool
Theoretical max throughput: 5200 op/s (with infinite pool)

Solution Architecture

Component Design

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    CLI Interface                             โ”‚
โ”‚   Thread count, workload configuration                       โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                           โ”‚
          โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
          โ”‚                โ”‚                โ”‚
          โ–ผ                โ–ผ                โ–ผ
   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
   โ”‚ Workload    โ”‚  โ”‚ Lock        โ”‚  โ”‚ Tracing     โ”‚
   โ”‚ Generator   โ”‚  โ”‚ Wrappers    โ”‚  โ”‚ Engine      โ”‚
   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜
          โ”‚                โ”‚                โ”‚
          โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                           โ”‚
                           โ–ผ
           โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
           โ”‚      Lock Event Collector      โ”‚
           โ”‚  Acquire/Release timestamps    โ”‚
           โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                           โ”‚
          โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
          โ”‚                โ”‚                โ”‚
          โ–ผ                โ–ผ                โ–ผ
   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
   โ”‚ Hold Time   โ”‚  โ”‚ Wait Time   โ”‚  โ”‚ Contention  โ”‚
   โ”‚ Calculator  โ”‚  โ”‚ Calculator  โ”‚  โ”‚ Analyzer    โ”‚
   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Instrumented Lock Wrapper

typedef struct {
    pthread_mutex_t mutex;
    const char *name;
    uint64_t acquire_count;
    uint64_t contention_count;  // Times acquired when already held
    uint64_t total_hold_ns;
    uint64_t total_wait_ns;
    uint64_t max_hold_ns;
    uint64_t max_wait_ns;
} profiled_mutex_t;

typedef struct {
    profiled_mutex_t *lock;
    int thread_id;
    uint64_t request_time;
    uint64_t acquire_time;
    uint64_t release_time;
} lock_event_t;

void profiled_lock(profiled_mutex_t *m, int tid) {
    uint64_t request_time = now_ns();

    // Try to acquire - if fails, it's contention
    if (pthread_mutex_trylock(&m->mutex) != 0) {
        atomic_fetch_add(&m->contention_count, 1);
        pthread_mutex_lock(&m->mutex);  // Now wait
    }

    uint64_t acquire_time = now_ns();
    uint64_t wait_time = acquire_time - request_time;

    atomic_fetch_add(&m->total_wait_ns, wait_time);
    atomic_fetch_add(&m->acquire_count, 1);

    // Store acquire time for hold time calculation
    thread_local_acquire_time[tid] = acquire_time;
}

void profiled_unlock(profiled_mutex_t *m, int tid) {
    uint64_t release_time = now_ns();
    uint64_t hold_time = release_time - thread_local_acquire_time[tid];

    atomic_fetch_add(&m->total_hold_ns, hold_time);

    pthread_mutex_unlock(&m->mutex);
}

Analysis Functions

// Calculate contention rate
double contention_rate(profiled_mutex_t *m) {
    if (m->acquire_count == 0) return 0;
    return (double)m->contention_count / m->acquire_count;
}

// Calculate wait/hold ratio
double wait_hold_ratio(profiled_mutex_t *m) {
    if (m->total_hold_ns == 0) return 0;
    return (double)m->total_wait_ns / m->total_hold_ns;
}

// Rank locks by contention impact
void rank_contention_hotspots(profiled_mutex_t *locks, int n) {
    // Sort by total wait time (impact on throughput)
    qsort(locks, n, sizeof(profiled_mutex_t), compare_wait_time);

    for (int i = 0; i < n; i++) {
        printf("%s: %.1f%% contention, %.2fx wait/hold\n",
               locks[i].name,
               contention_rate(&locks[i]) * 100,
               wait_hold_ratio(&locks[i]));
    }
}

Phased Implementation Guide

Phase 1: Instrumented Locks (Days 1-3)

Goal: Create lock wrappers that track timing.

Steps:

  1. Implement profiled_mutex_t structure
  2. Create lock/unlock wrappers with timing
  3. Use atomic operations for counters
  4. Test with simple multi-threaded workload
  5. Verify timing accuracy against wall clock

Validation: Hold + wait times sum correctly.

Phase 2: Contention Detection (Days 4-5)

Goal: Detect and count contention events.

Steps:

  1. Use trylock to detect contention
  2. Track per-thread wait times
  3. Calculate contention rate
  4. Identify hot locks

Validation: High-contention workload shows expected rates.

Phase 3: Scalability Analysis (Days 6-8)

Goal: Measure throughput at different thread counts.

Steps:

  1. Run workload at 1, 2, 4, 8, 16 threads
  2. Calculate throughput at each level
  3. Compute parallel efficiency
  4. Identify scaling bottlenecks

Validation: Matches Amdahlโ€™s law predictions.

Phase 4: Visualization (Days 9-11)

Goal: Generate lock timeline and contention graphs.

Steps:

  1. Log lock events with timestamps
  2. Generate timeline visualization (HTML/SVG)
  3. Create contention heat map
  4. Show thread blocking relationships

Validation: Timeline accurately shows blocking.

Phase 5: Recommendations Engine (Days 12-14)

Goal: Suggest optimizations based on patterns.

Steps:

  1. Classify lock usage patterns
  2. Match patterns to known solutions
  3. Generate actionable recommendations
  4. Estimate improvement potential

Validation: Recommendations are sensible for test cases.


Testing Strategy

Synthetic Workloads

1. Single Hot Lock

// All threads compete for one lock
pthread_mutex_t hot_lock;
for (int i = 0; i < 1000000; i++) {
    pthread_mutex_lock(&hot_lock);
    work();
    pthread_mutex_unlock(&hot_lock);
}

Expected: High contention, poor scaling.

2. Striped Locks

// Each thread has own lock
pthread_mutex_t locks[N_THREADS];
for (int i = 0; i < 1000000; i++) {
    pthread_mutex_lock(&locks[thread_id]);
    work();
    pthread_mutex_unlock(&locks[thread_id]);
}

Expected: Zero contention, linear scaling.

3. Reader-Writer Imbalance

// 95% reads, 5% writes
pthread_rwlock_t rwlock;
if (random() < 0.95) {
    pthread_rwlock_rdlock(&rwlock);
    read_data();
    pthread_rwlock_unlock(&rwlock);
} else {
    pthread_rwlock_wrlock(&rwlock);
    write_data();
    pthread_rwlock_unlock(&rwlock);
}

Expected: Moderate contention on writes.

Correctness Tests

  1. Race condition detection: Use ThreadSanitizer
  2. Timing accuracy: Compare with external timer
  3. Counter overflow: Handle long-running tests

Common Pitfalls and Debugging

Pitfall 1: Probe Effect (Observer Changes Outcome)

Symptom: Contention disappears when profiling.

Cause: Profiling overhead changes timing.

Solution: Use minimal overhead tracing:

// BAD: String formatting in hot path
pthread_mutex_lock(&m);
printf("Thread %d acquired lock %s\n", tid, m->name);

// GOOD: Just store event, format later
lock_events[event_count++] = (lock_event_t){
    .lock = m, .tid = tid, .time = now_ns()
};

Pitfall 2: Non-Atomic Counter Updates

Symptom: Counter values are wrong/negative.

Cause: Multiple threads updating same counter.

Solution:

// BAD: Race condition
m->total_hold_ns += hold_time;

// GOOD: Atomic operation
atomic_fetch_add(&m->total_hold_ns, hold_time);

// ALSO GOOD: Per-thread counters, aggregate later
thread_local_hold_ns[tid] += hold_time;

Pitfall 3: Measuring Lock Thatโ€™s Never Contended

Symptom: All locks show 0% contention.

Cause: Workload doesnโ€™t have enough parallelism.

Solution: Ensure work inside critical section is substantial:

// BAD: No real work, lock cycles too fast
pthread_mutex_lock(&m);
counter++;
pthread_mutex_unlock(&m);

// GOOD: Simulate real work
pthread_mutex_lock(&m);
for (int i = 0; i < 1000; i++) {
    data[i] = compute(data[i]);
}
pthread_mutex_unlock(&m);

Pitfall 4: Clock Skew Between Threads

Symptom: Negative wait times.

Cause: Different threads using different time sources.

Solution: Always use CLOCK_MONOTONIC, same for all threads.


Extensions and Challenges

Extension 1: Lock Hierarchy Verification

Build checker that:

  • Records lock acquisition order per thread
  • Detects potential deadlock cycles
  • Reports violations of lock ordering

Extension 2: RCU Comparison

Implement Read-Copy-Update pattern:

  • Compare with reader-writer lock
  • Measure read-side overhead
  • Show write-side cost trade-off

Extension 3: Lock-Free Comparison

For counter workload, compare:

  • Mutex-protected counter
  • Atomic increment
  • Thread-local counters with periodic merge

Challenge: Real System Profiling

Profile actual multi-threaded application:

  • Instrument its lock usage
  • Identify real contention hotspots
  • Implement and validate optimization

Real-World Connections

Industry Patterns

  1. Java synchronized: Biased locking, lock elision
  2. Linux kernel: Spinlocks, RCU, seqlocks
  3. Database systems: Two-phase locking, MVCC
  4. Go runtime: Mutex with spinning, channel-based sync

Famous Lock Optimizations

  • glibc pthread_mutex: Futex-based, minimal syscalls
  • WebKit WTF::Lock: Parking lot, efficient wait
  • Folly SharedMutex: Distributed reader counts
  • mcs_lock: Fair queuing, cache-friendly

Self-Assessment Checklist

Before considering this project complete, verify:

  • You can measure hold time and wait time separately
  • You correctly identify which locks are contended
  • Scalability test shows expected Amdahlโ€™s law behavior
  • Timeline visualization accurately shows blocking
  • Recommendations make sense for synthetic workloads
  • Profiling overhead is < 10% of workload time
  • You understand when to use each lock optimization

Resources

Essential Reading

  • โ€œThe Art of Multiprocessor Programmingโ€ by Herlihy & Shavit
  • โ€œSystems Performanceโ€ by Gregg, Chapter 6
  • โ€œOperating Systems: Three Easy Piecesโ€ by Arpaci-Dusseau, Chapter 26

Reference

  • POSIX Threads (pthreads) man pages
  • Linux futex documentation
  • โ€œIs Parallel Programming Hard?โ€ (Paul McKenney)

Tools

  • pthread profilers (mutrace, lockdep)
  • ThreadSanitizer for race detection
  • perf lock: Built-in Linux lock profiler