Project 8: Lock Contention and Concurrency Profiler

Project Overview

Attribute	Details
Difficulty	Advanced
Time Estimate	1-2 weeks
Primary Language	C
Alternative Languages	Go, Rust, C++
Knowledge Area	Concurrency and Contention
Tools Required	perf, tracing tools, pthreads
Primary Reference	“The Art of Multiprocessor Programming” by Herlihy & Shavit

Learning Objectives

By completing this project, you will be able to:

Measure lock hold time and wait time separately and accurately
Identify contention hotspots that limit scalability
Visualize lock dependency graphs showing thread relationships
Calculate throughput impact of contention on parallel efficiency
Apply lock optimization techniques (lock striping, RCU, lock-free)
Diagnose priority inversion and convoy effects

Deep Theoretical Foundation

Why Locks Kill Performance

Locks serialize access to shared resources. When multiple threads compete for one lock:

Thread execution timeline with lock contention:
Time →  0    1    2    3    4    5    6    7    8    9
T1:    [====LOCK HELD====][                        ]
T2:    [WAIT][WAIT][WAIT][====LOCK HELD====][     ]
T3:    [WAIT][WAIT][WAIT][WAIT][WAIT][WAIT][==LOCK]
T4:    [WAIT][WAIT][WAIT][WAIT][WAIT][WAIT][WAIT][WAIT]

[====] = Doing useful work
[WAIT] = Blocked on lock

With 4 threads but 1 lock, maximum parallelism is 1. Threads spend more time waiting than working.

Amdahl’s Law for Contention

The serialized portion limits speedup:

Speedup = 1 / (S + P/N)

Where:
  S = serial fraction (lock-protected code)
  P = parallel fraction (1 - S)
  N = number of threads

Example: 5% serial, 8 threads
  Speedup = 1 / (0.05 + 0.95/8) = 1 / (0.05 + 0.119) = 5.9x

With 5% serial code, maximum speedup is 5.9x even with 8 threads!

Lock Metrics Explained

Hold Time: Duration lock is held by a thread

Longer hold time = more waiting for others
Target: microseconds, not milliseconds

Wait Time: Duration thread waits before acquiring lock

High wait time indicates contention
Directly impacts tail latency

Contention Rate: Attempts to acquire already-held lock

Rate > 50% indicates hot lock
Rate > 90% indicates severe bottleneck

Lock Convoy: Phenomenon where threads line up behind slow holder

One slow thread delays entire queue
Unfair locks exacerbate convoys

Types of Locking Bugs

1. Contention Hotspot

// BAD: Single global lock
pthread_mutex_t global_lock;

void increment(int id) {
    pthread_mutex_lock(&global_lock);  // All threads compete here
    counters[id]++;
    pthread_mutex_unlock(&global_lock);
}

// GOOD: Per-element lock
pthread_mutex_t counter_locks[N];

void increment(int id) {
    pthread_mutex_lock(&counter_locks[id]);  // No contention
    counters[id]++;
    pthread_mutex_unlock(&counter_locks[id]);
}

2. Priority Inversion

High-priority thread H waits for lock held by low-priority L
Medium-priority M preempts L (it doesn't need the lock)
Result: H waits for M, even though H > M priority

Solution: Priority inheritance - L temporarily gets H's priority

3. Lock Ordering Deadlock

Thread 1: lock(A) → lock(B)
Thread 2: lock(B) → lock(A)

If both hold first lock, neither can proceed
Solution: Always acquire locks in consistent order

4. False Sharing

struct ThreadData {
    int counter;  // Thread 0 writes here
    // No padding
    int other;    // Thread 1 writes here
};

// Both variables on same cache line (64 bytes)
// Writing one invalidates the other's cache
// Solution: Pad to cache line boundary
struct ThreadData {
    int counter;
    char padding[60];  // Ensure separate cache lines
    int other;
};

Complete Project Specification

What You’re Building

A lock profiling toolkit called lock_prof that:

Measures lock hold and wait times per-lock and per-thread
Identifies contention hotspots with ranking
Visualizes lock access patterns as timeline and graph
Recommends optimizations based on patterns
Compares throughput at different thread counts

Functional Requirements

lock_prof run --workload <name> --threads <n> --duration <sec>
lock_prof analyze --data <file.csv>
lock_prof timeline --data <file.csv> --output <timeline.html>
lock_prof contention-map --data <file.csv> --output <map.svg>
lock_prof scale-test --workload <name> --max-threads <n>

Example Output

Lock Contention Report
═══════════════════════════════════════════════════
Workload: database_query
Threads: 8
Duration: 60 seconds

Lock Summary:
───────────────────────────────────────────────────
Lock Name        Acquisitions   Hold Time     Wait Time
                               (avg/p99)     (avg/p99)
connection_pool  45,231        0.8/4.2 ms    8.1/42.7 ms
query_cache      892,451       12/87 µs      0.4/2.1 ms
stats_lock       156,789       3/18 µs       0.02/0.1 ms

Contention Analysis:
───────────────────────────────────────────────────
Lock: connection_pool (CRITICAL CONTENTION)
  Contention rate: 73.2%
  Wait/Hold ratio: 10.1x  ← Waiting 10x longer than holding!
  Threads blocked: avg 5.8 / max 8
  Impact: 42% of total latency from this lock

  Recommendation:
  • Pool size: 4 connections for 8 threads (UNDERSIZED)
  • Consider: Increase pool to 8+ connections
  • Consider: Connection multiplexing

Lock: query_cache (MODERATE CONTENTION)
  Contention rate: 31.4%
  Wait/Hold ratio: 0.03x  ← Acceptable
  Threads blocked: avg 0.8 / max 3

  Recommendation:
  • Lock hold time is low, contention manageable
  • Consider: Read-write lock (many reads, few writes)

Scalability Analysis:
───────────────────────────────────────────────────
Threads  Throughput  Efficiency  Bottleneck
1        1000 op/s   100%        -
2        1850 op/s   92.5%       connection_pool
4        2890 op/s   72.3%       connection_pool
8        3120 op/s   39.0%       connection_pool

Scaling efficiency drops at 4+ threads due to connection_pool
Theoretical max throughput: 5200 op/s (with infinite pool)

Solution Architecture

Component Design

┌─────────────────────────────────────────────────────────────┐
│                    CLI Interface                             │
│   Thread count, workload configuration                       │
└──────────────────────────┬──────────────────────────────────┘
                           │
          ┌────────────────┼────────────────┐
          │                │                │
          ▼                ▼                ▼
   ┌─────────────┐  ┌─────────────┐  ┌─────────────┐
   │ Workload    │  │ Lock        │  │ Tracing     │
   │ Generator   │  │ Wrappers    │  │ Engine      │
   └──────┬──────┘  └──────┬──────┘  └──────┬──────┘
          │                │                │
          └────────────────┼────────────────┘
                           │
                           ▼
           ┌───────────────────────────────┐
           │      Lock Event Collector      │
           │  Acquire/Release timestamps    │
           └───────────────┬───────────────┘
                           │
          ┌────────────────┼────────────────┐
          │                │                │
          ▼                ▼                ▼
   ┌─────────────┐  ┌─────────────┐  ┌─────────────┐
   │ Hold Time   │  │ Wait Time   │  │ Contention  │
   │ Calculator  │  │ Calculator  │  │ Analyzer    │
   └─────────────┘  └─────────────┘  └─────────────┘

Instrumented Lock Wrapper

typedef struct {
    pthread_mutex_t mutex;
    const char *name;
    uint64_t acquire_count;
    uint64_t contention_count;  // Times acquired when already held
    uint64_t total_hold_ns;
    uint64_t total_wait_ns;
    uint64_t max_hold_ns;
    uint64_t max_wait_ns;
} profiled_mutex_t;

typedef struct {
    profiled_mutex_t *lock;
    int thread_id;
    uint64_t request_time;
    uint64_t acquire_time;
    uint64_t release_time;
} lock_event_t;

void profiled_lock(profiled_mutex_t *m, int tid) {
    uint64_t request_time = now_ns();

    // Try to acquire - if fails, it's contention
    if (pthread_mutex_trylock(&m->mutex) != 0) {
        atomic_fetch_add(&m->contention_count, 1);
        pthread_mutex_lock(&m->mutex);  // Now wait
    }

    uint64_t acquire_time = now_ns();
    uint64_t wait_time = acquire_time - request_time;

    atomic_fetch_add(&m->total_wait_ns, wait_time);
    atomic_fetch_add(&m->acquire_count, 1);

    // Store acquire time for hold time calculation
    thread_local_acquire_time[tid] = acquire_time;
}

void profiled_unlock(profiled_mutex_t *m, int tid) {
    uint64_t release_time = now_ns();
    uint64_t hold_time = release_time - thread_local_acquire_time[tid];

    atomic_fetch_add(&m->total_hold_ns, hold_time);

    pthread_mutex_unlock(&m->mutex);
}

Analysis Functions

// Calculate contention rate
double contention_rate(profiled_mutex_t *m) {
    if (m->acquire_count == 0) return 0;
    return (double)m->contention_count / m->acquire_count;
}

// Calculate wait/hold ratio
double wait_hold_ratio(profiled_mutex_t *m) {
    if (m->total_hold_ns == 0) return 0;
    return (double)m->total_wait_ns / m->total_hold_ns;
}

// Rank locks by contention impact
void rank_contention_hotspots(profiled_mutex_t *locks, int n) {
    // Sort by total wait time (impact on throughput)
    qsort(locks, n, sizeof(profiled_mutex_t), compare_wait_time);

    for (int i = 0; i < n; i++) {
        printf("%s: %.1f%% contention, %.2fx wait/hold\n",
               locks[i].name,
               contention_rate(&locks[i]) * 100,
               wait_hold_ratio(&locks[i]));
    }
}

Phased Implementation Guide

Phase 1: Instrumented Locks (Days 1-3)

Goal: Create lock wrappers that track timing.

Steps:

Implement profiled_mutex_t structure
Create lock/unlock wrappers with timing
Use atomic operations for counters
Test with simple multi-threaded workload
Verify timing accuracy against wall clock

Validation: Hold + wait times sum correctly.

Phase 2: Contention Detection (Days 4-5)

Goal: Detect and count contention events.

Steps:

Use trylock to detect contention
Track per-thread wait times
Calculate contention rate
Identify hot locks

Validation: High-contention workload shows expected rates.

Phase 3: Scalability Analysis (Days 6-8)

Goal: Measure throughput at different thread counts.

Steps:

Run workload at 1, 2, 4, 8, 16 threads
Calculate throughput at each level
Compute parallel efficiency
Identify scaling bottlenecks

Validation: Matches Amdahl’s law predictions.

Phase 4: Visualization (Days 9-11)

Goal: Generate lock timeline and contention graphs.

Steps:

Log lock events with timestamps
Generate timeline visualization (HTML/SVG)
Create contention heat map
Show thread blocking relationships

Validation: Timeline accurately shows blocking.

Phase 5: Recommendations Engine (Days 12-14)

Goal: Suggest optimizations based on patterns.

Steps:

Classify lock usage patterns
Match patterns to known solutions
Generate actionable recommendations
Estimate improvement potential

Validation: Recommendations are sensible for test cases.

Testing Strategy

Synthetic Workloads

1. Single Hot Lock

// All threads compete for one lock
pthread_mutex_t hot_lock;
for (int i = 0; i < 1000000; i++) {
    pthread_mutex_lock(&hot_lock);
    work();
    pthread_mutex_unlock(&hot_lock);
}

Expected: High contention, poor scaling.

2. Striped Locks

// Each thread has own lock
pthread_mutex_t locks[N_THREADS];
for (int i = 0; i < 1000000; i++) {
    pthread_mutex_lock(&locks[thread_id]);
    work();
    pthread_mutex_unlock(&locks[thread_id]);
}

Expected: Zero contention, linear scaling.

3. Reader-Writer Imbalance

// 95% reads, 5% writes
pthread_rwlock_t rwlock;
if (random() < 0.95) {
    pthread_rwlock_rdlock(&rwlock);
    read_data();
    pthread_rwlock_unlock(&rwlock);
} else {
    pthread_rwlock_wrlock(&rwlock);
    write_data();
    pthread_rwlock_unlock(&rwlock);
}

Expected: Moderate contention on writes.

Correctness Tests

Race condition detection: Use ThreadSanitizer
Timing accuracy: Compare with external timer
Counter overflow: Handle long-running tests

Common Pitfalls and Debugging

Pitfall 1: Probe Effect (Observer Changes Outcome)

Symptom: Contention disappears when profiling.

Cause: Profiling overhead changes timing.

Solution: Use minimal overhead tracing:

// BAD: String formatting in hot path
pthread_mutex_lock(&m);
printf("Thread %d acquired lock %s\n", tid, m->name);

// GOOD: Just store event, format later
lock_events[event_count++] = (lock_event_t){
    .lock = m, .tid = tid, .time = now_ns()
};

Pitfall 2: Non-Atomic Counter Updates

Symptom: Counter values are wrong/negative.

Cause: Multiple threads updating same counter.

Solution:

// BAD: Race condition
m->total_hold_ns += hold_time;

// GOOD: Atomic operation
atomic_fetch_add(&m->total_hold_ns, hold_time);

// ALSO GOOD: Per-thread counters, aggregate later
thread_local_hold_ns[tid] += hold_time;

Pitfall 3: Measuring Lock That’s Never Contended

Symptom: All locks show 0% contention.

Cause: Workload doesn’t have enough parallelism.

Solution: Ensure work inside critical section is substantial:

// BAD: No real work, lock cycles too fast
pthread_mutex_lock(&m);
counter++;
pthread_mutex_unlock(&m);

// GOOD: Simulate real work
pthread_mutex_lock(&m);
for (int i = 0; i < 1000; i++) {
    data[i] = compute(data[i]);
}
pthread_mutex_unlock(&m);

Pitfall 4: Clock Skew Between Threads

Symptom: Negative wait times.

Cause: Different threads using different time sources.

Solution: Always use CLOCK_MONOTONIC, same for all threads.

Extensions and Challenges

Extension 1: Lock Hierarchy Verification

Build checker that:

Records lock acquisition order per thread
Detects potential deadlock cycles
Reports violations of lock ordering

Extension 2: RCU Comparison

Implement Read-Copy-Update pattern:

Compare with reader-writer lock
Measure read-side overhead
Show write-side cost trade-off

Extension 3: Lock-Free Comparison

For counter workload, compare:

Mutex-protected counter
Atomic increment
Thread-local counters with periodic merge

Challenge: Real System Profiling

Profile actual multi-threaded application:

Instrument its lock usage
Identify real contention hotspots
Implement and validate optimization

Real-World Connections

Industry Patterns

Java synchronized: Biased locking, lock elision
Linux kernel: Spinlocks, RCU, seqlocks
Database systems: Two-phase locking, MVCC
Go runtime: Mutex with spinning, channel-based sync

Famous Lock Optimizations

glibc pthread_mutex: Futex-based, minimal syscalls
WebKit WTF::Lock: Parking lot, efficient wait
Folly SharedMutex: Distributed reader counts
mcs_lock: Fair queuing, cache-friendly

Self-Assessment Checklist

Before considering this project complete, verify:

You can measure hold time and wait time separately
You correctly identify which locks are contended
Scalability test shows expected Amdahl’s law behavior
Timeline visualization accurately shows blocking
Recommendations make sense for synthetic workloads
Profiling overhead is < 10% of workload time
You understand when to use each lock optimization

Resources

Essential Reading

“The Art of Multiprocessor Programming” by Herlihy & Shavit
“Systems Performance” by Gregg, Chapter 6
“Operating Systems: Three Easy Pieces” by Arpaci-Dusseau, Chapter 26

Reference

POSIX Threads (pthreads) man pages
Linux futex documentation
“Is Parallel Programming Hard?” (Paul McKenney)

Tools

pthread profilers (mutrace, lockdep)
ThreadSanitizer for race detection
perf lock: Built-in Linux lock profiler