Project 8: Parallel Sort Benchmark Suite
Quick Reference
| Attribute | Details |
|---|---|
| Difficulty | Intermediate |
| Time Estimate | 1 week |
| Primary Language | C++ |
| Alternative Languages | Rust, Java, Go |
| Knowledge Area | Parallel Algorithms / Performance Analysis |
| Tools Required | C++17 compiler with parallel STL, Google Benchmark (optional) |
| Primary Reference | “C++ Concurrency in Action, Second Edition” by Anthony Williams, Chapter 10.3 |
Learning Objectives
By completing this project, you will be able to:
- Implement parallel merge sort using
std::asyncwith proper task decomposition - Understand the THRESHOLD concept for deciding when to switch from parallel to sequential
- Compare execution policies (
seq,par,par_unseq) and explain when each excels - Design statistically sound benchmarks with warmup, multiple runs, and proper measurement
- Analyze how input distributions affect sorting performance (random, sorted, reverse, nearly sorted, duplicates)
- Predict the break-even point where parallel sorting becomes faster than sequential
- Explain Amdahl’s Law in the context of sorting algorithms
Theoretical Foundation
Why Sorting Is the Canonical Parallel Algorithm
Sorting is ideal for teaching parallelism because:
- Clear decomposition: Divide-and-conquer naturally splits work
- Well-understood sequential algorithms: Baseline is known
- Measurable speedup: Easy to compare parallel vs sequential
- Real-world importance: Sorting is everywhere (databases, search, data processing)
Core Concepts
Divide-and-Conquer Parallelism
Merge sort naturally parallelizes:
Unsorted Array [8 elements]
|
+--------------+--------------+
| |
[4 elements] [4 elements]
Sort in parallel Sort in parallel
| |
+-------------+---------------+
|
Merge results
The key insight: sorting the halves are independent operations. They can run on different cores simultaneously.
The Parallel Merge Sort Algorithm
parallel_merge_sort(array, depth):
if (array.size < THRESHOLD or depth == 0):
return std::sort(array) // Sequential fallback
mid = array.size / 2
// Sort halves in parallel
left_future = std::async(parallel_merge_sort, left_half, depth-1)
parallel_merge_sort(right_half, depth-1) // Current thread does one half
left_future.wait() // Wait for parallel half
std::inplace_merge(left_half, mid, right_half)
Why depth limit? Without it, we’d spawn O(n) tasks for n elements. With depth = log2(num_cores), we spawn just enough tasks to utilize all cores.
Why THRESHOLD? For small arrays, the overhead of spawning async tasks exceeds the benefit. The THRESHOLD prevents this.
Execution Policies in C++17
C++17 introduced three execution policies for standard algorithms:
| Policy | Meaning | When to Use |
|---|---|---|
std::execution::seq |
Sequential (default) | Small data, predictable timing |
std::execution::par |
Parallel | Large data, compute-bound |
std::execution::par_unseq |
Parallel + Vectorized | Large data, simple comparisons |
Example usage:
std::sort(std::execution::par, data.begin(), data.end());
The THRESHOLD Decision
Below a certain size, parallelism hurts performance:
Time vs Size:
Performance │ * Parallel
│ *
│ * Sequential
│ * ...............
│ * Break-even
│* point
└────────────────────────
Size
For size < THRESHOLD: Sequential wins (less overhead)
For size > THRESHOLD: Parallel wins (more cores utilized)
Typical THRESHOLD values: 10,000 - 100,000 elements
The optimal THRESHOLD depends on:
- Element size and comparison cost
- Number of cores
- Memory bandwidth
- Cache sizes
Historical Context
1945: John von Neumann invented merge sort for EDVAC computer 1964: Ken Batcher’s bitonic sort - first practical parallel sorting 1970s: Parallel merge algorithms developed for PRAM models 2017: C++17 brings parallel algorithms to the standard library Today: GPU-accelerated sorting achieves billions of elements/second
Common Misconceptions
Misconception 1: “Parallel sort is always faster” Reality: For small arrays (< 10K elements), the overhead of thread management makes parallel slower than sequential.
Misconception 2: “Speedup equals number of cores” Reality: Memory bandwidth, synchronization overhead, and Amdahl’s Law limit speedup. 8 cores typically yield 5-7x speedup.
Misconception 3: “par_unseq is always better than par”
Reality: par_unseq helps when comparisons are cheap (integers). For complex objects or expensive comparisons, the difference is minimal.
Misconception 4: “All input distributions benefit equally from parallelism” Reality: Nearly-sorted data is fast even sequentially (adaptive algorithms). Random data shows the biggest parallel speedup.
Project Specification
What You’re Building
A comprehensive benchmark suite called sort_benchmark that:
- Compares multiple sorting approaches:
std::sort(sequential)std::sort(std::execution::par)(parallel)std::sort(std::execution::par_unseq)(parallel + vectorized)- Custom
parallel_merge_sortimplementation
- Tests multiple input distributions:
- Random
- Already sorted
- Reverse sorted
- Nearly sorted (90% sorted, 10% random swaps)
- Many duplicates (only 10 unique values)
-
Analyzes across sizes: 1K, 10K, 100K, 1M, 10M, 100M elements
- Reports statistical measures: median, mean, standard deviation, min, max
Functional Requirements
sort_benchmark --size <n> --distribution <name> --iterations <n>
sort_benchmark --full # Run complete benchmark suite
sort_benchmark --find-threshold # Find optimal THRESHOLD value
sort_benchmark --compare-cores # Test with different thread counts
sort_benchmark --report --output <file> # Generate markdown report
Example Output
Parallel Sort Benchmark Suite
═══════════════════════════════════════════════════════════════════════════════
System Configuration:
CPU: AMD Ryzen 9 5900X (12 cores, 24 threads)
RAM: 64 GB DDR4-3200
Compiler: GCC 13.1 with -O3 -march=native
C++ Standard: C++17
Data Type: int64_t (8 bytes per element)
═══════════════════════════════════════════════════════════════════════════════
=== Size: 1,000 elements ===
Distribution: Random
Iterations: 100 (after 10 warmup runs)
Algorithm Median Mean Std Dev Min Max
─────────────────────────────────────────────────────────────────────────
std::sort 0.018ms 0.019ms 0.002ms 0.017ms 0.025ms
std::sort(par) 0.142ms 0.148ms 0.018ms 0.128ms 0.195ms
std::sort(par_unseq) 0.138ms 0.145ms 0.016ms 0.125ms 0.188ms
parallel_merge_sort 0.156ms 0.162ms 0.021ms 0.142ms 0.212ms
Winner: std::sort (7.9x faster than parallel)
Analysis: Parallel overhead dominates at this size
=== Size: 100,000 elements ===
Distribution: Random
Iterations: 50 (after 5 warmup runs)
Algorithm Median Mean Std Dev Min Max
─────────────────────────────────────────────────────────────────────────
std::sort 6.82ms 6.91ms 0.15ms 6.65ms 7.32ms
std::sort(par) 1.18ms 1.22ms 0.08ms 1.12ms 1.45ms
std::sort(par_unseq) 1.14ms 1.18ms 0.07ms 1.08ms 1.38ms
parallel_merge_sort 1.32ms 1.38ms 0.09ms 1.25ms 1.62ms
Winner: std::sort(par_unseq), 6.0x speedup over sequential
Analysis: Parallel algorithms now significantly faster
=== Size: 10,000,000 elements ===
Distribution Analysis (std::sort(par)):
─────────────────────────────────────────────────────────────────────────
Distribution Time Speedup vs Random Notes
random 145ms 1.0x baseline Typical case
sorted 18ms 8.1x faster Adaptive algorithm detects order
reverse 19ms 7.6x faster Easy to detect, flip
nearly_sorted 28ms 5.2x faster Few swaps needed
many_duplicates 98ms 1.5x faster Less comparison branching
=== THRESHOLD Analysis ===
Finding optimal parallel/sequential cutoff for parallel_merge_sort...
THRESHOLD Time @ 10M elements Efficiency
─────────────────────────────────────────────────────────────────────────
1,000 892ms Poor (too much task overhead)
5,000 312ms Moderate
10,000 156ms Good
25,000 148ms Excellent ✓ (optimal)
50,000 152ms Good
100,000 168ms Moderate (underutilizing cores)
1,000,000 285ms Poor (too sequential)
Recommended THRESHOLD: 25,000 elements
=== Scaling Analysis (100M elements, random) ===
─────────────────────────────────────────────────────────────────────────
Threads Time Speedup Efficiency
1 8.45s 1.0x 100%
2 4.38s 1.93x 96.5%
4 2.31s 3.66x 91.5%
8 1.28s 6.60x 82.5%
12 0.98s 8.62x 71.8%
24 0.92s 9.18x 38.3%
Analysis:
- Near-linear scaling up to physical core count (12)
- Hyperthreading provides minimal benefit for sorting
- Theoretical max speedup: 12x, achieved: 8.62x (71.8% efficiency)
- Bottleneck: Memory bandwidth during merge phase
=== Key Findings ===
1. Break-even point: ~50,000 elements (below this, sequential is faster)
2. Optimal speedup: 6-9x on 12 cores
3. Best distribution for parallelism: Random (most work to do)
4. par_unseq vs par: 3-5% improvement on integers
5. Custom parallel_merge_sort: Within 15% of std::sort(par)
═══════════════════════════════════════════════════════════════════════════════
Benchmark Correctness Requirements
Your benchmarks must be statistically valid:
- Warmup runs: Execute 3-10 iterations before measuring (populate caches, JIT effects)
- Fresh data each run: Generate new random data to avoid CPU pattern learning
- Prevent dead code elimination: Actually use the sorted result
- Multiple iterations: At least 10-50 runs for statistical significance
- Report variance: Std dev reveals measurement stability
- Flush caches between algorithms: Fair comparison
Solution Architecture
Component Design
┌─────────────────────────────────────────────────────────────────────────┐
│ CLI Interface │
│ --size, --distribution, --iterations, --full, --report │
└────────────────────────────────────┬────────────────────────────────────┘
│
┌──────────────────────────┼──────────────────────────┐
│ │ │
▼ ▼ ▼
┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ Data Generator │ │ Algorithm Suite │ │ Benchmark Engine │
│ │ │ │ │ │
│ - Random │ │ - std::sort │ │ - Timer │
│ - Sorted │ │ - std::sort(par) │ │ - Statistics │
│ - Reverse │ │ - par_unseq │ │ - Warmup │
│ - Nearly sorted │ │ - Parallel merge │ │ - Iterations │
│ - Many duplicates │ └────────┬──────────┘ └────────┬─────────┘
└────────┬──────────┘ │ │
│ │ │
└────────────────────────┼────────────────────────┘
│
▼
┌─────────────────────────────┐
│ Result Aggregator │
│ │
│ - Median/Mean/StdDev │
│ - Speedup calculations │
│ - Distribution comparisons │
│ - Report generation │
└─────────────────────────────┘
Parallel Merge Sort Visualization
parallel_merge_sort([8, 3, 1, 7, 4, 6, 5, 2], depth=2)
Level 0: [8, 3, 1, 7, 4, 6, 5, 2]
|
+-----------------+-----------------+
| |
Level 1: std::async Current thread
[8, 3, 1, 7] [4, 6, 5, 2]
| |
+-------+-------+ +-------+-------+
| | | |
Level 2: [8, 3] [1, 7] [4, 6] [5, 2]
| | | |
[3, 8] [1, 7] [4, 6] [2, 5]
| | | |
+-------+-------+ +-------+-------+
| |
[1, 3, 7, 8] [2, 4, 5, 6]
| |
+-----------+-----------+
|
[1, 2, 3, 4, 5, 6, 7, 8]
Key insight:
- Level 1: Two tasks run in parallel (one via async, one in current thread)
- Level 2: Sequential (depth exhausted, below THRESHOLD)
- Merge operations are sequential (harder to parallelize efficiently)
Key Data Structures
// Input distribution types
enum class Distribution {
Random,
Sorted,
Reverse,
NearlySorted,
ManyDuplicates
};
// Algorithm variants to benchmark
enum class Algorithm {
StdSort, // std::sort (sequential)
StdSortPar, // std::sort(std::execution::par, ...)
StdSortParUnseq, // std::sort(std::execution::par_unseq, ...)
ParallelMergeSort // Custom implementation
};
// Single benchmark result
struct BenchmarkResult {
Algorithm algorithm;
Distribution distribution;
size_t size;
std::vector<double> timings_ms; // All iterations
double median() const;
double mean() const;
double stddev() const;
double min() const;
double max() const;
};
// Parallel merge sort configuration
struct ParallelMergeSortConfig {
size_t threshold = 25000; // Switch to sequential below this
int max_depth = 8; // Limit task spawning depth
};
// Complete benchmark configuration
struct BenchmarkConfig {
std::vector<size_t> sizes = {1000, 10000, 100000, 1000000, 10000000};
std::vector<Distribution> distributions = {
Distribution::Random,
Distribution::Sorted,
Distribution::Reverse,
Distribution::NearlySorted,
Distribution::ManyDuplicates
};
std::vector<Algorithm> algorithms = {
Algorithm::StdSort,
Algorithm::StdSortPar,
Algorithm::StdSortParUnseq,
Algorithm::ParallelMergeSort
};
size_t warmup_iterations = 5;
size_t measured_iterations = 20;
ParallelMergeSortConfig parallel_config;
};
Core Algorithm Implementation
// Template for parallel merge sort with configurability
template<typename RandomIt, typename Compare = std::less<>>
void parallel_merge_sort(RandomIt first, RandomIt last,
Compare comp = Compare{},
size_t threshold = 25000,
int depth = 8) {
const auto size = std::distance(first, last);
// Base case: use sequential sort for small arrays or depth exhausted
if (size < static_cast<ptrdiff_t>(threshold) || depth == 0) {
std::sort(first, last, comp);
return;
}
// Find midpoint
auto mid = first + size / 2;
// Sort first half asynchronously
auto left_future = std::async(std::launch::async,
[=, &comp]() {
parallel_merge_sort(first, mid, comp, threshold, depth - 1);
}
);
// Sort second half in current thread
parallel_merge_sort(mid, last, comp, threshold, depth - 1);
// Wait for left half to complete
left_future.get();
// Merge the sorted halves
std::inplace_merge(first, mid, last, comp);
}
// Data generation functions
template<typename T>
std::vector<T> generate_data(size_t size, Distribution dist) {
std::vector<T> data(size);
std::random_device rd;
std::mt19937_64 gen(rd());
switch (dist) {
case Distribution::Random: {
std::uniform_int_distribution<T> dist(
std::numeric_limits<T>::min(),
std::numeric_limits<T>::max()
);
std::generate(data.begin(), data.end(),
[&]() { return dist(gen); });
break;
}
case Distribution::Sorted: {
std::iota(data.begin(), data.end(), 0);
break;
}
case Distribution::Reverse: {
std::iota(data.rbegin(), data.rend(), 0);
break;
}
case Distribution::NearlySorted: {
std::iota(data.begin(), data.end(), 0);
// Swap 10% of elements randomly
std::uniform_int_distribution<size_t> idx_dist(0, size - 1);
for (size_t i = 0; i < size / 10; ++i) {
std::swap(data[idx_dist(gen)], data[idx_dist(gen)]);
}
break;
}
case Distribution::ManyDuplicates: {
std::uniform_int_distribution<T> dist(0, 9); // Only 10 unique values
std::generate(data.begin(), data.end(),
[&]() { return dist(gen); });
break;
}
}
return data;
}
Implementation Guide
Phase 1: Basic Sequential and Parallel Sorting (Days 1-2)
Goal: Get all sorting algorithms working correctly.
Steps:
- Set up a C++17 project with parallel STL support
- Implement
parallel_merge_sortwith configurable THRESHOLD - Create wrapper functions for
std::sortwith each execution policy - Verify correctness:
std::is_sorted()after each sort - Test with small arrays first (1000 elements)
Compiler Setup:
# GCC
g++ -std=c++17 -O3 -march=native -ltbb main.cpp -o sort_benchmark
# Clang (with libc++)
clang++ -std=c++17 -O3 -march=native -ltbb main.cpp -o sort_benchmark
# MSVC
cl /std:c++17 /O2 /EHsc main.cpp
Validation: All algorithms produce sorted output.
Phase 2: Data Generators (Day 3)
Goal: Create all input distributions.
Steps:
- Implement
generate_data<T>()for each distribution type - Use
std::mt19937_64for reproducible randomness - Template on element type for flexibility
- Test each distribution produces expected patterns
Validation:
- Random: Elements uniformly distributed
- Sorted:
std::is_sorted()returns true - Reverse: Sorted in descending order
- Nearly sorted: Mostly ascending with some inversions
- Many duplicates: Only 10 unique values present
Phase 3: Benchmark Harness (Days 4-5)
Goal: Accurate timing with statistical analysis.
Critical Benchmarking Rules:
// 1. High-resolution timer
using Clock = std::chrono::high_resolution_clock;
// 2. Warmup runs (cache population, JIT)
for (size_t i = 0; i < warmup_iterations; ++i) {
auto data_copy = generate_data(size, dist); // Fresh data!
sort_algorithm(data_copy.begin(), data_copy.end());
benchmark::DoNotOptimize(data_copy.data()); // Prevent DCE
}
// 3. Measured runs with fresh data each time
std::vector<double> timings;
for (size_t i = 0; i < iterations; ++i) {
auto data_copy = generate_data(size, dist);
auto start = Clock::now();
sort_algorithm(data_copy.begin(), data_copy.end());
auto end = Clock::now();
benchmark::DoNotOptimize(data_copy.data());
auto duration = std::chrono::duration<double, std::milli>(end - start);
timings.push_back(duration.count());
}
// 4. Report median (robust to outliers)
std::sort(timings.begin(), timings.end());
double median = timings[timings.size() / 2];
Why fresh data each iteration?
- CPU learns patterns in data (branch prediction)
- Sorted data is already sorted on second run (trivially fast)
- Represents real-world scenario (sorting new data each time)
Phase 4: THRESHOLD Finder (Day 6)
Goal: Find optimal THRESHOLD for parallel_merge_sort.
Algorithm:
size_t find_optimal_threshold(size_t test_size = 10'000'000) {
std::vector<size_t> candidates = {
1000, 5000, 10000, 25000, 50000, 100000, 250000, 500000, 1000000
};
size_t best_threshold = candidates[0];
double best_time = std::numeric_limits<double>::max();
for (size_t threshold : candidates) {
// Run benchmark with this threshold
double median_time = benchmark_parallel_merge_sort(
test_size, Distribution::Random, threshold
);
if (median_time < best_time) {
best_time = median_time;
best_threshold = threshold;
}
std::cout << "THRESHOLD " << threshold
<< ": " << median_time << "ms\n";
}
return best_threshold;
}
Validation: Optimal THRESHOLD typically between 10K-100K.
Phase 5: Scaling Analysis (Day 7)
Goal: Measure speedup vs core count.
Implementation:
void analyze_scaling(size_t size) {
// Test with different thread counts
// Note: Controlling std::sort(par) threads requires TBB configuration
std::vector<int> thread_counts = {1, 2, 4, 8, 12, 16, 24};
for (int threads : thread_counts) {
// Limit hardware concurrency (implementation-specific)
// For Intel TBB:
tbb::global_control gc(
tbb::global_control::max_allowed_parallelism,
threads
);
double time = benchmark_algorithm(
size, Distribution::Random, Algorithm::StdSortPar
);
double speedup = baseline_time / time;
double efficiency = speedup / threads * 100;
std::cout << threads << " threads: "
<< time << "ms, "
<< speedup << "x speedup, "
<< efficiency << "% efficiency\n";
}
}
Expected results:
- Linear scaling up to physical cores
- Diminishing returns with hyperthreading
- Memory bandwidth becomes bottleneck at high core counts
Testing Strategy
Correctness Tests
void test_correctness() {
// 1. All algorithms produce sorted output
for (auto algo : all_algorithms) {
auto data = generate_data<int64_t>(10000, Distribution::Random);
sort_with_algorithm(data, algo);
assert(std::is_sorted(data.begin(), data.end()));
}
// 2. All algorithms produce same result
for (auto dist : all_distributions) {
auto baseline = generate_data<int64_t>(10000, dist);
auto expected = baseline;
std::sort(expected.begin(), expected.end());
for (auto algo : all_algorithms) {
auto data = baseline; // Same input
sort_with_algorithm(data, algo);
assert(data == expected);
}
}
// 3. Edge cases
std::vector<int64_t> empty{};
std::vector<int64_t> single{42};
std::vector<int64_t> two{5, 3};
for (auto algo : all_algorithms) {
// Should not crash, should produce sorted output
sort_with_algorithm(empty, algo);
sort_with_algorithm(single, algo);
sort_with_algorithm(two, algo);
}
std::cout << "All correctness tests passed!\n";
}
Performance Sanity Checks
void test_performance_sanity() {
constexpr size_t large_size = 10'000'000;
// 1. Parallel should be faster than sequential for large data
double seq_time = benchmark(large_size, Distribution::Random, Algorithm::StdSort);
double par_time = benchmark(large_size, Distribution::Random, Algorithm::StdSortPar);
assert(par_time < seq_time);
std::cout << "Parallel is " << (seq_time / par_time) << "x faster for "
<< large_size << " elements\n";
// 2. Sorted data should be faster than random
double sorted_time = benchmark(large_size, Distribution::Sorted, Algorithm::StdSort);
double random_time = benchmark(large_size, Distribution::Random, Algorithm::StdSort);
assert(sorted_time < random_time);
// 3. Custom implementation should be within 2x of std::sort(par)
double custom_time = benchmark(large_size, Distribution::Random,
Algorithm::ParallelMergeSort);
assert(custom_time < par_time * 2);
std::cout << "All performance sanity checks passed!\n";
}
Benchmark Stability Test
void test_benchmark_stability() {
// Run same benchmark 10 times, check variance
std::vector<double> runs;
for (int i = 0; i < 10; ++i) {
double time = benchmark(1'000'000, Distribution::Random,
Algorithm::StdSortPar);
runs.push_back(time);
}
double mean = calculate_mean(runs);
double stddev = calculate_stddev(runs);
double cv = stddev / mean; // Coefficient of variation
std::cout << "CV = " << (cv * 100) << "%\n";
assert(cv < 0.10); // Less than 10% variation
std::cout << "Benchmark is stable\n";
}
Common Pitfalls and Debugging
Pitfall 1: Parallel Is Slower Than Sequential
Symptom: std::sort(par) takes longer than std::sort even for large arrays.
Diagnosis:
- Is parallel STL actually enabled?
- Is the data size large enough?
- Are you accidentally sorting already-sorted data?
Solutions:
// 1. Check if TBB is linked
#if __has_include(<tbb/parallel_sort.h>)
std::cout << "TBB available\n";
#endif
// 2. Check minimum size
if (size < 50000) {
std::cout << "Warning: Size may be too small for parallel benefit\n";
}
// 3. Generate fresh random data for each iteration
auto data = generate_data<int64_t>(size, Distribution::Random); // Fresh each time!
Pitfall 2: Inconsistent Benchmark Results
Symptom: Same benchmark gives wildly different results each run.
Causes and Solutions:
// Problem 1: Not enough iterations
// Solution: Run more iterations, report median
// Problem 2: No warmup
// Solution: Add warmup runs before measuring
for (int i = 0; i < 5; ++i) {
// Warmup: don't record these
auto data = generate_data(size, dist);
std::sort(data.begin(), data.end());
}
// Now start measuring
// Problem 3: Background processes
// Solution: Use median (robust to outliers), run multiple times
// Problem 4: Turbo boost inconsistency
// Solution: Either disable turbo boost or warmup CPU first
volatile int sum = 0;
for (int i = 0; i < 1000000000; ++i) sum += i; // CPU warmup
Pitfall 3: Dead Code Elimination
Symptom: Benchmark shows impossibly fast times (nanoseconds for millions of elements).
Cause: Compiler optimized away the sort because result wasn’t used.
Solution:
// BAD: Compiler may eliminate the sort
auto data = generate_data(size, Distribution::Random);
auto start = Clock::now();
std::sort(data.begin(), data.end());
auto end = Clock::now();
// data never used after this...
// GOOD: Prevent optimization
#include <benchmark/benchmark.h> // Google Benchmark
auto data = generate_data(size, Distribution::Random);
auto start = Clock::now();
std::sort(data.begin(), data.end());
auto end = Clock::now();
benchmark::DoNotOptimize(data.data()); // Tell compiler data is used
benchmark::ClobberMemory(); // Memory barrier
// Or manually use the result
volatile int64_t first = data[0];
volatile int64_t last = data[size-1];
Pitfall 4: Sorting Same Data Multiple Times
Symptom: After first iteration, sort is instantaneous.
Cause: Sorting already-sorted data (adaptive algorithms detect this).
Solution:
// BAD: data stays sorted after first iteration
auto data = generate_data(size, Distribution::Random);
for (int i = 0; i < iterations; ++i) {
auto start = Clock::now();
std::sort(data.begin(), data.end()); // Already sorted after first run!
auto end = Clock::now();
}
// GOOD: Fresh data each iteration
for (int i = 0; i < iterations; ++i) {
auto data = generate_data(size, Distribution::Random); // Fresh!
auto start = Clock::now();
std::sort(data.begin(), data.end());
auto end = Clock::now();
}
// ALTERNATIVE: Use a copy
auto original = generate_data(size, Distribution::Random);
for (int i = 0; i < iterations; ++i) {
auto data = original; // Copy before sorting
// ...
}
Pitfall 5: THRESHOLD Too Low or High
Symptom: Custom parallel_merge_sort is much slower than std::sort(par).
Diagnosis:
THRESHOLD too low (1000):
- Spawning millions of tasks
- Huge overhead from std::async
- Each task does almost no work
THRESHOLD too high (10M):
- Barely any parallelism
- Essentially running sequential sort
- Not utilizing multiple cores
Solution: Run the THRESHOLD finder:
// Binary search for optimal THRESHOLD
size_t low = 1000, high = 1000000;
while (high - low > 1000) {
size_t mid = (low + high) / 2;
double time_mid = benchmark_with_threshold(mid);
double time_higher = benchmark_with_threshold(mid + 1000);
if (time_mid < time_higher) {
high = mid;
} else {
low = mid;
}
}
std::cout << "Optimal THRESHOLD: " << (low + high) / 2 << "\n";
Extensions and Challenges
Extension 1: GPU-Accelerated Sorting
Compare CPU parallel sort with GPU sorting:
- Use CUDA/Thrust for NVIDIA GPUs
- Measure transfer overhead
- Find break-even point where GPU wins
#include <thrust/sort.h>
#include <thrust/device_vector.h>
thrust::device_vector<int64_t> d_data = h_data;
thrust::sort(d_data.begin(), d_data.end());
Extension 2: Custom Comparison Functions
Benchmark with expensive comparators:
// Simple comparison (baseline)
std::sort(data.begin(), data.end());
// String comparison (expensive)
std::vector<std::string> strings = generate_strings(size);
std::sort(strings.begin(), strings.end());
// Custom object with multiple fields
struct Record { int64_t id; std::string name; double value; };
std::sort(records.begin(), records.end(),
[](const Record& a, const Record& b) {
return std::tie(a.name, a.id) < std::tie(b.name, b.id);
});
Extension 3: External Sorting
When data doesn’t fit in memory:
- Implement chunked merge sort
- Sort chunks in parallel
- Merge chunks from disk
- Benchmark I/O vs CPU time
Extension 4: Radix Sort Comparison
For integer data, compare with non-comparison sorts:
void radix_sort(std::vector<uint64_t>& data);
// Expected: Radix sort is O(n*k) vs O(n log n)
// Should be faster for large integer arrays
Challenge: Parallel Merge
The merge phase is currently sequential. Implement parallel merge:
// Standard sequential merge:
std::inplace_merge(first, mid, last);
// Your challenge: parallel merge
// Hint: Divide output range, use binary search to find split points
void parallel_merge(RandomIt first, RandomIt mid, RandomIt last) {
// 1. Find median of larger half
// 2. Binary search for its position in smaller half
// 3. These define split points
// 4. Recursively merge the two sub-problems in parallel
}
This can improve merge sort speedup from ~6x to ~10x on many-core systems.
Resources
Essential Reading
| Resource | Topics | Chapters |
|---|---|---|
| “C++ Concurrency in Action” - Williams | Parallel algorithms, std::async | Chapter 10.3 |
| “Algorithms, Fourth Edition” - Sedgewick | Merge sort, analysis | Chapter 2.2 |
| Intel TBB Documentation | parallel_sort implementation | Reference |
Reference Documentation
Tools
- Google Benchmark: Robust microbenchmarking framework
- perf stat: CPU performance counters
- Intel VTune: Detailed parallel analysis
- Compiler Explorer: See generated assembly for different policies
Papers
- “Is Parallel Programming Hard?” - Paul McKenney
- “Parallel Merge Sort” - Cole, 1988 (O(log n) parallel merge)
Self-Assessment Checklist
Before considering this project complete, verify:
Understanding
- You can explain why parallel sorting has overhead for small arrays
- You can calculate theoretical maximum speedup using Amdahl’s Law
- You understand why THRESHOLD exists and how to find optimal values
- You can explain the difference between
parandpar_unseq - You understand why fresh data is needed for each benchmark iteration
Implementation
- All sorting algorithms produce correct, sorted output
parallel_merge_sortusesstd::asynccorrectly with depth limiting- Benchmark harness includes warmup and multiple iterations
- Statistical measures (median, stddev) are calculated correctly
- All five input distributions are implemented and tested
Analysis
- You found the break-even point for your hardware
- You measured speedup at different core counts
- You identified which distribution benefits most from parallelism
- You compared your
parallel_merge_sorttostd::sort(par) - You can explain why speedup is less than number of cores
Benchmarking Practices
- Fresh data generated for each iteration
- Dead code elimination prevented
- Warmup runs executed before measurement
- Results are reproducible across multiple runs
- Variance is reported alongside median/mean
Submission/Completion Criteria
Your project is complete when you can:
- Demonstrate correct sorting for all algorithms and distributions
- Show benchmark results with statistical measures
- Identify the break-even point where parallel becomes faster
- Explain your THRESHOLD choice with experimental data
- Analyze scaling behavior (speedup vs cores)
- Answer these questions:
- At what size does parallel sort beat sequential on your hardware?
- What speedup do you achieve on your largest test case?
- Which distribution shows the biggest parallel advantage? Why?
- Why doesn’t speedup equal number of cores?
- What limits further speedup (Amdahl’s Law, memory bandwidth, etc.)?
Deliverables
- Source code with all algorithms and benchmarks
- Benchmark report (text or markdown) with your findings
- Brief write-up answering the five questions above
- Scaling chart (optional but recommended)
The Interview Questions They’ll Ask
Having completed this project, you should be prepared for:
- “Explain how you would parallelize a sorting algorithm.”
- Describe divide-and-conquer with parallel halves
- Mention THRESHOLD to avoid over-parallelization
- Discuss that merge is sequential (bottleneck)
- “What’s Amdahl’s Law and how does it apply to parallel sorting?”
- Speedup limited by sequential portion
- Merge phase is inherently sequential (limits speedup)
- Memory bandwidth is another limiter
- “When would you NOT use parallel sorting?”
- Small data sets (< 50K elements typically)
- Already sorted or nearly sorted data
- Memory-constrained environments
- When latency matters more than throughput
- “How do you benchmark correctly?”
- Warmup runs
- Fresh data each iteration
- Prevent dead code elimination
- Report median and variance
- Multiple iterations
- “What’s the difference between
parandpar_unseq?”par: Parallel across threadspar_unseq: Parallel + vectorized (SIMD)par_unseqmay be faster for simple comparisonspar_unseqrestricts what you can do in the comparison function