Project 8: Parallel Sort Benchmark Suite

Quick Reference

Attribute Details
Difficulty Intermediate
Time Estimate 1 week
Primary Language C++
Alternative Languages Rust, Java, Go
Knowledge Area Parallel Algorithms / Performance Analysis
Tools Required C++17 compiler with parallel STL, Google Benchmark (optional)
Primary Reference “C++ Concurrency in Action, Second Edition” by Anthony Williams, Chapter 10.3

Learning Objectives

By completing this project, you will be able to:

  1. Implement parallel merge sort using std::async with proper task decomposition
  2. Understand the THRESHOLD concept for deciding when to switch from parallel to sequential
  3. Compare execution policies (seq, par, par_unseq) and explain when each excels
  4. Design statistically sound benchmarks with warmup, multiple runs, and proper measurement
  5. Analyze how input distributions affect sorting performance (random, sorted, reverse, nearly sorted, duplicates)
  6. Predict the break-even point where parallel sorting becomes faster than sequential
  7. Explain Amdahl’s Law in the context of sorting algorithms

Theoretical Foundation

Why Sorting Is the Canonical Parallel Algorithm

Sorting is ideal for teaching parallelism because:

  1. Clear decomposition: Divide-and-conquer naturally splits work
  2. Well-understood sequential algorithms: Baseline is known
  3. Measurable speedup: Easy to compare parallel vs sequential
  4. Real-world importance: Sorting is everywhere (databases, search, data processing)

Core Concepts

Divide-and-Conquer Parallelism

Merge sort naturally parallelizes:

                    Unsorted Array [8 elements]
                           |
            +--------------+--------------+
            |                             |
      [4 elements]                  [4 elements]
      Sort in parallel              Sort in parallel
            |                             |
            +-------------+---------------+
                          |
                    Merge results

The key insight: sorting the halves are independent operations. They can run on different cores simultaneously.

The Parallel Merge Sort Algorithm

parallel_merge_sort(array, depth):
    if (array.size < THRESHOLD or depth == 0):
        return std::sort(array)          // Sequential fallback

    mid = array.size / 2

    // Sort halves in parallel
    left_future = std::async(parallel_merge_sort, left_half, depth-1)
    parallel_merge_sort(right_half, depth-1)  // Current thread does one half

    left_future.wait()  // Wait for parallel half

    std::inplace_merge(left_half, mid, right_half)

Why depth limit? Without it, we’d spawn O(n) tasks for n elements. With depth = log2(num_cores), we spawn just enough tasks to utilize all cores.

Why THRESHOLD? For small arrays, the overhead of spawning async tasks exceeds the benefit. The THRESHOLD prevents this.

Execution Policies in C++17

C++17 introduced three execution policies for standard algorithms:

Policy Meaning When to Use
std::execution::seq Sequential (default) Small data, predictable timing
std::execution::par Parallel Large data, compute-bound
std::execution::par_unseq Parallel + Vectorized Large data, simple comparisons

Example usage:

std::sort(std::execution::par, data.begin(), data.end());

The THRESHOLD Decision

Below a certain size, parallelism hurts performance:

Time vs Size:

Performance  │     * Parallel
             │    *
             │   *    Sequential
             │  * ...............
             │ *   Break-even
             │*    point
             └────────────────────────
                        Size

For size < THRESHOLD: Sequential wins (less overhead)
For size > THRESHOLD: Parallel wins (more cores utilized)

Typical THRESHOLD values: 10,000 - 100,000 elements

The optimal THRESHOLD depends on:

  • Element size and comparison cost
  • Number of cores
  • Memory bandwidth
  • Cache sizes

Historical Context

1945: John von Neumann invented merge sort for EDVAC computer 1964: Ken Batcher’s bitonic sort - first practical parallel sorting 1970s: Parallel merge algorithms developed for PRAM models 2017: C++17 brings parallel algorithms to the standard library Today: GPU-accelerated sorting achieves billions of elements/second

Common Misconceptions

Misconception 1: “Parallel sort is always faster” Reality: For small arrays (< 10K elements), the overhead of thread management makes parallel slower than sequential.

Misconception 2: “Speedup equals number of cores” Reality: Memory bandwidth, synchronization overhead, and Amdahl’s Law limit speedup. 8 cores typically yield 5-7x speedup.

Misconception 3: “par_unseq is always better than par” Reality: par_unseq helps when comparisons are cheap (integers). For complex objects or expensive comparisons, the difference is minimal.

Misconception 4: “All input distributions benefit equally from parallelism” Reality: Nearly-sorted data is fast even sequentially (adaptive algorithms). Random data shows the biggest parallel speedup.


Project Specification

What You’re Building

A comprehensive benchmark suite called sort_benchmark that:

  1. Compares multiple sorting approaches:
    • std::sort (sequential)
    • std::sort(std::execution::par) (parallel)
    • std::sort(std::execution::par_unseq) (parallel + vectorized)
    • Custom parallel_merge_sort implementation
  2. Tests multiple input distributions:
    • Random
    • Already sorted
    • Reverse sorted
    • Nearly sorted (90% sorted, 10% random swaps)
    • Many duplicates (only 10 unique values)
  3. Analyzes across sizes: 1K, 10K, 100K, 1M, 10M, 100M elements

  4. Reports statistical measures: median, mean, standard deviation, min, max

Functional Requirements

sort_benchmark --size <n> --distribution <name> --iterations <n>
sort_benchmark --full                     # Run complete benchmark suite
sort_benchmark --find-threshold           # Find optimal THRESHOLD value
sort_benchmark --compare-cores            # Test with different thread counts
sort_benchmark --report --output <file>   # Generate markdown report

Example Output

Parallel Sort Benchmark Suite
═══════════════════════════════════════════════════════════════════════════════
System Configuration:
  CPU: AMD Ryzen 9 5900X (12 cores, 24 threads)
  RAM: 64 GB DDR4-3200
  Compiler: GCC 13.1 with -O3 -march=native
  C++ Standard: C++17
  Data Type: int64_t (8 bytes per element)

═══════════════════════════════════════════════════════════════════════════════

=== Size: 1,000 elements ===
Distribution: Random
Iterations: 100 (after 10 warmup runs)

Algorithm              Median     Mean       Std Dev    Min        Max
─────────────────────────────────────────────────────────────────────────
std::sort              0.018ms    0.019ms    0.002ms    0.017ms    0.025ms
std::sort(par)         0.142ms    0.148ms    0.018ms    0.128ms    0.195ms
std::sort(par_unseq)   0.138ms    0.145ms    0.016ms    0.125ms    0.188ms
parallel_merge_sort    0.156ms    0.162ms    0.021ms    0.142ms    0.212ms

Winner: std::sort (7.9x faster than parallel)
Analysis: Parallel overhead dominates at this size

=== Size: 100,000 elements ===
Distribution: Random
Iterations: 50 (after 5 warmup runs)

Algorithm              Median     Mean       Std Dev    Min        Max
─────────────────────────────────────────────────────────────────────────
std::sort              6.82ms     6.91ms     0.15ms     6.65ms     7.32ms
std::sort(par)         1.18ms     1.22ms     0.08ms     1.12ms     1.45ms
std::sort(par_unseq)   1.14ms     1.18ms     0.07ms     1.08ms     1.38ms
parallel_merge_sort    1.32ms     1.38ms     0.09ms     1.25ms     1.62ms

Winner: std::sort(par_unseq), 6.0x speedup over sequential
Analysis: Parallel algorithms now significantly faster

=== Size: 10,000,000 elements ===

Distribution Analysis (std::sort(par)):
─────────────────────────────────────────────────────────────────────────
Distribution      Time         Speedup vs Random   Notes
random            145ms        1.0x baseline       Typical case
sorted            18ms         8.1x faster         Adaptive algorithm detects order
reverse           19ms         7.6x faster         Easy to detect, flip
nearly_sorted     28ms         5.2x faster         Few swaps needed
many_duplicates   98ms         1.5x faster         Less comparison branching

=== THRESHOLD Analysis ===
Finding optimal parallel/sequential cutoff for parallel_merge_sort...

THRESHOLD    Time @ 10M elements   Efficiency
─────────────────────────────────────────────────────────────────────────
1,000        892ms                 Poor (too much task overhead)
5,000        312ms                 Moderate
10,000       156ms                 Good
25,000       148ms                 Excellent ✓ (optimal)
50,000       152ms                 Good
100,000      168ms                 Moderate (underutilizing cores)
1,000,000    285ms                 Poor (too sequential)

Recommended THRESHOLD: 25,000 elements

=== Scaling Analysis (100M elements, random) ===
─────────────────────────────────────────────────────────────────────────
Threads   Time        Speedup     Efficiency
1         8.45s       1.0x        100%
2         4.38s       1.93x       96.5%
4         2.31s       3.66x       91.5%
8         1.28s       6.60x       82.5%
12        0.98s       8.62x       71.8%
24        0.92s       9.18x       38.3%

Analysis:
- Near-linear scaling up to physical core count (12)
- Hyperthreading provides minimal benefit for sorting
- Theoretical max speedup: 12x, achieved: 8.62x (71.8% efficiency)
- Bottleneck: Memory bandwidth during merge phase

=== Key Findings ===
1. Break-even point: ~50,000 elements (below this, sequential is faster)
2. Optimal speedup: 6-9x on 12 cores
3. Best distribution for parallelism: Random (most work to do)
4. par_unseq vs par: 3-5% improvement on integers
5. Custom parallel_merge_sort: Within 15% of std::sort(par)

═══════════════════════════════════════════════════════════════════════════════

Benchmark Correctness Requirements

Your benchmarks must be statistically valid:

  1. Warmup runs: Execute 3-10 iterations before measuring (populate caches, JIT effects)
  2. Fresh data each run: Generate new random data to avoid CPU pattern learning
  3. Prevent dead code elimination: Actually use the sorted result
  4. Multiple iterations: At least 10-50 runs for statistical significance
  5. Report variance: Std dev reveals measurement stability
  6. Flush caches between algorithms: Fair comparison

Solution Architecture

Component Design

┌─────────────────────────────────────────────────────────────────────────┐
│                           CLI Interface                                   │
│   --size, --distribution, --iterations, --full, --report                 │
└────────────────────────────────────┬────────────────────────────────────┘
                                     │
          ┌──────────────────────────┼──────────────────────────┐
          │                          │                          │
          ▼                          ▼                          ▼
┌──────────────────┐    ┌──────────────────┐    ┌──────────────────┐
│   Data Generator  │    │  Algorithm Suite  │    │ Benchmark Engine │
│                   │    │                   │    │                  │
│ - Random          │    │ - std::sort       │    │ - Timer          │
│ - Sorted          │    │ - std::sort(par)  │    │ - Statistics     │
│ - Reverse         │    │ - par_unseq       │    │ - Warmup         │
│ - Nearly sorted   │    │ - Parallel merge  │    │ - Iterations     │
│ - Many duplicates │    └────────┬──────────┘    └────────┬─────────┘
└────────┬──────────┘             │                        │
         │                        │                        │
         └────────────────────────┼────────────────────────┘
                                  │
                                  ▼
                    ┌─────────────────────────────┐
                    │     Result Aggregator        │
                    │                              │
                    │ - Median/Mean/StdDev         │
                    │ - Speedup calculations       │
                    │ - Distribution comparisons   │
                    │ - Report generation          │
                    └─────────────────────────────┘

Parallel Merge Sort Visualization

parallel_merge_sort([8, 3, 1, 7, 4, 6, 5, 2], depth=2)

Level 0:                    [8, 3, 1, 7, 4, 6, 5, 2]
                                      |
                    +-----------------+-----------------+
                    |                                   |
Level 1:     std::async               Current thread
             [8, 3, 1, 7]             [4, 6, 5, 2]
                    |                       |
            +-------+-------+       +-------+-------+
            |               |       |               |
Level 2:  [8, 3]         [1, 7]   [4, 6]         [5, 2]
            |               |       |               |
          [3, 8]         [1, 7]   [4, 6]         [2, 5]
            |               |       |               |
            +-------+-------+       +-------+-------+
                    |                       |
                [1, 3, 7, 8]           [2, 4, 5, 6]
                    |                       |
                    +-----------+-----------+
                                |
                        [1, 2, 3, 4, 5, 6, 7, 8]

Key insight:
- Level 1: Two tasks run in parallel (one via async, one in current thread)
- Level 2: Sequential (depth exhausted, below THRESHOLD)
- Merge operations are sequential (harder to parallelize efficiently)

Key Data Structures

// Input distribution types
enum class Distribution {
    Random,
    Sorted,
    Reverse,
    NearlySorted,
    ManyDuplicates
};

// Algorithm variants to benchmark
enum class Algorithm {
    StdSort,           // std::sort (sequential)
    StdSortPar,        // std::sort(std::execution::par, ...)
    StdSortParUnseq,   // std::sort(std::execution::par_unseq, ...)
    ParallelMergeSort  // Custom implementation
};

// Single benchmark result
struct BenchmarkResult {
    Algorithm algorithm;
    Distribution distribution;
    size_t size;
    std::vector<double> timings_ms;  // All iterations

    double median() const;
    double mean() const;
    double stddev() const;
    double min() const;
    double max() const;
};

// Parallel merge sort configuration
struct ParallelMergeSortConfig {
    size_t threshold = 25000;    // Switch to sequential below this
    int max_depth = 8;           // Limit task spawning depth
};

// Complete benchmark configuration
struct BenchmarkConfig {
    std::vector<size_t> sizes = {1000, 10000, 100000, 1000000, 10000000};
    std::vector<Distribution> distributions = {
        Distribution::Random,
        Distribution::Sorted,
        Distribution::Reverse,
        Distribution::NearlySorted,
        Distribution::ManyDuplicates
    };
    std::vector<Algorithm> algorithms = {
        Algorithm::StdSort,
        Algorithm::StdSortPar,
        Algorithm::StdSortParUnseq,
        Algorithm::ParallelMergeSort
    };
    size_t warmup_iterations = 5;
    size_t measured_iterations = 20;
    ParallelMergeSortConfig parallel_config;
};

Core Algorithm Implementation

// Template for parallel merge sort with configurability
template<typename RandomIt, typename Compare = std::less<>>
void parallel_merge_sort(RandomIt first, RandomIt last,
                         Compare comp = Compare{},
                         size_t threshold = 25000,
                         int depth = 8) {
    const auto size = std::distance(first, last);

    // Base case: use sequential sort for small arrays or depth exhausted
    if (size < static_cast<ptrdiff_t>(threshold) || depth == 0) {
        std::sort(first, last, comp);
        return;
    }

    // Find midpoint
    auto mid = first + size / 2;

    // Sort first half asynchronously
    auto left_future = std::async(std::launch::async,
        [=, &comp]() {
            parallel_merge_sort(first, mid, comp, threshold, depth - 1);
        }
    );

    // Sort second half in current thread
    parallel_merge_sort(mid, last, comp, threshold, depth - 1);

    // Wait for left half to complete
    left_future.get();

    // Merge the sorted halves
    std::inplace_merge(first, mid, last, comp);
}

// Data generation functions
template<typename T>
std::vector<T> generate_data(size_t size, Distribution dist) {
    std::vector<T> data(size);
    std::random_device rd;
    std::mt19937_64 gen(rd());

    switch (dist) {
        case Distribution::Random: {
            std::uniform_int_distribution<T> dist(
                std::numeric_limits<T>::min(),
                std::numeric_limits<T>::max()
            );
            std::generate(data.begin(), data.end(),
                         [&]() { return dist(gen); });
            break;
        }
        case Distribution::Sorted: {
            std::iota(data.begin(), data.end(), 0);
            break;
        }
        case Distribution::Reverse: {
            std::iota(data.rbegin(), data.rend(), 0);
            break;
        }
        case Distribution::NearlySorted: {
            std::iota(data.begin(), data.end(), 0);
            // Swap 10% of elements randomly
            std::uniform_int_distribution<size_t> idx_dist(0, size - 1);
            for (size_t i = 0; i < size / 10; ++i) {
                std::swap(data[idx_dist(gen)], data[idx_dist(gen)]);
            }
            break;
        }
        case Distribution::ManyDuplicates: {
            std::uniform_int_distribution<T> dist(0, 9);  // Only 10 unique values
            std::generate(data.begin(), data.end(),
                         [&]() { return dist(gen); });
            break;
        }
    }
    return data;
}

Implementation Guide

Phase 1: Basic Sequential and Parallel Sorting (Days 1-2)

Goal: Get all sorting algorithms working correctly.

Steps:

  1. Set up a C++17 project with parallel STL support
  2. Implement parallel_merge_sort with configurable THRESHOLD
  3. Create wrapper functions for std::sort with each execution policy
  4. Verify correctness: std::is_sorted() after each sort
  5. Test with small arrays first (1000 elements)

Compiler Setup:

# GCC
g++ -std=c++17 -O3 -march=native -ltbb main.cpp -o sort_benchmark

# Clang (with libc++)
clang++ -std=c++17 -O3 -march=native -ltbb main.cpp -o sort_benchmark

# MSVC
cl /std:c++17 /O2 /EHsc main.cpp

Validation: All algorithms produce sorted output.

Phase 2: Data Generators (Day 3)

Goal: Create all input distributions.

Steps:

  1. Implement generate_data<T>() for each distribution type
  2. Use std::mt19937_64 for reproducible randomness
  3. Template on element type for flexibility
  4. Test each distribution produces expected patterns

Validation:

  • Random: Elements uniformly distributed
  • Sorted: std::is_sorted() returns true
  • Reverse: Sorted in descending order
  • Nearly sorted: Mostly ascending with some inversions
  • Many duplicates: Only 10 unique values present

Phase 3: Benchmark Harness (Days 4-5)

Goal: Accurate timing with statistical analysis.

Critical Benchmarking Rules:

// 1. High-resolution timer
using Clock = std::chrono::high_resolution_clock;

// 2. Warmup runs (cache population, JIT)
for (size_t i = 0; i < warmup_iterations; ++i) {
    auto data_copy = generate_data(size, dist);  // Fresh data!
    sort_algorithm(data_copy.begin(), data_copy.end());
    benchmark::DoNotOptimize(data_copy.data());  // Prevent DCE
}

// 3. Measured runs with fresh data each time
std::vector<double> timings;
for (size_t i = 0; i < iterations; ++i) {
    auto data_copy = generate_data(size, dist);

    auto start = Clock::now();
    sort_algorithm(data_copy.begin(), data_copy.end());
    auto end = Clock::now();

    benchmark::DoNotOptimize(data_copy.data());

    auto duration = std::chrono::duration<double, std::milli>(end - start);
    timings.push_back(duration.count());
}

// 4. Report median (robust to outliers)
std::sort(timings.begin(), timings.end());
double median = timings[timings.size() / 2];

Why fresh data each iteration?

  • CPU learns patterns in data (branch prediction)
  • Sorted data is already sorted on second run (trivially fast)
  • Represents real-world scenario (sorting new data each time)

Phase 4: THRESHOLD Finder (Day 6)

Goal: Find optimal THRESHOLD for parallel_merge_sort.

Algorithm:

size_t find_optimal_threshold(size_t test_size = 10'000'000) {
    std::vector<size_t> candidates = {
        1000, 5000, 10000, 25000, 50000, 100000, 250000, 500000, 1000000
    };

    size_t best_threshold = candidates[0];
    double best_time = std::numeric_limits<double>::max();

    for (size_t threshold : candidates) {
        // Run benchmark with this threshold
        double median_time = benchmark_parallel_merge_sort(
            test_size, Distribution::Random, threshold
        );

        if (median_time < best_time) {
            best_time = median_time;
            best_threshold = threshold;
        }

        std::cout << "THRESHOLD " << threshold
                  << ": " << median_time << "ms\n";
    }

    return best_threshold;
}

Validation: Optimal THRESHOLD typically between 10K-100K.

Phase 5: Scaling Analysis (Day 7)

Goal: Measure speedup vs core count.

Implementation:

void analyze_scaling(size_t size) {
    // Test with different thread counts
    // Note: Controlling std::sort(par) threads requires TBB configuration

    std::vector<int> thread_counts = {1, 2, 4, 8, 12, 16, 24};

    for (int threads : thread_counts) {
        // Limit hardware concurrency (implementation-specific)
        // For Intel TBB:
        tbb::global_control gc(
            tbb::global_control::max_allowed_parallelism,
            threads
        );

        double time = benchmark_algorithm(
            size, Distribution::Random, Algorithm::StdSortPar
        );

        double speedup = baseline_time / time;
        double efficiency = speedup / threads * 100;

        std::cout << threads << " threads: "
                  << time << "ms, "
                  << speedup << "x speedup, "
                  << efficiency << "% efficiency\n";
    }
}

Expected results:

  • Linear scaling up to physical cores
  • Diminishing returns with hyperthreading
  • Memory bandwidth becomes bottleneck at high core counts

Testing Strategy

Correctness Tests

void test_correctness() {
    // 1. All algorithms produce sorted output
    for (auto algo : all_algorithms) {
        auto data = generate_data<int64_t>(10000, Distribution::Random);
        sort_with_algorithm(data, algo);
        assert(std::is_sorted(data.begin(), data.end()));
    }

    // 2. All algorithms produce same result
    for (auto dist : all_distributions) {
        auto baseline = generate_data<int64_t>(10000, dist);
        auto expected = baseline;
        std::sort(expected.begin(), expected.end());

        for (auto algo : all_algorithms) {
            auto data = baseline;  // Same input
            sort_with_algorithm(data, algo);
            assert(data == expected);
        }
    }

    // 3. Edge cases
    std::vector<int64_t> empty{};
    std::vector<int64_t> single{42};
    std::vector<int64_t> two{5, 3};

    for (auto algo : all_algorithms) {
        // Should not crash, should produce sorted output
        sort_with_algorithm(empty, algo);
        sort_with_algorithm(single, algo);
        sort_with_algorithm(two, algo);
    }

    std::cout << "All correctness tests passed!\n";
}

Performance Sanity Checks

void test_performance_sanity() {
    constexpr size_t large_size = 10'000'000;

    // 1. Parallel should be faster than sequential for large data
    double seq_time = benchmark(large_size, Distribution::Random, Algorithm::StdSort);
    double par_time = benchmark(large_size, Distribution::Random, Algorithm::StdSortPar);
    assert(par_time < seq_time);
    std::cout << "Parallel is " << (seq_time / par_time) << "x faster for "
              << large_size << " elements\n";

    // 2. Sorted data should be faster than random
    double sorted_time = benchmark(large_size, Distribution::Sorted, Algorithm::StdSort);
    double random_time = benchmark(large_size, Distribution::Random, Algorithm::StdSort);
    assert(sorted_time < random_time);

    // 3. Custom implementation should be within 2x of std::sort(par)
    double custom_time = benchmark(large_size, Distribution::Random,
                                   Algorithm::ParallelMergeSort);
    assert(custom_time < par_time * 2);

    std::cout << "All performance sanity checks passed!\n";
}

Benchmark Stability Test

void test_benchmark_stability() {
    // Run same benchmark 10 times, check variance
    std::vector<double> runs;
    for (int i = 0; i < 10; ++i) {
        double time = benchmark(1'000'000, Distribution::Random,
                               Algorithm::StdSortPar);
        runs.push_back(time);
    }

    double mean = calculate_mean(runs);
    double stddev = calculate_stddev(runs);
    double cv = stddev / mean;  // Coefficient of variation

    std::cout << "CV = " << (cv * 100) << "%\n";
    assert(cv < 0.10);  // Less than 10% variation

    std::cout << "Benchmark is stable\n";
}

Common Pitfalls and Debugging

Pitfall 1: Parallel Is Slower Than Sequential

Symptom: std::sort(par) takes longer than std::sort even for large arrays.

Diagnosis:

  1. Is parallel STL actually enabled?
  2. Is the data size large enough?
  3. Are you accidentally sorting already-sorted data?

Solutions:

// 1. Check if TBB is linked
#if __has_include(<tbb/parallel_sort.h>)
std::cout << "TBB available\n";
#endif

// 2. Check minimum size
if (size < 50000) {
    std::cout << "Warning: Size may be too small for parallel benefit\n";
}

// 3. Generate fresh random data for each iteration
auto data = generate_data<int64_t>(size, Distribution::Random);  // Fresh each time!

Pitfall 2: Inconsistent Benchmark Results

Symptom: Same benchmark gives wildly different results each run.

Causes and Solutions:

// Problem 1: Not enough iterations
// Solution: Run more iterations, report median

// Problem 2: No warmup
// Solution: Add warmup runs before measuring
for (int i = 0; i < 5; ++i) {
    // Warmup: don't record these
    auto data = generate_data(size, dist);
    std::sort(data.begin(), data.end());
}
// Now start measuring

// Problem 3: Background processes
// Solution: Use median (robust to outliers), run multiple times

// Problem 4: Turbo boost inconsistency
// Solution: Either disable turbo boost or warmup CPU first
volatile int sum = 0;
for (int i = 0; i < 1000000000; ++i) sum += i;  // CPU warmup

Pitfall 3: Dead Code Elimination

Symptom: Benchmark shows impossibly fast times (nanoseconds for millions of elements).

Cause: Compiler optimized away the sort because result wasn’t used.

Solution:

// BAD: Compiler may eliminate the sort
auto data = generate_data(size, Distribution::Random);
auto start = Clock::now();
std::sort(data.begin(), data.end());
auto end = Clock::now();
// data never used after this...

// GOOD: Prevent optimization
#include <benchmark/benchmark.h>  // Google Benchmark

auto data = generate_data(size, Distribution::Random);
auto start = Clock::now();
std::sort(data.begin(), data.end());
auto end = Clock::now();

benchmark::DoNotOptimize(data.data());  // Tell compiler data is used
benchmark::ClobberMemory();              // Memory barrier

// Or manually use the result
volatile int64_t first = data[0];
volatile int64_t last = data[size-1];

Pitfall 4: Sorting Same Data Multiple Times

Symptom: After first iteration, sort is instantaneous.

Cause: Sorting already-sorted data (adaptive algorithms detect this).

Solution:

// BAD: data stays sorted after first iteration
auto data = generate_data(size, Distribution::Random);
for (int i = 0; i < iterations; ++i) {
    auto start = Clock::now();
    std::sort(data.begin(), data.end());  // Already sorted after first run!
    auto end = Clock::now();
}

// GOOD: Fresh data each iteration
for (int i = 0; i < iterations; ++i) {
    auto data = generate_data(size, Distribution::Random);  // Fresh!
    auto start = Clock::now();
    std::sort(data.begin(), data.end());
    auto end = Clock::now();
}

// ALTERNATIVE: Use a copy
auto original = generate_data(size, Distribution::Random);
for (int i = 0; i < iterations; ++i) {
    auto data = original;  // Copy before sorting
    // ...
}

Pitfall 5: THRESHOLD Too Low or High

Symptom: Custom parallel_merge_sort is much slower than std::sort(par).

Diagnosis:

THRESHOLD too low (1000):
- Spawning millions of tasks
- Huge overhead from std::async
- Each task does almost no work

THRESHOLD too high (10M):
- Barely any parallelism
- Essentially running sequential sort
- Not utilizing multiple cores

Solution: Run the THRESHOLD finder:

// Binary search for optimal THRESHOLD
size_t low = 1000, high = 1000000;
while (high - low > 1000) {
    size_t mid = (low + high) / 2;
    double time_mid = benchmark_with_threshold(mid);
    double time_higher = benchmark_with_threshold(mid + 1000);

    if (time_mid < time_higher) {
        high = mid;
    } else {
        low = mid;
    }
}
std::cout << "Optimal THRESHOLD: " << (low + high) / 2 << "\n";

Extensions and Challenges

Extension 1: GPU-Accelerated Sorting

Compare CPU parallel sort with GPU sorting:

  • Use CUDA/Thrust for NVIDIA GPUs
  • Measure transfer overhead
  • Find break-even point where GPU wins
#include <thrust/sort.h>
#include <thrust/device_vector.h>

thrust::device_vector<int64_t> d_data = h_data;
thrust::sort(d_data.begin(), d_data.end());

Extension 2: Custom Comparison Functions

Benchmark with expensive comparators:

// Simple comparison (baseline)
std::sort(data.begin(), data.end());

// String comparison (expensive)
std::vector<std::string> strings = generate_strings(size);
std::sort(strings.begin(), strings.end());

// Custom object with multiple fields
struct Record { int64_t id; std::string name; double value; };
std::sort(records.begin(), records.end(),
          [](const Record& a, const Record& b) {
              return std::tie(a.name, a.id) < std::tie(b.name, b.id);
          });

Extension 3: External Sorting

When data doesn’t fit in memory:

  • Implement chunked merge sort
  • Sort chunks in parallel
  • Merge chunks from disk
  • Benchmark I/O vs CPU time

Extension 4: Radix Sort Comparison

For integer data, compare with non-comparison sorts:

void radix_sort(std::vector<uint64_t>& data);

// Expected: Radix sort is O(n*k) vs O(n log n)
// Should be faster for large integer arrays

Challenge: Parallel Merge

The merge phase is currently sequential. Implement parallel merge:

// Standard sequential merge:
std::inplace_merge(first, mid, last);

// Your challenge: parallel merge
// Hint: Divide output range, use binary search to find split points
void parallel_merge(RandomIt first, RandomIt mid, RandomIt last) {
    // 1. Find median of larger half
    // 2. Binary search for its position in smaller half
    // 3. These define split points
    // 4. Recursively merge the two sub-problems in parallel
}

This can improve merge sort speedup from ~6x to ~10x on many-core systems.


Resources

Essential Reading

Resource Topics Chapters
“C++ Concurrency in Action” - Williams Parallel algorithms, std::async Chapter 10.3
“Algorithms, Fourth Edition” - Sedgewick Merge sort, analysis Chapter 2.2
Intel TBB Documentation parallel_sort implementation Reference

Reference Documentation

Tools

  • Google Benchmark: Robust microbenchmarking framework
  • perf stat: CPU performance counters
  • Intel VTune: Detailed parallel analysis
  • Compiler Explorer: See generated assembly for different policies

Papers

  • “Is Parallel Programming Hard?” - Paul McKenney
  • “Parallel Merge Sort” - Cole, 1988 (O(log n) parallel merge)

Self-Assessment Checklist

Before considering this project complete, verify:

Understanding

  • You can explain why parallel sorting has overhead for small arrays
  • You can calculate theoretical maximum speedup using Amdahl’s Law
  • You understand why THRESHOLD exists and how to find optimal values
  • You can explain the difference between par and par_unseq
  • You understand why fresh data is needed for each benchmark iteration

Implementation

  • All sorting algorithms produce correct, sorted output
  • parallel_merge_sort uses std::async correctly with depth limiting
  • Benchmark harness includes warmup and multiple iterations
  • Statistical measures (median, stddev) are calculated correctly
  • All five input distributions are implemented and tested

Analysis

  • You found the break-even point for your hardware
  • You measured speedup at different core counts
  • You identified which distribution benefits most from parallelism
  • You compared your parallel_merge_sort to std::sort(par)
  • You can explain why speedup is less than number of cores

Benchmarking Practices

  • Fresh data generated for each iteration
  • Dead code elimination prevented
  • Warmup runs executed before measurement
  • Results are reproducible across multiple runs
  • Variance is reported alongside median/mean

Submission/Completion Criteria

Your project is complete when you can:

  1. Demonstrate correct sorting for all algorithms and distributions
  2. Show benchmark results with statistical measures
  3. Identify the break-even point where parallel becomes faster
  4. Explain your THRESHOLD choice with experimental data
  5. Analyze scaling behavior (speedup vs cores)
  6. Answer these questions:
    • At what size does parallel sort beat sequential on your hardware?
    • What speedup do you achieve on your largest test case?
    • Which distribution shows the biggest parallel advantage? Why?
    • Why doesn’t speedup equal number of cores?
    • What limits further speedup (Amdahl’s Law, memory bandwidth, etc.)?

Deliverables

  1. Source code with all algorithms and benchmarks
  2. Benchmark report (text or markdown) with your findings
  3. Brief write-up answering the five questions above
  4. Scaling chart (optional but recommended)

The Interview Questions They’ll Ask

Having completed this project, you should be prepared for:

  1. “Explain how you would parallelize a sorting algorithm.”
    • Describe divide-and-conquer with parallel halves
    • Mention THRESHOLD to avoid over-parallelization
    • Discuss that merge is sequential (bottleneck)
  2. “What’s Amdahl’s Law and how does it apply to parallel sorting?”
    • Speedup limited by sequential portion
    • Merge phase is inherently sequential (limits speedup)
    • Memory bandwidth is another limiter
  3. “When would you NOT use parallel sorting?”
    • Small data sets (< 50K elements typically)
    • Already sorted or nearly sorted data
    • Memory-constrained environments
    • When latency matters more than throughput
  4. “How do you benchmark correctly?”
    • Warmup runs
    • Fresh data each iteration
    • Prevent dead code elimination
    • Report median and variance
    • Multiple iterations
  5. “What’s the difference between par and par_unseq?”
    • par: Parallel across threads
    • par_unseq: Parallel + vectorized (SIMD)
    • par_unseq may be faster for simple comparisons
    • par_unseq restricts what you can do in the comparison function