P07: Parallel Image Processor
Build an image processing pipeline that uses C++17 parallel algorithms (
std::execution::par) to apply filters, transformations, and effects across multiple cores with minimal custom threading code.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Intermediate |
| Time Estimate | 1-2 weeks |
| Language | C++ (C++17 or later) |
| Prerequisites | Projects 1-2 completed, STL algorithms, basic image concepts |
| Primary Book | “C++ Concurrency in Action, 2nd Ed” by Anthony Williams, Ch. 10 |
| Knowledge Area | Parallel Algorithms / STL Execution Policies |
Learning Objectives
After completing this project, you will be able to:
- Use C++17 parallel algorithms - Apply
std::execution::parandpar_unseqto STL algorithms - Understand execution policies - Know when to use
seq,par, andpar_unseqand their implications - Recognize embarrassingly parallel problems - Identify operations that benefit from parallelization
- Apply Amdahl’s Law - Calculate theoretical maximum speedup and understand its limitations
- Measure parallel overhead - Determine when parallelism helps vs. hurts performance
- Handle neighborhood operations - Implement double-buffering for parallel convolution
- Process images efficiently - Design cache-friendly data structures for image processing
- Benchmark parallel code - Create fair, reproducible performance comparisons
The Core Question You’re Answering
“When does adding
std::execution::parto an algorithm actually make it faster?”
The promise of C++17 parallel algorithms is seductive: add a single parameter and your code runs on multiple cores. But the reality is more nuanced:
- For large, computationally intensive operations, parallelism can provide near-linear speedup
- For small data sets or simple operations, the overhead of thread management exceeds the benefit
- For operations with data dependencies, parallelism may be impossible or require careful design
Before coding, internalize this: Parallelism is not free. Every parallel invocation incurs overhead for thread management, synchronization, and cache coherence. Your job is to understand when the work per element justifies that overhead.
Theoretical Foundation
Core Concepts
Execution Policies Explained
C++17 introduced execution policies to control how STL algorithms execute:
Execution Policy Hierarchy:
std::execution::seq Standard sequential execution
│ - Same as traditional algorithms
│ - No parallelism, no vectorization
│ - Fully deterministic order
│
std::execution::par Parallel execution
│ - Operations may execute on multiple threads
│ - Order is unspecified
│ - Must be safe for concurrent execution
│ - No vectorization guarantees
│
std::execution::par_unseq Parallel + vectorized execution
- Operations may interleave on single thread (SIMD)
- AND execute on multiple threads
- Most restrictive requirements:
* No mutex locks
* No memory allocation in callbacks
* No thread-local access
Critical Insight: par_unseq allows operations to be interleaved within a single thread via SIMD instructions. This means your callback cannot acquire locks (potential deadlock if same thread tries to acquire twice) or allocate memory (heap locks).
Embarrassingly Parallel Problems
Some problems decompose naturally into independent subproblems:
Embarrassingly Parallel Operations (ideal for std::execution::par):
Point Operations (each pixel independent):
┌────────────────────────────────────────────────────────┐
│ Input Pixel Operation Output Pixel │
│ ─────────── ───────── ──────────── │
│ pixel[0] ───► brightness ───► result[0] (thread 1)│
│ pixel[1] ───► brightness ───► result[1] (thread 2)│
│ pixel[2] ───► brightness ───► result[2] (thread 3)│
│ ... │
│ pixel[N] ───► brightness ───► result[N] (thread N)│
│ │
│ No data dependencies between iterations! │
└────────────────────────────────────────────────────────┘
Neighborhood Operations (require double-buffering):
┌────────────────────────────────────────────────────────┐
│ READ from input buffer WRITE to output buffer │
│ ───────────────────── ───────────────────── │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ neighbors of [0]│ ───► │ result[0] │ │
│ │ neighbors of [1]│ ───► │ result[1] │ │
│ │ neighbors of [2]│ ───► │ result[2] │ │
│ └─────────────────┘ └─────────────────┘ │
│ │
│ Reading and writing to SAME buffer = race condition! │
│ Must use separate input and output buffers. │
└────────────────────────────────────────────────────────┘
Why This Matters: Real-World Image Processing
Image processing is a cornerstone of modern computing:
- Social Media: Every photo uploaded goes through filters, resizing, format conversion
- Medical Imaging: MRI, CT scans require enhancement and analysis
- Computer Vision: Autonomous vehicles process thousands of frames per second
- Photography: RAW processing, batch editing for professionals
- Streaming: Real-time filters for video calls
A 4K image (3840 x 2160) has over 8 million pixels. Processing each pixel even 10 times faster saves substantial time when handling thousands of images.
Amdahl’s Law and Overhead
Amdahl’s Law Formula
1
Speedup = ─────────────────────────
(1 - P) + P/N
Where:
P = Fraction of program that can be parallelized (0 to 1)
N = Number of processors
Example calculations:
If 90% of work is parallelizable (P = 0.9):
N = 2 cores: Speedup = 1 / (0.1 + 0.9/2) = 1 / 0.55 = 1.82x
N = 4 cores: Speedup = 1 / (0.1 + 0.9/4) = 1 / 0.325 = 3.08x
N = 8 cores: Speedup = 1 / (0.1 + 0.9/8) = 1 / 0.2125 = 4.71x
N = 16 cores: Speedup = 1 / (0.1 + 0.9/16) = 1 / 0.156 = 6.4x
N = infinity: Speedup = 1 / 0.1 = 10x (theoretical maximum!)
If 99% is parallelizable (P = 0.99):
N = 8 cores: Speedup = 1 / (0.01 + 0.99/8) = 1 / 0.134 = 7.47x
N = infinity: Speedup = 1 / 0.01 = 100x
Key Insight: Even with infinite processors, speedup is limited by the sequential portion.
Amdahl's Law Visualization:
Speedup
^
10│ .................... P=99%
│ ......
│ ......
8│ .....
│ ..
│ ..
6│ . ..................P=90%
│ . ......
│ . .....
4│. .....
│. ...
│. ... .................P=75%
2│. .. ......
│... ......
│. ......
1└──────────────────────────────────────────────────►
1 2 4 8 16 32 64 128 256 Cores
The Overhead Reality
Parallelism has costs that Amdahl’s Law doesn’t capture:
Parallel Overhead Components:
1. Thread Management
┌─────────────────────────────────────────────┐
│ • Thread creation/destruction │
│ • Task queue management │
│ • Work distribution │
│ • Synchronization at completion │
│ │
│ Typical overhead: 1-100 microseconds │
└─────────────────────────────────────────────┘
2. Cache Coherence
┌─────────────────────────────────────────────┐
│ Core 0 Cache Core 1 Cache Core 2 Cache│
│ [data] [data] [data] │
│ │ │ │ │
│ └───────────────┼───────────────┘ │
│ │ │
│ Memory Bus Traffic │
│ │
│ When cores write to nearby memory, │
│ cache lines must be invalidated/synchronized│
└─────────────────────────────────────────────┘
3. False Sharing
┌─────────────────────────────────────────────┐
│ Cache Line (64 bytes): │
│ ┌────────────────────────────────────────┐ │
│ │ Thread 0 │ Thread 1 │ Thread 2 │ ... │ │
│ │ writes │ writes │ writes │ │ │
│ └────────────────────────────────────────┘ │
│ │
│ Even if threads write different elements, │
│ if they share a cache line, performance │
│ degrades due to cache line bouncing. │
└─────────────────────────────────────────────┘
When Parallel Hurts Performance:
Work per element vs. Overhead:
Low work/element High work/element
(brightness adjust) (blur, convolution)
───────────────── ─────────────────
Small image Overhead >> Work Overhead > Work
(64x64) SLOWER with par! Maybe break-even
Large image Overhead < Work Overhead << Work
(4096x4096) Modest speedup Near-linear speedup
Common Misconceptions
Misconception 1: “More threads = more speed”
- Reality: Speedup plateaus and can even decrease due to overhead
- Beyond your core count, threads compete for CPU time
Misconception 2: “par_unseq is always faster than par”
- Reality:
par_unseqhas stricter requirements (no locks, no allocation) - If your operation doesn’t benefit from SIMD,
parmay be equivalent
Misconception 3: “Parallel algorithms handle synchronization for me”
- Reality: The algorithm parallelizes, but YOUR callback must be thread-safe
- Shared state, output dependencies, and side effects are your responsibility
Misconception 4: “I can parallelize any loop with std::transform”
- Reality: Some patterns don’t fit transform (dependencies between iterations)
- Neighborhood operations need double-buffering or row-based parallelization
Project Specification
What You’ll Build
A command-line image processing tool that:
- Loads images in PPM format (simple text-based format, easy to parse)
- Applies various filters using parallel algorithms
- Benchmarks sequential vs. parallel performance
- Reports speedup and warns when parallelism hurts
Real World Outcome
$ ./parallel_image --input photo.ppm --output processed.ppm
Loading photo.ppm (4096x3072, 12.6 megapixels)...
Loaded in 0.8s
Applying filters with std::execution::par:
1. Brightness adjustment (+20%)
Sequential: 245ms
Parallel: 42ms
Speedup: 5.8x
2. Gaussian blur (5x5 kernel)
Sequential: 1,840ms
Parallel: 285ms
Speedup: 6.5x
3. Edge detection (Sobel)
Sequential: 892ms
Parallel: 138ms
Speedup: 6.5x
4. Color quantization (16 colors)
Sequential: 3,200ms
Parallel: 520ms
Speedup: 6.2x
Total processing time:
Sequential: 6,177ms
Parallel: 985ms
Overall speedup: 6.3x on 8 cores
Saved processed.ppm
$ ./parallel_image --benchmark --small
Loading tiny.ppm (64x64, 4096 pixels)...
Warning: For small images, parallel overhead exceeds benefits
Sequential: 0.8ms
Parallel: 2.1ms (SLOWER due to thread overhead!)
Benchmark Output Format
┌─────────────────────────────────────────────────────────────────────────────┐
│ PARALLEL IMAGE PROCESSOR │
│ Benchmark Results │
├─────────────────────────────────────────────────────────────────────────────┤
│ Image: landscape.ppm (4096 x 3072 pixels = 12,582,912 pixels) │
│ System: 8 cores, 16 threads │
│ Compiler: g++ 12.2 with -O3 -march=native │
├─────────────────────────────────────────────────────────────────────────────┤
│ Filter │ Sequential │ Parallel │ Speedup │ Efficiency │
├──────────────────┼────────────┼──────────┼─────────┼───────────────────────┤
│ Brightness │ 245 ms │ 42 ms │ 5.8x │ 73% (5.8/8 cores) │
│ Contrast │ 251 ms │ 44 ms │ 5.7x │ 71% │
│ Invert │ 198 ms │ 35 ms │ 5.7x │ 71% │
│ Grayscale │ 312 ms │ 52 ms │ 6.0x │ 75% │
│ Box Blur 3x3 │ 892 ms │ 138 ms │ 6.5x │ 81% │
│ Gaussian 5x5 │ 1,840 ms │ 285 ms │ 6.5x │ 81% │
│ Sobel Edge │ 892 ms │ 138 ms │ 6.5x │ 81% │
│ Sharpen │ 724 ms │ 112 ms │ 6.5x │ 81% │
│ Color Quantize │ 3,200 ms │ 520 ms │ 6.2x │ 78% │
├──────────────────┼────────────┼──────────┼─────────┼───────────────────────┤
│ TOTAL │ 8,554 ms │ 1,366 ms │ 6.3x │ 79% average │
└─────────────────────────────────────────────────────────────────────────────┘
Observations:
- Neighborhood operations (blur, edge) show best scaling due to higher work/pixel
- Point operations (brightness, invert) have lower efficiency due to memory bandwidth
- Theoretical maximum speedup on 8 cores: 8.0x
- Achieved: 6.3x (79% parallel efficiency)
Solution Architecture
System Design
┌─────────────────────────────────────────────────────────────────────────────┐
│ PARALLEL IMAGE PROCESSOR │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ CLI Parser │────>│ Image Loader │────>│ Image Buffer │ │
│ │ (arguments) │ │ (PPM format) │ │ (vector<Pixel>)│ │
│ └─────────────────┘ └─────────────────┘ └────────┬────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ FILTER PIPELINE │ │
│ │ │ │
│ │ ┌───────────────────────────────────────────────────────────────┐ │ │
│ │ │ Point Operations (embarrassingly parallel) │ │ │
│ │ │ │ │ │
│ │ │ std::transform(std::execution::par, │ │ │
│ │ │ pixels.begin(), pixels.end(), │ │ │
│ │ │ output.begin(), │ │ │
│ │ │ [](Pixel p) { return adjust_brightness(p); }); │ │ │
│ │ │ │ │ │
│ │ │ • Brightness: p.r += delta (clamped) │ │ │
│ │ │ • Contrast: (p.r - 128) * factor + 128 │ │ │
│ │ │ • Invert: 255 - p.r │ │ │
│ │ │ • Grayscale: 0.299*r + 0.587*g + 0.114*b │ │ │
│ │ └───────────────────────────────────────────────────────────────┘ │ │
│ │ │ │
│ │ ┌───────────────────────────────────────────────────────────────┐ │ │
│ │ │ Neighborhood Operations (need double-buffering) │ │ │
│ │ │ │ │ │
│ │ │ // Create row iterators for parallel row processing │ │ │
│ │ │ std::for_each(std::execution::par, │ │ │
│ │ │ row_indices.begin(), row_indices.end(), │ │ │
│ │ │ [&](size_t row) { │ │ │
│ │ │ process_row(input, output, row, kernel); │ │ │
│ │ │ }); │ │ │
│ │ │ │ │ │
│ │ │ • Blur: Average of 3x3 or 5x5 neighborhood │ │ │
│ │ │ • Edge: Sobel gradient magnitude │ │ │
│ │ │ • Sharpen: Center - weighted neighbors │ │ │
│ │ └───────────────────────────────────────────────────────────────┘ │ │
│ │ │ │
│ │ ┌───────────────────────────────────────────────────────────────┐ │ │
│ │ │ Global Operations (need synchronization or reduction) │ │ │
│ │ │ │ │ │
│ │ │ // First pass: parallel histogram computation │ │ │
│ │ │ std::for_each(std::execution::par, │ │ │
│ │ │ chunks.begin(), chunks.end(), │ │ │
│ │ │ compute_local_histogram); │ │ │
│ │ │ │ │ │
│ │ │ // Merge histograms (sequential or parallel reduce) │ │ │
│ │ │ auto global_hist = merge_histograms(local_hists); │ │ │
│ │ │ │ │ │
│ │ │ // Second pass: parallel application │ │ │
│ │ │ std::transform(std::execution::par, ..., apply_lut); │ │ │
│ │ └───────────────────────────────────────────────────────────────┘ │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ BENCHMARKING │ │
│ │ │ │
│ │ for each filter: │ │
│ │ time_seq = measure(filter, std::execution::seq) │ │
│ │ time_par = measure(filter, std::execution::par) │ │
│ │ speedup = time_seq / time_par │ │
│ │ efficiency = speedup / num_cores │ │
│ │ │ │
│ │ Report results with analysis │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ │ │
│ ▼ │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Image Saver │────>│ Output File │ │
│ │ (PPM format) │ │ (.ppm) │ │
│ └─────────────────┘ └─────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Image Buffer and Parallel Transform
Image Memory Layout (Row-Major, Contiguous):
Memory Address: 0 1 2 3 4 5 6 7 8 ...
┌────┬────┬────┬────┬────┬────┬────┬────┬────┐
│ P0 │ P1 │ P2 │ P3 │ P4 │ P5 │ P6 │ P7 │ P8 │ ...
└────┴────┴────┴────┴────┴────┴────┴────┴────┘
Row 0 │ Row 1 │ Row 2
◄───────────────────────┤───────────────────►│
Each Pixel P = { uint8_t r, g, b } (3 bytes)
Access pattern:
Pixel at (x, y) = pixels[y * width + x]
Point Operation Parallelization:
Thread 0 Thread 1 Thread 2 Thread 3
│ │ │ │
▼ ▼ ▼ ▼
┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────┐
│ pixels │ │ pixels │ │ pixels │ │ pixels │
│ [0..N/4] │ │ [N/4..N/2] │ │ [N/2..3N/4]│ │ [3N/4..N] │
└────────────┘ └────────────┘ └────────────┘ └────────────┘
│ │ │ │
▼ ▼ ▼ ▼
┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────┐
│ output │ │ output │ │ output │ │ output │
│ [0..N/4] │ │ [N/4..N/2] │ │ [N/2..3N/4]│ │ [3N/4..N] │
└────────────┘ └────────────┘ └────────────┘ └────────────┘
No overlap = No data race = Perfect parallelism
Neighborhood Operation Parallelization (Row-Based):
INPUT BUFFER (read-only)
┌────────────────────────────────┐
Row 0 │ ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ │
Row 1 Thread 0 │ ████████████████████████████ ◄─┼─── reads rows 0,1,2
Row 2 │ ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ │
Row 3 Thread 1 │ ████████████████████████████ ◄─┼─── reads rows 2,3,4
Row 4 │ ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ │
Row 5 Thread 2 │ ████████████████████████████ ◄─┼─── reads rows 4,5,6
Row 6 │ ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ │
... │ ... │
└────────────────────────────────┘
│
▼ Each thread writes to
OUTPUT BUFFER (write, no overlap)
┌────────────────────────────────┐
Row 0 │ │
Row 1 Thread 0 │ ████████████████████████████ ◄─┼─── writes row 1 only
Row 2 │ │
Row 3 Thread 1 │ ████████████████████████████ ◄─┼─── writes row 3 only
Row 4 │ │
Row 5 Thread 2 │ ████████████████████████████ ◄─┼─── writes row 5 only
Row 6 │ │
... │ ... │
└────────────────────────────────┘
Reading neighbors is fine (input unchanged).
Writing to different rows prevents races.
Key Components
// Core data structures
struct Pixel {
uint8_t r, g, b;
};
struct Image {
std::vector<Pixel> pixels; // Contiguous memory
size_t width;
size_t height;
Pixel& at(size_t x, size_t y) {
return pixels[y * width + x];
}
const Pixel& at(size_t x, size_t y) const {
return pixels[y * width + x];
}
};
// Kernel for convolution operations
template<size_t N>
struct Kernel {
std::array<std::array<float, N>, N> weights;
float sum() const {
float s = 0;
for (auto& row : weights)
for (float w : row)
s += w;
return s;
}
void normalize() {
float s = sum();
if (s != 0) {
for (auto& row : weights)
for (float& w : row)
w /= s;
}
}
};
// Common kernels
constexpr Kernel<3> BOX_BLUR_3x3 = {{
{{1.0f/9, 1.0f/9, 1.0f/9}},
{{1.0f/9, 1.0f/9, 1.0f/9}},
{{1.0f/9, 1.0f/9, 1.0f/9}}
}};
constexpr Kernel<3> SOBEL_X = {{
{{-1, 0, 1}},
{{-2, 0, 2}},
{{-1, 0, 1}}
}};
constexpr Kernel<3> SOBEL_Y = {{
{{-1, -2, -1}},
{{ 0, 0, 0}},
{{ 1, 2, 1}}
}};
constexpr Kernel<3> SHARPEN = {{
{{ 0, -1, 0}},
{{-1, 5, -1}},
{{ 0, -1, 0}}
}};
Data Structures
Module Structure:
parallel_image/
├── src/
│ ├── main.cpp # CLI entry point
│ ├── image.hpp # Image and Pixel structures
│ ├── image.cpp # Image I/O (PPM format)
│ ├── filters.hpp # Filter function declarations
│ ├── filters_point.cpp # Point operations (brightness, contrast, etc.)
│ ├── filters_conv.cpp # Convolution operations (blur, edge, etc.)
│ ├── filters_global.cpp # Global operations (histogram, quantize)
│ ├── kernels.hpp # Predefined convolution kernels
│ ├── benchmark.hpp # Timing and reporting utilities
│ └── benchmark.cpp
├── tests/
│ ├── test_filters.cpp # Unit tests for filters
│ ├── test_parallel.cpp # Tests verifying parallel correctness
│ └── test_images/
│ ├── gradient.ppm # Simple gradient for testing
│ ├── checkerboard.ppm # Pattern for blur testing
│ └── edges.ppm # Known edges for edge detection
├── CMakeLists.txt
└── README.md
Implementation Guide
Phase 1: PPM Image Loading and Saving (1-2 hours)
PPM (Portable Pixel Map) is the simplest image format - pure ASCII text with pixel values.
PPM Format Specification:
P3 # Magic number (P3 = ASCII RGB)
4 3 # Width Height
255 # Maximum color value
255 0 0 # Row 0: Red, Green, Blue, White
0 255 0
0 0 255
255 255 255
0 0 0 # Row 1: Black, Cyan, Magenta, Yellow
0 255 255
255 0 255
255 255 0
128 128 128 # Row 2: Gray x4
128 128 128
128 128 128
128 128 128
Code Structure (implement yourself):
// image.hpp
#include <vector>
#include <string>
#include <fstream>
struct Pixel {
uint8_t r, g, b;
// Useful for algorithms
bool operator==(const Pixel& other) const {
return r == other.r && g == other.g && b == other.b;
}
};
struct Image {
std::vector<Pixel> pixels;
size_t width = 0;
size_t height = 0;
size_t size() const { return pixels.size(); }
bool empty() const { return pixels.empty(); }
Pixel& at(size_t x, size_t y);
const Pixel& at(size_t x, size_t y) const;
// Load from PPM file
static Image load_ppm(const std::string& filename);
// Save to PPM file
void save_ppm(const std::string& filename) const;
};
Verification:
// Create a test image, save it, load it, verify it matches
Image img;
img.width = 2;
img.height = 2;
img.pixels = {{255,0,0}, {0,255,0}, {0,0,255}, {255,255,255}};
img.save_ppm("test.ppm");
Image loaded = Image::load_ppm("test.ppm");
assert(loaded.width == img.width);
assert(loaded.height == img.height);
assert(loaded.pixels == img.pixels);
Phase 2: Point Operations with Sequential Execution (2-3 hours)
Start with sequential implementations to ensure correctness.
Point Operations to Implement:
- Brightness: Add constant to all channels
- Contrast: Scale around midpoint (128)
- Invert: Subtract each channel from 255
- Grayscale: Convert RGB to single luminance value
// filters.hpp
#include <execution>
#include "image.hpp"
// Clamp helper - essential for image processing
inline uint8_t clamp(int value) {
return static_cast<uint8_t>(std::clamp(value, 0, 255));
}
// Brightness adjustment
void adjust_brightness(Image& img, int delta);
// Contrast adjustment
void adjust_contrast(Image& img, float factor);
// Invert colors
void invert(Image& img);
// Convert to grayscale
void grayscale(Image& img);
Sequential Implementation Pattern:
void adjust_brightness(Image& img, int delta) {
std::transform(
img.pixels.begin(),
img.pixels.end(),
img.pixels.begin(),
[delta](Pixel p) {
return Pixel{
clamp(p.r + delta),
clamp(p.g + delta),
clamp(p.b + delta)
};
}
);
}
Verification:
// Test brightness on known values
Image img;
img.pixels = {{100, 100, 100}};
adjust_brightness(img, 50);
assert(img.pixels[0].r == 150);
// Test clamping
img.pixels = {{250, 250, 250}};
adjust_brightness(img, 50);
assert(img.pixels[0].r == 255); // Clamped, not 300
// Test invert
img.pixels = {{0, 128, 255}};
invert(img);
assert(img.pixels[0].r == 255);
assert(img.pixels[0].g == 127);
assert(img.pixels[0].b == 0);
Phase 3: Add Parallel Execution Policies (2-3 hours)
Now add execution policies and create benchmark infrastructure.
Parallel Version:
template<typename ExecutionPolicy>
void adjust_brightness(ExecutionPolicy&& policy, Image& img, int delta) {
std::transform(
std::forward<ExecutionPolicy>(policy),
img.pixels.begin(),
img.pixels.end(),
img.pixels.begin(),
[delta](Pixel p) {
return Pixel{
clamp(p.r + delta),
clamp(p.g + delta),
clamp(p.b + delta)
};
}
);
}
// Usage:
adjust_brightness(std::execution::seq, img, 50); // Sequential
adjust_brightness(std::execution::par, img, 50); // Parallel
Benchmark Infrastructure:
#include <chrono>
struct BenchmarkResult {
std::string filter_name;
double sequential_ms;
double parallel_ms;
double speedup;
double efficiency;
};
template<typename Func>
double measure_ms(Func&& f, int iterations = 5) {
// Warmup
f();
auto start = std::chrono::high_resolution_clock::now();
for (int i = 0; i < iterations; ++i) {
f();
}
auto end = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration<double, std::milli>(end - start);
return duration.count() / iterations;
}
BenchmarkResult benchmark_filter(
const std::string& name,
std::function<void()> seq_version,
std::function<void()> par_version,
int num_cores
) {
double seq_ms = measure_ms(seq_version);
double par_ms = measure_ms(par_version);
return BenchmarkResult{
name,
seq_ms,
par_ms,
seq_ms / par_ms,
(seq_ms / par_ms) / num_cores
};
}
Verification:
// Parallel should produce same results as sequential
Image img1 = load_test_image();
Image img2 = img1; // Copy
adjust_brightness(std::execution::seq, img1, 50);
adjust_brightness(std::execution::par, img2, 50);
assert(img1.pixels == img2.pixels); // Results must match!
Phase 4: Neighborhood Operations with Double-Buffering (3-4 hours)
Neighborhood operations read from surrounding pixels, so you cannot write in-place.
Double-Buffering Pattern:
Image apply_convolution(const Image& input, const Kernel<3>& kernel) {
Image output;
output.width = input.width;
output.height = input.height;
output.pixels.resize(input.width * input.height);
// Process interior pixels only (skip 1-pixel border)
// You'll handle borders separately or use padding
for (size_t y = 1; y < input.height - 1; ++y) {
for (size_t x = 1; x < input.width - 1; ++x) {
float r = 0, g = 0, b = 0;
// Apply kernel
for (int ky = -1; ky <= 1; ++ky) {
for (int kx = -1; kx <= 1; ++kx) {
const Pixel& p = input.at(x + kx, y + ky);
float w = kernel.weights[ky + 1][kx + 1];
r += p.r * w;
g += p.g * w;
b += p.b * w;
}
}
output.at(x, y) = Pixel{
clamp(static_cast<int>(r)),
clamp(static_cast<int>(g)),
clamp(static_cast<int>(b))
};
}
}
return output;
}
Row-Based Parallel Convolution:
template<typename ExecutionPolicy>
Image apply_convolution(ExecutionPolicy&& policy,
const Image& input,
const Kernel<3>& kernel) {
Image output;
output.width = input.width;
output.height = input.height;
output.pixels.resize(input.width * input.height);
// Create vector of row indices to parallelize
std::vector<size_t> row_indices;
for (size_t y = 1; y < input.height - 1; ++y) {
row_indices.push_back(y);
}
// Process rows in parallel
std::for_each(
std::forward<ExecutionPolicy>(policy),
row_indices.begin(),
row_indices.end(),
[&input, &output, &kernel](size_t y) {
// Each thread processes one complete row
for (size_t x = 1; x < input.width - 1; ++x) {
// Apply kernel at (x, y)
// Write to output.at(x, y)
// This is safe: each thread writes to different rows
}
}
);
// Handle border pixels (copy or pad)
// ...
return output;
}
Filters to Implement:
- Box Blur (3x3): Simple averaging
- Gaussian Blur (5x5): Weighted averaging with Gaussian distribution
- Sobel Edge Detection: Gradient magnitude from horizontal and vertical Sobel
- Sharpen: Emphasize center relative to neighbors
Verification:
// Blur uniform image = same image
Image uniform;
uniform.pixels = std::vector<Pixel>(100 * 100, {128, 128, 128});
uniform.width = 100;
uniform.height = 100;
Image blurred = apply_convolution(std::execution::seq, uniform, BOX_BLUR_3x3);
// Interior pixels should still be 128 (blur of uniform = uniform)
assert(blurred.at(50, 50).r == 128);
// Edge detection on uniform = near zero
Image edges = apply_edge_detection(std::execution::seq, uniform);
assert(edges.at(50, 50).r < 5); // Should be nearly zero
Phase 5: Global Operations (2-3 hours)
Some operations need information from the entire image first.
Histogram Equalization Example:
// Two-pass algorithm:
// Pass 1: Compute histogram (can be parallelized with local histograms)
// Pass 2: Apply lookup table (embarrassingly parallel)
struct Histogram {
std::array<size_t, 256> r{};
std::array<size_t, 256> g{};
std::array<size_t, 256> b{};
Histogram& operator+=(const Histogram& other) {
for (int i = 0; i < 256; ++i) {
r[i] += other.r[i];
g[i] += other.g[i];
b[i] += other.b[i];
}
return *this;
}
};
Histogram compute_histogram(const Image& img) {
Histogram h;
for (const Pixel& p : img.pixels) {
h.r[p.r]++;
h.g[p.g]++;
h.b[p.b]++;
}
return h;
}
// Parallel version using std::reduce with custom reduction
template<typename ExecutionPolicy>
Histogram compute_histogram_parallel(ExecutionPolicy&& policy, const Image& img) {
// Chunk the image and compute local histograms
size_t num_chunks = std::thread::hardware_concurrency();
size_t chunk_size = img.pixels.size() / num_chunks;
std::vector<size_t> chunk_starts;
for (size_t i = 0; i < num_chunks; ++i) {
chunk_starts.push_back(i * chunk_size);
}
std::vector<Histogram> local_histograms(num_chunks);
std::for_each(
std::forward<ExecutionPolicy>(policy),
chunk_starts.begin(),
chunk_starts.end(),
[&img, &local_histograms, chunk_size, num_chunks](size_t start) {
size_t chunk_idx = start / chunk_size;
size_t end = (chunk_idx == num_chunks - 1)
? img.pixels.size()
: start + chunk_size;
Histogram& h = local_histograms[chunk_idx];
for (size_t i = start; i < end; ++i) {
const Pixel& p = img.pixels[i];
h.r[p.r]++;
h.g[p.g]++;
h.b[p.b]++;
}
}
);
// Merge histograms (can also be parallelized but usually fast enough)
Histogram result;
for (const Histogram& h : local_histograms) {
result += h;
}
return result;
}
Color Quantization:
// Reduce image to N colors using simple median-cut or k-means
// This is more complex - implement a basic version first
void quantize_colors(Image& img, int num_colors) {
// Simple approach: reduce precision per channel
// More advanced: k-means clustering on color space
int levels_per_channel = std::cbrt(num_colors); // e.g., 16 colors = 2-3 levels
int step = 256 / levels_per_channel;
std::transform(
std::execution::par,
img.pixels.begin(),
img.pixels.end(),
img.pixels.begin(),
[step](Pixel p) {
return Pixel{
static_cast<uint8_t>((p.r / step) * step + step / 2),
static_cast<uint8_t>((p.g / step) * step + step / 2),
static_cast<uint8_t>((p.b / step) * step + step / 2)
};
}
);
}
Phase 6: Comprehensive Benchmarking (2-3 hours)
Create a benchmark suite that tests various image sizes and operations.
void run_full_benchmark(const std::string& image_path) {
Image img = Image::load_ppm(image_path);
int num_cores = std::thread::hardware_concurrency();
std::cout << "Image: " << image_path << "\n";
std::cout << "Size: " << img.width << " x " << img.height
<< " = " << img.size() << " pixels\n";
std::cout << "Cores: " << num_cores << "\n\n";
std::vector<BenchmarkResult> results;
// Point operations
{
Image copy = img;
auto seq = [©]() {
adjust_brightness(std::execution::seq, copy, 20);
};
auto par = [©]() {
adjust_brightness(std::execution::par, copy, 20);
};
results.push_back(benchmark_filter("Brightness", seq, par, num_cores));
}
// Add more filters...
// Neighborhood operations
{
auto seq = [&img]() {
return apply_convolution(std::execution::seq, img, BOX_BLUR_3x3);
};
auto par = [&img]() {
return apply_convolution(std::execution::par, img, BOX_BLUR_3x3);
};
results.push_back(benchmark_filter("Box Blur 3x3", seq, par, num_cores));
}
// Print results table
print_benchmark_table(results);
// Analyze and warn about overhead
for (const auto& r : results) {
if (r.speedup < 1.0) {
std::cout << "WARNING: " << r.filter_name
<< " is SLOWER with parallelism ("
<< r.speedup << "x)\n";
}
}
}
Test Different Image Sizes:
void benchmark_scaling() {
std::vector<std::pair<int, int>> sizes = {
{64, 64}, // 4K pixels
{256, 256}, // 65K pixels
{1024, 1024}, // 1M pixels
{2048, 2048}, // 4M pixels
{4096, 4096} // 16M pixels
};
std::cout << "Finding parallel crossover point...\n\n";
std::cout << "Size\t\tSeq (ms)\tPar (ms)\tSpeedup\n";
std::cout << "----\t\t--------\t--------\t-------\n";
for (auto [w, h] : sizes) {
Image img = create_test_image(w, h);
auto seq = [&img]() {
adjust_brightness(std::execution::seq, img, 20);
};
auto par = [&img]() {
adjust_brightness(std::execution::par, img, 20);
};
double seq_ms = measure_ms(seq);
double par_ms = measure_ms(par);
std::cout << w << "x" << h << "\t\t"
<< std::fixed << std::setprecision(2)
<< seq_ms << "\t\t"
<< par_ms << "\t\t"
<< seq_ms / par_ms << "x\n";
}
}
Phase 7: CLI and Integration (1-2 hours)
Create a polished command-line interface.
#include <iostream>
#include <string>
int main(int argc, char* argv[]) {
if (argc < 2) {
std::cout << "Usage: parallel_image [options]\n\n"
<< "Options:\n"
<< " --input <file.ppm> Input image\n"
<< " --output <file.ppm> Output image\n"
<< " --filter <name> Filter to apply\n"
<< " --benchmark Run benchmarks\n"
<< " --small Use small test images\n"
<< "\n"
<< "Filters:\n"
<< " brightness <delta> Adjust brightness\n"
<< " contrast <factor> Adjust contrast\n"
<< " invert Invert colors\n"
<< " grayscale Convert to grayscale\n"
<< " blur Box blur 3x3\n"
<< " gaussian Gaussian blur 5x5\n"
<< " edge Sobel edge detection\n"
<< " sharpen Sharpen\n"
<< " quantize <colors> Reduce to N colors\n";
return 0;
}
// Parse arguments and process...
}
Questions to Guide Your Design
Algorithm Design
- Why use
std::transforminstead of a raw parallel loop?- Think about: iterator requirements, algorithm semantics, portability
- When should you use
par_unseqvs justpar?- Think about: SIMD requirements, lock-free guarantees, allocator use
- How do you handle image borders in convolution?
- Options: skip borders, pad with zeros, reflect edges, wrap around
Performance Considerations
- What is the minimum image size where parallelism helps?
- This depends on your system - measure it!
- Why might point operations show lower parallel efficiency than convolutions?
- Think about: work per pixel, memory bandwidth, CPU intensity
- How does cache line size affect parallel performance?
- Consider: false sharing, memory access patterns
Design Decisions
- Why use a contiguous
vector<Pixel>instead ofvector<vector<Pixel>>?- Think about: cache locality, memory allocation, iterator compatibility
- Should the Pixel struct have padding?
- Consider: sizeof(Pixel), alignment, SIMD implications
- For histogram computation, is it better to use thread-local histograms or atomic operations?
- Trade-offs: contention vs. memory usage vs. final merge cost
Thinking Exercise
Before implementing, work through this scenario by hand:
Scenario: You have a 1000x1000 image (1 million pixels) and 4 cores. Each pixel takes 100 nanoseconds to process for brightness adjustment.
Questions:
- Sequential time: 1,000,000 pixels * 100 ns = ?
- Answer: 100 milliseconds
- Theoretical parallel time (perfect scaling): 100 ms / 4 = ?
- Answer: 25 milliseconds
- With 1ms thread startup overhead and assuming perfect work distribution:
- Answer: 25ms + 1ms = 26ms, speedup = 100/26 = 3.85x (96% efficiency)
- If there’s 5% sequential overhead (loading, saving, etc.):
- Sequential: 5ms + 100ms = 105ms
- Parallel: 5ms + 25ms = 30ms
- Speedup = 105/30 = 3.5x (but Amdahl says max is 1/(0.05 + 0.95/4) = 3.48x - close!)
- For a 10x10 image (100 pixels):
- Sequential: 100 * 100ns = 10 microseconds = 0.01ms
- Parallel overhead (1ms) » work (0.01ms)
- Parallel will be ~100x slower!
Key Insight: The crossover point where parallelism helps depends on:
- Work per element
- Thread overhead
- Number of cores
- Memory bandwidth
For your implementation, find this crossover point empirically!
Testing Strategy
Unit Tests
// test_filters.cpp
#include <cassert>
#include "filters.hpp"
void test_brightness_clamping() {
Image img;
img.pixels = {{250, 10, 128}};
img.width = 1;
img.height = 1;
adjust_brightness(std::execution::seq, img, 50);
assert(img.pixels[0].r == 255); // Clamped from 300
assert(img.pixels[0].g == 60); // 10 + 50
assert(img.pixels[0].b == 178); // 128 + 50
}
void test_invert_roundtrip() {
Image img;
img.pixels = {{0, 128, 255}, {100, 150, 200}};
img.width = 2;
img.height = 1;
Image original = img;
invert(img);
invert(img);
assert(img.pixels == original.pixels); // Double invert = identity
}
void test_blur_uniform() {
// Blurring a uniform image should produce the same uniform image
Image img;
img.width = 10;
img.height = 10;
img.pixels = std::vector<Pixel>(100, {100, 100, 100});
Image blurred = apply_convolution(std::execution::seq, img, BOX_BLUR_3x3);
// Check interior pixels (not borders)
for (size_t y = 1; y < img.height - 1; ++y) {
for (size_t x = 1; x < img.width - 1; ++x) {
assert(blurred.at(x, y).r == 100);
}
}
}
void test_edge_detection_uniform() {
// Edge detection on uniform image should produce near-zero values
Image img;
img.width = 10;
img.height = 10;
img.pixels = std::vector<Pixel>(100, {128, 128, 128});
Image edges = apply_edge_detection(std::execution::seq, img);
for (size_t y = 1; y < img.height - 1; ++y) {
for (size_t x = 1; x < img.width - 1; ++x) {
assert(edges.at(x, y).r < 5); // Should be near zero
}
}
}
void test_edge_detection_gradient() {
// Vertical gradient should produce horizontal edges
Image img;
img.width = 10;
img.height = 10;
img.pixels.reserve(100);
for (size_t y = 0; y < 10; ++y) {
uint8_t value = static_cast<uint8_t>(y * 25);
for (size_t x = 0; x < 10; ++x) {
img.pixels.push_back({value, value, value});
}
}
Image edges = apply_edge_detection(std::execution::seq, img);
// Interior should have non-zero values (edges detected)
bool has_edges = false;
for (size_t y = 2; y < 8; ++y) {
for (size_t x = 2; x < 8; ++x) {
if (edges.at(x, y).r > 10) {
has_edges = true;
}
}
}
assert(has_edges);
}
Parallel Correctness Tests
// test_parallel.cpp
void test_parallel_matches_sequential() {
// For all filters, parallel version must produce identical results
Image img = create_random_test_image(100, 100);
// Test brightness
{
Image seq_copy = img;
Image par_copy = img;
adjust_brightness(std::execution::seq, seq_copy, 50);
adjust_brightness(std::execution::par, par_copy, 50);
assert(seq_copy.pixels == par_copy.pixels);
}
// Test blur
{
Image seq_result = apply_convolution(std::execution::seq, img, BOX_BLUR_3x3);
Image par_result = apply_convolution(std::execution::par, img, BOX_BLUR_3x3);
assert(seq_result.pixels == par_result.pixels);
}
// Add tests for all filters...
}
void test_deterministic_results() {
// Running parallel version multiple times should give same result
Image img = create_random_test_image(100, 100);
std::vector<Image> results;
for (int i = 0; i < 10; ++i) {
Image copy = img;
adjust_brightness(std::execution::par, copy, 50);
results.push_back(copy);
}
for (size_t i = 1; i < results.size(); ++i) {
assert(results[i].pixels == results[0].pixels);
}
}
Integration Tests
void test_full_pipeline() {
// Apply multiple filters in sequence
Image img = Image::load_ppm("test_images/photo.ppm");
adjust_brightness(std::execution::par, img, 20);
Image blurred = apply_convolution(std::execution::par, img, GAUSSIAN_5x5);
Image edges = apply_edge_detection(std::execution::par, blurred);
edges.save_ppm("test_output/pipeline_result.ppm");
// Verify output file exists and is valid
Image loaded = Image::load_ppm("test_output/pipeline_result.ppm");
assert(!loaded.empty());
}
void test_large_image_performance() {
// Verify parallel is faster for large images
Image img = create_test_image(2048, 2048);
auto seq_time = measure_ms([&img]() {
adjust_brightness(std::execution::seq, img, 20);
});
auto par_time = measure_ms([&img]() {
adjust_brightness(std::execution::par, img, 20);
});
double speedup = seq_time / par_time;
std::cout << "Large image speedup: " << speedup << "x\n";
assert(speedup > 1.5); // Should see significant speedup
}
Common Pitfalls
Pitfall 1: Iterator Invalidation
Symptom: Crash or undefined behavior during parallel execution
Cause: Resizing container during parallel operation
// WRONG: May reallocate during parallel loop
std::for_each(std::execution::par, indices.begin(), indices.end(),
[&output](size_t i) {
output.push_back(compute(i)); // BAD: push_back may reallocate!
});
Fix: Pre-allocate output container
// CORRECT: Pre-allocate, then assign
output.resize(indices.size());
std::for_each(std::execution::par, indices.begin(), indices.end(),
[&output](size_t i) {
output[i] = compute(i); // GOOD: No reallocation
});
Pitfall 2: Data Races in Neighborhood Operations
Symptom: Image has artifacts, varies between runs
Cause: Reading and writing same buffer
// WRONG: Blur reads neighbors while other threads modify them
std::for_each(std::execution::par, ...,
[&img](size_t idx) {
img.pixels[idx] = average_of_neighbors(img, idx); // Race!
});
Fix: Double-buffering
// CORRECT: Read from input, write to output
std::for_each(std::execution::par, ...,
[&input, &output](size_t idx) {
output.pixels[idx] = average_of_neighbors(input, idx); // Safe
});
Pitfall 3: Using par_unseq with Locks
Symptom: Deadlock or crash
Cause: par_unseq may interleave operations on same thread
std::mutex mtx;
std::for_each(std::execution::par_unseq, ...,
[&mtx](auto& item) {
std::lock_guard lock(mtx); // DANGER: May deadlock!
// If this thread is paused and resumed mid-operation,
// it may try to acquire the same lock twice
});
Fix: Use par or redesign to avoid locks
// Option 1: Use par instead of par_unseq
std::for_each(std::execution::par, ...);
// Option 2: Use thread-local storage
thread_local std::vector<int> local_results;
Pitfall 4: Measuring Debug Builds
Symptom: No speedup or slowdown from parallelism
Cause: Debug builds disable optimizations
# WRONG: Debug build
g++ -g main.cpp -ltbb
# CORRECT: Release build with optimizations
g++ -O3 -march=native main.cpp -ltbb
Pitfall 5: Not Warming Up Cache
Symptom: First run is much slower, inconsistent benchmarks
Cause: Cold cache, lazy initialization
Fix: Warm up before measuring
template<typename Func>
double measure_ms(Func&& f, int iterations = 5) {
// Warmup run - don't count this
f();
auto start = std::chrono::high_resolution_clock::now();
for (int i = 0; i < iterations; ++i) {
f();
}
auto end = std::chrono::high_resolution_clock::now();
return std::chrono::duration<double, std::milli>(end - start).count()
/ iterations;
}
Pitfall 6: Ignoring Memory Bandwidth
Symptom: Low efficiency even with perfect parallelism
Cause: Memory bandwidth is the bottleneck, not CPU
// Point operations may be memory-bound
// Each pixel is 3 bytes, read once, written once
// 4K image = 12M pixels * 3 bytes * 2 (read+write) = 72 MB
// Memory bandwidth: ~50 GB/s
// Time to move data: 72 MB / 50 GB/s = 1.4 ms
// If compute is faster than this, memory is the bottleneck
Mitigation:
- Fuse multiple operations to amortize memory access
- Process data in cache-sized chunks
Extensions & Challenges
Extension 1: Streaming Large Images
Handle images too large to fit in memory:
class StreamingImageProcessor {
// Process image in tiles
void process_tiled(const std::string& input_path,
const std::string& output_path,
size_t tile_size = 1024) {
// Read and process one tile at a time
// Tiles can be processed in parallel
// Handle tile boundaries for neighborhood operations
}
};
Extension 2: GPU Acceleration with SYCL
Port to GPU using SYCL (oneAPI):
#include <sycl/sycl.hpp>
void brightness_gpu(sycl::queue& q, Image& img, int delta) {
sycl::buffer<Pixel> buf(img.pixels.data(), img.pixels.size());
q.submit([&](sycl::handler& h) {
auto acc = buf.get_access<sycl::access::mode::read_write>(h);
h.parallel_for(sycl::range<1>(img.pixels.size()), [=](sycl::id<1> i) {
Pixel& p = acc[i];
p.r = clamp(p.r + delta);
p.g = clamp(p.g + delta);
p.b = clamp(p.b + delta);
});
});
}
Extension 3: Vectorized Operations with std::simd
Use explicit SIMD with C++23 std::simd:
#include <experimental/simd>
namespace stdx = std::experimental;
void brightness_simd(Image& img, int delta) {
using simd_t = stdx::native_simd<uint8_t>;
constexpr size_t simd_size = simd_t::size();
// Process in SIMD-width chunks
uint8_t* data = reinterpret_cast<uint8_t*>(img.pixels.data());
size_t total_bytes = img.pixels.size() * 3;
size_t i = 0;
for (; i + simd_size <= total_bytes; i += simd_size) {
simd_t v;
v.copy_from(data + i, stdx::element_aligned);
// Saturating add (clamps automatically)
v = stdx::min(v + simd_t(delta), simd_t(255));
v.copy_to(data + i, stdx::element_aligned);
}
// Handle remainder
for (; i < total_bytes; ++i) {
data[i] = clamp(data[i] + delta);
}
}
Extension 4: Real-Time Video Processing
Apply filters to webcam feed:
// Using OpenCV for video capture
void realtime_processing() {
cv::VideoCapture cap(0);
while (true) {
cv::Mat frame;
cap >> frame;
// Convert to Image, process, convert back
Image img = opencv_to_image(frame);
Image edges = apply_edge_detection(std::execution::par, img);
frame = image_to_opencv(edges);
cv::imshow("Edges", frame);
if (cv::waitKey(1) == 'q') break;
}
}
Extension 5: Automatic Parallelism Selection
Decide at runtime whether to use parallelism:
template<typename Func>
auto auto_parallel(const Image& img, Func&& filter) {
// Heuristic: use parallel for images > 100K pixels
constexpr size_t PARALLEL_THRESHOLD = 100000;
if (img.size() > PARALLEL_THRESHOLD) {
return filter(std::execution::par);
} else {
return filter(std::execution::seq);
}
}
// Usage:
auto result = auto_parallel(img, [&](auto policy) {
return apply_convolution(policy, img, BOX_BLUR_3x3);
});
Resources
Primary References
| Resource | Description |
|---|---|
| “C++ Concurrency in Action, 2nd Ed” Ch. 10 | Comprehensive coverage of parallel algorithms |
| cppreference.com - Execution policies | Official documentation for std::execution |
| Intel TBB Documentation | Implementation details of parallel algorithms |
Secondary References
| Resource | Description |
|---|---|
| “Digital Image Processing” by Gonzalez & Woods | Image processing algorithms |
| “Computer Graphics from Scratch” by Gambetta | Fundamental graphics concepts |
| “What Every Programmer Should Know About Memory” | Understanding cache effects |
Online Resources
| Resource | URL |
|---|---|
| C++17 Parallel Algorithms Paper | P0024R2 |
| Intel TBB Tutorial | oneapi.io/tbb |
| Netpbm Format Specification | netpbm.sourceforge.net |
Tools
| Tool | Purpose |
|---|---|
| Intel VTune | Performance profiling |
| perf | Linux performance analysis |
| Valgrind/Helgrind | Thread error detection |
| ThreadSanitizer (TSan) | Runtime race detection |
Self-Assessment Checklist
Conceptual Understanding
- I can explain the difference between
seq,par, andpar_unseqexecution policies - I understand why
par_unseqprohibits locks and memory allocation - I can apply Amdahl’s Law to calculate theoretical speedup limits
- I know when parallel overhead exceeds parallel benefit
- I understand false sharing and how to avoid it
- I can identify embarrassingly parallel operations
Implementation Skills
- My PPM loader/saver handles files correctly
- Point operations (brightness, contrast, invert) work correctly
- Neighborhood operations use double-buffering
- Parallel versions produce identical results to sequential
- My code compiles with
-O3optimization
Performance Analysis
- I benchmarked sequential vs. parallel for multiple image sizes
- I found the crossover point where parallelism becomes beneficial
- I calculated parallel efficiency (speedup / cores)
- I understand why my efficiency is < 100%
- I can explain performance differences between filters
Testing
- I have unit tests for all filters
- I verified parallel correctness (same results as sequential)
- I tested edge cases (tiny images, uniform images, gradients)
- My tests run with ThreadSanitizer without errors
Submission/Completion Criteria
Your project is complete when:
Functional Requirements
- Core Features
- Loads and saves PPM images correctly
- Implements at least 4 point operations (brightness, contrast, invert, grayscale)
- Implements at least 3 neighborhood operations (blur, edge, sharpen)
- All filters work with both
seqandparexecution policies
- Benchmarking
- Produces formatted benchmark output showing speedup
- Tests multiple image sizes (at least 3: small, medium, large)
- Identifies crossover point where parallelism helps
- Reports parallel efficiency (speedup / cores)
- Code Quality
- Clean separation between image I/O, filters, and benchmarking
- No data races (verified with ThreadSanitizer)
- Compiles without warnings on
-Wall -Wextra - Includes unit tests for correctness
Performance Requirements
- Achieves > 50% parallel efficiency on 4+ core system for large images (1000x1000+)
- Correctly identifies when parallelism is counterproductive
- Neighborhood operations show higher efficiency than point operations (due to more work/pixel)
Documentation
- README explains how to build and run
- Comments explain algorithm choices (double-buffering, row parallelization)
- Benchmark results documented with system specs
Stretch Goals (Optional)
- Implement Gaussian blur with 5x5 kernel
- Add histogram equalization (global operation)
- Support additional image formats (PNG via stb_image)
- Add
--auto-parallelflag that chooses based on image size - Implement separable convolution for efficiency
Moving Forward
After completing this project:
-
Next Project: P08: Parallel Sort Benchmark Suite - Apply parallel algorithms to sorting
- Related Projects:
- P12: SIMD Math Library - Explicit vectorization
- P13: Audio DSP with SIMD - Real-time signal processing
- P14: Real-Time Game Physics - Combining threading and SIMD
- Deepen Understanding:
- Profile with Intel VTune to understand memory bandwidth limits
- Experiment with different chunk sizes for histogram computation
- Compare TBB’s
parallel_forwithstd::for_each(par, ...)
Key Insight to Carry Forward: C++17 parallel algorithms make parallelism syntactically trivial, but achieving good performance requires understanding the cost model. The difference between 2x and 6x speedup often comes down to:
- Minimizing memory bandwidth pressure
- Choosing appropriate granularity
- Avoiding synchronization in hot paths
- Measuring, not guessing
When you reach for std::execution::par, ask: “Is the work per element enough to justify the overhead?” This project taught you to answer that question empirically.
This project bridges high-level parallel algorithms with practical performance engineering. The skills transfer directly to production systems where the difference between “parallel but slow” and “efficiently parallel” can mean the difference between meeting latency requirements or failing them.