P07: Parallel Image Processor

Build an image processing pipeline that uses C++17 parallel algorithms (std::execution::par) to apply filters, transformations, and effects across multiple cores with minimal custom threading code.


Quick Reference

Attribute Value
Difficulty Intermediate
Time Estimate 1-2 weeks
Language C++ (C++17 or later)
Prerequisites Projects 1-2 completed, STL algorithms, basic image concepts
Primary Book “C++ Concurrency in Action, 2nd Ed” by Anthony Williams, Ch. 10
Knowledge Area Parallel Algorithms / STL Execution Policies

Learning Objectives

After completing this project, you will be able to:

  1. Use C++17 parallel algorithms - Apply std::execution::par and par_unseq to STL algorithms
  2. Understand execution policies - Know when to use seq, par, and par_unseq and their implications
  3. Recognize embarrassingly parallel problems - Identify operations that benefit from parallelization
  4. Apply Amdahl’s Law - Calculate theoretical maximum speedup and understand its limitations
  5. Measure parallel overhead - Determine when parallelism helps vs. hurts performance
  6. Handle neighborhood operations - Implement double-buffering for parallel convolution
  7. Process images efficiently - Design cache-friendly data structures for image processing
  8. Benchmark parallel code - Create fair, reproducible performance comparisons

The Core Question You’re Answering

“When does adding std::execution::par to an algorithm actually make it faster?”

The promise of C++17 parallel algorithms is seductive: add a single parameter and your code runs on multiple cores. But the reality is more nuanced:

  • For large, computationally intensive operations, parallelism can provide near-linear speedup
  • For small data sets or simple operations, the overhead of thread management exceeds the benefit
  • For operations with data dependencies, parallelism may be impossible or require careful design

Before coding, internalize this: Parallelism is not free. Every parallel invocation incurs overhead for thread management, synchronization, and cache coherence. Your job is to understand when the work per element justifies that overhead.


Theoretical Foundation

Core Concepts

Execution Policies Explained

C++17 introduced execution policies to control how STL algorithms execute:

Execution Policy Hierarchy:

std::execution::seq        Standard sequential execution
         │                 - Same as traditional algorithms
         │                 - No parallelism, no vectorization
         │                 - Fully deterministic order
         │
std::execution::par        Parallel execution
         │                 - Operations may execute on multiple threads
         │                 - Order is unspecified
         │                 - Must be safe for concurrent execution
         │                 - No vectorization guarantees
         │
std::execution::par_unseq  Parallel + vectorized execution
                           - Operations may interleave on single thread (SIMD)
                           - AND execute on multiple threads
                           - Most restrictive requirements:
                             * No mutex locks
                             * No memory allocation in callbacks
                             * No thread-local access

Critical Insight: par_unseq allows operations to be interleaved within a single thread via SIMD instructions. This means your callback cannot acquire locks (potential deadlock if same thread tries to acquire twice) or allocate memory (heap locks).

Embarrassingly Parallel Problems

Some problems decompose naturally into independent subproblems:

Embarrassingly Parallel Operations (ideal for std::execution::par):

Point Operations (each pixel independent):
┌────────────────────────────────────────────────────────┐
│  Input Pixel     Operation       Output Pixel         │
│  ───────────     ─────────       ────────────         │
│  pixel[0]   ───► brightness ───► result[0]  (thread 1)│
│  pixel[1]   ───► brightness ───► result[1]  (thread 2)│
│  pixel[2]   ───► brightness ───► result[2]  (thread 3)│
│  ...                                                  │
│  pixel[N]   ───► brightness ───► result[N]  (thread N)│
│                                                       │
│  No data dependencies between iterations!             │
└────────────────────────────────────────────────────────┘

Neighborhood Operations (require double-buffering):
┌────────────────────────────────────────────────────────┐
│  READ from input buffer    WRITE to output buffer     │
│  ─────────────────────     ─────────────────────      │
│  ┌─────────────────┐       ┌─────────────────┐        │
│  │ neighbors of [0]│ ───►  │  result[0]      │        │
│  │ neighbors of [1]│ ───►  │  result[1]      │        │
│  │ neighbors of [2]│ ───►  │  result[2]      │        │
│  └─────────────────┘       └─────────────────┘        │
│                                                       │
│  Reading and writing to SAME buffer = race condition! │
│  Must use separate input and output buffers.          │
└────────────────────────────────────────────────────────┘

Why This Matters: Real-World Image Processing

Image processing is a cornerstone of modern computing:

  • Social Media: Every photo uploaded goes through filters, resizing, format conversion
  • Medical Imaging: MRI, CT scans require enhancement and analysis
  • Computer Vision: Autonomous vehicles process thousands of frames per second
  • Photography: RAW processing, batch editing for professionals
  • Streaming: Real-time filters for video calls

A 4K image (3840 x 2160) has over 8 million pixels. Processing each pixel even 10 times faster saves substantial time when handling thousands of images.

Amdahl’s Law and Overhead

Amdahl’s Law Formula

                    1
Speedup = ─────────────────────────
          (1 - P) + P/N

Where:
  P = Fraction of program that can be parallelized (0 to 1)
  N = Number of processors

Example calculations:

If 90% of work is parallelizable (P = 0.9):
  N = 2 cores:  Speedup = 1 / (0.1 + 0.9/2) = 1 / 0.55 = 1.82x
  N = 4 cores:  Speedup = 1 / (0.1 + 0.9/4) = 1 / 0.325 = 3.08x
  N = 8 cores:  Speedup = 1 / (0.1 + 0.9/8) = 1 / 0.2125 = 4.71x
  N = 16 cores: Speedup = 1 / (0.1 + 0.9/16) = 1 / 0.156 = 6.4x
  N = infinity: Speedup = 1 / 0.1 = 10x (theoretical maximum!)

If 99% is parallelizable (P = 0.99):
  N = 8 cores:  Speedup = 1 / (0.01 + 0.99/8) = 1 / 0.134 = 7.47x
  N = infinity: Speedup = 1 / 0.01 = 100x

Key Insight: Even with infinite processors, speedup is limited by the sequential portion.

Amdahl's Law Visualization:

Speedup
  ^
10│                         .................... P=99%
  │                   ......
  │             ......
 8│        .....
  │      ..
  │    ..
 6│   .                       ..................P=90%
  │  .                  ......
  │ .              .....
 4│.          .....
  │.       ...
  │.    ...                      .................P=75%
 2│.  ..                   ......
  │...               ......
  │.           ......
 1└──────────────────────────────────────────────────►
   1    2    4    8    16   32   64  128  256  Cores

The Overhead Reality

Parallelism has costs that Amdahl’s Law doesn’t capture:

Parallel Overhead Components:

1. Thread Management
   ┌─────────────────────────────────────────────┐
   │ • Thread creation/destruction               │
   │ • Task queue management                     │
   │ • Work distribution                         │
   │ • Synchronization at completion             │
   │                                             │
   │ Typical overhead: 1-100 microseconds        │
   └─────────────────────────────────────────────┘

2. Cache Coherence
   ┌─────────────────────────────────────────────┐
   │ Core 0 Cache    Core 1 Cache    Core 2 Cache│
   │   [data]          [data]          [data]   │
   │      │               │               │      │
   │      └───────────────┼───────────────┘      │
   │                      │                      │
   │              Memory Bus Traffic             │
   │                                             │
   │ When cores write to nearby memory,          │
   │ cache lines must be invalidated/synchronized│
   └─────────────────────────────────────────────┘

3. False Sharing
   ┌─────────────────────────────────────────────┐
   │ Cache Line (64 bytes):                      │
   │ ┌────────────────────────────────────────┐  │
   │ │ Thread 0 │ Thread 1 │ Thread 2 │ ...   │  │
   │ │  writes  │  writes  │  writes  │       │  │
   │ └────────────────────────────────────────┘  │
   │                                             │
   │ Even if threads write different elements,   │
   │ if they share a cache line, performance     │
   │ degrades due to cache line bouncing.        │
   └─────────────────────────────────────────────┘

When Parallel Hurts Performance:

Work per element vs. Overhead:

              Low work/element        High work/element
              (brightness adjust)     (blur, convolution)
              ─────────────────       ─────────────────
Small image   Overhead >> Work        Overhead > Work
(64x64)       SLOWER with par!        Maybe break-even

Large image   Overhead < Work         Overhead << Work
(4096x4096)   Modest speedup          Near-linear speedup

Common Misconceptions

Misconception 1: “More threads = more speed”

  • Reality: Speedup plateaus and can even decrease due to overhead
  • Beyond your core count, threads compete for CPU time

Misconception 2: “par_unseq is always faster than par”

  • Reality: par_unseq has stricter requirements (no locks, no allocation)
  • If your operation doesn’t benefit from SIMD, par may be equivalent

Misconception 3: “Parallel algorithms handle synchronization for me”

  • Reality: The algorithm parallelizes, but YOUR callback must be thread-safe
  • Shared state, output dependencies, and side effects are your responsibility

Misconception 4: “I can parallelize any loop with std::transform”

  • Reality: Some patterns don’t fit transform (dependencies between iterations)
  • Neighborhood operations need double-buffering or row-based parallelization

Project Specification

What You’ll Build

A command-line image processing tool that:

  1. Loads images in PPM format (simple text-based format, easy to parse)
  2. Applies various filters using parallel algorithms
  3. Benchmarks sequential vs. parallel performance
  4. Reports speedup and warns when parallelism hurts

Real World Outcome

$ ./parallel_image --input photo.ppm --output processed.ppm

Loading photo.ppm (4096x3072, 12.6 megapixels)...
Loaded in 0.8s

Applying filters with std::execution::par:

1. Brightness adjustment (+20%)
   Sequential: 245ms
   Parallel:   42ms
   Speedup:    5.8x

2. Gaussian blur (5x5 kernel)
   Sequential: 1,840ms
   Parallel:   285ms
   Speedup:    6.5x

3. Edge detection (Sobel)
   Sequential: 892ms
   Parallel:   138ms
   Speedup:    6.5x

4. Color quantization (16 colors)
   Sequential: 3,200ms
   Parallel:   520ms
   Speedup:    6.2x

Total processing time:
   Sequential: 6,177ms
   Parallel:   985ms
   Overall speedup: 6.3x on 8 cores

Saved processed.ppm

$ ./parallel_image --benchmark --small
Loading tiny.ppm (64x64, 4096 pixels)...
Warning: For small images, parallel overhead exceeds benefits
   Sequential: 0.8ms
   Parallel:   2.1ms (SLOWER due to thread overhead!)

Benchmark Output Format

┌─────────────────────────────────────────────────────────────────────────────┐
│                         PARALLEL IMAGE PROCESSOR                             │
│                         Benchmark Results                                    │
├─────────────────────────────────────────────────────────────────────────────┤
│ Image: landscape.ppm (4096 x 3072 pixels = 12,582,912 pixels)               │
│ System: 8 cores, 16 threads                                                 │
│ Compiler: g++ 12.2 with -O3 -march=native                                   │
├─────────────────────────────────────────────────────────────────────────────┤
│ Filter           │ Sequential │ Parallel │ Speedup │ Efficiency            │
├──────────────────┼────────────┼──────────┼─────────┼───────────────────────┤
│ Brightness       │    245 ms  │    42 ms │   5.8x  │ 73% (5.8/8 cores)     │
│ Contrast         │    251 ms  │    44 ms │   5.7x  │ 71%                   │
│ Invert           │    198 ms  │    35 ms │   5.7x  │ 71%                   │
│ Grayscale        │    312 ms  │    52 ms │   6.0x  │ 75%                   │
│ Box Blur 3x3     │    892 ms  │   138 ms │   6.5x  │ 81%                   │
│ Gaussian 5x5     │  1,840 ms  │   285 ms │   6.5x  │ 81%                   │
│ Sobel Edge       │    892 ms  │   138 ms │   6.5x  │ 81%                   │
│ Sharpen          │    724 ms  │   112 ms │   6.5x  │ 81%                   │
│ Color Quantize   │  3,200 ms  │   520 ms │   6.2x  │ 78%                   │
├──────────────────┼────────────┼──────────┼─────────┼───────────────────────┤
│ TOTAL            │  8,554 ms  │ 1,366 ms │   6.3x  │ 79% average           │
└─────────────────────────────────────────────────────────────────────────────┘

Observations:
- Neighborhood operations (blur, edge) show best scaling due to higher work/pixel
- Point operations (brightness, invert) have lower efficiency due to memory bandwidth
- Theoretical maximum speedup on 8 cores: 8.0x
- Achieved: 6.3x (79% parallel efficiency)

Solution Architecture

System Design

┌─────────────────────────────────────────────────────────────────────────────┐
│                       PARALLEL IMAGE PROCESSOR                               │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  ┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐       │
│  │   CLI Parser    │────>│   Image Loader  │────>│  Image Buffer   │       │
│  │  (arguments)    │     │  (PPM format)   │     │  (vector<Pixel>)│       │
│  └─────────────────┘     └─────────────────┘     └────────┬────────┘       │
│                                                           │                 │
│                                                           ▼                 │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │                        FILTER PIPELINE                               │   │
│  │                                                                      │   │
│  │  ┌───────────────────────────────────────────────────────────────┐  │   │
│  │  │ Point Operations (embarrassingly parallel)                     │  │   │
│  │  │                                                                │  │   │
│  │  │ std::transform(std::execution::par,                           │  │   │
│  │  │               pixels.begin(), pixels.end(),                    │  │   │
│  │  │               output.begin(),                                  │  │   │
│  │  │               [](Pixel p) { return adjust_brightness(p); });  │  │   │
│  │  │                                                                │  │   │
│  │  │ • Brightness: p.r += delta (clamped)                          │  │   │
│  │  │ • Contrast: (p.r - 128) * factor + 128                        │  │   │
│  │  │ • Invert: 255 - p.r                                           │  │   │
│  │  │ • Grayscale: 0.299*r + 0.587*g + 0.114*b                      │  │   │
│  │  └───────────────────────────────────────────────────────────────┘  │   │
│  │                                                                      │   │
│  │  ┌───────────────────────────────────────────────────────────────┐  │   │
│  │  │ Neighborhood Operations (need double-buffering)               │  │   │
│  │  │                                                                │  │   │
│  │  │ // Create row iterators for parallel row processing           │  │   │
│  │  │ std::for_each(std::execution::par,                            │  │   │
│  │  │              row_indices.begin(), row_indices.end(),          │  │   │
│  │  │              [&](size_t row) {                                 │  │   │
│  │  │                  process_row(input, output, row, kernel);     │  │   │
│  │  │              });                                               │  │   │
│  │  │                                                                │  │   │
│  │  │ • Blur: Average of 3x3 or 5x5 neighborhood                    │  │   │
│  │  │ • Edge: Sobel gradient magnitude                              │  │   │
│  │  │ • Sharpen: Center - weighted neighbors                        │  │   │
│  │  └───────────────────────────────────────────────────────────────┘  │   │
│  │                                                                      │   │
│  │  ┌───────────────────────────────────────────────────────────────┐  │   │
│  │  │ Global Operations (need synchronization or reduction)         │  │   │
│  │  │                                                                │  │   │
│  │  │ // First pass: parallel histogram computation                 │  │   │
│  │  │ std::for_each(std::execution::par,                            │  │   │
│  │  │              chunks.begin(), chunks.end(),                    │  │   │
│  │  │              compute_local_histogram);                        │  │   │
│  │  │                                                                │  │   │
│  │  │ // Merge histograms (sequential or parallel reduce)           │  │   │
│  │  │ auto global_hist = merge_histograms(local_hists);             │  │   │
│  │  │                                                                │  │   │
│  │  │ // Second pass: parallel application                          │  │   │
│  │  │ std::transform(std::execution::par, ..., apply_lut);          │  │   │
│  │  └───────────────────────────────────────────────────────────────┘  │   │
│  │                                                                      │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
│                                    │                                        │
│                                    ▼                                        │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │                          BENCHMARKING                                │   │
│  │                                                                      │   │
│  │  for each filter:                                                   │   │
│  │    time_seq = measure(filter, std::execution::seq)                  │   │
│  │    time_par = measure(filter, std::execution::par)                  │   │
│  │    speedup = time_seq / time_par                                    │   │
│  │    efficiency = speedup / num_cores                                 │   │
│  │                                                                      │   │
│  │  Report results with analysis                                       │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
│                                    │                                        │
│                                    ▼                                        │
│  ┌─────────────────┐     ┌─────────────────┐                               │
│  │  Image Saver    │────>│   Output File   │                               │
│  │  (PPM format)   │     │   (.ppm)        │                               │
│  └─────────────────┘     └─────────────────┘                               │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Image Buffer and Parallel Transform

Image Memory Layout (Row-Major, Contiguous):

Memory Address:  0    1    2    3    4    5    6    7    8    ...
                 ┌────┬────┬────┬────┬────┬────┬────┬────┬────┐
                 │ P0 │ P1 │ P2 │ P3 │ P4 │ P5 │ P6 │ P7 │ P8 │ ...
                 └────┴────┴────┴────┴────┴────┴────┴────┴────┘
                 Row 0                   │ Row 1              │ Row 2
                 ◄───────────────────────┤───────────────────►│

Each Pixel P = { uint8_t r, g, b }  (3 bytes)

Access pattern:
  Pixel at (x, y) = pixels[y * width + x]


Point Operation Parallelization:

Thread 0         Thread 1         Thread 2         Thread 3
    │                │                │                │
    ▼                ▼                ▼                ▼
┌────────────┐  ┌────────────┐  ┌────────────┐  ┌────────────┐
│ pixels     │  │ pixels     │  │ pixels     │  │ pixels     │
│ [0..N/4]   │  │ [N/4..N/2] │  │ [N/2..3N/4]│  │ [3N/4..N]  │
└────────────┘  └────────────┘  └────────────┘  └────────────┘
    │                │                │                │
    ▼                ▼                ▼                ▼
┌────────────┐  ┌────────────┐  ┌────────────┐  ┌────────────┐
│ output     │  │ output     │  │ output     │  │ output     │
│ [0..N/4]   │  │ [N/4..N/2] │  │ [N/2..3N/4]│  │ [3N/4..N]  │
└────────────┘  └────────────┘  └────────────┘  └────────────┘

No overlap = No data race = Perfect parallelism


Neighborhood Operation Parallelization (Row-Based):

                      INPUT BUFFER (read-only)
                 ┌────────────────────────────────┐
Row 0            │ ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ │
Row 1   Thread 0 │ ████████████████████████████ ◄─┼─── reads rows 0,1,2
Row 2            │ ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ │
Row 3   Thread 1 │ ████████████████████████████ ◄─┼─── reads rows 2,3,4
Row 4            │ ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ │
Row 5   Thread 2 │ ████████████████████████████ ◄─┼─── reads rows 4,5,6
Row 6            │ ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ │
...              │ ...                            │
                 └────────────────────────────────┘
                             │
                             ▼ Each thread writes to
                      OUTPUT BUFFER (write, no overlap)
                 ┌────────────────────────────────┐
Row 0            │                                │
Row 1   Thread 0 │ ████████████████████████████ ◄─┼─── writes row 1 only
Row 2            │                                │
Row 3   Thread 1 │ ████████████████████████████ ◄─┼─── writes row 3 only
Row 4            │                                │
Row 5   Thread 2 │ ████████████████████████████ ◄─┼─── writes row 5 only
Row 6            │                                │
...              │ ...                            │
                 └────────────────────────────────┘

Reading neighbors is fine (input unchanged).
Writing to different rows prevents races.

Key Components

// Core data structures

struct Pixel {
    uint8_t r, g, b;
};

struct Image {
    std::vector<Pixel> pixels;  // Contiguous memory
    size_t width;
    size_t height;

    Pixel& at(size_t x, size_t y) {
        return pixels[y * width + x];
    }

    const Pixel& at(size_t x, size_t y) const {
        return pixels[y * width + x];
    }
};

// Kernel for convolution operations
template<size_t N>
struct Kernel {
    std::array<std::array<float, N>, N> weights;

    float sum() const {
        float s = 0;
        for (auto& row : weights)
            for (float w : row)
                s += w;
        return s;
    }

    void normalize() {
        float s = sum();
        if (s != 0) {
            for (auto& row : weights)
                for (float& w : row)
                    w /= s;
        }
    }
};

// Common kernels
constexpr Kernel<3> BOX_BLUR_3x3 = {{
    {{1.0f/9, 1.0f/9, 1.0f/9}},
    {{1.0f/9, 1.0f/9, 1.0f/9}},
    {{1.0f/9, 1.0f/9, 1.0f/9}}
}};

constexpr Kernel<3> SOBEL_X = {{
    {{-1, 0, 1}},
    {{-2, 0, 2}},
    {{-1, 0, 1}}
}};

constexpr Kernel<3> SOBEL_Y = {{
    {{-1, -2, -1}},
    {{ 0,  0,  0}},
    {{ 1,  2,  1}}
}};

constexpr Kernel<3> SHARPEN = {{
    {{ 0, -1,  0}},
    {{-1,  5, -1}},
    {{ 0, -1,  0}}
}};

Data Structures

Module Structure:

parallel_image/
├── src/
│   ├── main.cpp              # CLI entry point
│   ├── image.hpp             # Image and Pixel structures
│   ├── image.cpp             # Image I/O (PPM format)
│   ├── filters.hpp           # Filter function declarations
│   ├── filters_point.cpp     # Point operations (brightness, contrast, etc.)
│   ├── filters_conv.cpp      # Convolution operations (blur, edge, etc.)
│   ├── filters_global.cpp    # Global operations (histogram, quantize)
│   ├── kernels.hpp           # Predefined convolution kernels
│   ├── benchmark.hpp         # Timing and reporting utilities
│   └── benchmark.cpp
├── tests/
│   ├── test_filters.cpp      # Unit tests for filters
│   ├── test_parallel.cpp     # Tests verifying parallel correctness
│   └── test_images/
│       ├── gradient.ppm      # Simple gradient for testing
│       ├── checkerboard.ppm  # Pattern for blur testing
│       └── edges.ppm         # Known edges for edge detection
├── CMakeLists.txt
└── README.md

Implementation Guide

Phase 1: PPM Image Loading and Saving (1-2 hours)

PPM (Portable Pixel Map) is the simplest image format - pure ASCII text with pixel values.

PPM Format Specification:

P3                    # Magic number (P3 = ASCII RGB)
4 3                   # Width Height
255                   # Maximum color value
255   0   0           # Row 0: Red, Green, Blue, White
  0 255   0
  0   0 255
255 255 255
  0   0   0           # Row 1: Black, Cyan, Magenta, Yellow
  0 255 255
255   0 255
255 255   0
128 128 128           # Row 2: Gray x4
128 128 128
128 128 128
128 128 128

Code Structure (implement yourself):

// image.hpp
#include <vector>
#include <string>
#include <fstream>

struct Pixel {
    uint8_t r, g, b;

    // Useful for algorithms
    bool operator==(const Pixel& other) const {
        return r == other.r && g == other.g && b == other.b;
    }
};

struct Image {
    std::vector<Pixel> pixels;
    size_t width = 0;
    size_t height = 0;

    size_t size() const { return pixels.size(); }
    bool empty() const { return pixels.empty(); }

    Pixel& at(size_t x, size_t y);
    const Pixel& at(size_t x, size_t y) const;

    // Load from PPM file
    static Image load_ppm(const std::string& filename);

    // Save to PPM file
    void save_ppm(const std::string& filename) const;
};

Verification:

// Create a test image, save it, load it, verify it matches
Image img;
img.width = 2;
img.height = 2;
img.pixels = {{255,0,0}, {0,255,0}, {0,0,255}, {255,255,255}};
img.save_ppm("test.ppm");

Image loaded = Image::load_ppm("test.ppm");
assert(loaded.width == img.width);
assert(loaded.height == img.height);
assert(loaded.pixels == img.pixels);

Phase 2: Point Operations with Sequential Execution (2-3 hours)

Start with sequential implementations to ensure correctness.

Point Operations to Implement:

  1. Brightness: Add constant to all channels
  2. Contrast: Scale around midpoint (128)
  3. Invert: Subtract each channel from 255
  4. Grayscale: Convert RGB to single luminance value
// filters.hpp
#include <execution>
#include "image.hpp"

// Clamp helper - essential for image processing
inline uint8_t clamp(int value) {
    return static_cast<uint8_t>(std::clamp(value, 0, 255));
}

// Brightness adjustment
void adjust_brightness(Image& img, int delta);

// Contrast adjustment
void adjust_contrast(Image& img, float factor);

// Invert colors
void invert(Image& img);

// Convert to grayscale
void grayscale(Image& img);

Sequential Implementation Pattern:

void adjust_brightness(Image& img, int delta) {
    std::transform(
        img.pixels.begin(),
        img.pixels.end(),
        img.pixels.begin(),
        [delta](Pixel p) {
            return Pixel{
                clamp(p.r + delta),
                clamp(p.g + delta),
                clamp(p.b + delta)
            };
        }
    );
}

Verification:

// Test brightness on known values
Image img;
img.pixels = {{100, 100, 100}};
adjust_brightness(img, 50);
assert(img.pixels[0].r == 150);

// Test clamping
img.pixels = {{250, 250, 250}};
adjust_brightness(img, 50);
assert(img.pixels[0].r == 255);  // Clamped, not 300

// Test invert
img.pixels = {{0, 128, 255}};
invert(img);
assert(img.pixels[0].r == 255);
assert(img.pixels[0].g == 127);
assert(img.pixels[0].b == 0);

Phase 3: Add Parallel Execution Policies (2-3 hours)

Now add execution policies and create benchmark infrastructure.

Parallel Version:

template<typename ExecutionPolicy>
void adjust_brightness(ExecutionPolicy&& policy, Image& img, int delta) {
    std::transform(
        std::forward<ExecutionPolicy>(policy),
        img.pixels.begin(),
        img.pixels.end(),
        img.pixels.begin(),
        [delta](Pixel p) {
            return Pixel{
                clamp(p.r + delta),
                clamp(p.g + delta),
                clamp(p.b + delta)
            };
        }
    );
}

// Usage:
adjust_brightness(std::execution::seq, img, 50);  // Sequential
adjust_brightness(std::execution::par, img, 50);  // Parallel

Benchmark Infrastructure:

#include <chrono>

struct BenchmarkResult {
    std::string filter_name;
    double sequential_ms;
    double parallel_ms;
    double speedup;
    double efficiency;
};

template<typename Func>
double measure_ms(Func&& f, int iterations = 5) {
    // Warmup
    f();

    auto start = std::chrono::high_resolution_clock::now();
    for (int i = 0; i < iterations; ++i) {
        f();
    }
    auto end = std::chrono::high_resolution_clock::now();

    auto duration = std::chrono::duration<double, std::milli>(end - start);
    return duration.count() / iterations;
}

BenchmarkResult benchmark_filter(
    const std::string& name,
    std::function<void()> seq_version,
    std::function<void()> par_version,
    int num_cores
) {
    double seq_ms = measure_ms(seq_version);
    double par_ms = measure_ms(par_version);

    return BenchmarkResult{
        name,
        seq_ms,
        par_ms,
        seq_ms / par_ms,
        (seq_ms / par_ms) / num_cores
    };
}

Verification:

// Parallel should produce same results as sequential
Image img1 = load_test_image();
Image img2 = img1;  // Copy

adjust_brightness(std::execution::seq, img1, 50);
adjust_brightness(std::execution::par, img2, 50);

assert(img1.pixels == img2.pixels);  // Results must match!

Phase 4: Neighborhood Operations with Double-Buffering (3-4 hours)

Neighborhood operations read from surrounding pixels, so you cannot write in-place.

Double-Buffering Pattern:

Image apply_convolution(const Image& input, const Kernel<3>& kernel) {
    Image output;
    output.width = input.width;
    output.height = input.height;
    output.pixels.resize(input.width * input.height);

    // Process interior pixels only (skip 1-pixel border)
    // You'll handle borders separately or use padding

    for (size_t y = 1; y < input.height - 1; ++y) {
        for (size_t x = 1; x < input.width - 1; ++x) {
            float r = 0, g = 0, b = 0;

            // Apply kernel
            for (int ky = -1; ky <= 1; ++ky) {
                for (int kx = -1; kx <= 1; ++kx) {
                    const Pixel& p = input.at(x + kx, y + ky);
                    float w = kernel.weights[ky + 1][kx + 1];
                    r += p.r * w;
                    g += p.g * w;
                    b += p.b * w;
                }
            }

            output.at(x, y) = Pixel{
                clamp(static_cast<int>(r)),
                clamp(static_cast<int>(g)),
                clamp(static_cast<int>(b))
            };
        }
    }

    return output;
}

Row-Based Parallel Convolution:

template<typename ExecutionPolicy>
Image apply_convolution(ExecutionPolicy&& policy,
                        const Image& input,
                        const Kernel<3>& kernel) {
    Image output;
    output.width = input.width;
    output.height = input.height;
    output.pixels.resize(input.width * input.height);

    // Create vector of row indices to parallelize
    std::vector<size_t> row_indices;
    for (size_t y = 1; y < input.height - 1; ++y) {
        row_indices.push_back(y);
    }

    // Process rows in parallel
    std::for_each(
        std::forward<ExecutionPolicy>(policy),
        row_indices.begin(),
        row_indices.end(),
        [&input, &output, &kernel](size_t y) {
            // Each thread processes one complete row
            for (size_t x = 1; x < input.width - 1; ++x) {
                // Apply kernel at (x, y)
                // Write to output.at(x, y)
                // This is safe: each thread writes to different rows
            }
        }
    );

    // Handle border pixels (copy or pad)
    // ...

    return output;
}

Filters to Implement:

  1. Box Blur (3x3): Simple averaging
  2. Gaussian Blur (5x5): Weighted averaging with Gaussian distribution
  3. Sobel Edge Detection: Gradient magnitude from horizontal and vertical Sobel
  4. Sharpen: Emphasize center relative to neighbors

Verification:

// Blur uniform image = same image
Image uniform;
uniform.pixels = std::vector<Pixel>(100 * 100, {128, 128, 128});
uniform.width = 100;
uniform.height = 100;

Image blurred = apply_convolution(std::execution::seq, uniform, BOX_BLUR_3x3);

// Interior pixels should still be 128 (blur of uniform = uniform)
assert(blurred.at(50, 50).r == 128);

// Edge detection on uniform = near zero
Image edges = apply_edge_detection(std::execution::seq, uniform);
assert(edges.at(50, 50).r < 5);  // Should be nearly zero

Phase 5: Global Operations (2-3 hours)

Some operations need information from the entire image first.

Histogram Equalization Example:

// Two-pass algorithm:
// Pass 1: Compute histogram (can be parallelized with local histograms)
// Pass 2: Apply lookup table (embarrassingly parallel)

struct Histogram {
    std::array<size_t, 256> r{};
    std::array<size_t, 256> g{};
    std::array<size_t, 256> b{};

    Histogram& operator+=(const Histogram& other) {
        for (int i = 0; i < 256; ++i) {
            r[i] += other.r[i];
            g[i] += other.g[i];
            b[i] += other.b[i];
        }
        return *this;
    }
};

Histogram compute_histogram(const Image& img) {
    Histogram h;
    for (const Pixel& p : img.pixels) {
        h.r[p.r]++;
        h.g[p.g]++;
        h.b[p.b]++;
    }
    return h;
}

// Parallel version using std::reduce with custom reduction
template<typename ExecutionPolicy>
Histogram compute_histogram_parallel(ExecutionPolicy&& policy, const Image& img) {
    // Chunk the image and compute local histograms
    size_t num_chunks = std::thread::hardware_concurrency();
    size_t chunk_size = img.pixels.size() / num_chunks;

    std::vector<size_t> chunk_starts;
    for (size_t i = 0; i < num_chunks; ++i) {
        chunk_starts.push_back(i * chunk_size);
    }

    std::vector<Histogram> local_histograms(num_chunks);

    std::for_each(
        std::forward<ExecutionPolicy>(policy),
        chunk_starts.begin(),
        chunk_starts.end(),
        [&img, &local_histograms, chunk_size, num_chunks](size_t start) {
            size_t chunk_idx = start / chunk_size;
            size_t end = (chunk_idx == num_chunks - 1)
                       ? img.pixels.size()
                       : start + chunk_size;

            Histogram& h = local_histograms[chunk_idx];
            for (size_t i = start; i < end; ++i) {
                const Pixel& p = img.pixels[i];
                h.r[p.r]++;
                h.g[p.g]++;
                h.b[p.b]++;
            }
        }
    );

    // Merge histograms (can also be parallelized but usually fast enough)
    Histogram result;
    for (const Histogram& h : local_histograms) {
        result += h;
    }

    return result;
}

Color Quantization:

// Reduce image to N colors using simple median-cut or k-means
// This is more complex - implement a basic version first

void quantize_colors(Image& img, int num_colors) {
    // Simple approach: reduce precision per channel
    // More advanced: k-means clustering on color space

    int levels_per_channel = std::cbrt(num_colors);  // e.g., 16 colors = 2-3 levels
    int step = 256 / levels_per_channel;

    std::transform(
        std::execution::par,
        img.pixels.begin(),
        img.pixels.end(),
        img.pixels.begin(),
        [step](Pixel p) {
            return Pixel{
                static_cast<uint8_t>((p.r / step) * step + step / 2),
                static_cast<uint8_t>((p.g / step) * step + step / 2),
                static_cast<uint8_t>((p.b / step) * step + step / 2)
            };
        }
    );
}

Phase 6: Comprehensive Benchmarking (2-3 hours)

Create a benchmark suite that tests various image sizes and operations.

void run_full_benchmark(const std::string& image_path) {
    Image img = Image::load_ppm(image_path);
    int num_cores = std::thread::hardware_concurrency();

    std::cout << "Image: " << image_path << "\n";
    std::cout << "Size: " << img.width << " x " << img.height
              << " = " << img.size() << " pixels\n";
    std::cout << "Cores: " << num_cores << "\n\n";

    std::vector<BenchmarkResult> results;

    // Point operations
    {
        Image copy = img;
        auto seq = [&copy]() {
            adjust_brightness(std::execution::seq, copy, 20);
        };
        auto par = [&copy]() {
            adjust_brightness(std::execution::par, copy, 20);
        };
        results.push_back(benchmark_filter("Brightness", seq, par, num_cores));
    }

    // Add more filters...

    // Neighborhood operations
    {
        auto seq = [&img]() {
            return apply_convolution(std::execution::seq, img, BOX_BLUR_3x3);
        };
        auto par = [&img]() {
            return apply_convolution(std::execution::par, img, BOX_BLUR_3x3);
        };
        results.push_back(benchmark_filter("Box Blur 3x3", seq, par, num_cores));
    }

    // Print results table
    print_benchmark_table(results);

    // Analyze and warn about overhead
    for (const auto& r : results) {
        if (r.speedup < 1.0) {
            std::cout << "WARNING: " << r.filter_name
                      << " is SLOWER with parallelism ("
                      << r.speedup << "x)\n";
        }
    }
}

Test Different Image Sizes:

void benchmark_scaling() {
    std::vector<std::pair<int, int>> sizes = {
        {64, 64},       // 4K pixels
        {256, 256},     // 65K pixels
        {1024, 1024},   // 1M pixels
        {2048, 2048},   // 4M pixels
        {4096, 4096}    // 16M pixels
    };

    std::cout << "Finding parallel crossover point...\n\n";
    std::cout << "Size\t\tSeq (ms)\tPar (ms)\tSpeedup\n";
    std::cout << "----\t\t--------\t--------\t-------\n";

    for (auto [w, h] : sizes) {
        Image img = create_test_image(w, h);

        auto seq = [&img]() {
            adjust_brightness(std::execution::seq, img, 20);
        };
        auto par = [&img]() {
            adjust_brightness(std::execution::par, img, 20);
        };

        double seq_ms = measure_ms(seq);
        double par_ms = measure_ms(par);

        std::cout << w << "x" << h << "\t\t"
                  << std::fixed << std::setprecision(2)
                  << seq_ms << "\t\t"
                  << par_ms << "\t\t"
                  << seq_ms / par_ms << "x\n";
    }
}

Phase 7: CLI and Integration (1-2 hours)

Create a polished command-line interface.

#include <iostream>
#include <string>

int main(int argc, char* argv[]) {
    if (argc < 2) {
        std::cout << "Usage: parallel_image [options]\n\n"
                  << "Options:\n"
                  << "  --input <file.ppm>    Input image\n"
                  << "  --output <file.ppm>   Output image\n"
                  << "  --filter <name>       Filter to apply\n"
                  << "  --benchmark           Run benchmarks\n"
                  << "  --small               Use small test images\n"
                  << "\n"
                  << "Filters:\n"
                  << "  brightness <delta>    Adjust brightness\n"
                  << "  contrast <factor>     Adjust contrast\n"
                  << "  invert                Invert colors\n"
                  << "  grayscale             Convert to grayscale\n"
                  << "  blur                  Box blur 3x3\n"
                  << "  gaussian              Gaussian blur 5x5\n"
                  << "  edge                  Sobel edge detection\n"
                  << "  sharpen               Sharpen\n"
                  << "  quantize <colors>     Reduce to N colors\n";
        return 0;
    }

    // Parse arguments and process...
}

Questions to Guide Your Design

Algorithm Design

  1. Why use std::transform instead of a raw parallel loop?
    • Think about: iterator requirements, algorithm semantics, portability
  2. When should you use par_unseq vs just par?
    • Think about: SIMD requirements, lock-free guarantees, allocator use
  3. How do you handle image borders in convolution?
    • Options: skip borders, pad with zeros, reflect edges, wrap around

Performance Considerations

  1. What is the minimum image size where parallelism helps?
    • This depends on your system - measure it!
  2. Why might point operations show lower parallel efficiency than convolutions?
    • Think about: work per pixel, memory bandwidth, CPU intensity
  3. How does cache line size affect parallel performance?
    • Consider: false sharing, memory access patterns

Design Decisions

  1. Why use a contiguous vector<Pixel> instead of vector<vector<Pixel>>?
    • Think about: cache locality, memory allocation, iterator compatibility
  2. Should the Pixel struct have padding?
    • Consider: sizeof(Pixel), alignment, SIMD implications
  3. For histogram computation, is it better to use thread-local histograms or atomic operations?
    • Trade-offs: contention vs. memory usage vs. final merge cost

Thinking Exercise

Before implementing, work through this scenario by hand:

Scenario: You have a 1000x1000 image (1 million pixels) and 4 cores. Each pixel takes 100 nanoseconds to process for brightness adjustment.

Questions:

  1. Sequential time: 1,000,000 pixels * 100 ns = ?
    • Answer: 100 milliseconds
  2. Theoretical parallel time (perfect scaling): 100 ms / 4 = ?
    • Answer: 25 milliseconds
  3. With 1ms thread startup overhead and assuming perfect work distribution:
    • Answer: 25ms + 1ms = 26ms, speedup = 100/26 = 3.85x (96% efficiency)
  4. If there’s 5% sequential overhead (loading, saving, etc.):
    • Sequential: 5ms + 100ms = 105ms
    • Parallel: 5ms + 25ms = 30ms
    • Speedup = 105/30 = 3.5x (but Amdahl says max is 1/(0.05 + 0.95/4) = 3.48x - close!)
  5. For a 10x10 image (100 pixels):
    • Sequential: 100 * 100ns = 10 microseconds = 0.01ms
    • Parallel overhead (1ms) » work (0.01ms)
    • Parallel will be ~100x slower!

Key Insight: The crossover point where parallelism helps depends on:

  • Work per element
  • Thread overhead
  • Number of cores
  • Memory bandwidth

For your implementation, find this crossover point empirically!


Testing Strategy

Unit Tests

// test_filters.cpp
#include <cassert>
#include "filters.hpp"

void test_brightness_clamping() {
    Image img;
    img.pixels = {{250, 10, 128}};
    img.width = 1;
    img.height = 1;

    adjust_brightness(std::execution::seq, img, 50);

    assert(img.pixels[0].r == 255);  // Clamped from 300
    assert(img.pixels[0].g == 60);   // 10 + 50
    assert(img.pixels[0].b == 178);  // 128 + 50
}

void test_invert_roundtrip() {
    Image img;
    img.pixels = {{0, 128, 255}, {100, 150, 200}};
    img.width = 2;
    img.height = 1;

    Image original = img;

    invert(img);
    invert(img);

    assert(img.pixels == original.pixels);  // Double invert = identity
}

void test_blur_uniform() {
    // Blurring a uniform image should produce the same uniform image
    Image img;
    img.width = 10;
    img.height = 10;
    img.pixels = std::vector<Pixel>(100, {100, 100, 100});

    Image blurred = apply_convolution(std::execution::seq, img, BOX_BLUR_3x3);

    // Check interior pixels (not borders)
    for (size_t y = 1; y < img.height - 1; ++y) {
        for (size_t x = 1; x < img.width - 1; ++x) {
            assert(blurred.at(x, y).r == 100);
        }
    }
}

void test_edge_detection_uniform() {
    // Edge detection on uniform image should produce near-zero values
    Image img;
    img.width = 10;
    img.height = 10;
    img.pixels = std::vector<Pixel>(100, {128, 128, 128});

    Image edges = apply_edge_detection(std::execution::seq, img);

    for (size_t y = 1; y < img.height - 1; ++y) {
        for (size_t x = 1; x < img.width - 1; ++x) {
            assert(edges.at(x, y).r < 5);  // Should be near zero
        }
    }
}

void test_edge_detection_gradient() {
    // Vertical gradient should produce horizontal edges
    Image img;
    img.width = 10;
    img.height = 10;
    img.pixels.reserve(100);

    for (size_t y = 0; y < 10; ++y) {
        uint8_t value = static_cast<uint8_t>(y * 25);
        for (size_t x = 0; x < 10; ++x) {
            img.pixels.push_back({value, value, value});
        }
    }

    Image edges = apply_edge_detection(std::execution::seq, img);

    // Interior should have non-zero values (edges detected)
    bool has_edges = false;
    for (size_t y = 2; y < 8; ++y) {
        for (size_t x = 2; x < 8; ++x) {
            if (edges.at(x, y).r > 10) {
                has_edges = true;
            }
        }
    }
    assert(has_edges);
}

Parallel Correctness Tests

// test_parallel.cpp

void test_parallel_matches_sequential() {
    // For all filters, parallel version must produce identical results

    Image img = create_random_test_image(100, 100);

    // Test brightness
    {
        Image seq_copy = img;
        Image par_copy = img;

        adjust_brightness(std::execution::seq, seq_copy, 50);
        adjust_brightness(std::execution::par, par_copy, 50);

        assert(seq_copy.pixels == par_copy.pixels);
    }

    // Test blur
    {
        Image seq_result = apply_convolution(std::execution::seq, img, BOX_BLUR_3x3);
        Image par_result = apply_convolution(std::execution::par, img, BOX_BLUR_3x3);

        assert(seq_result.pixels == par_result.pixels);
    }

    // Add tests for all filters...
}

void test_deterministic_results() {
    // Running parallel version multiple times should give same result

    Image img = create_random_test_image(100, 100);

    std::vector<Image> results;
    for (int i = 0; i < 10; ++i) {
        Image copy = img;
        adjust_brightness(std::execution::par, copy, 50);
        results.push_back(copy);
    }

    for (size_t i = 1; i < results.size(); ++i) {
        assert(results[i].pixels == results[0].pixels);
    }
}

Integration Tests

void test_full_pipeline() {
    // Apply multiple filters in sequence
    Image img = Image::load_ppm("test_images/photo.ppm");

    adjust_brightness(std::execution::par, img, 20);
    Image blurred = apply_convolution(std::execution::par, img, GAUSSIAN_5x5);
    Image edges = apply_edge_detection(std::execution::par, blurred);

    edges.save_ppm("test_output/pipeline_result.ppm");

    // Verify output file exists and is valid
    Image loaded = Image::load_ppm("test_output/pipeline_result.ppm");
    assert(!loaded.empty());
}

void test_large_image_performance() {
    // Verify parallel is faster for large images
    Image img = create_test_image(2048, 2048);

    auto seq_time = measure_ms([&img]() {
        adjust_brightness(std::execution::seq, img, 20);
    });

    auto par_time = measure_ms([&img]() {
        adjust_brightness(std::execution::par, img, 20);
    });

    double speedup = seq_time / par_time;

    std::cout << "Large image speedup: " << speedup << "x\n";
    assert(speedup > 1.5);  // Should see significant speedup
}

Common Pitfalls

Pitfall 1: Iterator Invalidation

Symptom: Crash or undefined behavior during parallel execution

Cause: Resizing container during parallel operation

// WRONG: May reallocate during parallel loop
std::for_each(std::execution::par, indices.begin(), indices.end(),
    [&output](size_t i) {
        output.push_back(compute(i));  // BAD: push_back may reallocate!
    });

Fix: Pre-allocate output container

// CORRECT: Pre-allocate, then assign
output.resize(indices.size());
std::for_each(std::execution::par, indices.begin(), indices.end(),
    [&output](size_t i) {
        output[i] = compute(i);  // GOOD: No reallocation
    });

Pitfall 2: Data Races in Neighborhood Operations

Symptom: Image has artifacts, varies between runs

Cause: Reading and writing same buffer

// WRONG: Blur reads neighbors while other threads modify them
std::for_each(std::execution::par, ...,
    [&img](size_t idx) {
        img.pixels[idx] = average_of_neighbors(img, idx);  // Race!
    });

Fix: Double-buffering

// CORRECT: Read from input, write to output
std::for_each(std::execution::par, ...,
    [&input, &output](size_t idx) {
        output.pixels[idx] = average_of_neighbors(input, idx);  // Safe
    });

Pitfall 3: Using par_unseq with Locks

Symptom: Deadlock or crash

Cause: par_unseq may interleave operations on same thread

std::mutex mtx;
std::for_each(std::execution::par_unseq, ...,
    [&mtx](auto& item) {
        std::lock_guard lock(mtx);  // DANGER: May deadlock!
        // If this thread is paused and resumed mid-operation,
        // it may try to acquire the same lock twice
    });

Fix: Use par or redesign to avoid locks

// Option 1: Use par instead of par_unseq
std::for_each(std::execution::par, ...);

// Option 2: Use thread-local storage
thread_local std::vector<int> local_results;

Pitfall 4: Measuring Debug Builds

Symptom: No speedup or slowdown from parallelism

Cause: Debug builds disable optimizations

# WRONG: Debug build
g++ -g main.cpp -ltbb

# CORRECT: Release build with optimizations
g++ -O3 -march=native main.cpp -ltbb

Pitfall 5: Not Warming Up Cache

Symptom: First run is much slower, inconsistent benchmarks

Cause: Cold cache, lazy initialization

Fix: Warm up before measuring

template<typename Func>
double measure_ms(Func&& f, int iterations = 5) {
    // Warmup run - don't count this
    f();

    auto start = std::chrono::high_resolution_clock::now();
    for (int i = 0; i < iterations; ++i) {
        f();
    }
    auto end = std::chrono::high_resolution_clock::now();

    return std::chrono::duration<double, std::milli>(end - start).count()
           / iterations;
}

Pitfall 6: Ignoring Memory Bandwidth

Symptom: Low efficiency even with perfect parallelism

Cause: Memory bandwidth is the bottleneck, not CPU

// Point operations may be memory-bound
// Each pixel is 3 bytes, read once, written once
// 4K image = 12M pixels * 3 bytes * 2 (read+write) = 72 MB
// Memory bandwidth: ~50 GB/s
// Time to move data: 72 MB / 50 GB/s = 1.4 ms
// If compute is faster than this, memory is the bottleneck

Mitigation:

  • Fuse multiple operations to amortize memory access
  • Process data in cache-sized chunks

Extensions & Challenges

Extension 1: Streaming Large Images

Handle images too large to fit in memory:

class StreamingImageProcessor {
    // Process image in tiles
    void process_tiled(const std::string& input_path,
                       const std::string& output_path,
                       size_t tile_size = 1024) {
        // Read and process one tile at a time
        // Tiles can be processed in parallel
        // Handle tile boundaries for neighborhood operations
    }
};

Extension 2: GPU Acceleration with SYCL

Port to GPU using SYCL (oneAPI):

#include <sycl/sycl.hpp>

void brightness_gpu(sycl::queue& q, Image& img, int delta) {
    sycl::buffer<Pixel> buf(img.pixels.data(), img.pixels.size());

    q.submit([&](sycl::handler& h) {
        auto acc = buf.get_access<sycl::access::mode::read_write>(h);

        h.parallel_for(sycl::range<1>(img.pixels.size()), [=](sycl::id<1> i) {
            Pixel& p = acc[i];
            p.r = clamp(p.r + delta);
            p.g = clamp(p.g + delta);
            p.b = clamp(p.b + delta);
        });
    });
}

Extension 3: Vectorized Operations with std::simd

Use explicit SIMD with C++23 std::simd:

#include <experimental/simd>

namespace stdx = std::experimental;

void brightness_simd(Image& img, int delta) {
    using simd_t = stdx::native_simd<uint8_t>;
    constexpr size_t simd_size = simd_t::size();

    // Process in SIMD-width chunks
    uint8_t* data = reinterpret_cast<uint8_t*>(img.pixels.data());
    size_t total_bytes = img.pixels.size() * 3;

    size_t i = 0;
    for (; i + simd_size <= total_bytes; i += simd_size) {
        simd_t v;
        v.copy_from(data + i, stdx::element_aligned);

        // Saturating add (clamps automatically)
        v = stdx::min(v + simd_t(delta), simd_t(255));

        v.copy_to(data + i, stdx::element_aligned);
    }

    // Handle remainder
    for (; i < total_bytes; ++i) {
        data[i] = clamp(data[i] + delta);
    }
}

Extension 4: Real-Time Video Processing

Apply filters to webcam feed:

// Using OpenCV for video capture
void realtime_processing() {
    cv::VideoCapture cap(0);

    while (true) {
        cv::Mat frame;
        cap >> frame;

        // Convert to Image, process, convert back
        Image img = opencv_to_image(frame);
        Image edges = apply_edge_detection(std::execution::par, img);
        frame = image_to_opencv(edges);

        cv::imshow("Edges", frame);
        if (cv::waitKey(1) == 'q') break;
    }
}

Extension 5: Automatic Parallelism Selection

Decide at runtime whether to use parallelism:

template<typename Func>
auto auto_parallel(const Image& img, Func&& filter) {
    // Heuristic: use parallel for images > 100K pixels
    constexpr size_t PARALLEL_THRESHOLD = 100000;

    if (img.size() > PARALLEL_THRESHOLD) {
        return filter(std::execution::par);
    } else {
        return filter(std::execution::seq);
    }
}

// Usage:
auto result = auto_parallel(img, [&](auto policy) {
    return apply_convolution(policy, img, BOX_BLUR_3x3);
});

Resources

Primary References

Resource Description
“C++ Concurrency in Action, 2nd Ed” Ch. 10 Comprehensive coverage of parallel algorithms
cppreference.com - Execution policies Official documentation for std::execution
Intel TBB Documentation Implementation details of parallel algorithms

Secondary References

Resource Description
“Digital Image Processing” by Gonzalez & Woods Image processing algorithms
“Computer Graphics from Scratch” by Gambetta Fundamental graphics concepts
“What Every Programmer Should Know About Memory” Understanding cache effects

Online Resources

Resource URL
C++17 Parallel Algorithms Paper P0024R2
Intel TBB Tutorial oneapi.io/tbb
Netpbm Format Specification netpbm.sourceforge.net

Tools

Tool Purpose
Intel VTune Performance profiling
perf Linux performance analysis
Valgrind/Helgrind Thread error detection
ThreadSanitizer (TSan) Runtime race detection

Self-Assessment Checklist

Conceptual Understanding

  • I can explain the difference between seq, par, and par_unseq execution policies
  • I understand why par_unseq prohibits locks and memory allocation
  • I can apply Amdahl’s Law to calculate theoretical speedup limits
  • I know when parallel overhead exceeds parallel benefit
  • I understand false sharing and how to avoid it
  • I can identify embarrassingly parallel operations

Implementation Skills

  • My PPM loader/saver handles files correctly
  • Point operations (brightness, contrast, invert) work correctly
  • Neighborhood operations use double-buffering
  • Parallel versions produce identical results to sequential
  • My code compiles with -O3 optimization

Performance Analysis

  • I benchmarked sequential vs. parallel for multiple image sizes
  • I found the crossover point where parallelism becomes beneficial
  • I calculated parallel efficiency (speedup / cores)
  • I understand why my efficiency is < 100%
  • I can explain performance differences between filters

Testing

  • I have unit tests for all filters
  • I verified parallel correctness (same results as sequential)
  • I tested edge cases (tiny images, uniform images, gradients)
  • My tests run with ThreadSanitizer without errors

Submission/Completion Criteria

Your project is complete when:

Functional Requirements

  1. Core Features
    • Loads and saves PPM images correctly
    • Implements at least 4 point operations (brightness, contrast, invert, grayscale)
    • Implements at least 3 neighborhood operations (blur, edge, sharpen)
    • All filters work with both seq and par execution policies
  2. Benchmarking
    • Produces formatted benchmark output showing speedup
    • Tests multiple image sizes (at least 3: small, medium, large)
    • Identifies crossover point where parallelism helps
    • Reports parallel efficiency (speedup / cores)
  3. Code Quality
    • Clean separation between image I/O, filters, and benchmarking
    • No data races (verified with ThreadSanitizer)
    • Compiles without warnings on -Wall -Wextra
    • Includes unit tests for correctness

Performance Requirements

  • Achieves > 50% parallel efficiency on 4+ core system for large images (1000x1000+)
  • Correctly identifies when parallelism is counterproductive
  • Neighborhood operations show higher efficiency than point operations (due to more work/pixel)

Documentation

  • README explains how to build and run
  • Comments explain algorithm choices (double-buffering, row parallelization)
  • Benchmark results documented with system specs

Stretch Goals (Optional)

  • Implement Gaussian blur with 5x5 kernel
  • Add histogram equalization (global operation)
  • Support additional image formats (PNG via stb_image)
  • Add --auto-parallel flag that chooses based on image size
  • Implement separable convolution for efficiency

Moving Forward

After completing this project:

  1. Next Project: P08: Parallel Sort Benchmark Suite - Apply parallel algorithms to sorting

  2. Related Projects:
  3. Deepen Understanding:
    • Profile with Intel VTune to understand memory bandwidth limits
    • Experiment with different chunk sizes for histogram computation
    • Compare TBB’s parallel_for with std::for_each(par, ...)

Key Insight to Carry Forward: C++17 parallel algorithms make parallelism syntactically trivial, but achieving good performance requires understanding the cost model. The difference between 2x and 6x speedup often comes down to:

  • Minimizing memory bandwidth pressure
  • Choosing appropriate granularity
  • Avoiding synchronization in hot paths
  • Measuring, not guessing

When you reach for std::execution::par, ask: “Is the work per element enough to justify the overhead?” This project taught you to answer that question empirically.


This project bridges high-level parallel algorithms with practical performance engineering. The skills transfer directly to production systems where the difference between “parallel but slow” and “efficiently parallel” can mean the difference between meeting latency requirements or failing them.