Project 9: System Call and I/O Latency Profiler

Project 9: System Call and I/O Latency Profiler

Project Overview

Attribute Details
Difficulty Intermediate
Time Estimate 1-2 weeks
Primary Language C
Alternative Languages Go, Rust, Python
Knowledge Area I/O and Syscalls
Tools Required perf, strace, bpftrace
Primary Reference โ€œThe Linux Programming Interfaceโ€ by Michael Kerrisk

Learning Objectives

By completing this project, you will be able to:

  1. Explain syscall mechanics including user-kernel transitions and overhead
  2. Trace and time individual syscalls with microsecond precision
  3. Identify I/O-bound workloads distinguishing them from CPU-bound
  4. Measure I/O latency distributions for different operations
  5. Correlate syscall patterns with application behavior
  6. Apply I/O optimization techniques (batching, async I/O)

Deep Theoretical Foundation

The Syscall Boundary

User programs cannot directly access hardware. They must request services from the kernel via system calls:

User Space                        Kernel Space
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”           โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Application        โ”‚           โ”‚ Kernel             โ”‚
โ”‚                    โ”‚  syscall  โ”‚                    โ”‚
โ”‚ read(fd, buf, n)  โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–บ  sys_read()       โ”‚
โ”‚                    โ”‚           โ”‚    โ†“               โ”‚
โ”‚                    โ”‚           โ”‚  VFS layer         โ”‚
โ”‚                    โ”‚           โ”‚    โ†“               โ”‚
โ”‚                    โ”‚           โ”‚  Block device      โ”‚
โ”‚                    โ”‚           โ”‚    โ†“               โ”‚
โ”‚                    โ”‚  return   โ”‚  Hardware I/O      โ”‚
โ”‚ โ† result โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค                    โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜           โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Syscall Boundary Diagram

Syscall Overhead Components

1. Mode Switch (1-2 ยตs)

  • Save user registers
  • Switch to kernel stack
  • Security checks (Spectre mitigations add ~100ns)

2. Kernel Processing (varies)

  • Path lookup for file operations
  • Permission checks
  • Buffer management

3. Hardware I/O (huge variance)

  • NVMe SSD: 10-200 ยตs
  • SATA SSD: 50-500 ยตs
  • HDD: 5,000-15,000 ยตs
  • Network: 100-100,000 ยตs

4. Return to User (1-2 ยตs)

  • Copy results to user buffers
  • Restore user context

Categories of Syscalls

Fast Syscalls (< 10 ยตs)

getpid()      // Returns cached value
gettimeofday()// Reads shared memory (vDSO)

Medium Syscalls (10-100 ยตs)

open()        // Path lookup, permission check
stat()        // File metadata (may be cached)
mmap()        // Virtual memory manipulation

Slow Syscalls (100+ ยตs)

read()        // If data not in cache โ†’ disk I/O
write()       // If synchronous โ†’ wait for disk
fsync()       // Wait for all data to disk
connect()     // Network handshake

Potentially Blocking Syscalls (unbounded)

read()        // On slow device or network
accept()      // Waiting for connection
select()      // Waiting for events
sleep()       // Explicit waiting

I/O Latency Stack

Application Request
        โ”‚
        โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ System Call       โ”‚  1-5 ยตs overhead
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ VFS Layer         โ”‚  Path resolution, caching
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ File System       โ”‚  ext4, XFS, btrfs logic
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Block Layer       โ”‚  I/O scheduling, merging
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Device Driver     โ”‚  Hardware-specific
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Hardware          โ”‚  Actual I/O
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Each layer adds latency and variability

Buffering and Caching Effects

Page Cache Linux caches recently accessed file data in RAM:

First read:   Application โ†’ Syscall โ†’ Disk โ†’ Cache โ†’ Return
Second read:  Application โ†’ Syscall โ†’ Cache โ†’ Return (100x faster)

Write Buffering Writes go to page cache, not immediately to disk:

write():   Copy to page cache โ†’ Return (fast, < 10 ยตs)
           Background: Page cache โ†’ Disk (delayed)

fsync():   Wait for all dirty pages โ†’ Disk (slow, 1-100 ms)

This means write() latency is deceptively low until fsync().


Complete Project Specification

What Youโ€™re Building

A syscall profiling toolkit called syscall_prof that:

  1. Traces syscalls for target process with timing
  2. Categorizes syscalls by type and latency
  3. Identifies slow operations and their sources
  4. Generates latency distributions per syscall type
  5. Recommends I/O optimizations based on patterns

Functional Requirements

syscall_prof trace --pid <pid> --duration <sec> --output <file.csv>
syscall_prof trace --command "<cmd>" --output <file.csv>
syscall_prof analyze --input <file.csv> --top <n>
syscall_prof histogram --input <file.csv> --syscall <name>
syscall_prof report --input <file.csv> --output <report.md>

Example Output

Syscall Profiling Report
โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
Process: database_server (PID 12345)
Duration: 60 seconds
Total syscalls: 1,247,891

Syscall Summary:
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
Syscall        Count      Total Time   Avg      p99
read           423,891    12.4 s       29 ยตs    2.1 ms
write          312,456    3.2 s        10 ยตs    890 ยตs
fsync          2,341      48.7 s       20.8 ms  127 ms
epoll_wait     89,234     142.1 s      1.6 ms   15 ms
open           15,678     0.8 s        51 ยตs    320 ยตs
close          15,678     0.1 s        6 ยตs     42 ยตs

I/O Pattern Analysis:
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
Category           Time      % of Total
Disk I/O (fsync)   48.7 s    76.2%      โ† BOTTLENECK
Network wait       142.1 s   N/A (async)
Disk read          8.2 s     12.8%
Disk write         3.2 s     5.0%
Other              3.8 s     6.0%

Critical Findings:
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
1. fsync() dominates latency (76% of I/O time)
   - Called 2,341 times (39 per second)
   - Each fsync waits 20.8 ms average
   - p99: 127 ms โ†’ causes tail latency spikes

   Recommendation:
   โ€ข Batch writes before fsync (group commit)
   โ€ข Consider O_DIRECT + io_uring for control
   โ€ข Review if all fsyncs are necessary

2. read() shows bimodal distribution
   - 95% complete in < 100 ยตs (cached)
   - 5% take 1-10 ms (disk hit)

   Recommendation:
   โ€ข Working set may exceed RAM
   โ€ข Consider memory-mapped I/O for hot data
   โ€ข Pre-fetch predictable access patterns

Latency Distribution (read):
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
    < 10 ยตs   โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ  34.2%
   10-50 ยตs   โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ          24.1%
  50-100 ยตs   โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ            22.3%
 100-500 ยตs   โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ                      11.8%
  500 ยตs-1ms  โ–ˆโ–ˆโ–ˆ                                3.2%
    1-10 ms   โ–ˆโ–ˆโ–ˆโ–ˆ                               4.2%
     > 10 ms  โ–Œ                                  0.2%

Solution Architecture

Component Design

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    CLI Interface                             โ”‚
โ”‚   Target selection, tracing parameters                       โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                           โ”‚
          โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
          โ”‚                โ”‚                โ”‚
          โ–ผ                โ–ผ                โ–ผ
   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
   โ”‚ strace      โ”‚  โ”‚ perf trace  โ”‚  โ”‚ bpftrace    โ”‚
   โ”‚ Backend     โ”‚  โ”‚ Backend     โ”‚  โ”‚ Backend     โ”‚
   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜
          โ”‚                โ”‚                โ”‚
          โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                           โ”‚
                           โ–ผ
           โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
           โ”‚      Syscall Event Parser      โ”‚
           โ”‚  Normalize across backends     โ”‚
           โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                           โ”‚
          โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
          โ”‚                โ”‚                โ”‚
          โ–ผ                โ–ผ                โ–ผ
   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
   โ”‚ Statistics  โ”‚  โ”‚ Histogram   โ”‚  โ”‚ Pattern     โ”‚
   โ”‚ Calculator  โ”‚  โ”‚ Builder     โ”‚  โ”‚ Detector    โ”‚
   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Key Data Structures

// Syscall event record
typedef struct {
    uint64_t timestamp_ns;
    uint32_t pid;
    uint32_t tid;
    int syscall_nr;           // Syscall number
    const char *syscall_name;
    uint64_t duration_ns;
    int return_value;
    uint64_t args[6];         // Syscall arguments
} syscall_event_t;

// Per-syscall statistics
typedef struct {
    const char *name;
    uint64_t count;
    uint64_t total_time_ns;
    uint64_t min_time_ns;
    uint64_t max_time_ns;
    histogram_t latency_hist;
} syscall_stats_t;

// I/O pattern classification
typedef enum {
    IO_PATTERN_SEQUENTIAL,
    IO_PATTERN_RANDOM,
    IO_PATTERN_MIXED
} io_pattern_t;

typedef struct {
    int fd;
    const char *path;         // If available
    io_pattern_t pattern;
    uint64_t total_bytes;
    uint64_t total_ops;
    uint64_t total_time_ns;
} fd_stats_t;

Tracing Backend Implementations

strace Backend (Portable)

# Capture syscall timing
strace -c -p $PID -o stats.txt &
sleep 60
kill %1

# Or with timestamps
strace -tt -T -o trace.txt -p $PID

Parse output:

// Line format: "14:23:45.123456 read(3, "...", 4096) = 4096 <0.000034>"
//                                                          โ†‘ duration

perf trace Backend (Lower Overhead)

perf trace -p $PID --duration 60000 -o trace.data
perf script -i trace.data

bpftrace Backend (Minimal Overhead)

bpftrace -e '
tracepoint:syscalls:sys_enter_read { @start[tid] = nsecs; }
tracepoint:syscalls:sys_exit_read /@start[tid]/ {
    @latency = hist(nsecs - @start[tid]);
    delete(@start[tid]);
}
'

Phased Implementation Guide

Phase 1: Basic Tracing (Days 1-3)

Goal: Capture syscalls with timing using strace.

Steps:

  1. Create strace wrapper with timing options
  2. Parse strace output format
  3. Extract syscall name, duration, return value
  4. Store events in structured format
  5. Calculate basic statistics

Validation: Capture matches strace -c summary.

Phase 2: Multi-Backend Support (Days 4-6)

Goal: Add perf trace and bpftrace backends.

Steps:

  1. Implement perf trace output parser
  2. Create bpftrace script for syscall tracing
  3. Normalize event format across backends
  4. Add backend selection in CLI

Validation: Same results from different backends.

Phase 3: Histogram and Analysis (Days 7-9)

Goal: Build latency distributions per syscall.

Steps:

  1. Implement histogram with configurable buckets
  2. Calculate percentiles (p50, p90, p95, p99)
  3. Detect bimodal distributions
  4. Generate visualization output

Validation: Histogram matches expected distribution.

Phase 4: I/O Pattern Detection (Days 10-11)

Goal: Classify I/O access patterns.

Steps:

  1. Track per-FD operations
  2. Detect sequential vs random access
  3. Calculate read/write ratio
  4. Identify hot files/descriptors

Validation: Sequential access correctly identified.

Phase 5: Recommendations Engine (Days 12-14)

Goal: Generate optimization suggestions.

Steps:

  1. Define pattern-to-recommendation rules
  2. Detect common anti-patterns
  3. Generate actionable suggestions
  4. Estimate improvement potential

Validation: Recommendations sensible for test cases.


Testing Strategy

Synthetic Workloads

1. Pure CPU (No I/O)

// Should show minimal syscall activity
while (running) {
    compute();
}

Expected: Few syscalls, low overhead.

2. Synchronous File I/O

// Heavy fsync usage
for (int i = 0; i < 1000; i++) {
    write(fd, data, 4096);
    fsync(fd);
}

Expected: fsync dominates latency.

3. Buffered I/O

// No fsync, uses page cache
for (int i = 0; i < 1000; i++) {
    write(fd, data, 4096);
}
// No fsync

Expected: Fast writes, potential cache pressure.

4. Random vs Sequential Read

// Sequential
for (int i = 0; i < 1000; i++) {
    pread(fd, buf, 4096, i * 4096);
}

// Random
for (int i = 0; i < 1000; i++) {
    off_t offset = (rand() % file_size) & ~4095;
    pread(fd, buf, 4096, offset);
}

Expected: Random 10-100x slower if file > RAM.

Validation

  1. Compare with strace -c: Summary should match
  2. Timing sanity: Total syscall time < wall clock
  3. Count accuracy: Every enter has matching exit

Common Pitfalls and Debugging

Pitfall 1: Observer Effect

Symptom: Application runs slower with profiling.

Cause: strace adds 10-100x overhead per syscall.

Solution: Use lower-overhead methods:

# strace: High overhead (ptrace)
# perf trace: Medium overhead (kernel tracing)
# bpftrace: Low overhead (eBPF, in-kernel)

# For production, use sampling instead of tracing
perf record -e syscalls:sys_enter_* -c 100 ./app

Pitfall 2: Missing Context

Symptom: Donโ€™t know which file a read/write targets.

Cause: strace only shows fd number, not path.

Solution: Track open() calls to build fdโ†’path mapping:

// When see: open("/data/file.txt", O_RDONLY) = 3
// Record: fd 3 โ†’ "/data/file.txt"

// When see: read(3, ...) = 4096
// Lookup: fd 3 โ†’ "/data/file.txt"

Pitfall 3: Async I/O Confusion

Symptom: I/O operations complete instantly.

Cause: Using async I/O (io_uring, aio), work happens later.

Solution: Trace completion events too:

# For io_uring
bpftrace -e '
tracepoint:io_uring:io_uring_submit_sqe { ... }
tracepoint:io_uring:io_uring_complete { ... }
'

Pitfall 4: Page Cache Masking

Symptom: All reads show < 1 ms, even for huge files.

Cause: File is fully cached in RAM.

Solution: Understand this IS realistic production behavior:

# Drop caches to measure cold performance
echo 3 | sudo tee /proc/sys/vm/drop_caches

# Or trace cache hit/miss
perf record -e cache-misses ./app

Extensions and Challenges

Extension 1: io_uring Profiling

Add support for io_uring async I/O:

  • Trace submission and completion separately
  • Calculate true I/O latency
  • Compare with synchronous I/O

Extension 2: Network I/O Focus

Specialize for network syscalls:

  • Track connection lifecycle
  • Measure time-to-first-byte
  • Identify connection establishment overhead

Extension 3: Per-Thread Attribution

Break down by thread:

  • Which threads do most I/O?
  • Thread-specific patterns
  • Correlation with application logic

Challenge: Production Tracing

Design production-safe tracing:

  • < 1% overhead target
  • Sampling-based approach
  • Aggregate statistics only
  • Automatic anomaly detection

Real-World Connections

Industry Practices

  1. eBPF Observability: Modern production tracing uses eBPF
  2. Cloud Profiling: GCP/AWS offer syscall-level visibility
  3. Database I/O Patterns: PostgreSQL, MySQL tune around fsync
  4. Storage Systems: Understanding I/O is core to storage engineering

Common Optimizations

  • Batching: Group multiple writes, single fsync
  • Async I/O: Donโ€™t block on slow operations
  • Memory Mapping: Avoid syscall overhead for hot data
  • O_DIRECT: Bypass page cache for predictable latency
  • Buffered I/O: Leverage page cache for read-heavy workloads

Self-Assessment Checklist

Before considering this project complete, verify:

  • You can explain syscall overhead sources
  • You traced syscalls and identified slow operations
  • Latency histogram shows expected distribution
  • You detected sequential vs random I/O patterns
  • Recommendations are actionable for test workloads
  • You understand page cache effects on measurements
  • You know when to use different tracing backends

Resources

Essential Reading

  • โ€œThe Linux Programming Interfaceโ€ by Kerrisk, Chapter 3
  • โ€œSystems Performanceโ€ by Gregg, Chapter 7
  • โ€œOperating Systems: Three Easy Piecesโ€ by Arpaci-Dusseau, Chapter 36

Reference

  • Linux syscall table: ausyscall --dump
  • strace man page
  • perf trace documentation
  • bpftrace tutorial

Tools

  • strace: Classic syscall tracer
  • perf trace: Lower overhead tracing
  • bpftrace: eBPF-based tracing
  • sysdig: Container-aware tracing