Project 2: Memory-Mapped File Reader

Build a file reader that memory-maps files for efficient random access, enabling operations on gigabyte-sized files without loading them into RAM.

Quick Reference

Attribute Value
Difficulty Intermediate
Time Estimate 1-2 weeks
Language C++
Prerequisites Project 1, understanding of pointers, basic file I/O
Key Topics Virtual memory, mmap/MapViewOfFile, pointer arithmetic, RAII, cross-platform abstraction

1. Learning Objectives

After completing this project, you will:

  • Understand how virtual memory maps files directly into your address space
  • Master the mmap (POSIX) and MapViewOfFile (Windows) system calls
  • Learn safe pointer arithmetic for traversing memory regions
  • Implement cross-platform code using conditional compilation
  • Apply the RAII pattern to ensure proper resource cleanup
  • Build high-performance file processing tools

2. Theoretical Foundation

2.1 Core Concepts

Virtual Memory and Memory Mapping

Every process has a virtual address space—a vast expanse of addresses that don’t correspond directly to physical RAM. The operating system’s memory management unit (MMU) translates virtual addresses to physical addresses.

Memory mapping exploits this: instead of reading a file into a buffer, you ask the OS to map the file’s contents into your virtual address space. The file appears as a contiguous block of memory starting at some pointer.

Traditional File Reading:
  File on disk --read()--> Kernel buffer --copy--> User buffer --access--> Your code

Memory Mapping:
  File on disk --mmap()--> Virtual address space --access--> Your code
                          (Pages loaded on demand)

Page Faults and Demand Paging

When you access a memory-mapped region, the page might not be in RAM. The CPU triggers a page fault, the OS loads the page from disk, and your code continues—transparently. Only accessed pages are loaded.

For a 4GB file on a machine with 8GB RAM, you can still “access” the entire file. The OS manages which pages are in RAM and which are on disk.

Benefits of Memory Mapping

Benefit Explanation
No explicit I/O Read/write through pointers, not syscalls
Zero-copy access No buffer copying between kernel and user space
Demand paging Only touched pages loaded into RAM
Shared mapping Multiple processes can share the same mapping
Simplified code Treat file as array; OS handles caching

Platform APIs

POSIX (Linux, macOS, BSD):

void* mmap(void* addr, size_t length, int prot, int flags, int fd, off_t offset);
int munmap(void* addr, size_t length);

Windows:

HANDLE CreateFileMapping(HANDLE hFile, ...);
void* MapViewOfFile(HANDLE hMap, DWORD access, ...);
BOOL UnmapViewOfFile(void* lpBaseAddress);

2.2 Why This Matters

Memory mapping is used in:

  • Databases: SQLite, PostgreSQL use mmap for efficient file access
  • Video editors: Large media files mapped rather than loaded
  • Text editors: Vim, VS Code map large files for fast search
  • Browsers: Mmap executables and shared libraries
  • Game engines: Stream assets from disk via memory mapping

Understanding mmap reveals how operating systems virtualize memory and how high-performance software accesses data.

2.3 Historical Context

Memory mapping predates modern operating systems. Multics (1960s) introduced the concept. UNIX adopted it in the 1980s with mmap. Windows followed with file mapping objects.

The key insight: files and memory are both just bytes. Why treat them differently? Memory mapping unifies them—a file is just a region of memory that persists on disk.

2.4 Common Misconceptions

Misconception 1: “mmap loads the entire file into RAM” False. Pages are loaded on demand. You can map a 100GB file on a machine with 4GB RAM.

Misconception 2: “mmap is always faster than read()” False. For sequential, single-pass reading, read() with appropriate buffering can be faster. mmap shines for random access and repeated access.

Misconception 3: “Memory-mapped files are automatically persistent” Partially true. Changes to MAP_SHARED mappings are written back. MAP_PRIVATE mappings are copy-on-write—changes stay in memory only.

Misconception 4: “Pointer arithmetic on mmap is dangerous” True if you go out of bounds. But with proper size checking, it’s safe and efficient.


3. Project Specification

3.1 What You Will Build

A cross-platform memory-mapped file reader that:

  1. Maps files into memory using platform-native APIs
  2. Supports operations on very large files (multi-gigabyte)
  3. Provides fast line counting, word counting, and searching
  4. Displays arbitrary byte ranges
  5. Measures performance to demonstrate mmap efficiency
  6. Cleans up resources automatically via RAII

3.2 Functional Requirements

ID Requirement
F1 Open and memory-map any file up to 64GB
F2 Report file size immediately upon mapping
F3 Count lines in the file
F4 Count words in the file
F5 Count characters/bytes in the file
F6 Search for a pattern and report all occurrences
F7 Display arbitrary byte range (e.g., bytes 1000-2000)
F8 Report timing for operations
F9 Handle missing/inaccessible files gracefully
F10 Work on both Linux/macOS (POSIX) and Windows

3.3 Non-Functional Requirements

ID Requirement
N1 Line counting: process 1GB in under 2 seconds
N2 Searching: process 1GB in under 3 seconds
N3 Memory usage: RSS should not exceed 100MB for any file size
N4 Startup time: map a file in under 10ms
N5 No file size limits except OS/filesystem limits

3.4 Example Usage / Output

$ ./mmap_reader largefile.log

File: largefile.log (4.2 GB)
Mapped successfully.

Commands:
  count - Count lines/words
  search <pattern> - Find pattern
  view <start> <end> - Show byte range
  quit - Exit

> count
Lines: 45,234,891
Words: 312,456,789
Chars: 4,512,345,678
(Completed in 1.2 seconds - file never fully loaded to RAM)

> search "ERROR"
Found 1,234 occurrences
First 5:
  Line 45: [2024-01-15 10:23:45] ERROR: Connection timeout
  Line 892: [2024-01-15 10:25:12] ERROR: Database unreachable
  Line 1203: [2024-01-15 10:27:33] ERROR: Authentication failed
  Line 2891: [2024-01-15 10:30:01] ERROR: Disk full
  Line 5123: [2024-01-15 10:35:55] ERROR: Network unreachable
(Completed in 0.8 seconds)

> view 0 1000
[Shows first 1000 bytes of file]

> view 1000000 1000100
[Shows bytes 1000000-1000100]

> quit
Goodbye!

3.5 Real World Outcome

After completing this project, you will:

  • Understand OS virtual memory deeply
  • Know when to use mmap vs traditional I/O
  • Have a powerful log analysis tool for large files
  • Be prepared for systems programming interviews about memory

4. Solution Architecture

4.1 High-Level Design

+------------------+
|   Command Line   |
+------------------+
         |
         v
+------------------+
| MemoryMappedFile |  <-- Cross-platform abstraction
+------------------+
    |         |
    v         v
+-------+  +----------+
| POSIX |  | Windows  |
| mmap  |  | MapView  |
+-------+  +----------+
    |         |
    v         v
+------------------+
|    OS Kernel     |
+------------------+
         |
         v
+------------------+
|    File System   |
+------------------+
         |
         v
+------------------+
|      Disk        |
+------------------+

4.2 Key Components

MemoryMappedFile Class

class MemoryMappedFile {
public:
    bool open(const std::string& path);
    void close();

    const char* data() const;
    size_t size() const;

    const char* begin() const;
    const char* end() const;

    // Query operations
    size_t countLines() const;
    size_t countWords() const;
    std::vector<size_t> search(const std::string& pattern) const;
    std::string_view view(size_t start, size_t end) const;

private:
#ifdef _WIN32
    HANDLE hFile_ = INVALID_HANDLE_VALUE;
    HANDLE hMapping_ = nullptr;
#else
    int fd_ = -1;
#endif
    void* data_ = nullptr;
    size_t size_ = 0;
};

REPL (Command Loop)

  • Parses commands: count, search, view, quit
  • Measures timing for operations
  • Formats output

4.3 Data Structures

Structure Purpose
char* data_ Pointer to mapped memory region
size_t size_ Size of the mapped file
std::vector<size_t> Store byte offsets of search matches

4.4 Algorithm Overview

Line Counting

count = 0
for each byte in [begin, end):
    if byte == '\n':
        count++
return count

This is cache-friendly sequential access—the OS prefetches pages as you iterate.

Word Counting

count = 0
in_word = false
for each byte in [begin, end):
    if isspace(byte):
        in_word = false
    else if not in_word:
        in_word = true
        count++
return count

Pattern Search

pattern = "ERROR"
results = []
ptr = begin
while ptr < end:
    found = std::search(ptr, end, pattern.begin(), pattern.end())
    if found == end:
        break
    results.push_back(found - begin)  // Byte offset
    ptr = found + 1
return results

For better performance, use Boyer-Moore (std::search with std::boyer_moore_searcher in C++17).


5. Implementation Guide

5.1 Development Environment Setup

# Linux/macOS
mkdir mmap_reader && cd mmap_reader
touch main.cpp memory_mapped_file.hpp memory_mapped_file.cpp
g++ -std=c++17 -O2 -Wall -o mmap_reader main.cpp memory_mapped_file.cpp

# Windows (Visual Studio Developer Command Prompt)
cl /std:c++17 /O2 /EHsc main.cpp memory_mapped_file.cpp /Fe:mmap_reader.exe

# Create test file
dd if=/dev/zero of=testfile.bin bs=1M count=100  # 100MB file

5.2 Project Structure

mmap_reader/
├── CMakeLists.txt
├── include/
│   └── memory_mapped_file.hpp
├── src/
│   ├── main.cpp
│   └── memory_mapped_file.cpp
└── tests/
    ├── test_mmap.cpp
    └── test_large_file.cpp

5.3 The Core Question You’re Answering

“How can I process files larger than available RAM efficiently, without manual chunking or buffering?”

This decomposes into:

  1. How does virtual memory allow accessing more data than fits in RAM?
  2. How do I use OS APIs to map files into my address space?
  3. How do I safely traverse a memory region with pointers?
  4. How do I ensure resources (file handles, mappings) are properly cleaned up?

5.4 Concepts You Must Understand First

Question Book Reference
What is virtual memory? CSAPP Chapter 9
What is a page fault? CSAPP Chapter 9
How do file descriptors work on POSIX? TLPI Chapter 4
What is the difference between stack and heap? CSAPP Chapter 9
What is RAII in C++? Effective C++ Item 13

5.5 Questions to Guide Your Design

Memory Mapping Design

  • What happens if the file is too large for a 32-bit address space?
  • What protection flags should you use (read-only vs read-write)?
  • Should you use MAP_PRIVATE or MAP_SHARED?
  • What if the file changes while you’re mapping it?

Cross-Platform Abstraction

  • How will you hide platform differences behind a common interface?
  • What’s the cleanest way to use #ifdef for platform-specific code?
  • Should platform code be in separate files or the same file?

Resource Management

  • How will you ensure the mapping is always unmapped, even if exceptions occur?
  • What order should cleanup happen (unmap before close)?
  • What should the destructor do? What about copy/move?

Error Handling

  • What errors can mmap/MapViewOfFile return?
  • How will you report errors to the caller?
  • Should you throw exceptions or return error codes?

5.6 Thinking Exercise

Trace what happens when you run this code:

MemoryMappedFile mmf;
mmf.open("4GB_file.log");

// Count newlines
size_t lines = 0;
for (const char* p = mmf.begin(); p != mmf.end(); ++p) {
    if (*p == '\n') lines++;
}

Questions to answer:

  1. After open(), how much RAM is used? (Answer: almost none—just page tables)
  2. When does the first page fault occur? (Answer: first dereference of *p)
  3. How many page faults total for a 4GB file with 4KB pages? (Answer: ~1 million if pages not cached)
  4. What if you run the same count again immediately? (Answer: few/no page faults—pages cached)

5.7 Hints in Layers

Hint 1: Starting Point (Conceptual) Start with POSIX-only implementation. Get mmap working on Linux/macOS first. Add Windows support later with #ifdef.

Hint 2: Next Level (More Specific)

POSIX implementation skeleton:

bool MemoryMappedFile::open(const std::string& path) {
    fd_ = ::open(path.c_str(), O_RDONLY);
    if (fd_ < 0) return false;

    struct stat sb;
    if (fstat(fd_, &sb) < 0) {
        ::close(fd_);
        return false;
    }
    size_ = sb.st_size;

    data_ = mmap(nullptr, size_, PROT_READ, MAP_PRIVATE, fd_, 0);
    if (data_ == MAP_FAILED) {
        ::close(fd_);
        return false;
    }

    return true;
}

Hint 3: Technical Details (Approach)

Windows implementation:

bool MemoryMappedFile::open(const std::string& path) {
    hFile_ = CreateFileA(path.c_str(), GENERIC_READ, FILE_SHARE_READ,
                         nullptr, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, nullptr);
    if (hFile_ == INVALID_HANDLE_VALUE) return false;

    LARGE_INTEGER fileSize;
    if (!GetFileSizeEx(hFile_, &fileSize)) {
        CloseHandle(hFile_);
        return false;
    }
    size_ = static_cast<size_t>(fileSize.QuadPart);

    hMapping_ = CreateFileMappingA(hFile_, nullptr, PAGE_READONLY, 0, 0, nullptr);
    if (!hMapping_) {
        CloseHandle(hFile_);
        return false;
    }

    data_ = MapViewOfFile(hMapping_, FILE_MAP_READ, 0, 0, 0);
    if (!data_) {
        CloseHandle(hMapping_);
        CloseHandle(hFile_);
        return false;
    }

    return true;
}

Hint 4: Implementation Details

For efficient searching, use the standard library:

std::vector<size_t> MemoryMappedFile::search(const std::string& pattern) const {
    std::vector<size_t> results;
    const char* p = begin();
    const char* e = end();

    while (p < e) {
        p = std::search(p, e, pattern.begin(), pattern.end());
        if (p == e) break;
        results.push_back(p - begin());
        ++p;
    }
    return results;
}

For C++17, use std::boyer_moore_searcher for O(n/m) searching:

std::boyer_moore_searcher searcher(pattern.begin(), pattern.end());
while (p < e) {
    auto it = std::search(p, e, searcher);
    // ...
}

5.8 The Interview Questions They’ll Ask

  1. “What is memory mapping and when would you use it?”
    • Maps file directly into virtual address space
    • Use for: random access, large files, shared memory between processes
    • Avoid for: small files, sequential-only access
  2. “What happens when you access a memory-mapped page that isn’t in RAM?”
    • Page fault: CPU traps to OS
    • OS reads page from disk into physical memory
    • OS updates page tables
    • CPU retries the instruction
  3. “How does mmap compare to read() for performance?”
    • mmap: no copy from kernel buffer, good for random access
    • read: explicit control, can be faster for sequential
    • For large files with random access, mmap wins
  4. “What are the pitfalls of memory mapping?”
    • 32-bit systems: limited address space (4GB)
    • File changes: if file truncated, access causes SIGBUS
    • Error handling: page faults can fail (SIGBUS)
    • Thread safety: need synchronization for MAP_SHARED writes
  5. “How would you handle a 1TB file on a machine with 16GB RAM?”
    • 64-bit required (enough virtual address space)
    • mmap the whole file (OS pages in/out as needed)
    • Or: map in chunks (windows), sliding window approach
    • The OS LRU cache handles which pages stay in RAM

5.9 Books That Will Help

Topic Book Chapter
Memory Mapping The Linux Programming Interface Chapter 49
Virtual Memory Computer Systems: A Programmer’s Perspective Chapter 9
File I/O C++ Primer Plus Chapter 17
Platform Abstraction C++ Primer Plus Chapter 9
System Calls Advanced Programming in the UNIX Environment Chapters 4, 14

5.10 Implementation Phases

Phase 1: Basic POSIX Mapping (Day 1-2)

  • Implement open() with mmap on Linux/macOS
  • Implement close() with munmap
  • Test: map a small file, print first 100 bytes

Phase 2: Iteration and Counting (Day 3-4)

  • Implement begin(), end()
  • Implement countLines()
  • Implement countWords()
  • Test: count lines in a log file, compare with wc -l

Phase 3: Searching (Day 5)

  • Implement search() with std::search
  • Return byte offsets of matches
  • Test: search for pattern in log file

Phase 4: Windows Support (Day 6-7)

  • Add #ifdef blocks for Windows
  • Implement with CreateFileMapping/MapViewOfFile
  • Test: verify same behavior on Windows

Phase 5: REPL and Polish (Day 8-10)

  • Add interactive command loop
  • Add timing for operations
  • Add view command for byte ranges
  • Handle edge cases (empty files, binary files)

5.11 Key Implementation Decisions

Decision Options Recommendation
Mapping flags MAP_SHARED vs MAP_PRIVATE MAP_PRIVATE (no file modification)
Protection PROT_READ vs PROT_READ|PROT_WRITE PROT_READ (read-only)
32-bit support Map whole file vs chunks Chunks if supporting 32-bit
Search algorithm std::search vs custom std::search (or boyer_moore)
Error handling Exceptions vs error codes Error codes (system programming style)

6. Testing Strategy

6.1 Unit Tests

void testSmallFile() {
    // Create temp file
    std::ofstream f("test.txt");
    f << "Line 1\nLine 2\nLine 3\n";
    f.close();

    MemoryMappedFile mmf;
    assert(mmf.open("test.txt"));
    assert(mmf.size() == 21);
    assert(mmf.countLines() == 3);
    assert(mmf.countWords() == 6);

    auto matches = mmf.search("Line");
    assert(matches.size() == 3);

    std::remove("test.txt");
}

void testLargeFile() {
    // Create 1GB file
    // ... (use dd or generate programmatically)

    MemoryMappedFile mmf;
    assert(mmf.open("large.bin"));

    auto start = std::chrono::high_resolution_clock::now();
    size_t lines = mmf.countLines();
    auto end = std::chrono::high_resolution_clock::now();

    auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
    std::cout << "Counted " << lines << " lines in " << duration.count() << "ms\n";
    assert(duration.count() < 2000);  // Under 2 seconds
}

6.2 Edge Cases to Test

Input Expected
Empty file (0 bytes) size=0, lines=0, words=0
File with no newlines lines=0 (or 1 if counting partial)
Binary file Should work (may have nulls)
Very long lines Should work
Non-existent file open() returns false
Permission denied open() returns false
File in use Should still work (read sharing)

7. Common Pitfalls & Debugging

Problem Symptom Root Cause Fix
Segfault Crash on access Pointer out of bounds Check p < end()
SIGBUS Crash during access File truncated Catch signal or check size
Memory leak RSS grows munmap not called Use RAII destructor
Wrong size Incorrect count 32-bit overflow Use size_t, not int
Slow performance Too slow Debug build Compile with -O2
Windows fails CreateFileMapping fails Wrong flags Check GENERIC_READ, OPEN_EXISTING

Debugging Tips

  1. Check return values: mmap returns MAP_FAILED (not nullptr)
  2. Print errno: Use strerror(errno) for POSIX errors
  3. Monitor RSS: ps aux | grep mmap_reader to check memory
  4. Test small first: Use tiny files before large ones
  5. Use strace/dtrace: Trace system calls to see what’s happening

8. Extensions & Challenges

Extension Difficulty Concepts Learned
Write support: modify mapped memory Medium MAP_SHARED, msync
Parallel line counting (threads) Medium Cache line alignment, reduction
madvise hints for sequential access Easy OS optimization hints
Memory-mapped database index Hard B-tree on mapped memory
Large file on 32-bit system Hard Sliding window mapping
Compare with read() performance Easy Benchmarking

9. Real-World Connections

How Production Systems Use Memory Mapping

System Usage
SQLite Maps database file for efficient random access
PostgreSQL Uses mmap for some file operations
MongoDB Memory-maps data files (MMAPv1 engine)
Linux kernel Maps executables and shared libraries
Chrome Maps downloaded files for scanning
Git Maps pack files for object access

When NOT to Use mmap

  • Very small files (overhead not worth it)
  • Purely sequential reads (read() with large buffer can be faster)
  • Files that change during access (SIGBUS danger)
  • When you need fine-grained error handling

10. Resources

Primary References

  • Kerrisk, M. “The Linux Programming Interface” - Chapter 49 (Memory Mappings)
  • Bryant & O’Hallaron “Computer Systems: A Programmer’s Perspective” - Chapter 9
  • Stevens & Rago “Advanced Programming in the UNIX Environment” - Chapter 14

Online Resources


11. Self-Assessment Checklist

Before considering this project complete, verify:

  • Can map files of any size (tested with 1GB+)
  • Line count matches wc -l
  • Word count matches wc -w
  • Search finds all occurrences correctly
  • view command shows correct byte range
  • Works on Linux or macOS (POSIX)
  • Works on Windows (if targeting both)
  • Resources properly cleaned up (no leaks)
  • Handles missing files gracefully
  • Performance meets non-functional requirements

12. Submission / Completion Criteria

This project is complete when:

  1. All functional requirements (F1-F10) are implemented
  2. All example operations from section 3.4 work correctly
  3. 1GB file is processed in under 3 seconds for any operation
  4. Memory usage stays under 100MB regardless of file size
  5. Cross-platform support works on at least 2 platforms
  6. RAII ensures no resource leaks
  7. You can explain when mmap is better than read()
  8. You understand what happens during a page fault

Deliverables:

  • Source code with platform-specific sections clearly marked
  • README with build instructions for each platform
  • Performance benchmarks comparing small vs large files
  • Brief writeup explaining your design decisions

This project bridges user-space programming and OS internals. Memory mapping is how operating systems efficiently share data between processes and files. Understanding it prepares you for databases, browsers, and any performance-critical file processing.