Project 2: Memory-Mapped File Reader

Build a file reader that memory-maps files for efficient random access, enabling operations on gigabyte-sized files without loading them into RAM.

Quick Reference

Attribute	Value
Difficulty	Intermediate
Time Estimate	1-2 weeks
Language	C++
Prerequisites	Project 1, understanding of pointers, basic file I/O
Key Topics	Virtual memory, mmap/MapViewOfFile, pointer arithmetic, RAII, cross-platform abstraction

1. Learning Objectives

After completing this project, you will:

Understand how virtual memory maps files directly into your address space
Master the mmap (POSIX) and MapViewOfFile (Windows) system calls
Learn safe pointer arithmetic for traversing memory regions
Implement cross-platform code using conditional compilation
Apply the RAII pattern to ensure proper resource cleanup
Build high-performance file processing tools

2. Theoretical Foundation

2.1 Core Concepts

Virtual Memory and Memory Mapping

Every process has a virtual address space—a vast expanse of addresses that don’t correspond directly to physical RAM. The operating system’s memory management unit (MMU) translates virtual addresses to physical addresses.

Memory mapping exploits this: instead of reading a file into a buffer, you ask the OS to map the file’s contents into your virtual address space. The file appears as a contiguous block of memory starting at some pointer.

Traditional File Reading:
  File on disk --read()--> Kernel buffer --copy--> User buffer --access--> Your code

Memory Mapping:
  File on disk --mmap()--> Virtual address space --access--> Your code
                          (Pages loaded on demand)

Page Faults and Demand Paging

When you access a memory-mapped region, the page might not be in RAM. The CPU triggers a page fault, the OS loads the page from disk, and your code continues—transparently. Only accessed pages are loaded.

For a 4GB file on a machine with 8GB RAM, you can still “access” the entire file. The OS manages which pages are in RAM and which are on disk.

Benefits of Memory Mapping

Benefit	Explanation
No explicit I/O	Read/write through pointers, not syscalls
Zero-copy access	No buffer copying between kernel and user space
Demand paging	Only touched pages loaded into RAM
Shared mapping	Multiple processes can share the same mapping
Simplified code	Treat file as array; OS handles caching

Platform APIs

POSIX (Linux, macOS, BSD):

void* mmap(void* addr, size_t length, int prot, int flags, int fd, off_t offset);
int munmap(void* addr, size_t length);

Windows:

HANDLE CreateFileMapping(HANDLE hFile, ...);
void* MapViewOfFile(HANDLE hMap, DWORD access, ...);
BOOL UnmapViewOfFile(void* lpBaseAddress);

2.2 Why This Matters

Memory mapping is used in:

Databases: SQLite, PostgreSQL use mmap for efficient file access
Video editors: Large media files mapped rather than loaded
Text editors: Vim, VS Code map large files for fast search
Browsers: Mmap executables and shared libraries
Game engines: Stream assets from disk via memory mapping

Understanding mmap reveals how operating systems virtualize memory and how high-performance software accesses data.

2.3 Historical Context

Memory mapping predates modern operating systems. Multics (1960s) introduced the concept. UNIX adopted it in the 1980s with mmap. Windows followed with file mapping objects.

The key insight: files and memory are both just bytes. Why treat them differently? Memory mapping unifies them—a file is just a region of memory that persists on disk.

2.4 Common Misconceptions

Misconception 1: “mmap loads the entire file into RAM” False. Pages are loaded on demand. You can map a 100GB file on a machine with 4GB RAM.

Misconception 2: “mmap is always faster than read()” False. For sequential, single-pass reading, read() with appropriate buffering can be faster. mmap shines for random access and repeated access.

Misconception 3: “Memory-mapped files are automatically persistent” Partially true. Changes to MAP_SHARED mappings are written back. MAP_PRIVATE mappings are copy-on-write—changes stay in memory only.

Misconception 4: “Pointer arithmetic on mmap is dangerous” True if you go out of bounds. But with proper size checking, it’s safe and efficient.

3. Project Specification

3.1 What You Will Build

A cross-platform memory-mapped file reader that:

Maps files into memory using platform-native APIs
Supports operations on very large files (multi-gigabyte)
Provides fast line counting, word counting, and searching
Displays arbitrary byte ranges
Measures performance to demonstrate mmap efficiency
Cleans up resources automatically via RAII

3.2 Functional Requirements

ID	Requirement
F1	Open and memory-map any file up to 64GB
F2	Report file size immediately upon mapping
F3	Count lines in the file
F4	Count words in the file
F5	Count characters/bytes in the file
F6	Search for a pattern and report all occurrences
F7	Display arbitrary byte range (e.g., bytes 1000-2000)
F8	Report timing for operations
F9	Handle missing/inaccessible files gracefully
F10	Work on both Linux/macOS (POSIX) and Windows

3.3 Non-Functional Requirements

ID	Requirement
N1	Line counting: process 1GB in under 2 seconds
N2	Searching: process 1GB in under 3 seconds
N3	Memory usage: RSS should not exceed 100MB for any file size
N4	Startup time: map a file in under 10ms
N5	No file size limits except OS/filesystem limits

3.4 Example Usage / Output

$ ./mmap_reader largefile.log

File: largefile.log (4.2 GB)
Mapped successfully.

Commands:
  count - Count lines/words
  search <pattern> - Find pattern
  view <start> <end> - Show byte range
  quit - Exit

> count
Lines: 45,234,891
Words: 312,456,789
Chars: 4,512,345,678
(Completed in 1.2 seconds - file never fully loaded to RAM)

> search "ERROR"
Found 1,234 occurrences
First 5:
  Line 45: [2024-01-15 10:23:45] ERROR: Connection timeout
  Line 892: [2024-01-15 10:25:12] ERROR: Database unreachable
  Line 1203: [2024-01-15 10:27:33] ERROR: Authentication failed
  Line 2891: [2024-01-15 10:30:01] ERROR: Disk full
  Line 5123: [2024-01-15 10:35:55] ERROR: Network unreachable
(Completed in 0.8 seconds)

> view 0 1000
[Shows first 1000 bytes of file]

> view 1000000 1000100
[Shows bytes 1000000-1000100]

> quit
Goodbye!

3.5 Real World Outcome

After completing this project, you will:

Understand OS virtual memory deeply
Know when to use mmap vs traditional I/O
Have a powerful log analysis tool for large files
Be prepared for systems programming interviews about memory

4. Solution Architecture

4.1 High-Level Design

+------------------+
|   Command Line   |
+------------------+
         |
         v
+------------------+
| MemoryMappedFile |  <-- Cross-platform abstraction
+------------------+
    |         |
    v         v
+-------+  +----------+
| POSIX |  | Windows  |
| mmap  |  | MapView  |
+-------+  +----------+
    |         |
    v         v
+------------------+
|    OS Kernel     |
+------------------+
         |
         v
+------------------+
|    File System   |
+------------------+
         |
         v
+------------------+
|      Disk        |
+------------------+

4.2 Key Components

MemoryMappedFile Class

class MemoryMappedFile {
public:
    bool open(const std::string& path);
    void close();

    const char* data() const;
    size_t size() const;

    const char* begin() const;
    const char* end() const;

    // Query operations
    size_t countLines() const;
    size_t countWords() const;
    std::vector<size_t> search(const std::string& pattern) const;
    std::string_view view(size_t start, size_t end) const;

private:
#ifdef _WIN32
    HANDLE hFile_ = INVALID_HANDLE_VALUE;
    HANDLE hMapping_ = nullptr;
#else
    int fd_ = -1;
#endif
    void* data_ = nullptr;
    size_t size_ = 0;
};

REPL (Command Loop)

Parses commands: count, search, view, quit
Measures timing for operations
Formats output

4.3 Data Structures

Structure	Purpose
`char* data_`	Pointer to mapped memory region
`size_t size_`	Size of the mapped file
`std::vector<size_t>`	Store byte offsets of search matches

4.4 Algorithm Overview

Line Counting

count = 0
for each byte in [begin, end):
    if byte == '\n':
        count++
return count

This is cache-friendly sequential access—the OS prefetches pages as you iterate.

Word Counting

count = 0
in_word = false
for each byte in [begin, end):
    if isspace(byte):
        in_word = false
    else if not in_word:
        in_word = true
        count++
return count

Pattern Search

pattern = "ERROR"
results = []
ptr = begin
while ptr < end:
    found = std::search(ptr, end, pattern.begin(), pattern.end())
    if found == end:
        break
    results.push_back(found - begin)  // Byte offset
    ptr = found + 1
return results

For better performance, use Boyer-Moore (std::search with std::boyer_moore_searcher in C++17).

5. Implementation Guide

5.1 Development Environment Setup

# Linux/macOS
mkdir mmap_reader && cd mmap_reader
touch main.cpp memory_mapped_file.hpp memory_mapped_file.cpp
g++ -std=c++17 -O2 -Wall -o mmap_reader main.cpp memory_mapped_file.cpp

# Windows (Visual Studio Developer Command Prompt)
cl /std:c++17 /O2 /EHsc main.cpp memory_mapped_file.cpp /Fe:mmap_reader.exe

# Create test file
dd if=/dev/zero of=testfile.bin bs=1M count=100  # 100MB file

5.2 Project Structure

mmap_reader/
├── CMakeLists.txt
├── include/
│   └── memory_mapped_file.hpp
├── src/
│   ├── main.cpp
│   └── memory_mapped_file.cpp
└── tests/
    ├── test_mmap.cpp
    └── test_large_file.cpp

5.3 The Core Question You’re Answering

“How can I process files larger than available RAM efficiently, without manual chunking or buffering?”

This decomposes into:

How does virtual memory allow accessing more data than fits in RAM?
How do I use OS APIs to map files into my address space?
How do I safely traverse a memory region with pointers?
How do I ensure resources (file handles, mappings) are properly cleaned up?

5.4 Concepts You Must Understand First

Question	Book Reference
What is virtual memory?	CSAPP Chapter 9
What is a page fault?	CSAPP Chapter 9
How do file descriptors work on POSIX?	TLPI Chapter 4
What is the difference between stack and heap?	CSAPP Chapter 9
What is RAII in C++?	Effective C++ Item 13

5.5 Questions to Guide Your Design

Memory Mapping Design

What happens if the file is too large for a 32-bit address space?
What protection flags should you use (read-only vs read-write)?
Should you use MAP_PRIVATE or MAP_SHARED?
What if the file changes while you’re mapping it?

Cross-Platform Abstraction

How will you hide platform differences behind a common interface?
What’s the cleanest way to use #ifdef for platform-specific code?
Should platform code be in separate files or the same file?

Resource Management

How will you ensure the mapping is always unmapped, even if exceptions occur?
What order should cleanup happen (unmap before close)?
What should the destructor do? What about copy/move?

Error Handling

What errors can mmap/MapViewOfFile return?
How will you report errors to the caller?
Should you throw exceptions or return error codes?

5.6 Thinking Exercise

Trace what happens when you run this code:

MemoryMappedFile mmf;
mmf.open("4GB_file.log");

// Count newlines
size_t lines = 0;
for (const char* p = mmf.begin(); p != mmf.end(); ++p) {
    if (*p == '\n') lines++;
}

Questions to answer:

After open(), how much RAM is used? (Answer: almost none—just page tables)
When does the first page fault occur? (Answer: first dereference of *p)
How many page faults total for a 4GB file with 4KB pages? (Answer: ~1 million if pages not cached)
What if you run the same count again immediately? (Answer: few/no page faults—pages cached)

5.7 Hints in Layers

Hint 1: Starting Point (Conceptual) Start with POSIX-only implementation. Get mmap working on Linux/macOS first. Add Windows support later with #ifdef.

Hint 2: Next Level (More Specific)

POSIX implementation skeleton:

bool MemoryMappedFile::open(const std::string& path) {
    fd_ = ::open(path.c_str(), O_RDONLY);
    if (fd_ < 0) return false;

    struct stat sb;
    if (fstat(fd_, &sb) < 0) {
        ::close(fd_);
        return false;
    }
    size_ = sb.st_size;

    data_ = mmap(nullptr, size_, PROT_READ, MAP_PRIVATE, fd_, 0);
    if (data_ == MAP_FAILED) {
        ::close(fd_);
        return false;
    }

    return true;
}

Hint 3: Technical Details (Approach)

Windows implementation:

bool MemoryMappedFile::open(const std::string& path) {
    hFile_ = CreateFileA(path.c_str(), GENERIC_READ, FILE_SHARE_READ,
                         nullptr, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, nullptr);
    if (hFile_ == INVALID_HANDLE_VALUE) return false;

    LARGE_INTEGER fileSize;
    if (!GetFileSizeEx(hFile_, &fileSize)) {
        CloseHandle(hFile_);
        return false;
    }
    size_ = static_cast<size_t>(fileSize.QuadPart);

    hMapping_ = CreateFileMappingA(hFile_, nullptr, PAGE_READONLY, 0, 0, nullptr);
    if (!hMapping_) {
        CloseHandle(hFile_);
        return false;
    }

    data_ = MapViewOfFile(hMapping_, FILE_MAP_READ, 0, 0, 0);
    if (!data_) {
        CloseHandle(hMapping_);
        CloseHandle(hFile_);
        return false;
    }

    return true;
}

Hint 4: Implementation Details

For efficient searching, use the standard library:

std::vector<size_t> MemoryMappedFile::search(const std::string& pattern) const {
    std::vector<size_t> results;
    const char* p = begin();
    const char* e = end();

    while (p < e) {
        p = std::search(p, e, pattern.begin(), pattern.end());
        if (p == e) break;
        results.push_back(p - begin());
        ++p;
    }
    return results;
}

For C++17, use std::boyer_moore_searcher for O(n/m) searching:

std::boyer_moore_searcher searcher(pattern.begin(), pattern.end());
while (p < e) {
    auto it = std::search(p, e, searcher);
    // ...
}

5.8 The Interview Questions They’ll Ask

“What is memory mapping and when would you use it?”
- Maps file directly into virtual address space
- Use for: random access, large files, shared memory between processes
- Avoid for: small files, sequential-only access
“What happens when you access a memory-mapped page that isn’t in RAM?”
- Page fault: CPU traps to OS
- OS reads page from disk into physical memory
- OS updates page tables
- CPU retries the instruction
“How does mmap compare to read() for performance?”
- mmap: no copy from kernel buffer, good for random access
- read: explicit control, can be faster for sequential
- For large files with random access, mmap wins
“What are the pitfalls of memory mapping?”
- 32-bit systems: limited address space (4GB)
- File changes: if file truncated, access causes SIGBUS
- Error handling: page faults can fail (SIGBUS)
- Thread safety: need synchronization for MAP_SHARED writes
“How would you handle a 1TB file on a machine with 16GB RAM?”
- 64-bit required (enough virtual address space)
- mmap the whole file (OS pages in/out as needed)
- Or: map in chunks (windows), sliding window approach
- The OS LRU cache handles which pages stay in RAM

5.9 Books That Will Help

Topic	Book	Chapter
Memory Mapping	The Linux Programming Interface	Chapter 49
Virtual Memory	Computer Systems: A Programmer’s Perspective	Chapter 9
File I/O	C++ Primer Plus	Chapter 17
Platform Abstraction	C++ Primer Plus	Chapter 9
System Calls	Advanced Programming in the UNIX Environment	Chapters 4, 14

5.10 Implementation Phases

Phase 1: Basic POSIX Mapping (Day 1-2)

Implement open() with mmap on Linux/macOS
Implement close() with munmap
Test: map a small file, print first 100 bytes

Phase 2: Iteration and Counting (Day 3-4)

Implement begin(), end()
Implement countLines()
Implement countWords()
Test: count lines in a log file, compare with wc -l

Phase 3: Searching (Day 5)

Implement search() with std::search
Return byte offsets of matches
Test: search for pattern in log file

Phase 4: Windows Support (Day 6-7)

Add #ifdef blocks for Windows
Implement with CreateFileMapping/MapViewOfFile
Test: verify same behavior on Windows

Phase 5: REPL and Polish (Day 8-10)

Add interactive command loop
Add timing for operations
Add view command for byte ranges
Handle edge cases (empty files, binary files)

5.11 Key Implementation Decisions

Decision	Options	Recommendation
Mapping flags	MAP_SHARED vs MAP_PRIVATE	MAP_PRIVATE (no file modification)
Protection	PROT_READ vs PROT_READ\|PROT_WRITE	PROT_READ (read-only)
32-bit support	Map whole file vs chunks	Chunks if supporting 32-bit
Search algorithm	std::search vs custom	std::search (or boyer_moore)
Error handling	Exceptions vs error codes	Error codes (system programming style)

6. Testing Strategy

6.1 Unit Tests

void testSmallFile() {
    // Create temp file
    std::ofstream f("test.txt");
    f << "Line 1\nLine 2\nLine 3\n";
    f.close();

    MemoryMappedFile mmf;
    assert(mmf.open("test.txt"));
    assert(mmf.size() == 21);
    assert(mmf.countLines() == 3);
    assert(mmf.countWords() == 6);

    auto matches = mmf.search("Line");
    assert(matches.size() == 3);

    std::remove("test.txt");
}

void testLargeFile() {
    // Create 1GB file
    // ... (use dd or generate programmatically)

    MemoryMappedFile mmf;
    assert(mmf.open("large.bin"));

    auto start = std::chrono::high_resolution_clock::now();
    size_t lines = mmf.countLines();
    auto end = std::chrono::high_resolution_clock::now();

    auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
    std::cout << "Counted " << lines << " lines in " << duration.count() << "ms\n";
    assert(duration.count() < 2000);  // Under 2 seconds
}

6.2 Edge Cases to Test

Input	Expected
Empty file (0 bytes)	size=0, lines=0, words=0
File with no newlines	lines=0 (or 1 if counting partial)
Binary file	Should work (may have nulls)
Very long lines	Should work
Non-existent file	open() returns false
Permission denied	open() returns false
File in use	Should still work (read sharing)

7. Common Pitfalls & Debugging

Problem	Symptom	Root Cause	Fix
Segfault	Crash on access	Pointer out of bounds	Check `p < end()`
SIGBUS	Crash during access	File truncated	Catch signal or check size
Memory leak	RSS grows	munmap not called	Use RAII destructor
Wrong size	Incorrect count	32-bit overflow	Use size_t, not int
Slow performance	Too slow	Debug build	Compile with -O2
Windows fails	CreateFileMapping fails	Wrong flags	Check GENERIC_READ, OPEN_EXISTING

Debugging Tips

Check return values: mmap returns MAP_FAILED (not nullptr)
Print errno: Use strerror(errno) for POSIX errors
Monitor RSS: ps aux | grep mmap_reader to check memory
Test small first: Use tiny files before large ones
Use strace/dtrace: Trace system calls to see what’s happening

8. Extensions & Challenges

Extension	Difficulty	Concepts Learned
Write support: modify mapped memory	Medium	MAP_SHARED, msync
Parallel line counting (threads)	Medium	Cache line alignment, reduction
madvise hints for sequential access	Easy	OS optimization hints
Memory-mapped database index	Hard	B-tree on mapped memory
Large file on 32-bit system	Hard	Sliding window mapping
Compare with read() performance	Easy	Benchmarking

9. Real-World Connections

How Production Systems Use Memory Mapping

System	Usage
SQLite	Maps database file for efficient random access
PostgreSQL	Uses mmap for some file operations
MongoDB	Memory-maps data files (MMAPv1 engine)
Linux kernel	Maps executables and shared libraries
Chrome	Maps downloaded files for scanning
Git	Maps pack files for object access

When NOT to Use mmap

Very small files (overhead not worth it)
Purely sequential reads (read() with large buffer can be faster)
Files that change during access (SIGBUS danger)
When you need fine-grained error handling

10. Resources

Primary References

Kerrisk, M. “The Linux Programming Interface” - Chapter 49 (Memory Mappings)
Bryant & O’Hallaron “Computer Systems: A Programmer’s Perspective” - Chapter 9
Stevens & Rago “Advanced Programming in the UNIX Environment” - Chapter 14

Online Resources

11. Self-Assessment Checklist

Before considering this project complete, verify:

12. Submission / Completion Criteria

This project is complete when:

All functional requirements (F1-F10) are implemented
All example operations from section 3.4 work correctly
1GB file is processed in under 3 seconds for any operation
Memory usage stays under 100MB regardless of file size
Cross-platform support works on at least 2 platforms
RAII ensures no resource leaks
You can explain when mmap is better than read()
You understand what happens during a page fault

Deliverables:

Source code with platform-specific sections clearly marked
README with build instructions for each platform
Performance benchmarks comparing small vs large files
Brief writeup explaining your design decisions

This project bridges user-space programming and OS internals. Memory mapping is how operating systems efficiently share data between processes and files. Understanding it prepares you for databases, browsers, and any performance-critical file processing.