Project 2: Memory-Mapped File Reader
Build a file reader that memory-maps files for efficient random access, enabling operations on gigabyte-sized files without loading them into RAM.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Intermediate |
| Time Estimate | 1-2 weeks |
| Language | C++ |
| Prerequisites | Project 1, understanding of pointers, basic file I/O |
| Key Topics | Virtual memory, mmap/MapViewOfFile, pointer arithmetic, RAII, cross-platform abstraction |
1. Learning Objectives
After completing this project, you will:
- Understand how virtual memory maps files directly into your address space
- Master the mmap (POSIX) and MapViewOfFile (Windows) system calls
- Learn safe pointer arithmetic for traversing memory regions
- Implement cross-platform code using conditional compilation
- Apply the RAII pattern to ensure proper resource cleanup
- Build high-performance file processing tools
2. Theoretical Foundation
2.1 Core Concepts
Virtual Memory and Memory Mapping
Every process has a virtual address space—a vast expanse of addresses that don’t correspond directly to physical RAM. The operating system’s memory management unit (MMU) translates virtual addresses to physical addresses.
Memory mapping exploits this: instead of reading a file into a buffer, you ask the OS to map the file’s contents into your virtual address space. The file appears as a contiguous block of memory starting at some pointer.
Traditional File Reading:
File on disk --read()--> Kernel buffer --copy--> User buffer --access--> Your code
Memory Mapping:
File on disk --mmap()--> Virtual address space --access--> Your code
(Pages loaded on demand)
Page Faults and Demand Paging
When you access a memory-mapped region, the page might not be in RAM. The CPU triggers a page fault, the OS loads the page from disk, and your code continues—transparently. Only accessed pages are loaded.
For a 4GB file on a machine with 8GB RAM, you can still “access” the entire file. The OS manages which pages are in RAM and which are on disk.
Benefits of Memory Mapping
| Benefit | Explanation |
|---|---|
| No explicit I/O | Read/write through pointers, not syscalls |
| Zero-copy access | No buffer copying between kernel and user space |
| Demand paging | Only touched pages loaded into RAM |
| Shared mapping | Multiple processes can share the same mapping |
| Simplified code | Treat file as array; OS handles caching |
Platform APIs
POSIX (Linux, macOS, BSD):
void* mmap(void* addr, size_t length, int prot, int flags, int fd, off_t offset);
int munmap(void* addr, size_t length);
Windows:
HANDLE CreateFileMapping(HANDLE hFile, ...);
void* MapViewOfFile(HANDLE hMap, DWORD access, ...);
BOOL UnmapViewOfFile(void* lpBaseAddress);
2.2 Why This Matters
Memory mapping is used in:
- Databases: SQLite, PostgreSQL use mmap for efficient file access
- Video editors: Large media files mapped rather than loaded
- Text editors: Vim, VS Code map large files for fast search
- Browsers: Mmap executables and shared libraries
- Game engines: Stream assets from disk via memory mapping
Understanding mmap reveals how operating systems virtualize memory and how high-performance software accesses data.
2.3 Historical Context
Memory mapping predates modern operating systems. Multics (1960s) introduced the concept. UNIX adopted it in the 1980s with mmap. Windows followed with file mapping objects.
The key insight: files and memory are both just bytes. Why treat them differently? Memory mapping unifies them—a file is just a region of memory that persists on disk.
2.4 Common Misconceptions
Misconception 1: “mmap loads the entire file into RAM” False. Pages are loaded on demand. You can map a 100GB file on a machine with 4GB RAM.
Misconception 2: “mmap is always faster than read()” False. For sequential, single-pass reading, read() with appropriate buffering can be faster. mmap shines for random access and repeated access.
Misconception 3: “Memory-mapped files are automatically persistent” Partially true. Changes to MAP_SHARED mappings are written back. MAP_PRIVATE mappings are copy-on-write—changes stay in memory only.
Misconception 4: “Pointer arithmetic on mmap is dangerous” True if you go out of bounds. But with proper size checking, it’s safe and efficient.
3. Project Specification
3.1 What You Will Build
A cross-platform memory-mapped file reader that:
- Maps files into memory using platform-native APIs
- Supports operations on very large files (multi-gigabyte)
- Provides fast line counting, word counting, and searching
- Displays arbitrary byte ranges
- Measures performance to demonstrate mmap efficiency
- Cleans up resources automatically via RAII
3.2 Functional Requirements
| ID | Requirement |
|---|---|
| F1 | Open and memory-map any file up to 64GB |
| F2 | Report file size immediately upon mapping |
| F3 | Count lines in the file |
| F4 | Count words in the file |
| F5 | Count characters/bytes in the file |
| F6 | Search for a pattern and report all occurrences |
| F7 | Display arbitrary byte range (e.g., bytes 1000-2000) |
| F8 | Report timing for operations |
| F9 | Handle missing/inaccessible files gracefully |
| F10 | Work on both Linux/macOS (POSIX) and Windows |
3.3 Non-Functional Requirements
| ID | Requirement |
|---|---|
| N1 | Line counting: process 1GB in under 2 seconds |
| N2 | Searching: process 1GB in under 3 seconds |
| N3 | Memory usage: RSS should not exceed 100MB for any file size |
| N4 | Startup time: map a file in under 10ms |
| N5 | No file size limits except OS/filesystem limits |
3.4 Example Usage / Output
$ ./mmap_reader largefile.log
File: largefile.log (4.2 GB)
Mapped successfully.
Commands:
count - Count lines/words
search <pattern> - Find pattern
view <start> <end> - Show byte range
quit - Exit
> count
Lines: 45,234,891
Words: 312,456,789
Chars: 4,512,345,678
(Completed in 1.2 seconds - file never fully loaded to RAM)
> search "ERROR"
Found 1,234 occurrences
First 5:
Line 45: [2024-01-15 10:23:45] ERROR: Connection timeout
Line 892: [2024-01-15 10:25:12] ERROR: Database unreachable
Line 1203: [2024-01-15 10:27:33] ERROR: Authentication failed
Line 2891: [2024-01-15 10:30:01] ERROR: Disk full
Line 5123: [2024-01-15 10:35:55] ERROR: Network unreachable
(Completed in 0.8 seconds)
> view 0 1000
[Shows first 1000 bytes of file]
> view 1000000 1000100
[Shows bytes 1000000-1000100]
> quit
Goodbye!
3.5 Real World Outcome
After completing this project, you will:
- Understand OS virtual memory deeply
- Know when to use mmap vs traditional I/O
- Have a powerful log analysis tool for large files
- Be prepared for systems programming interviews about memory
4. Solution Architecture
4.1 High-Level Design
+------------------+
| Command Line |
+------------------+
|
v
+------------------+
| MemoryMappedFile | <-- Cross-platform abstraction
+------------------+
| |
v v
+-------+ +----------+
| POSIX | | Windows |
| mmap | | MapView |
+-------+ +----------+
| |
v v
+------------------+
| OS Kernel |
+------------------+
|
v
+------------------+
| File System |
+------------------+
|
v
+------------------+
| Disk |
+------------------+
4.2 Key Components
MemoryMappedFile Class
class MemoryMappedFile {
public:
bool open(const std::string& path);
void close();
const char* data() const;
size_t size() const;
const char* begin() const;
const char* end() const;
// Query operations
size_t countLines() const;
size_t countWords() const;
std::vector<size_t> search(const std::string& pattern) const;
std::string_view view(size_t start, size_t end) const;
private:
#ifdef _WIN32
HANDLE hFile_ = INVALID_HANDLE_VALUE;
HANDLE hMapping_ = nullptr;
#else
int fd_ = -1;
#endif
void* data_ = nullptr;
size_t size_ = 0;
};
REPL (Command Loop)
- Parses commands: count, search, view, quit
- Measures timing for operations
- Formats output
4.3 Data Structures
| Structure | Purpose |
|---|---|
char* data_ |
Pointer to mapped memory region |
size_t size_ |
Size of the mapped file |
std::vector<size_t> |
Store byte offsets of search matches |
4.4 Algorithm Overview
Line Counting
count = 0
for each byte in [begin, end):
if byte == '\n':
count++
return count
This is cache-friendly sequential access—the OS prefetches pages as you iterate.
Word Counting
count = 0
in_word = false
for each byte in [begin, end):
if isspace(byte):
in_word = false
else if not in_word:
in_word = true
count++
return count
Pattern Search
pattern = "ERROR"
results = []
ptr = begin
while ptr < end:
found = std::search(ptr, end, pattern.begin(), pattern.end())
if found == end:
break
results.push_back(found - begin) // Byte offset
ptr = found + 1
return results
For better performance, use Boyer-Moore (std::search with std::boyer_moore_searcher in C++17).
5. Implementation Guide
5.1 Development Environment Setup
# Linux/macOS
mkdir mmap_reader && cd mmap_reader
touch main.cpp memory_mapped_file.hpp memory_mapped_file.cpp
g++ -std=c++17 -O2 -Wall -o mmap_reader main.cpp memory_mapped_file.cpp
# Windows (Visual Studio Developer Command Prompt)
cl /std:c++17 /O2 /EHsc main.cpp memory_mapped_file.cpp /Fe:mmap_reader.exe
# Create test file
dd if=/dev/zero of=testfile.bin bs=1M count=100 # 100MB file
5.2 Project Structure
mmap_reader/
├── CMakeLists.txt
├── include/
│ └── memory_mapped_file.hpp
├── src/
│ ├── main.cpp
│ └── memory_mapped_file.cpp
└── tests/
├── test_mmap.cpp
└── test_large_file.cpp
5.3 The Core Question You’re Answering
“How can I process files larger than available RAM efficiently, without manual chunking or buffering?”
This decomposes into:
- How does virtual memory allow accessing more data than fits in RAM?
- How do I use OS APIs to map files into my address space?
- How do I safely traverse a memory region with pointers?
- How do I ensure resources (file handles, mappings) are properly cleaned up?
5.4 Concepts You Must Understand First
| Question | Book Reference |
|---|---|
| What is virtual memory? | CSAPP Chapter 9 |
| What is a page fault? | CSAPP Chapter 9 |
| How do file descriptors work on POSIX? | TLPI Chapter 4 |
| What is the difference between stack and heap? | CSAPP Chapter 9 |
| What is RAII in C++? | Effective C++ Item 13 |
5.5 Questions to Guide Your Design
Memory Mapping Design
- What happens if the file is too large for a 32-bit address space?
- What protection flags should you use (read-only vs read-write)?
- Should you use MAP_PRIVATE or MAP_SHARED?
- What if the file changes while you’re mapping it?
Cross-Platform Abstraction
- How will you hide platform differences behind a common interface?
- What’s the cleanest way to use
#ifdeffor platform-specific code? - Should platform code be in separate files or the same file?
Resource Management
- How will you ensure the mapping is always unmapped, even if exceptions occur?
- What order should cleanup happen (unmap before close)?
- What should the destructor do? What about copy/move?
Error Handling
- What errors can mmap/MapViewOfFile return?
- How will you report errors to the caller?
- Should you throw exceptions or return error codes?
5.6 Thinking Exercise
Trace what happens when you run this code:
MemoryMappedFile mmf;
mmf.open("4GB_file.log");
// Count newlines
size_t lines = 0;
for (const char* p = mmf.begin(); p != mmf.end(); ++p) {
if (*p == '\n') lines++;
}
Questions to answer:
- After
open(), how much RAM is used? (Answer: almost none—just page tables) - When does the first page fault occur? (Answer: first dereference of
*p) - How many page faults total for a 4GB file with 4KB pages? (Answer: ~1 million if pages not cached)
- What if you run the same count again immediately? (Answer: few/no page faults—pages cached)
5.7 Hints in Layers
Hint 1: Starting Point (Conceptual)
Start with POSIX-only implementation. Get mmap working on Linux/macOS first. Add Windows support later with #ifdef.
Hint 2: Next Level (More Specific)
POSIX implementation skeleton:
bool MemoryMappedFile::open(const std::string& path) {
fd_ = ::open(path.c_str(), O_RDONLY);
if (fd_ < 0) return false;
struct stat sb;
if (fstat(fd_, &sb) < 0) {
::close(fd_);
return false;
}
size_ = sb.st_size;
data_ = mmap(nullptr, size_, PROT_READ, MAP_PRIVATE, fd_, 0);
if (data_ == MAP_FAILED) {
::close(fd_);
return false;
}
return true;
}
Hint 3: Technical Details (Approach)
Windows implementation:
bool MemoryMappedFile::open(const std::string& path) {
hFile_ = CreateFileA(path.c_str(), GENERIC_READ, FILE_SHARE_READ,
nullptr, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, nullptr);
if (hFile_ == INVALID_HANDLE_VALUE) return false;
LARGE_INTEGER fileSize;
if (!GetFileSizeEx(hFile_, &fileSize)) {
CloseHandle(hFile_);
return false;
}
size_ = static_cast<size_t>(fileSize.QuadPart);
hMapping_ = CreateFileMappingA(hFile_, nullptr, PAGE_READONLY, 0, 0, nullptr);
if (!hMapping_) {
CloseHandle(hFile_);
return false;
}
data_ = MapViewOfFile(hMapping_, FILE_MAP_READ, 0, 0, 0);
if (!data_) {
CloseHandle(hMapping_);
CloseHandle(hFile_);
return false;
}
return true;
}
Hint 4: Implementation Details
For efficient searching, use the standard library:
std::vector<size_t> MemoryMappedFile::search(const std::string& pattern) const {
std::vector<size_t> results;
const char* p = begin();
const char* e = end();
while (p < e) {
p = std::search(p, e, pattern.begin(), pattern.end());
if (p == e) break;
results.push_back(p - begin());
++p;
}
return results;
}
For C++17, use std::boyer_moore_searcher for O(n/m) searching:
std::boyer_moore_searcher searcher(pattern.begin(), pattern.end());
while (p < e) {
auto it = std::search(p, e, searcher);
// ...
}
5.8 The Interview Questions They’ll Ask
- “What is memory mapping and when would you use it?”
- Maps file directly into virtual address space
- Use for: random access, large files, shared memory between processes
- Avoid for: small files, sequential-only access
- “What happens when you access a memory-mapped page that isn’t in RAM?”
- Page fault: CPU traps to OS
- OS reads page from disk into physical memory
- OS updates page tables
- CPU retries the instruction
- “How does mmap compare to read() for performance?”
- mmap: no copy from kernel buffer, good for random access
- read: explicit control, can be faster for sequential
- For large files with random access, mmap wins
- “What are the pitfalls of memory mapping?”
- 32-bit systems: limited address space (4GB)
- File changes: if file truncated, access causes SIGBUS
- Error handling: page faults can fail (SIGBUS)
- Thread safety: need synchronization for MAP_SHARED writes
- “How would you handle a 1TB file on a machine with 16GB RAM?”
- 64-bit required (enough virtual address space)
- mmap the whole file (OS pages in/out as needed)
- Or: map in chunks (windows), sliding window approach
- The OS LRU cache handles which pages stay in RAM
5.9 Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Memory Mapping | The Linux Programming Interface | Chapter 49 |
| Virtual Memory | Computer Systems: A Programmer’s Perspective | Chapter 9 |
| File I/O | C++ Primer Plus | Chapter 17 |
| Platform Abstraction | C++ Primer Plus | Chapter 9 |
| System Calls | Advanced Programming in the UNIX Environment | Chapters 4, 14 |
5.10 Implementation Phases
Phase 1: Basic POSIX Mapping (Day 1-2)
- Implement open() with mmap on Linux/macOS
- Implement close() with munmap
- Test: map a small file, print first 100 bytes
Phase 2: Iteration and Counting (Day 3-4)
- Implement begin(), end()
- Implement countLines()
- Implement countWords()
- Test: count lines in a log file, compare with
wc -l
Phase 3: Searching (Day 5)
- Implement search() with std::search
- Return byte offsets of matches
- Test: search for pattern in log file
Phase 4: Windows Support (Day 6-7)
- Add #ifdef blocks for Windows
- Implement with CreateFileMapping/MapViewOfFile
- Test: verify same behavior on Windows
Phase 5: REPL and Polish (Day 8-10)
- Add interactive command loop
- Add timing for operations
- Add view command for byte ranges
- Handle edge cases (empty files, binary files)
5.11 Key Implementation Decisions
| Decision | Options | Recommendation |
|---|---|---|
| Mapping flags | MAP_SHARED vs MAP_PRIVATE | MAP_PRIVATE (no file modification) |
| Protection | PROT_READ vs PROT_READ|PROT_WRITE | PROT_READ (read-only) |
| 32-bit support | Map whole file vs chunks | Chunks if supporting 32-bit |
| Search algorithm | std::search vs custom | std::search (or boyer_moore) |
| Error handling | Exceptions vs error codes | Error codes (system programming style) |
6. Testing Strategy
6.1 Unit Tests
void testSmallFile() {
// Create temp file
std::ofstream f("test.txt");
f << "Line 1\nLine 2\nLine 3\n";
f.close();
MemoryMappedFile mmf;
assert(mmf.open("test.txt"));
assert(mmf.size() == 21);
assert(mmf.countLines() == 3);
assert(mmf.countWords() == 6);
auto matches = mmf.search("Line");
assert(matches.size() == 3);
std::remove("test.txt");
}
void testLargeFile() {
// Create 1GB file
// ... (use dd or generate programmatically)
MemoryMappedFile mmf;
assert(mmf.open("large.bin"));
auto start = std::chrono::high_resolution_clock::now();
size_t lines = mmf.countLines();
auto end = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
std::cout << "Counted " << lines << " lines in " << duration.count() << "ms\n";
assert(duration.count() < 2000); // Under 2 seconds
}
6.2 Edge Cases to Test
| Input | Expected |
|---|---|
| Empty file (0 bytes) | size=0, lines=0, words=0 |
| File with no newlines | lines=0 (or 1 if counting partial) |
| Binary file | Should work (may have nulls) |
| Very long lines | Should work |
| Non-existent file | open() returns false |
| Permission denied | open() returns false |
| File in use | Should still work (read sharing) |
7. Common Pitfalls & Debugging
| Problem | Symptom | Root Cause | Fix |
|---|---|---|---|
| Segfault | Crash on access | Pointer out of bounds | Check p < end() |
| SIGBUS | Crash during access | File truncated | Catch signal or check size |
| Memory leak | RSS grows | munmap not called | Use RAII destructor |
| Wrong size | Incorrect count | 32-bit overflow | Use size_t, not int |
| Slow performance | Too slow | Debug build | Compile with -O2 |
| Windows fails | CreateFileMapping fails | Wrong flags | Check GENERIC_READ, OPEN_EXISTING |
Debugging Tips
- Check return values: mmap returns MAP_FAILED (not nullptr)
- Print errno: Use
strerror(errno)for POSIX errors - Monitor RSS:
ps aux | grep mmap_readerto check memory - Test small first: Use tiny files before large ones
- Use strace/dtrace: Trace system calls to see what’s happening
8. Extensions & Challenges
| Extension | Difficulty | Concepts Learned |
|---|---|---|
| Write support: modify mapped memory | Medium | MAP_SHARED, msync |
| Parallel line counting (threads) | Medium | Cache line alignment, reduction |
| madvise hints for sequential access | Easy | OS optimization hints |
| Memory-mapped database index | Hard | B-tree on mapped memory |
| Large file on 32-bit system | Hard | Sliding window mapping |
| Compare with read() performance | Easy | Benchmarking |
9. Real-World Connections
How Production Systems Use Memory Mapping
| System | Usage |
|---|---|
| SQLite | Maps database file for efficient random access |
| PostgreSQL | Uses mmap for some file operations |
| MongoDB | Memory-maps data files (MMAPv1 engine) |
| Linux kernel | Maps executables and shared libraries |
| Chrome | Maps downloaded files for scanning |
| Git | Maps pack files for object access |
When NOT to Use mmap
- Very small files (overhead not worth it)
- Purely sequential reads (read() with large buffer can be faster)
- Files that change during access (SIGBUS danger)
- When you need fine-grained error handling
10. Resources
Primary References
- Kerrisk, M. “The Linux Programming Interface” - Chapter 49 (Memory Mappings)
- Bryant & O’Hallaron “Computer Systems: A Programmer’s Perspective” - Chapter 9
- Stevens & Rago “Advanced Programming in the UNIX Environment” - Chapter 14
Online Resources
11. Self-Assessment Checklist
Before considering this project complete, verify:
- Can map files of any size (tested with 1GB+)
- Line count matches
wc -l - Word count matches
wc -w - Search finds all occurrences correctly
- view command shows correct byte range
- Works on Linux or macOS (POSIX)
- Works on Windows (if targeting both)
- Resources properly cleaned up (no leaks)
- Handles missing files gracefully
- Performance meets non-functional requirements
12. Submission / Completion Criteria
This project is complete when:
- All functional requirements (F1-F10) are implemented
- All example operations from section 3.4 work correctly
- 1GB file is processed in under 3 seconds for any operation
- Memory usage stays under 100MB regardless of file size
- Cross-platform support works on at least 2 platforms
- RAII ensures no resource leaks
- You can explain when mmap is better than read()
- You understand what happens during a page fault
Deliverables:
- Source code with platform-specific sections clearly marked
- README with build instructions for each platform
- Performance benchmarks comparing small vs large files
- Brief writeup explaining your design decisions
This project bridges user-space programming and OS internals. Memory mapping is how operating systems efficiently share data between processes and files. Understanding it prepares you for databases, browsers, and any performance-critical file processing.