SOFTWARE DEFINED STORAGE DEEP DIVE
In the traditional world, storage was a black box”—proprietary hardware from vendors like EMC or NetApp. If you needed more space, you bought another expensive box. Software Defined Storage (SDS) shattered this model by moving the intelligence from specialized hardware to open-source software running on commodity servers.
Learn Software Defined Storage (SDS): From Zero to Storage Architect
Goal: Deeply understand the internal machinery of Software Defined Storage—specifically how systems like Ceph and GlusterFS eliminate metadata bottlenecks, manage massive data placement without central tables (CRUSH/Elastic Hashing), and achieve autonomous self-healing. By building these components from first principles, you will master the art of architecting resilient, exabyte-scale storage clusters.
Why SDS Matters
In the traditional world, storage was a “black box”—proprietary hardware from vendors like EMC or NetApp. If you needed more space, you bought another expensive box. Software Defined Storage (SDS) shattered this model by moving the “intelligence” from specialized hardware to open-source software running on commodity servers.
Real-World Impact:
- Exabyte Scale: Platforms like CERN use Ceph to store hundreds of petabytes of physics data.
- Cloud Foundation: OpenStack, Kubernetes, and Proxmox rely on SDS for persistent volume management.
- Cost: Moving from SAN/NAS hardware to commodity SDS can reduce storage costs by 70-80%.
Understanding SDS means understanding how to manage “Memory as a Network Service.” It is the bridge between raw disks and the distributed applications that drive the modern web.
Core Concept Analysis
1. The Death of the Central Metadata Server
Traditional distributed filesystems (like HDFS) use a “NameNode”—a central server that knows where every block lives. This is a massive bottleneck. SDS systems like Ceph and Gluster use Algorithms instead of Lookup Tables.
Traditional (HDFS):
Client -> "Where is File A?" -> Metadata Server
Metadata Server -> "It's on Server 5" -> Client
Client -> Server 5 (Fetch)
SDS (Ceph/Gluster):
Client -> Hash(FileName) + Topology Map -> Local Computation
Local Computation -> "It's on Server 5"
Client -> Server 5 (Fetch)
2. CRUSH Maps & Failure Domains
CRUSH (Controlled Replication Under Scalable Hashing) is the brain of Ceph. It treats the storage cluster as a hierarchy (Data Center > Room > Rack > Host > Disk).
CRUSH Hierarchy:
[Root]
/ \
[Rack 1] [Rack 2]
/ \ / \
[H1] [H2][H3] [H4]
| | | |
[D1] [D2][D3] [D4]
Why it matters: CRUSH rules ensure that if you want 3 replicas, it places them in different racks. If Rack 1 loses power, the data is still safe in Rack 2.
3. Placement Groups (PGs)
Managing 10 billion objects individually is impossible. Ceph groups objects into PGs.
[Object A] \
[Object B] -- [PG 1.1] -> [OSD 5, OSD 12, OSD 42]
[Object C] /
[Object D] \
[Object E] -- [PG 1.2] -> [OSD 7, OSD 2, OSD 19]
[Object F] /
Analogy: If you have 1,000,000 letters to deliver, you don’t track every letter. You track 10,000 mailbags. If a mail truck (OSD) breaks down, you just re-route the bags.
4. Self-Healing (The Peering Dance)
When a disk fails, the remaining disks in the affected PGs “gossip” (Peer) to determine what data is missing and where to re-replicate it.
1. OSD 5 Fails.
2. PGs 1.1, 1.5, and 2.3 are now "Degraded" (only 2/3 replicas).
3. Surviving OSDs (12 & 42) notify the cluster.
4. CRUSH calculates new home: OSD 99.
5. OSD 12 copies data to OSD 99.
6. Cluster is "Clean" again.
Concept Summary Table
| Concept Cluster | What You Need to Internalize |
|---|---|
| Algorithmic Placement | Using Hashing (EHA) or CRUSH instead of central databases to locate data. |
| Failure Domains | Designing the system to survive the loss of a disk, host, rack, or entire DC. |
| Placement Groups | Sharding objects into manageable logical units for rebalancing and recovery. |
| Consistent Hashing | Minimizing data movement when nodes join or leave the cluster. |
| Replication/Erasure Coding | The math of durability vs. storage efficiency. |
| Peering & Recovery | How nodes autonomously agree on state and repair themselves without human help. |
Deep Dive Reading by Concept
Storage Theory & Basics
| Concept | Book & Chapter |
|---|---|
| Distributed Systems Fundamentals | Distributed Systems: Concepts and Design by Coulouris — Ch. 1: “Characterization of Distributed Systems” |
| Storage Hierarchy | Computer Systems: A Programmer’s Perspective by Bryant & O’Hallaron — Ch. 6: “The Memory Hierarchy” |
SDS Specifics (Ceph & Gluster)
| Concept | Book & Chapter |
|---|---|
| CRUSH Algorithm | Mastering Ceph by Nick Fisk — Ch. 5: “CRUSH Maps and Data Placement” |
| Placement Groups | Learning Ceph by Karan Singh — Ch. 2: “Ceph Internal Architecture” |
| Gluster Translators | GlusterFS Architecture Guide (Official Docs) — “Architecture and Translators” |
| Self-Healing Logic | Ceph: Distributed Storage for Linux and Cloud by Anthony D’Atri — Ch. 6: “OSD Recovery and Rebalancing” |
Essential Reading Order
- The Math (Week 1):
- Read the original CRUSH Paper (Sage Weil, 2006). It’s the foundation of modern SDS.
- The Layout (Week 1):
- Mastering Ceph Ch. 5. Understand how hierarchy affects durability.
- The Recovery (Week 2):
- Learning Ceph Ch. 2. Focus on how monitors (ceph-mon) maintain the cluster map.
Project 1: The Hash-Based Object Router
- File: SOFTWARE_DEFINED_STORAGE_DEEP_DIVE.md
- Main Programming Language: Go
- Alternative Programming Languages: Python, Rust, C++
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 1: Beginner
- Knowledge Area: Consistent Hashing / Data Distribution
- Software or Tool: Static Node List
- Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann
What you’ll build: A tool that takes a filename and a list of storage nodes, and consistently returns the same node ID for that file without using a database.
Why it teaches SDS: This is the “Eureka” moment of SDS. You discover that you don’t need a central index to find files. You realize that Location = F(FileName, ClusterMap).
Core challenges you’ll face:
- Handling Node Changes -> Mapping files when a node is added (Consistent Hashing).
- Uniform Distribution -> Ensuring one node doesn’t get 90% of the data.
- Virtual Nodes -> Using replicas of nodes to smooth out hash distribution.
Key Concepts
- Consistent Hashing: “Designing Data-Intensive Applications” Ch. 6 - Martin Kleppmann
Difficulty: Beginner Time estimate: Weekend Prerequisites: Basic knowledge of Hash functions (MD5/SHA1).
Real World Outcome
You’ll have a CLI tool that can simulate adding 10,000 files to a cluster and show how many files move when a node is added. Success is seeing that adding a node only affects ~1/N of the files, rather than triggering a total cluster reshuffle.
Example Output:
$ ./router route --file "holiday_photo.jpg" --nodes "nodeA,nodeB,nodeC"
File: holiday_photo.jpg -> Mapped to: nodeB
$ ./router simulate --files 10000 --nodes 3
Node Distribution:
nodeA: 3341 files
nodeB: 3320 files
nodeC: 3339 files
$ ./router simulate --files 10000 --nodes 3 --add-node nodeD
Total files moved: 2501 (Efficiency: 25.01%)
# Note: In a non-consistent hash, almost 100% of files would move!
The Core Question You’re Answering
“If the phone book (metadata server) disappears, how can I still find my friend’s number just by knowing their name?”
Before you write any code, sit with this question. Most developers are addicted to central databases. SDS is about the independence of data from a central coordinator. Can you calculate location using only math?
Concepts You Must Understand First
Stop and research these before coding:
- Hash Functions & Uniformity
- Why do we use SHA-256 or MD5 instead of just
len(filename) % nodes? - What happens if your hash function isn’t uniform?
- Book Reference: “Algorithms” by Sedgewick — Ch. 3 (Hash Tables)
- Why do we use SHA-256 or MD5 instead of just
- The “Modulo N” Problem
- If you have 3 nodes and add 1, why does
hash % 3vshash % 4cause 75% of data to move? - Book Reference: “Designing Data-Intensive Applications” Ch. 6.
- If you have 3 nodes and add 1, why does
Questions to Guide Your Design
Before implementing, think through these:
- The Ring
- How will you represent the “ring” of hash values (0 to 2^64-1)?
- How do you find the “closest” node on the ring to a file’s hash?
- Virtual Nodes
- If one node is a $10,000 beefy server and another is a $500 old PC, how do you give the big server more “shares” of the ring?
Thinking Exercise
The Reshuffle Trace
Imagine 3 nodes (A, B, C) and 4 files (1, 2, 3, 4).
- Assign them using
hash(file) % 3. - Add Node D.
- Re-assign using
hash(file) % 4.
Questions while tracing:
- How many files stayed on their original node?
- Now try it with a “Ring” where nodes occupy specific points. If you add Node D between A and B, which files actually move?
The Interview Questions They’ll Ask
Prepare to answer these:
- “What is consistent hashing and why is it used in distributed storage?”
- “How do virtual nodes help with heterogeneity in a cluster?”
- “Compare consistent hashing to the ‘Elastic Hashing’ used in GlusterFS.”
- “What is the computational complexity of finding a node in a ring of size N?”
Hints in Layers
Hint 1: The Circle Think of the hash space as a circle. Map your nodes to random points on that circle.
Hint 2: Successor To find a file’s home, hash the file name to get a point on the circle. Walk clockwise until you hit the first node. That’s the owner.
Hint 3: Implementation
Use a sorted array of the nodes’ hash values. For any file hash, use binary search (sort.Search in Go) to find the first node value that is greater than or equal to the file hash.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Hashing | “Algorithms” by Sedgewick | Ch. 3 |
| Partitioning | “Designing Data-Intensive Applications” | Ch. 6 |
Project 2: Failure-Domain Aware CRUSH Engine
- File: SOFTWARE_DEFINED_STORAGE_DEEP_DIVE.md
- Main Programming Language: Go
- Alternative Programming Languages: Python, Rust
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 3: Advanced
- Knowledge Area: Data Placement Algorithms
- Software or Tool: Hierarchical Maps
- Main Book: “Mastering Ceph” by Nick Fisk
What you’ll build: A simplified CRUSH algorithm implementation that places data replicas across different “Buckets” (Racks, Hosts).
Why it teaches SDS: It forces you to think about physical reality. You’ll learn that a distributed system is only as good as its failure domain logic. You’ll move from “put this on a random server” to “put this on a server that doesn’t share a power strip with the first one.”
Core challenges you’ll face:
- Tree Traversal -> Navigating the cluster hierarchy to find leaves (disks).
- Determinism -> Ensuring the same input always yields the same replica set.
- Avoidance Logic -> If a host is already used for Replica 1, the algorithm must skip that host for Replica 2.
Key Concepts
- CRUSH Algorithm: “CRUSH: Controlled, Scalable, Decentralized Placement of Replicated Data” (Paper) - Sage Weil
Difficulty: Advanced Time estimate: 1-2 weeks Prerequisites: Recursion, Tree Data Structures.
Real World Outcome
A simulation where you define a complex data center and verify that data is never stored twice in the same rack. You will prove that your system is “Rack-Aware.”
Example Output:
$ ./crush map --cluster dc.json --replicas 3 --object "db_backup_01"
Placement for 'db_backup_01':
Replica 1: DC_North -> Rack_01 -> Host_A -> Disk_0
Replica 2: DC_North -> Rack_02 -> Host_C -> Disk_1
Replica 3: DC_South -> Rack_05 -> Host_X -> Disk_4
# SUCCESS: All replicas are in different Racks!
The Core Question You’re Answering
“If I have 10,000 disks and 1,000 racks, how can I guarantee my data survives a rack-level power failure without keeping a massive spreadsheet of where every file is?”
This project is about Deterministic Hierarchy. You aren’t picking a random disk; you are picking a random path through a tree that obeys strict rules.
Concepts You Must Understand First
Stop and research these before coding:
- Failure Domains
- What is a “Correlation of Failure”?
- Why is a host a failure domain? Why is a rack a failure domain?
- Book Reference: “Mastering Ceph” Ch. 5.
- Recursive Descent
- How do you “take” a branch and continue the search for a leaf?
- Book Reference: “The Algorithm Design Manual” by Skiena.
Questions to Guide Your Design
Before implementing, think through these:
- Weights
- If one rack has 10 disks and another has 20, how do you ensure the 20-disk rack gets twice as many PGs? (Hint: Strawberry algorithm / Straw2).
- The “Take” Step
- How do you implement the “take root -> select 3 racks -> select 1 host per rack -> select 1 disk per host” logic?
Thinking Exercise
The Rack-Failure Test
Sketch a cluster with 2 Racks. Each rack has 2 Hosts. Each host has 1 Disk. You want 3 replicas.
- If you use simple hashing, what is the probability that 2 replicas end up in the same rack?
- If you use CRUSH rules, how do you prevent the 3rd replica from being placed at all if there are only 2 racks? (Safety over availability).
The Interview Questions They’ll Ask
Prepare to answer these:
- “Explain the difference between a CRUSH Map and a standard Hash Map.”
- “How does CRUSH handle the addition of a new rack to an existing cluster?”
- “What is a ‘straw2’ bucket and why is it the default in modern Ceph?”
Hints in Layers
Hint 1: The Hierarchy
Build a JSON tree: Root -> [Racks] -> [Hosts] -> [Disks]. Each level should have a “weight.”
Hint 2: Selection
For each replica, start at the root. Use a hash of (object_name, replica_number, bucket_id) to pick a child bucket.
Hint 3: Retries If the chosen child leads to a leaf you’ve already picked for this object, increment a “retry” counter in the hash and pick again. This is “collision resolution” in CRUSH.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| CRUSH | “Mastering Ceph” | Ch. 5 |
| Tree Algorithms | “The Algorithm Design Manual” | Ch. 3 |
Project 3: The Placement Group (PG) Manager
- File: SOFTWARE_DEFINED_STORAGE_DEEP_DIVE.md
- Main Programming Language: Go
- Alternative Programming Languages: Rust, C
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 4. The “Open Core” Infrastructure
- Difficulty: Level 3: Advanced
- Knowledge Area: Sharding & Logical Grouping
- Software or Tool: RADOS-like architecture
- Main Book: “Learning Ceph” by Karan Singh
What you’ll build: A system that shards millions of objects into a fixed number of “Placement Groups” (PGs) and maps those PGs to specific storage nodes.
Why it teaches SDS: You’ll understand why tracking every object is a scaling nightmare and how “logical indirection” (Object -> PG -> Node) allows the cluster to rebalance 100GB of data by just changing one line in a map.
Core challenges you’ll face:
- PG Splitting/Merging -> What happens when you add 1,000 more disks?
- State Management -> Each PG has a state (Clean, Degraded, Peering).
- Mapping Logic ->
PG_ID = hash(object_name) % num_pgs.
Key Concepts
- Logical Sharding: “Learning Ceph” Ch. 2.
Difficulty: Advanced Time estimate: 2 weeks Prerequisites: Hash functions, modular arithmetic.
Real World Outcome
A dashboard-like CLI showing the health of your logical cluster. You’ll be able to “down” an OSD and see exactly which PGs are affected and need repair.
Example Output:
$ ./pg-manager status
Pool: 'default'
Total Objects: 1,500,000
Total PGs: 128
PG Distribution:
PG [0..31]: 12 disks (Healthy)
PG [32..63]: 12 disks (Healthy)
PG [64..95]: 11 disks (Degraded - OSD_5 is down)
PG [96..127]: 12 disks (Healthy)
$ ./pg-manager inspect-pg 65
Objects in PG 65:
- user_avatar_99.png
- cat_video_final.mp4
- ...
Current Acting Set: [OSD_2, OSD_8, OSD_21]
The Core Question You’re Answering
“How do I move 10TB of data without moving 10TB of metadata?”
By grouping objects into PGs, you only have to track the mapping of ~10,000 PGs to OSDs, rather than 1,000,000,000 objects. This is the difference between a system that scales and one that dies.
Concepts You Must Understand First
Stop and research these before coding:
- Sharding vs. Replication
- How does a PG differ from a simple shard? (Hint: a PG is replicated).
- Book Reference: “Learning Ceph” Ch. 2.
- Modular Arithmetic in Distributed Systems
pg = hash(obj) % num_pgs. Why mustnum_pgsbe a power of 2?
Thinking Exercise
The Migration Paradox
If you add 1 OSD to a 10-OSD cluster:
- How many PGs should move to the new OSD?
- If you didn’t have PGs and just mapped objects to OSDs, what would be the impact on your metadata database during this rebalance?
The Interview Questions They’ll Ask
Prepare to answer these:
- “Why does Ceph use Placement Groups instead of direct object-to-OSD mapping?”
- “What happens if you have too few PGs in a cluster? What about too many?”
- “Explain the relationship between PGs, Pools, and OSDs.”
Project 4: The SDS Monitor (OSDMap Keeper)
- File: SOFTWARE_DEFINED_STORAGE_DEEP_DIVE.md
- Main Programming Language: Go
- Alternative Programming Languages: Rust, Python
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 2: Intermediate
- Knowledge Area: Cluster Membership & Consensus
- Software or Tool: HTTP/JSON Map
- Main Book: “Ceph: Distributed Storage for Linux and Cloud” by Anthony D’Atri
What you’ll build: A central service (Monitor) that tracks which storage nodes (OSDs) are “Up” or “Down” and broadcasts a versioned “Cluster Map” to all clients.
Why it teaches SDS: In SDS, the “Truth” is the Map. You’ll learn how to handle the “Flapping OSD” problem (a disk that turns on and off quickly) and how versioning maps prevents clients from writing to dead nodes.
Core challenges you’ll face:
- Health Checks (Heartbeats) -> Detecting a dead node in < 2 seconds.
- Map Versioning (Epochs) -> Ensuring all clients are using the same version of the truth.
- Quorum (Conceptual) -> Why one monitor isn’t enough (Paxos/Raft).
Key Concepts
- Cluster Maps: “Ceph: Distributed Storage for Linux and Cloud” Ch. 3.
Difficulty: Intermediate Time estimate: 1 week Prerequisites: HTTP API basics, JSON serialization.
Real World Outcome
A system where you can kill a storage process and see the “Client” automatically get a new map update and stop sending data to that node. This is “Client-Side Intelligence.”
Example Output:
$ ./monitor start
Monitor started on port 6789 (Epoch: 1)
# In another terminal, an OSD joins:
$ ./osd start --id 5 --addr 192.168.1.50
OSD 5 reported UP. Monitor Epoch now: 2.
$ ./monitor get-map
{
"epoch": 2,
"osds": [
{"id": 5, "state": "UP", "addr": "192.168.1.50"}
]
}
The Core Question You’re Answering
“Who is allowed to declare a node dead, and how do we ensure everyone agrees on who is alive?”
This project answers the “Consensus” problem. If Client A thinks OSD 1 is alive but Client B thinks it’s dead, the system will diverge. The Monitor is the “Court of Record.”
Project 5: The “Object Store” Daemon (OSD)
- File: SOFTWARE_DEFINED_STORAGE_DEEP_DIVE.md
- Main Programming Language: Go
- Alternative Programming Languages: C++, Rust
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 4. The “Open Core” Infrastructure
- Difficulty: Level 3: Advanced
- Knowledge Area: Local Storage & Network I/O
- Software or Tool: Filesystem (XFS/Ext4)
- Main Book: “Learning Ceph” by Karan Singh
What you’ll build: A daemon that receives binary objects over the network and stores them on the local disk, ensuring data integrity with checksums.
Why it teaches SDS: You’ll understand the “Data Plane.” While the Monitor handles the “Control Plane,” the OSD is where the bytes actually touch the spinning rust (or silicon). You’ll learn about atomic writes and data corruption.
Core challenges you’ll face:
- Object IDs to File Paths -> Mapping a UUID to a folder structure that doesn’t hit filesystem limits.
- Checksumming -> Generating and verifying MD5/CRC32 for every read/write.
- Concurrency -> Handling 100 simultaneous client writes safely.
Key Concepts
- RADOS Object Store: “Learning Ceph” Ch. 2.
Difficulty: Advanced Time estimate: 2 weeks Prerequisites: File I/O, Network programming.
Real World Outcome
A set of folders on your laptop that act as a “distributed” node. You can curl a file into the OSD, and find it stored as a specific object ID in a subfolder.
Example Output:
$ ./osd-daemon --id 1 --port 6801 --storage-path ./data/osd1
OSD 1 Listening...
# From a client:
$ ./client put --id "my-photo" --file ./vacation.jpg
SUCCESS: Object my-photo stored on OSD 1.
$ ls -R ./data/osd1
./data/osd1/objects/my/my-photo.bin
./data/osd1/objects/my/my-photo.meta (Contains checksum)
The Core Question You’re Answering
“How do I turn a ‘file’ into an ‘object’ and ensure it hasn’t changed since I wrote it?”
You’ll discover that at the bottom layer, everything is just a flat key-value store of bytes. You are building the “S” in “SDS.”
Project 6: Synchronous Multi-Copy Replication
- File: SOFTWARE_DEFINED_STORAGE_DEEP_DIVE.md
- Main Programming Language: Go
- Alternative Programming Languages: Rust
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 4. The “Open Core” Infrastructure
- Difficulty: Level 4: Expert
- Knowledge Area: Distributed Consistency
- Software or Tool: Multiple OSDs
- Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann
What you’ll build: A replication engine where the “Primary” OSD receives a write, sends it to two “Replica” OSDs, and only returns “Success” to the user after all three have acknowledged.
Why it teaches SDS: This is the core of “Strong Consistency.” You’ll learn about the write-latency penalty and the “Split Brain” risk if nodes disagree on who is the primary.
Core challenges you’ll face:
- Primary/Replica Handshakes -> Managing the 2-Phase Commit (2PC) light.
- Timeout Logic -> What if a replica is slow? Does the write fail or do you mark the OSD as down?
- Ordering -> Ensuring Write A happens before Write B on all 3 nodes.
Key Concepts
- Replication: “Designing Data-Intensive Applications” Ch. 5.
Difficulty: Expert
Time estimate: 2 weeks
Prerequisites: Networking, Concurrency, State Machines.
Real World Outcome
A “fail-proof” write. You write to the Primary, pull the plug on a Replica, and see the system report a successful write (on 2 nodes) but also a “Degraded” warning. This proves your cluster map is communicating node state correctly.
Example Output:
$ ./client put --object "secret_plan" --replicas 3
1. Sending to Primary (OSD 1)...
2. Primary -> Replicating to OSD 2... OK
3. Primary -> Replicating to OSD 3... FAILED (Timeout)
4. Primary -> Notifying Monitor: OSD 3 is lagging.
5. SUCCESS (Degraded): 2/3 replicas stored.
The Core Question You’re Answering
“How can I guarantee that data is written to multiple machines simultaneously without making the user wait forever?”
You’ll grapple with the Latency vs. Durability trade-off. If you have 10 replicas, a write is very safe but very slow. If you have 1, it’s fast but dangerous.
Project 7: The Self-Healing Engine (Rebalancing)
-
File: SOFTWARE_DEFINED_STORAGE_DEEP_DIVE.md
-
Main Programming Language: Go
-
Alternative Programming Languages: Rust, Python
-
Coolness Level: Level 5: Pure Magic (Super Cool)
-
Business Potential: 4. The “Open Core” Infrastructure
-
Difficulty: Level 4: Expert
-
Knowledge Area: Automation & Data Recovery
-
Software or Tool: PG State Machine
-
Main Book: “Ceph: Distributed Storage for Linux and Cloud” by Anthony D’Atri
What you’ll build: A “Manager” module that detects when a PG is missing a replica and automatically instructs the remaining nodes to copy data to a new spare node.
Why it teaches SDS: This is the “Magic” of Ceph. You’ll understand how a system can be “autonomic.” You’ll learn the difference between “Replication” (keeping copies) and “Recovery” (making new copies when old ones die).
Core challenges you’ll face:
-
Delta Calculation -> Only copying the objects that changed since the node went down.
-
Priority Queuing -> Ensuring user traffic (Reads/Writes) isn’t choked by recovery traffic.
-
Backfilling -> Managing the bulk transfer of data to a brand-new node.
Key Concepts
- Recovery & Backfilling: “Mastering Ceph” Ch. 6.
Difficulty: Expert
Time estimate: 2 weeks
Prerequisites: Project 3 (PG Manager), Project 6 (Replication).
Real World Outcome
A “Resilient Cluster.” You kill an OSD, wait 30 seconds, and see the PG status go from Degraded to Recovering to Active+Clean as the data moves to a spare OSD. This is the moment you stop being a sysadmin and start being an architect.
Example Output:
# Initial State
$ ./manager status
Health: HEALTH_OK
# Simulate Failure
$ kill -9 <OSD_3_PID>
Health: HEALTH_WARN (1 PG degraded)
# 10 seconds later:
Health: HEALTH_RECOVERING (PG 1.5: 45% backfilled)
# 30 seconds later:
Health: HEALTH_OK (PG 1.5 moved to OSD_7)
The Core Question You’re Answering
“If a disk fails at 3 AM, can the software fix itself so I can keep sleeping?”
This project is about the Autonomy of distributed systems. A system that requires a human to fix every broken disk will never scale to 10,000 nodes.
Project 8: Erasure Coding (EC) Translator
-
File: SOFTWARE_DEFINED_STORAGE_DEEP_DIVE.md
-
Main Programming Language: Go
-
Alternative Programming Languages: C, Rust
-
Coolness Level: Level 5: Pure Magic (Super Cool)
-
Business Potential: 5. The “Industry Disruptor”
-
Difficulty: Level 5: Master
-
Knowledge Area: Information Theory & Math
-
Software or Tool: Reed-Solomon Libraries
-
Main Book: “The Secret Life of Programs” by Jonathan Steinhart
What you’ll build: A module that takes an object, splits it into $K$ data chunks and $M$ parity chunks, and distributes them such that you can lose any $M$ nodes and still recover the file.
Why it teaches SDS: Replication is expensive (3x overhead). Erasure Coding is how big clouds (S3, Azure) store exabytes efficiently. You’ll learn the math behind “RAID for the network.”
Core challenges you’ll face:
-
Matrix Math -> Implementing (or using) Reed-Solomon encoding.
-
Chunk Alignment -> Handling files of different sizes.
-
Reconstruction -> The logic to rebuild a missing chunk from the remaining ones.
Key Concepts
- Erasure Coding: “Designing Data-Intensive Applications” Ch. 5 (section on EC vs Replication).
Difficulty: Master
Time estimate: 2-3 weeks
Prerequisites: Bitwise operations, Linear Algebra basics.
Real World Outcome
A tool that can delete 2 out of 5 chunks of a file and still perfectly recreate the original file. You will grasp the magic of
- 1 = 3$ (where the 3 is parity).
Example Output:
$ ./ec-tool encode --file my_doc.pdf --k 3 --m 2
Created 5 chunks: chunk.0, chunk.1, chunk.2 (Data) + chunk.p1, chunk.p2 (Parity)
$ rm chunk.1 chunk.p1
$ ./ec-tool decode --output recovered.pdf
Status: Success. (2 chunks missing, recovered using Reed-Solomon)
The Core Question You’re Answering
“How can I store 100GB of data safely using only 130GB of disk space, instead of the 300GB required by 3x replication?”
This is the Efficiency problem. Large scale storage is a business of margins. Erasure Coding makes petabyte-scale storage economically viable.
Project 9: Deep Scrubbing & Inconsistency Resolver
-
File: SOFTWARE_DEFINED_STORAGE_DEEP_DIVE.md
-
Main Programming Language: Go
-
Alternative Programming Languages: Rust
-
Coolness Level: Level 4: Hardcore Tech Flex
-
Business Potential: 3. The “Service & Support” Model
-
Difficulty: Level 3: Advanced
-
Knowledge Area: Data Integrity
-
Software or Tool: Background Workers
-
Main Book: “Learning Ceph” by Karan Singh
What you’ll build: A background worker that periodically compares the bit-for-bit content of all replicas of a PG and identifies “silent data corruption” (Bit rot).
Why it teaches SDS: You’ll learn that disks lie. Even if a disk says a read was successful, the bytes might be wrong. You’ll learn how to resolve “Conflicts” when 2 replicas say “A” and 1 says “B”.
Core challenges you’ll face:
-
Resource Throttling -> Running a deep scan without slowing down the user.
-
Majority Voting -> How to decide which replica is “right.”
-
Merkle Trees (Optional) -> Comparing large directories efficiently.
Key Concepts
- Data Scrubbing: “Learning Ceph” Ch. 5.
Difficulty: Advanced
Time estimate: 1 week
Prerequisites: Project 6 (Replication), Hashing.
Project 10: The SDS API Gateway (S3/Swift)
-
File: SOFTWARE_DEFINED_STORAGE_DEEP_DIVE.md
-
Main Programming Language: Go
-
Alternative Programming Languages: Python (FastAPI), Rust (Axum)
-
Coolness Level: Level 3: Genuinely Clever
-
Business Potential: 5. The “Industry Disruptor”
-
Difficulty: Level 3: Advanced
-
Knowledge Area: Web Protocols & Auth
-
Software or Tool: HTTP Server / S3 API
-
Main Book: “Design and Build Great Web APIs” by Mike Amundsen
What you’ll build: A gateway that speaks the S3 protocol (PUT/GET buckets and objects) and translates those calls into your internal SDS commands.
Why it teaches SDS: This is the “Public Interface.” You’ll learn how to map an unstructured HTTP request to a structured distributed storage backend. You’ll handle large file uploads (Multipart) and authentication.
Core challenges you’ll face:
-
Streaming Uploads -> Piping data from the HTTP request directly to OSDs without buffering it all in RAM.
-
S3 Protocol Compatibility -> Correctly implementing XML/JSON responses that S3 clients expect.
-
Bucket Metadata -> Where do you store the list of files in a bucket? (Hint: Use another object).
Key Concepts
- Object Gateways: “Learning Ceph” Ch. 4.
Difficulty: Advanced
Time estimate: 2 weeks
Prerequisites: HTTP server basics, Project 5 (OSD).
Real World Outcome
You can use a standard S3 tool (like aws cli or rclone) to upload a file to your local “mini-S3” cluster.
Example Output:
$ export AWS_ACCESS_KEY_ID=test
$ export AWS_SECRET_ACCESS_KEY=test
$ aws s3 --endpoint-url http://localhost:8080 cp photo.jpg s3://mybucket/
upload: ./photo.jpg to s3://mybucket/photo.jpg
# Under the hood: Gateway -> Hash -> PG -> OSD 1,2,3
Project 11: RBD Snapshot Engine (Copy-on-Write)
-
File: SOFTWARE_DEFINED_STORAGE_DEEP_DIVE.md
-
Main Programming Language: Go
-
Alternative Programming Languages: C, Rust
-
Coolness Level: Level 4: Hardcore Tech Flex
-
Business Potential: 4. The “Open Core” Infrastructure
-
Difficulty: Level 4: Expert
-
Knowledge Area: Block Storage & Versioning
-
Software or Tool: Sparse Files
-
Main Book: “Operating Systems: Three Easy Pieces” by Remzi Arpaci-Dusseau
What you’ll build: A system that lets you take a “Snapshot” of a large block device instantaneosly, where only the changes made after the snapshot take up extra space.
Why it teaches SDS: This teaches you about “Immutability.” You’ll learn that you never overwrite an object; you create a new version and update a pointer. This is how databases and VMs perform instant backups.
Core challenges you’ll face:
-
Copy-on-Write (CoW) -> Only duplicating a block when it’s modified.
-
Pointer Tracking -> Keeping a tree of which blocks belong to which snapshot.
-
Thin Provisioning -> Selling 1TB of storage when you only have 10GB of physical space.
Key Concepts
- Snapshots: “Operating Systems: Three Easy Pieces” Ch. 41: “Locality and The Fast File System”.
Difficulty: Expert
Time estimate: 2 weeks
Prerequisites: Project 5 (OSD), understanding of File Pointers.
Real World Outcome
You create a 10GB “Virtual Disk,” fill it with 100MB of data, take a snapshot, and see that the total storage used is still only ~100MB.
Example Output:
$ ./rbd-tool create disk1 --size 10G
$ ./rbd-tool write disk1 --offset 0 --data "Hello"
$ ./rbd-tool snapshot disk1 --name snap1
$ ./rbd-tool write disk1 --offset 0 --data "World"
$ ./rbd-tool read disk1 --snapshot snap1 --offset 0
"Hello"
$ ./rbd-tool read disk1 --offset 0
"World"
Project 12: Tiered Storage (Flash vs Disk)
-
File: SOFTWARE_DEFINED_STORAGE_DEEP_DIVE.md
-
Main Programming Language: Go
-
Alternative Programming Languages: Python (for policy), Rust
-
Coolness Level: Level 3: Genuinely Clever
-
Business Potential: 5. The “Industry Disruptor”
-
Difficulty: Level 3: Advanced
-
Knowledge Area: Resource Management & QoS
-
Software or Tool: Multi-Pool Logic
-
Main Book: “Mastering Ceph” by Nick Fisk
What you’ll build: A policy engine that moves “Hot” (frequently accessed) objects to an SSD pool and “Cold” (rarely used) objects to an HDD pool.
Why it teaches SDS: You’ll learn about “Lifecycle Management.” It forces you to think about the cost-performance ratio. You’ll learn how to migrate data across pools without the user noticing.
Core challenges you’ll face:
-
Heat Mapping -> Tracking access frequency for every object.
-
Migration Orchestration -> Moving bytes from Node A to Node B while reads are happening.
-
Promotion/Demotion Rules -> Deciding exactly when a file is “cold enough” to move.
Key Concepts
- Cache Tiering: “Mastering Ceph” Ch. 8.
Difficulty: Advanced
Time estimate: 2 weeks
Prerequisites: Project 10 (Gateway), basic statistics/counters.
Project 13: The Distributed Quota Manager
-
File: SOFTWARE_DEFINED_STORAGE_DEEP_DIVE.md
-
Main Programming Language: Go
-
Alternative Programming Languages: Rust, Python
-
Coolness Level: Level 3: Genuinely Clever
-
Business Potential: 3. The “Service & Support” Model
-
Difficulty: Level 3: Advanced
-
Knowledge Area: Distributed Counters & Enforcement
-
Software or Tool: Redis (as counter) or Custom State
-
Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann
What you’ll build: A system that prevents a specific user from uploading more than their allocated 10GB, even if they are uploading to 5 different nodes simultaneously.
Why it teaches SDS: This is about “Global State.” You’ll learn the difference between “Local enforcement” (easy) and “Global enforcement” (hard). You’ll deal with race conditions where a user tries to fill up the last 10MB of their quota from two different gateways.
Core challenges you’ll face:
-
Consensus on Usage -> How do nodes agree on how much a user has used?
-
Soft vs Hard Quotas -> Giving a user a warning vs. cutting off their access.
-
Latency -> Ensuring that checking a quota doesn’t double the time of an upload.
Key Concepts
- Distributed Counters: “Designing Data-Intensive Applications” Ch. 9.
Difficulty: Advanced
Time estimate: 1 week
Prerequisites: Project 10 (Gateway).
Project Comparison Table
| Project | Difficulty | Time | Depth of Understanding | Fun Factor |
|———|————|——|————————|————|
| 1. Object Router | Level 1 | Weekend | High (Foundational) | 3 |
| 2. CRUSH Engine | Level 3 | 1-2 weeks | Very High (Architecture) | 5 |
| 3. PG Manager | Level 3 | 2 weeks | High (Logical Ops) | 4 |
| 4. SDS Monitor | Level 2 | 1 week | Medium (Control Plane) | 3 |
| 5. OSD Daemon | Level 3 | 2 weeks | High (Data Plane) | 4 |
| 6. Replication | Level 4 | 2 weeks | High (Consistency) | 4 |
| 7. Self-Healing | Level 4 | 2 weeks | Very High (Automation) | 5 |
| 8. Erasure Coding | Level 5 | 3 weeks | Extreme (Math/Durability) | 5 |
| 9. Scrubber | Level 3 | 1 week | High (Integrity) | 3 |
| 10. API Gateway | Level 3 | 2 weeks | Medium (Interfacing) | 4 |
| 11. Snapshots | Level 4 | 2 weeks | High (Block Storage) | 5 |
| 12. Tiered Storage | Level 3 | 2 weeks | Medium (Efficiency) | 4 |
| 13. Quota Manager | Level 3 | 1 week | Medium (Coordination) | 3 |
Recommendation
If you are a student/hobbyist: Start with Project 1 (Router) and Project 2 (CRUSH). They give you the most “bang for your buck” in terms of understanding why SDS is different from a normal database.
If you are a software engineer: Focus on Project 6 (Replication) and Project 10 (Gateway). These deal with the hardest part of distributed systems: network reliability and protocol translation.
If you want a job at Red Hat/Suse/Canonical: You must complete Project 8 (Erasure Coding) and Project 11 (Snapshots). These are the advanced features that distinguish junior storage admins from senior storage architects.
Final Overall Project: The “Petabyte-Scale” Storage OS
What you’ll build: Combine all previous components into a single distributed storage cluster.
-
Launch 3 Monitors (Project 4) in a quorum.
-
Launch 10 OSDs (Project 5) across different simulated racks.
-
Implement a Gateway (Project 10) that uses the CRUSH Map (Project 2) to write replicated data (Project 6).
-
Add a Scrubber (Project 9) that runs every 60 seconds.
Success Criteria:
-
You can upload a file via the S3 Gateway.
-
You can kill 2 OSDs simultaneously and the data remains accessible.
-
You can add a new OSD and see the cluster “rebalance” the data automatically.
-
You can verify that all data is checksummed and healed if corrupted.
Summary
This learning path covers Software Defined Storage through 13 hands-on projects.
| # | Project Name | Main Language | Difficulty | Time Estimate |
|—|————–|—————|————|—————|
| 1 | Hash-Based Object Router | Go | Level 1 | Weekend |
| 2 | Failure-Domain Aware CRUSH Engine | Go | Level 3 | 1-2 weeks |
| 3 | Placement Group (PG) Manager | Go | Level 3 | 2 weeks |
| 4 | SDS Monitor (OSDMap Keeper) | Go | Level 2 | 1 week |
| 5 | Object Store Daemon (OSD) | Go | Level 3 | 2 weeks |
| 6 | Sync Multi-Copy Replication | Go | Level 4 | 2 weeks |
| 7 | Self-Healing Engine | Go | Level 4 | 2 weeks |
| 8 | Erasure Coding Translator | Go | Level 5 | 3 weeks |
| 9 | Deep Scrubber | Go | Level 3 | 1 week |
| 10 | SDS API Gateway (S3/Swift) | Go | Level 3 | 2 weeks |
| 11 | RBD Snapshot Engine | Go | Level 4 | 2 weeks |
| 12 | Tiered Storage Policy | Go | Level 3 | 2 weeks |
| 13 | Distributed Quota Manager | Go | Level 3 | 1 week |
Recommended Learning Path
For beginners: Start with projects #1, #2, #4.
For intermediate: Focus on #3, #5, #6, #10.
For advanced: Master #7, #8, #9, #11.
Expected Outcomes
After completing these projects, you will:
-
Understand how algorithms replace metadata servers for massive scale.
-
Master the CRUSH algorithm for topology-aware data placement.
-
Build autonomous systems that heal themselves from hardware failure.
-
Implement storage efficiencies like Erasure Coding and Snapshots.
-
Be capable of architecting storage solutions for cloud-native environments.
You’ll have built a working distributed storage system from the ground up, proving you understand SDS from the first bit to the last parity chunk.