Learn Software Defined Storage (SDS): From Zero to Storage Architect

Goal: Deeply understand the internal machinery of Software Defined Storage—specifically how systems like Ceph and GlusterFS eliminate metadata bottlenecks, manage massive data placement without central tables (CRUSH/Elastic Hashing), and achieve autonomous self-healing. By building these components from first principles, you will master the art of architecting resilient, exabyte-scale storage clusters.

Why SDS Matters

In the traditional world, storage was a “black box”—proprietary hardware from vendors like EMC or NetApp. If you needed more space, you bought another expensive box. Software Defined Storage (SDS) shattered this model by moving the “intelligence” from specialized hardware to open-source software running on commodity servers.

Real-World Impact:

Exabyte Scale: Platforms like CERN use Ceph to store hundreds of petabytes of physics data.
Cloud Foundation: OpenStack, Kubernetes, and Proxmox rely on SDS for persistent volume management.
Cost: Moving from SAN/NAS hardware to commodity SDS can reduce storage costs by 70-80%.

Understanding SDS means understanding how to manage “Memory as a Network Service.” It is the bridge between raw disks and the distributed applications that drive the modern web.

Core Concept Analysis

1. The Death of the Central Metadata Server

Traditional distributed filesystems (like HDFS) use a “NameNode”—a central server that knows where every block lives. This is a massive bottleneck. SDS systems like Ceph and Gluster use Algorithms instead of Lookup Tables.

Traditional (HDFS):
Client -> "Where is File A?" -> Metadata Server
Metadata Server -> "It's on Server 5" -> Client
Client -> Server 5 (Fetch)

SDS (Ceph/Gluster):
Client -> Hash(FileName) + Topology Map -> Local Computation
Local Computation -> "It's on Server 5"
Client -> Server 5 (Fetch)

2. CRUSH Maps & Failure Domains

CRUSH (Controlled Replication Under Scalable Hashing) is the brain of Ceph. It treats the storage cluster as a hierarchy (Data Center > Room > Rack > Host > Disk).

CRUSH Hierarchy:
       [Root]
      /      \
 [Rack 1]  [Rack 2]
  /    \    /    \
[H1]  [H2][H3]  [H4]
 |     |   |     |
[D1]  [D2][D3]  [D4]

Why it matters: CRUSH rules ensure that if you want 3 replicas, it places them in different racks. If Rack 1 loses power, the data is still safe in Rack 2.

3. Placement Groups (PGs)

Managing 10 billion objects individually is impossible. Ceph groups objects into PGs.

[Object A] \
[Object B] -- [PG 1.1] -> [OSD 5, OSD 12, OSD 42]
[Object C] /

[Object D] \
[Object E] -- [PG 1.2] -> [OSD 7, OSD 2, OSD 19]
[Object F] /

Analogy: If you have 1,000,000 letters to deliver, you don’t track every letter. You track 10,000 mailbags. If a mail truck (OSD) breaks down, you just re-route the bags.

4. Self-Healing (The Peering Dance)

When a disk fails, the remaining disks in the affected PGs “gossip” (Peer) to determine what data is missing and where to re-replicate it.

1. OSD 5 Fails.
2. PGs 1.1, 1.5, and 2.3 are now "Degraded" (only 2/3 replicas).
3. Surviving OSDs (12 & 42) notify the cluster.
4. CRUSH calculates new home: OSD 99.
5. OSD 12 copies data to OSD 99.
6. Cluster is "Clean" again.

Concept Summary Table

Concept Cluster	What You Need to Internalize
Algorithmic Placement	Using Hashing (EHA) or CRUSH instead of central databases to locate data.
Failure Domains	Designing the system to survive the loss of a disk, host, rack, or entire DC.
Placement Groups	Sharding objects into manageable logical units for rebalancing and recovery.
Consistent Hashing	Minimizing data movement when nodes join or leave the cluster.
Replication/Erasure Coding	The math of durability vs. storage efficiency.
Peering & Recovery	How nodes autonomously agree on state and repair themselves without human help.

Deep Dive Reading by Concept

Storage Theory & Basics

Concept	Book & Chapter
Distributed Systems Fundamentals	Distributed Systems: Concepts and Design by Coulouris — Ch. 1: “Characterization of Distributed Systems”
Storage Hierarchy	Computer Systems: A Programmer’s Perspective by Bryant & O’Hallaron — Ch. 6: “The Memory Hierarchy”

SDS Specifics (Ceph & Gluster)

Concept	Book & Chapter
CRUSH Algorithm	Mastering Ceph by Nick Fisk — Ch. 5: “CRUSH Maps and Data Placement”
Placement Groups	Learning Ceph by Karan Singh — Ch. 2: “Ceph Internal Architecture”
Gluster Translators	GlusterFS Architecture Guide (Official Docs) — “Architecture and Translators”
Self-Healing Logic	Ceph: Distributed Storage for Linux and Cloud by Anthony D’Atri — Ch. 6: “OSD Recovery and Rebalancing”

Essential Reading Order

The Math (Week 1):
- Read the original CRUSH Paper (Sage Weil, 2006). It’s the foundation of modern SDS.
The Layout (Week 1):
- Mastering Ceph Ch. 5. Understand how hierarchy affects durability.
The Recovery (Week 2):
- Learning Ceph Ch. 2. Focus on how monitors (ceph-mon) maintain the cluster map.

Project 1: The Hash-Based Object Router

File: SOFTWARE_DEFINED_STORAGE_DEEP_DIVE.md
Main Programming Language: Go
Alternative Programming Languages: Python, Rust, C++
Coolness Level: Level 3: Genuinely Clever
Business Potential: 1. The “Resume Gold”
Difficulty: Level 1: Beginner
Knowledge Area: Consistent Hashing / Data Distribution
Software or Tool: Static Node List
Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A tool that takes a filename and a list of storage nodes, and consistently returns the same node ID for that file without using a database.

Why it teaches SDS: This is the “Eureka” moment of SDS. You discover that you don’t need a central index to find files. You realize that Location = F(FileName, ClusterMap).

Core challenges you’ll face:

Handling Node Changes -> Mapping files when a node is added (Consistent Hashing).
Uniform Distribution -> Ensuring one node doesn’t get 90% of the data.
Virtual Nodes -> Using replicas of nodes to smooth out hash distribution.

Key Concepts

Consistent Hashing: “Designing Data-Intensive Applications” Ch. 6 - Martin Kleppmann

Difficulty: Beginner Time estimate: Weekend Prerequisites: Basic knowledge of Hash functions (MD5/SHA1).

Real World Outcome

You’ll have a CLI tool that can simulate adding 10,000 files to a cluster and show how many files move when a node is added. Success is seeing that adding a node only affects ~1/N of the files, rather than triggering a total cluster reshuffle.

Example Output:

$ ./router route --file "holiday_photo.jpg" --nodes "nodeA,nodeB,nodeC"
File: holiday_photo.jpg -> Mapped to: nodeB

$ ./router simulate --files 10000 --nodes 3
Node Distribution:
  nodeA: 3341 files
  nodeB: 3320 files
  nodeC: 3339 files

$ ./router simulate --files 10000 --nodes 3 --add-node nodeD
Total files moved: 2501 (Efficiency: 25.01%)
# Note: In a non-consistent hash, almost 100% of files would move!

The Core Question You’re Answering

“If the phone book (metadata server) disappears, how can I still find my friend’s number just by knowing their name?”

Before you write any code, sit with this question. Most developers are addicted to central databases. SDS is about the independence of data from a central coordinator. Can you calculate location using only math?

Concepts You Must Understand First

Stop and research these before coding:

Hash Functions & Uniformity
- Why do we use SHA-256 or MD5 instead of just len(filename) % nodes?
- What happens if your hash function isn’t uniform?
- Book Reference: “Algorithms” by Sedgewick — Ch. 3 (Hash Tables)
The “Modulo N” Problem
- If you have 3 nodes and add 1, why does hash % 3 vs hash % 4 cause 75% of data to move?
- Book Reference: “Designing Data-Intensive Applications” Ch. 6.

Questions to Guide Your Design

Before implementing, think through these:

The Ring
- How will you represent the “ring” of hash values (0 to 2^64-1)?
- How do you find the “closest” node on the ring to a file’s hash?
Virtual Nodes
- If one node is a $10,000 beefy server and another is a $500 old PC, how do you give the big server more “shares” of the ring?

Thinking Exercise

The Reshuffle Trace

Imagine 3 nodes (A, B, C) and 4 files (1, 2, 3, 4).

Assign them using hash(file) % 3.
Add Node D.
Re-assign using hash(file) % 4.

Questions while tracing:

How many files stayed on their original node?
Now try it with a “Ring” where nodes occupy specific points. If you add Node D between A and B, which files actually move?

The Interview Questions They’ll Ask

Prepare to answer these:

“What is consistent hashing and why is it used in distributed storage?”
“How do virtual nodes help with heterogeneity in a cluster?”
“Compare consistent hashing to the ‘Elastic Hashing’ used in GlusterFS.”
“What is the computational complexity of finding a node in a ring of size N?”

Hints in Layers

Hint 1: The Circle Think of the hash space as a circle. Map your nodes to random points on that circle.

Hint 2: Successor To find a file’s home, hash the file name to get a point on the circle. Walk clockwise until you hit the first node. That’s the owner.

Hint 3: Implementation Use a sorted array of the nodes’ hash values. For any file hash, use binary search (sort.Search in Go) to find the first node value that is greater than or equal to the file hash.

Books That Will Help

Topic	Book	Chapter
Hashing	“Algorithms” by Sedgewick	Ch. 3
Partitioning	“Designing Data-Intensive Applications”	Ch. 6

Project 2: Failure-Domain Aware CRUSH Engine

File: SOFTWARE_DEFINED_STORAGE_DEEP_DIVE.md
Main Programming Language: Go
Alternative Programming Languages: Python, Rust
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 1. The “Resume Gold”
Difficulty: Level 3: Advanced
Knowledge Area: Data Placement Algorithms
Software or Tool: Hierarchical Maps
Main Book: “Mastering Ceph” by Nick Fisk

What you’ll build: A simplified CRUSH algorithm implementation that places data replicas across different “Buckets” (Racks, Hosts).

Why it teaches SDS: It forces you to think about physical reality. You’ll learn that a distributed system is only as good as its failure domain logic. You’ll move from “put this on a random server” to “put this on a server that doesn’t share a power strip with the first one.”

Core challenges you’ll face:

Tree Traversal -> Navigating the cluster hierarchy to find leaves (disks).
Determinism -> Ensuring the same input always yields the same replica set.
Avoidance Logic -> If a host is already used for Replica 1, the algorithm must skip that host for Replica 2.

Key Concepts

CRUSH Algorithm: “CRUSH: Controlled, Scalable, Decentralized Placement of Replicated Data” (Paper) - Sage Weil

Difficulty: Advanced Time estimate: 1-2 weeks Prerequisites: Recursion, Tree Data Structures.

Real World Outcome

A simulation where you define a complex data center and verify that data is never stored twice in the same rack. You will prove that your system is “Rack-Aware.”

Example Output:

$ ./crush map --cluster dc.json --replicas 3 --object "db_backup_01"
Placement for 'db_backup_01':
  Replica 1: DC_North -> Rack_01 -> Host_A -> Disk_0
  Replica 2: DC_North -> Rack_02 -> Host_C -> Disk_1
  Replica 3: DC_South -> Rack_05 -> Host_X -> Disk_4

# SUCCESS: All replicas are in different Racks!

The Core Question You’re Answering

“If I have 10,000 disks and 1,000 racks, how can I guarantee my data survives a rack-level power failure without keeping a massive spreadsheet of where every file is?”

This project is about Deterministic Hierarchy. You aren’t picking a random disk; you are picking a random path through a tree that obeys strict rules.

Concepts You Must Understand First

Stop and research these before coding:

Failure Domains
- What is a “Correlation of Failure”?
- Why is a host a failure domain? Why is a rack a failure domain?
- Book Reference: “Mastering Ceph” Ch. 5.
Recursive Descent
- How do you “take” a branch and continue the search for a leaf?
- Book Reference: “The Algorithm Design Manual” by Skiena.

Questions to Guide Your Design

Before implementing, think through these:

Weights
- If one rack has 10 disks and another has 20, how do you ensure the 20-disk rack gets twice as many PGs? (Hint: Strawberry algorithm / Straw2).
The “Take” Step
- How do you implement the “take root -> select 3 racks -> select 1 host per rack -> select 1 disk per host” logic?

Thinking Exercise

The Rack-Failure Test

Sketch a cluster with 2 Racks. Each rack has 2 Hosts. Each host has 1 Disk. You want 3 replicas.

If you use simple hashing, what is the probability that 2 replicas end up in the same rack?
If you use CRUSH rules, how do you prevent the 3rd replica from being placed at all if there are only 2 racks? (Safety over availability).

The Interview Questions They’ll Ask

Prepare to answer these:

“Explain the difference between a CRUSH Map and a standard Hash Map.”
“How does CRUSH handle the addition of a new rack to an existing cluster?”
“What is a ‘straw2’ bucket and why is it the default in modern Ceph?”

Hints in Layers

Hint 1: The Hierarchy Build a JSON tree: Root -> [Racks] -> [Hosts] -> [Disks]. Each level should have a “weight.”

Hint 2: Selection For each replica, start at the root. Use a hash of (object_name, replica_number, bucket_id) to pick a child bucket.

Hint 3: Retries If the chosen child leads to a leaf you’ve already picked for this object, increment a “retry” counter in the hash and pick again. This is “collision resolution” in CRUSH.

Books That Will Help

Topic	Book	Chapter
CRUSH	“Mastering Ceph”	Ch. 5
Tree Algorithms	“The Algorithm Design Manual”	Ch. 3

Project 3: The Placement Group (PG) Manager

File: SOFTWARE_DEFINED_STORAGE_DEEP_DIVE.md
Main Programming Language: Go
Alternative Programming Languages: Rust, C
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 4. The “Open Core” Infrastructure
Difficulty: Level 3: Advanced
Knowledge Area: Sharding & Logical Grouping
Software or Tool: RADOS-like architecture
Main Book: “Learning Ceph” by Karan Singh

What you’ll build: A system that shards millions of objects into a fixed number of “Placement Groups” (PGs) and maps those PGs to specific storage nodes.

Why it teaches SDS: You’ll understand why tracking every object is a scaling nightmare and how “logical indirection” (Object -> PG -> Node) allows the cluster to rebalance 100GB of data by just changing one line in a map.

Core challenges you’ll face:

PG Splitting/Merging -> What happens when you add 1,000 more disks?
State Management -> Each PG has a state (Clean, Degraded, Peering).
Mapping Logic -> PG_ID = hash(object_name) % num_pgs.

Key Concepts

Logical Sharding: “Learning Ceph” Ch. 2.

Difficulty: Advanced Time estimate: 2 weeks Prerequisites: Hash functions, modular arithmetic.

Real World Outcome

A dashboard-like CLI showing the health of your logical cluster. You’ll be able to “down” an OSD and see exactly which PGs are affected and need repair.

Example Output:

$ ./pg-manager status
Pool: 'default'
  Total Objects: 1,500,000
  Total PGs: 128
  PG Distribution:
    PG [0..31]: 12 disks (Healthy)
    PG [32..63]: 12 disks (Healthy)
    PG [64..95]: 11 disks (Degraded - OSD_5 is down)
    PG [96..127]: 12 disks (Healthy)

$ ./pg-manager inspect-pg 65
Objects in PG 65:
  - user_avatar_99.png
  - cat_video_final.mp4
  - ...
Current Acting Set: [OSD_2, OSD_8, OSD_21]

The Core Question You’re Answering

“How do I move 10TB of data without moving 10TB of metadata?”

By grouping objects into PGs, you only have to track the mapping of ~10,000 PGs to OSDs, rather than 1,000,000,000 objects. This is the difference between a system that scales and one that dies.

Concepts You Must Understand First

Stop and research these before coding:

Sharding vs. Replication
- How does a PG differ from a simple shard? (Hint: a PG is replicated).
- Book Reference: “Learning Ceph” Ch. 2.
Modular Arithmetic in Distributed Systems
- pg = hash(obj) % num_pgs. Why must num_pgs be a power of 2?

Thinking Exercise

The Migration Paradox

If you add 1 OSD to a 10-OSD cluster:

How many PGs should move to the new OSD?
If you didn’t have PGs and just mapped objects to OSDs, what would be the impact on your metadata database during this rebalance?

The Interview Questions They’ll Ask

Prepare to answer these:

“Why does Ceph use Placement Groups instead of direct object-to-OSD mapping?”
“What happens if you have too few PGs in a cluster? What about too many?”
“Explain the relationship between PGs, Pools, and OSDs.”

Project 4: The SDS Monitor (OSDMap Keeper)

File: SOFTWARE_DEFINED_STORAGE_DEEP_DIVE.md
Main Programming Language: Go
Alternative Programming Languages: Rust, Python
Coolness Level: Level 3: Genuinely Clever
Business Potential: 3. The “Service & Support” Model
Difficulty: Level 2: Intermediate
Knowledge Area: Cluster Membership & Consensus
Software or Tool: HTTP/JSON Map
Main Book: “Ceph: Distributed Storage for Linux and Cloud” by Anthony D’Atri

What you’ll build: A central service (Monitor) that tracks which storage nodes (OSDs) are “Up” or “Down” and broadcasts a versioned “Cluster Map” to all clients.

Why it teaches SDS: In SDS, the “Truth” is the Map. You’ll learn how to handle the “Flapping OSD” problem (a disk that turns on and off quickly) and how versioning maps prevents clients from writing to dead nodes.

Core challenges you’ll face:

Health Checks (Heartbeats) -> Detecting a dead node in < 2 seconds.
Map Versioning (Epochs) -> Ensuring all clients are using the same version of the truth.
Quorum (Conceptual) -> Why one monitor isn’t enough (Paxos/Raft).

Key Concepts

Cluster Maps: “Ceph: Distributed Storage for Linux and Cloud” Ch. 3.

Difficulty: Intermediate Time estimate: 1 week Prerequisites: HTTP API basics, JSON serialization.

Real World Outcome

A system where you can kill a storage process and see the “Client” automatically get a new map update and stop sending data to that node. This is “Client-Side Intelligence.”

Example Output:

$ ./monitor start
Monitor started on port 6789 (Epoch: 1)

# In another terminal, an OSD joins:
$ ./osd start --id 5 --addr 192.168.1.50
OSD 5 reported UP. Monitor Epoch now: 2.

$ ./monitor get-map
{
  "epoch": 2,
  "osds": [
    {"id": 5, "state": "UP", "addr": "192.168.1.50"}
  ]
}

The Core Question You’re Answering

“Who is allowed to declare a node dead, and how do we ensure everyone agrees on who is alive?”

This project answers the “Consensus” problem. If Client A thinks OSD 1 is alive but Client B thinks it’s dead, the system will diverge. The Monitor is the “Court of Record.”

Project 5: The “Object Store” Daemon (OSD)

File: SOFTWARE_DEFINED_STORAGE_DEEP_DIVE.md
Main Programming Language: Go
Alternative Programming Languages: C++, Rust
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 4. The “Open Core” Infrastructure
Difficulty: Level 3: Advanced
Knowledge Area: Local Storage & Network I/O
Software or Tool: Filesystem (XFS/Ext4)
Main Book: “Learning Ceph” by Karan Singh

What you’ll build: A daemon that receives binary objects over the network and stores them on the local disk, ensuring data integrity with checksums.

Why it teaches SDS: You’ll understand the “Data Plane.” While the Monitor handles the “Control Plane,” the OSD is where the bytes actually touch the spinning rust (or silicon). You’ll learn about atomic writes and data corruption.

Core challenges you’ll face:

Object IDs to File Paths -> Mapping a UUID to a folder structure that doesn’t hit filesystem limits.
Checksumming -> Generating and verifying MD5/CRC32 for every read/write.
Concurrency -> Handling 100 simultaneous client writes safely.

Key Concepts

RADOS Object Store: “Learning Ceph” Ch. 2.

Difficulty: Advanced Time estimate: 2 weeks Prerequisites: File I/O, Network programming.

Real World Outcome

A set of folders on your laptop that act as a “distributed” node. You can curl a file into the OSD, and find it stored as a specific object ID in a subfolder.

Example Output:

$ ./osd-daemon --id 1 --port 6801 --storage-path ./data/osd1
OSD 1 Listening...

# From a client:
$ ./client put --id "my-photo" --file ./vacation.jpg
SUCCESS: Object my-photo stored on OSD 1.

$ ls -R ./data/osd1
./data/osd1/objects/my/my-photo.bin
./data/osd1/objects/my/my-photo.meta (Contains checksum)

The Core Question You’re Answering

“How do I turn a ‘file’ into an ‘object’ and ensure it hasn’t changed since I wrote it?”

You’ll discover that at the bottom layer, everything is just a flat key-value store of bytes. You are building the “S” in “SDS.”

Project 6: Synchronous Multi-Copy Replication

File: SOFTWARE_DEFINED_STORAGE_DEEP_DIVE.md
Main Programming Language: Go
Alternative Programming Languages: Rust
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 4. The “Open Core” Infrastructure
Difficulty: Level 4: Expert
Knowledge Area: Distributed Consistency
Software or Tool: Multiple OSDs
Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A replication engine where the “Primary” OSD receives a write, sends it to two “Replica” OSDs, and only returns “Success” to the user after all three have acknowledged.

Why it teaches SDS: This is the core of “Strong Consistency.” You’ll learn about the write-latency penalty and the “Split Brain” risk if nodes disagree on who is the primary.

Core challenges you’ll face:

Primary/Replica Handshakes -> Managing the 2-Phase Commit (2PC) light.
Timeout Logic -> What if a replica is slow? Does the write fail or do you mark the OSD as down?
Ordering -> Ensuring Write A happens before Write B on all 3 nodes.

Key Concepts

Replication: “Designing Data-Intensive Applications” Ch. 5.

Difficulty: Expert

Time estimate: 2 weeks

Prerequisites: Networking, Concurrency, State Machines.

Real World Outcome

A “fail-proof” write. You write to the Primary, pull the plug on a Replica, and see the system report a successful write (on 2 nodes) but also a “Degraded” warning. This proves your cluster map is communicating node state correctly.

Example Output:

$ ./client put --object "secret_plan" --replicas 3

Sending to Primary (OSD 1)...

Primary -> Replicating to OSD 2... OK

Primary -> Replicating to OSD 3... FAILED (Timeout)

Primary -> Notifying Monitor: OSD 3 is lagging.

SUCCESS (Degraded): 2/3 replicas stored.

The Core Question You’re Answering

“How can I guarantee that data is written to multiple machines simultaneously without making the user wait forever?”

You’ll grapple with the Latency vs. Durability trade-off. If you have 10 replicas, a write is very safe but very slow. If you have 1, it’s fast but dangerous.

Project 7: The Self-Healing Engine (Rebalancing)

File: SOFTWARE_DEFINED_STORAGE_DEEP_DIVE.md
Main Programming Language: Go
Alternative Programming Languages: Rust, Python
Coolness Level: Level 5: Pure Magic (Super Cool)
Business Potential: 4. The “Open Core” Infrastructure
Difficulty: Level 4: Expert
Knowledge Area: Automation & Data Recovery
Software or Tool: PG State Machine
Main Book: “Ceph: Distributed Storage for Linux and Cloud” by Anthony D’Atri

What you’ll build: A “Manager” module that detects when a PG is missing a replica and automatically instructs the remaining nodes to copy data to a new spare node.

Why it teaches SDS: This is the “Magic” of Ceph. You’ll understand how a system can be “autonomic.” You’ll learn the difference between “Replication” (keeping copies) and “Recovery” (making new copies when old ones die).

Core challenges you’ll face:

Delta Calculation -> Only copying the objects that changed since the node went down.
Priority Queuing -> Ensuring user traffic (Reads/Writes) isn’t choked by recovery traffic.
Backfilling -> Managing the bulk transfer of data to a brand-new node.

Key Concepts

Recovery & Backfilling: “Mastering Ceph” Ch. 6.

Difficulty: Expert

Time estimate: 2 weeks

Prerequisites: Project 3 (PG Manager), Project 6 (Replication).

Real World Outcome

A “Resilient Cluster.” You kill an OSD, wait 30 seconds, and see the PG status go from Degraded to Recovering to Active+Clean as the data moves to a spare OSD. This is the moment you stop being a sysadmin and start being an architect.

Example Output:

# Initial State

$ ./manager status

Health: HEALTH_OK

# Simulate Failure

$ kill -9 <OSD_3_PID>

Health: HEALTH_WARN (1 PG degraded)

# 10 seconds later:

Health: HEALTH_RECOVERING (PG 1.5: 45% backfilled)

# 30 seconds later:

Health: HEALTH_OK (PG 1.5 moved to OSD_7)

The Core Question You’re Answering

“If a disk fails at 3 AM, can the software fix itself so I can keep sleeping?”

This project is about the Autonomy of distributed systems. A system that requires a human to fix every broken disk will never scale to 10,000 nodes.

Project 8: Erasure Coding (EC) Translator

File: SOFTWARE_DEFINED_STORAGE_DEEP_DIVE.md
Main Programming Language: Go
Alternative Programming Languages: C, Rust
Coolness Level: Level 5: Pure Magic (Super Cool)
Business Potential: 5. The “Industry Disruptor”
Difficulty: Level 5: Master
Knowledge Area: Information Theory & Math
Software or Tool: Reed-Solomon Libraries
Main Book: “The Secret Life of Programs” by Jonathan Steinhart

What you’ll build: A module that takes an object, splits it into $K$ data chunks and $M$ parity chunks, and distributes them such that you can lose any $M$ nodes and still recover the file.

Why it teaches SDS: Replication is expensive (3x overhead). Erasure Coding is how big clouds (S3, Azure) store exabytes efficiently. You’ll learn the math behind “RAID for the network.”

Core challenges you’ll face:

Matrix Math -> Implementing (or using) Reed-Solomon encoding.
Chunk Alignment -> Handling files of different sizes.
Reconstruction -> The logic to rebuild a missing chunk from the remaining ones.

Key Concepts

Erasure Coding: “Designing Data-Intensive Applications” Ch. 5 (section on EC vs Replication).

Difficulty: Master

Time estimate: 2-3 weeks

Prerequisites: Bitwise operations, Linear Algebra basics.

Real World Outcome

A tool that can delete 2 out of 5 chunks of a file and still perfectly recreate the original file. You will grasp the magic of

1 = 3$ (where the 3 is parity).

Example Output:

$ ./ec-tool encode --file my_doc.pdf --k 3 --m 2

Created 5 chunks: chunk.0, chunk.1, chunk.2 (Data) + chunk.p1, chunk.p2 (Parity)

$ rm chunk.1 chunk.p1

$ ./ec-tool decode --output recovered.pdf

Status: Success. (2 chunks missing, recovered using Reed-Solomon)

The Core Question You’re Answering

“How can I store 100GB of data safely using only 130GB of disk space, instead of the 300GB required by 3x replication?”

This is the Efficiency problem. Large scale storage is a business of margins. Erasure Coding makes petabyte-scale storage economically viable.

Project 9: Deep Scrubbing & Inconsistency Resolver

File: SOFTWARE_DEFINED_STORAGE_DEEP_DIVE.md
Main Programming Language: Go
Alternative Programming Languages: Rust
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 3. The “Service & Support” Model
Difficulty: Level 3: Advanced
Knowledge Area: Data Integrity
Software or Tool: Background Workers
Main Book: “Learning Ceph” by Karan Singh

What you’ll build: A background worker that periodically compares the bit-for-bit content of all replicas of a PG and identifies “silent data corruption” (Bit rot).

Why it teaches SDS: You’ll learn that disks lie. Even if a disk says a read was successful, the bytes might be wrong. You’ll learn how to resolve “Conflicts” when 2 replicas say “A” and 1 says “B”.

Core challenges you’ll face:

Resource Throttling -> Running a deep scan without slowing down the user.
Majority Voting -> How to decide which replica is “right.”
Merkle Trees (Optional) -> Comparing large directories efficiently.

Key Concepts

Data Scrubbing: “Learning Ceph” Ch. 5.

Difficulty: Advanced

Time estimate: 1 week

Prerequisites: Project 6 (Replication), Hashing.

Project 10: The SDS API Gateway (S3/Swift)

File: SOFTWARE_DEFINED_STORAGE_DEEP_DIVE.md
Main Programming Language: Go
Alternative Programming Languages: Python (FastAPI), Rust (Axum)
Coolness Level: Level 3: Genuinely Clever
Business Potential: 5. The “Industry Disruptor”
Difficulty: Level 3: Advanced
Knowledge Area: Web Protocols & Auth
Software or Tool: HTTP Server / S3 API
Main Book: “Design and Build Great Web APIs” by Mike Amundsen

What you’ll build: A gateway that speaks the S3 protocol (PUT/GET buckets and objects) and translates those calls into your internal SDS commands.

Why it teaches SDS: This is the “Public Interface.” You’ll learn how to map an unstructured HTTP request to a structured distributed storage backend. You’ll handle large file uploads (Multipart) and authentication.

Core challenges you’ll face:

Streaming Uploads -> Piping data from the HTTP request directly to OSDs without buffering it all in RAM.
S3 Protocol Compatibility -> Correctly implementing XML/JSON responses that S3 clients expect.
Bucket Metadata -> Where do you store the list of files in a bucket? (Hint: Use another object).

Key Concepts

Object Gateways: “Learning Ceph” Ch. 4.

Difficulty: Advanced

Time estimate: 2 weeks

Prerequisites: HTTP server basics, Project 5 (OSD).

Real World Outcome

You can use a standard S3 tool (like aws cli or rclone) to upload a file to your local “mini-S3” cluster.

Example Output:

$ export AWS_ACCESS_KEY_ID=test

$ export AWS_SECRET_ACCESS_KEY=test

$ aws s3 --endpoint-url http://localhost:8080 cp photo.jpg s3://mybucket/

upload: ./photo.jpg to s3://mybucket/photo.jpg

# Under the hood: Gateway -> Hash -> PG -> OSD 1,2,3

Project 11: RBD Snapshot Engine (Copy-on-Write)

File: SOFTWARE_DEFINED_STORAGE_DEEP_DIVE.md
Main Programming Language: Go
Alternative Programming Languages: C, Rust
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 4. The “Open Core” Infrastructure
Difficulty: Level 4: Expert
Knowledge Area: Block Storage & Versioning
Software or Tool: Sparse Files
Main Book: “Operating Systems: Three Easy Pieces” by Remzi Arpaci-Dusseau

What you’ll build: A system that lets you take a “Snapshot” of a large block device instantaneosly, where only the changes made after the snapshot take up extra space.

Why it teaches SDS: This teaches you about “Immutability.” You’ll learn that you never overwrite an object; you create a new version and update a pointer. This is how databases and VMs perform instant backups.

Core challenges you’ll face:

Copy-on-Write (CoW) -> Only duplicating a block when it’s modified.
Pointer Tracking -> Keeping a tree of which blocks belong to which snapshot.
Thin Provisioning -> Selling 1TB of storage when you only have 10GB of physical space.

Key Concepts

Snapshots: “Operating Systems: Three Easy Pieces” Ch. 41: “Locality and The Fast File System”.

Difficulty: Expert

Time estimate: 2 weeks

Prerequisites: Project 5 (OSD), understanding of File Pointers.

Real World Outcome

You create a 10GB “Virtual Disk,” fill it with 100MB of data, take a snapshot, and see that the total storage used is still only ~100MB.

Example Output:

$ ./rbd-tool create disk1 --size 10G

$ ./rbd-tool write disk1 --offset 0 --data "Hello"

$ ./rbd-tool snapshot disk1 --name snap1

$ ./rbd-tool write disk1 --offset 0 --data "World"

$ ./rbd-tool read disk1 --snapshot snap1 --offset 0

"Hello"

$ ./rbd-tool read disk1 --offset 0

"World"

Project 12: Tiered Storage (Flash vs Disk)

File: SOFTWARE_DEFINED_STORAGE_DEEP_DIVE.md
Main Programming Language: Go
Alternative Programming Languages: Python (for policy), Rust
Coolness Level: Level 3: Genuinely Clever
Business Potential: 5. The “Industry Disruptor”
Difficulty: Level 3: Advanced
Knowledge Area: Resource Management & QoS
Software or Tool: Multi-Pool Logic
Main Book: “Mastering Ceph” by Nick Fisk

What you’ll build: A policy engine that moves “Hot” (frequently accessed) objects to an SSD pool and “Cold” (rarely used) objects to an HDD pool.

Why it teaches SDS: You’ll learn about “Lifecycle Management.” It forces you to think about the cost-performance ratio. You’ll learn how to migrate data across pools without the user noticing.

Core challenges you’ll face:

Heat Mapping -> Tracking access frequency for every object.
Migration Orchestration -> Moving bytes from Node A to Node B while reads are happening.
Promotion/Demotion Rules -> Deciding exactly when a file is “cold enough” to move.

Key Concepts

Cache Tiering: “Mastering Ceph” Ch. 8.

Difficulty: Advanced

Time estimate: 2 weeks

Prerequisites: Project 10 (Gateway), basic statistics/counters.

Project 13: The Distributed Quota Manager

File: SOFTWARE_DEFINED_STORAGE_DEEP_DIVE.md
Main Programming Language: Go
Alternative Programming Languages: Rust, Python
Coolness Level: Level 3: Genuinely Clever
Business Potential: 3. The “Service & Support” Model
Difficulty: Level 3: Advanced
Knowledge Area: Distributed Counters & Enforcement
Software or Tool: Redis (as counter) or Custom State
Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A system that prevents a specific user from uploading more than their allocated 10GB, even if they are uploading to 5 different nodes simultaneously.

Why it teaches SDS: This is about “Global State.” You’ll learn the difference between “Local enforcement” (easy) and “Global enforcement” (hard). You’ll deal with race conditions where a user tries to fill up the last 10MB of their quota from two different gateways.

Core challenges you’ll face:

Consensus on Usage -> How do nodes agree on how much a user has used?
Soft vs Hard Quotas -> Giving a user a warning vs. cutting off their access.
Latency -> Ensuring that checking a quota doesn’t double the time of an upload.

Key Concepts

Distributed Counters: “Designing Data-Intensive Applications” Ch. 9.

Difficulty: Advanced

Time estimate: 1 week

Prerequisites: Project 10 (Gateway).

Project Comparison Table

Project

Difficulty

Time

Depth of Understanding

Fun Factor

|———|————|——|————————|————|

1. Object Router

Level 1

Weekend

High (Foundational)

2. CRUSH Engine

Level 3

1-2 weeks

Very High (Architecture)

3. PG Manager

Level 3

2 weeks

High (Logical Ops)

4. SDS Monitor

Level 2

1 week

Medium (Control Plane)

5. OSD Daemon

Level 3

2 weeks

High (Data Plane)

6. Replication

Level 4

2 weeks

High (Consistency)

7. Self-Healing

Level 4

2 weeks

Very High (Automation)

8. Erasure Coding

Level 5

3 weeks

Extreme (Math/Durability)

9. Scrubber

Level 3

1 week

High (Integrity)

10. API Gateway

Level 3

2 weeks

Medium (Interfacing)

11. Snapshots

Level 4

2 weeks

High (Block Storage)

12. Tiered Storage

Level 3

2 weeks

Medium (Efficiency)

13. Quota Manager

Level 3

1 week

Medium (Coordination)

Recommendation

If you are a student/hobbyist: Start with Project 1 (Router) and Project 2 (CRUSH). They give you the most “bang for your buck” in terms of understanding why SDS is different from a normal database.

If you are a software engineer: Focus on Project 6 (Replication) and Project 10 (Gateway). These deal with the hardest part of distributed systems: network reliability and protocol translation.

If you want a job at Red Hat/Suse/Canonical: You must complete Project 8 (Erasure Coding) and Project 11 (Snapshots). These are the advanced features that distinguish junior storage admins from senior storage architects.

Final Overall Project: The “Petabyte-Scale” Storage OS

What you’ll build: Combine all previous components into a single distributed storage cluster.

Launch 3 Monitors (Project 4) in a quorum.
Launch 10 OSDs (Project 5) across different simulated racks.
Implement a Gateway (Project 10) that uses the CRUSH Map (Project 2) to write replicated data (Project 6).
Add a Scrubber (Project 9) that runs every 60 seconds.

Success Criteria:

You can upload a file via the S3 Gateway.
You can kill 2 OSDs simultaneously and the data remains accessible.
You can add a new OSD and see the cluster “rebalance” the data automatically.
You can verify that all data is checksummed and healed if corrupted.

Summary

This learning path covers Software Defined Storage through 13 hands-on projects.

Project Name

Main Language

Difficulty

Time Estimate

|—|————–|—————|————|—————|

Hash-Based Object Router

Level 1

Weekend

Failure-Domain Aware CRUSH Engine

Level 3

1-2 weeks

Placement Group (PG) Manager

Level 3

2 weeks

SDS Monitor (OSDMap Keeper)

Level 2

1 week

Object Store Daemon (OSD)

Level 3

2 weeks

Sync Multi-Copy Replication

Level 4

2 weeks

Self-Healing Engine

Level 4

2 weeks

Erasure Coding Translator

Level 5

3 weeks

Deep Scrubber

Level 3

1 week

SDS API Gateway (S3/Swift)

Level 3

2 weeks

RBD Snapshot Engine

Level 4

2 weeks

Tiered Storage Policy

Level 3

2 weeks

Distributed Quota Manager

Level 3

1 week

Recommended Learning Path

For beginners: Start with projects #1, #2, #4.

For intermediate: Focus on #3, #5, #6, #10.

For advanced: Master #7, #8, #9, #11.

Expected Outcomes

After completing these projects, you will:

Understand how algorithms replace metadata servers for massive scale.
Master the CRUSH algorithm for topology-aware data placement.
Build autonomous systems that heal themselves from hardware failure.
Implement storage efficiencies like Erasure Coding and Snapshots.
Be capable of architecting storage solutions for cloud-native environments.

You’ll have built a working distributed storage system from the ground up, proving you understand SDS from the first bit to the last parity chunk.