P15: Decentralized Storage Client (IPFS-like)
P15: Decentralized Storage Client (IPFS-like)
Project Overview
| Attribute | Value |
|---|---|
| Main Language | Go |
| Alternative Languages | Rust, TypeScript |
| Difficulty | Expert |
| Coolness Level | Level 4: Hardcore Tech Flex |
| Business Potential | The โOpen Coreโ Infrastructure |
| Knowledge Area | Distributed Systems / Storage |
| Software or Tool | Content-Addressed Storage |
| Main Book | โDesigning Data-Intensive Applicationsโ by Martin Kleppmann |
| Time Estimate | 3-4 weeks |
| Prerequisites | Projects 3 & 4, P2P networking experience |
Learning Objectives
By completing this project, you will:
- Master content-addressed storage understanding how cryptographic hashes replace location-based addressing and enable data integrity verification without trusting the source
- Implement the Kademlia DHT algorithm including XOR distance metric, k-buckets, and iterative routing that finds content in O(log N) hops
- Build Merkle DAG structures splitting large files into chunks, linking them with cryptographic hashes, and enabling efficient partial downloads and deduplication
- Understand peer-to-peer networking fundamentals including peer discovery, NAT traversal, connection management, and multiaddr formatting
- Design and implement block exchange protocols handling concurrent downloads, peer prioritization, and verification of received data
- Explore content persistence and garbage collection understanding pinning strategies, provider record management, and data availability guarantees
- Connect theory to Web3 infrastructure understanding why IPFS is the data layer for NFT metadata, dApp frontends, and decentralized applications
- Apply distributed systems principles including eventual consistency, Byzantine fault tolerance concepts, and censorship resistance
Deep Theoretical Foundation
The Problem: Location vs. Content
The internet we use today is fundamentally location-based. When you request https://example.com/file.pdf, youโre saying โgive me whatever is at this location.โ This has several problems:
- Single point of failure: If the server goes down, the file is unavailable
- Trust required: You trust that the server gives you the correct file
- Duplication inefficiency: The same file hosted on 100 servers has 100 different URLs
- Censorship vulnerability: Block the server, block the content
Content-addressed storage flips this model: โGive me the file with this hash, from anyone who has it.โ
Location-Based (HTTP): Content-Based (IPFS):
Request Request
โ โ
โผ โผ
โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ
โ Server A โ โ ANY Peer โ
โ example.com โ โ that has it โ
โโโโโโโโฌโโโโโโโ โโโโโโโโโโฌโโโโโโโโโ
โ โ
โผ โผ
file.pdf file.pdf
(trust it?) (verify hash = CID)
โ
Why This Matters for Web3
Blockchains are not designed for large file storage:
- Ethereum stores ~1 KB of state per transaction
- Storing 1 MB on-chain costs hundreds of dollars in gas
- Blockchain state must be replicated by every node
So where do NFT images live? Where do dApp frontends live? Where does the actual data of Web3 live?
IPFS and similar systems provide the data layer. When an NFT points to ipfs://QmXyz..., thatโs a content-addressed reference that:
- Works regardless of which server hosts the file
- Can be verified by hashing the received data
- Enables censorship resistance (content persists if ANY node has it)
The Architecture of Content-Addressed Storage
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ IPFS-LIKE SYSTEM ARCHITECTURE โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ APPLICATION LAYER โ โ
โ โ โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโ โ โ
โ โ โ CLI โ โ API โ โ Gateway โ โ Name Resolution โ โ โ
โ โ โ ipfs addโ โ /api/v0 โ โ HTTP โ โ (IPNS/DNSLink) โ โ โ
โ โ โโโโโโฌโโโโโ โโโโโโฌโโโโโ โโโโโโฌโโโโโ โโโโโโโโโโโฌโโโโโโโโโโโ โ โ
โ โโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ โ โ โ
โ โโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโ โ
โ โ DATA LAYER โ โ
โ โ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโ โ โ
โ โ โ MERKLE DAG โ โ BLOCK STORE โ โ PINNING โ โ โ
โ โ โ โ โ โ โ โ โ โ
โ โ โ โโโโโโโโโโโโโ โ โ CID โ Bytes โ โ CID โ PinStatus โ โ โ
โ โ โ โ Root โ โ โ โ โ โ โ โ
โ โ โ โโโโโโโฌโโโโโโค โ โ Local storage โ โ Garbage collection โ โ โ
โ โ โ โChunkโChunkโ โ โ (LevelDB, FS) โ โ reference counting โ โ โ
โ โ โ โโโโโโโดโโโโโโ โ โ โ โ โ โ โ
โ โ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ NETWORK LAYER โ โ
โ โ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโ โ โ
โ โ โ KADEMLIA DHT โ โ BITSWAP โ โ PEER MANAGER โ โ โ
โ โ โ โ โ (Exchange) โ โ โ โ โ
โ โ โ k-buckets โ โ โ โ Connection pooling โ โ โ
โ โ โ XOR distance โ โ Want/Have/ โ โ NAT traversal โ โ โ
โ โ โ Provider โ โ Block msgs โ โ Bootstrap nodes โ โ โ
โ โ โ records โ โ โ โ โ โ โ
โ โ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ TRANSPORT LAYER โ โ
โ โ โโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโ โ โ
โ โ โ TCP โ โ QUIC โ โ WebRTC โ โ WebSocket โ โ โ
โ โ โโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโ โ โ
โ โ โ โ
โ โ Multiaddr: /ip4/192.168.1.1/tcp/4001/p2p/12D3KooW... โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
The Core Question Youโre Answering
โHow can we store and retrieve data in a decentralized network where content is addressed by its cryptographic hash, not by server location?โ
This fundamentally changes data storage:
- Location independence: โWhatโ the data is (hash) matters, not โwhereโ it lives (server)
- Verifiability: You can prove data integrity by hashing it
- Deduplication: Identical content always has the same CID
- Censorship resistance: No single party controls the data
Understanding content-addressed storage means understanding why IPFS is the data layer for Web3โblockchains store small state, IPFS stores large files.
Concepts You Must Understand First
Stop and research these before coding:
1. Content Addressing vs. Location Addressing
Traditional HTTP addressing identifies content by WHERE it lives:
https://example.com/files/document.pdf
โฒ โฒ
โ โ
Server Path on server
Content addressing identifies content by WHAT it is:
QmYwAPJzv5CZsnA625s3Xf2nemtYgPpHdWEz79ojWnPbdG
โฒ
โโโ Hash of the content itself
Key insight: The same content always produces the same hash. If two people add the same file, they get the same CID. This enables automatic deduplication and verification.
What happens when content changes? New content = new hash = new CID. Thereโs no โupdatingโ a CIDโyou create a new one. This is immutability by design.
Questions to test your understanding:
- Why does content addressing enable deduplication?
- What happens if I change one byte in a file? (Entire CID changes)
- How does this relate to Git commits? (Git uses content-addressing too!)
- Why is immutability important for NFT metadata?
Book Reference: โDesigning Data-Intensive Applicationsโ Ch. 6 - Kleppmann
2. Cryptographic Hashing and CIDs
A Content Identifier (CID) is more than just a hash. Itโs a self-describing structure:
CID Structure:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Content Identifier (CID) โ
โโโโโโโโโโโโโฌโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Version โ Codec โ Multihash โ
โ (v0/v1) โ (dag-pb, โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ raw, etc) โ โ Hash Func โ Length โ Hash Bytes โ
โ 0x01 โ 0x70 โ โ 0x12 โ 0x20 โ sha256(content) โ
โโโโโโโโโโโโโดโโโโโโโโโโโโโดโโดโโโโโโโโโโโโดโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโ
Example breakdown of: bafybeigdyrzt5sfp7udm7hu76uh7y26nf3efuylqabf3okuez3...
- Version: 1 (CIDv1)
- Codec: dag-pb (0x70) - protobuf-encoded DAG
- Multihash function: sha2-256 (0x12)
- Hash length: 32 bytes (0x20)
- Hash bytes: [actual 32-byte SHA-256 hash]
Why โmultihashโ? Hash algorithm agility. If SHA-256 is broken, the network can migrate to SHA-3 without changing the CID format. The hash function is embedded in the CID itself.
CIDv0 vs CIDv1:
- CIDv0: Legacy format, starts with
Qm..., always uses dag-pb + SHA-256 - CIDv1: Modern format, self-describing, supports multiple codecs and hash functions
Questions to test your understanding:
- What is the advantage of embedding the hash function in the identifier?
- Why might you choose
rawcodec vsdag-pbcodec? - How do you upgrade hash algorithms without breaking old content?
Project Reference: You should have built Project 1 (SHA-256 Visualizer) and Project 3 (Merkle Tree Library) Web Reference: IPFS CID Specification
3. Merkle DAGs (Directed Acyclic Graphs)
For large files, we donโt hash the entire file as one blob. We split it into chunks and link them with a Merkle DAG:
Merkle DAG for a 2.4 MB File (10 chunks):
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Root Node โ
โ CID: QmYwAPJzv5CZsnA625s3Xf2nem... โ
โ Type: UnixFS File โ
โ Size: 2,516,582 bytes โ
โโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโฌโโโโโโโโโโโฌโโโโโโโโผโโโโโโโโฌโโโโโโโโโโโฌโโโโโโโโโโโ
โ โ โ โ โ โ โ
โผ โผ โผ โผ โผ โผ โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ ... โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Chunk 0 โโ Chunk 1 โโ Chunk 2 โ โ Chunk 7 โโ Chunk 8 โโ Chunk 9 โ
โ 256 KB โโ 256 KB โโ 256 KB โ โ 256 KB โโ 256 KB โโ 128 KB โ
โ QmHash0 โโ QmHash1 โโ QmHash2 โ โ QmHash7 โโ QmHash8 โโ QmHash9 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Root node contains:
{
"type": "file",
"links": [
{"name": "chunk-0", "cid": "QmHash0...", "size": 262144},
{"name": "chunk-1", "cid": "QmHash1...", "size": 262144},
...
{"name": "chunk-9", "cid": "QmHash9...", "size": 131072}
]
}
CID(file) = hash(root_node_serialized)
Why DAGs instead of simple Merkle trees?
- DAGs allow sharing: if two files have identical chunks, they reference the same CID
- DAGs support directories: a directory node links to file nodes
- DAGs enable incremental updates: change one chunk, only that chunkโs ancestors change
The relationship to Git: Git commits form a DAG. Each commit points to parent commits and a tree. The tree points to blobs (file contents). Sound familiar?
Questions to test your understanding:
- If you change one byte in chunk 5, which CIDs change? (Chunk 5, root node)
- How does chunking enable parallel downloads?
- Why is 256 KB a common chunk size? (Balance between overhead and parallelism)
- How would you represent a directory with 10,000 files?
Book Reference: โMastering Bitcoinโ Ch. 9 (Merkle Trees) - Antonopoulos Project Reference: Project 3 (Merkle Tree Library)
4. Kademlia DHT (Distributed Hash Table)
The DHT is how we find who has the content. Kademlia is the specific DHT algorithm used by IPFS.
Core concepts:
- Every peer has a 256-bit Peer ID (hash of their public key)
- Every CID maps to a 256-bit key
- Distance is measured by XOR:
distance(a, b) = a XOR b
XOR Distance Metric Visualization:
Node A ID: 1010 0000 0000 0000 ...
Node B ID: 1010 0001 0000 0000 ...
โโโโโโโโโโโโโโโโโโโโโโโโโ
XOR: 0000 0001 0000 0000 ...
โฒ
โโโ 1 bit different = distance 256
Node A ID: 1010 0000 0000 0000 ...
Node C ID: 0000 0000 0000 0000 ...
โโโโโโโโโโโโโโโโโโโโโโโโโ
XOR: 1010 0000 0000 0000 ...
โฒ
โโโ Many bits different = much larger distance
Key insight: XOR distance has these properties:
- d(a,a) = 0 (identity)
- d(a,b) = d(b,a) (symmetric)
- d(a,c) <= d(a,b) + d(b,c) (triangle inequality)
K-Buckets:
K-Bucket Structure (k=3 for simplicity):
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Local Node ID: 1010 0000 0000 0000 0000 0000 ... (256 bits) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ Bucket 255: Peers with ID differing in bit 255 (very close) โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Peer1: 1010 0000 0000 0000 ... 0001 โ Last seen: 2s ago โ โ
โ โ Peer2: 1010 0000 0000 0000 ... 0010 โ Last seen: 5s ago โ โ
โ โ [can hold up to k=3 peers] โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ Bucket 254: Peers with ID differing in bit 254 โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Peer3: 1010 0000 0000 0000 ... 0100 โ Last seen: 1s ago โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ ... โ
โ โ
โ Bucket 0: Peers with ID differing in bit 0 (very far) โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Peer99: 0... (starts with 0, we start with 1) โ โ
โ โ Peer100: 0... โ โ
โ โ Peer101: 0... โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Rule: Each bucket stores up to k peers at that distance.
Closer buckets have fewer potential peers but we know them better.
Farther buckets have many potential peers but we only need a sample.
Iterative Lookup Process:
Finding content CID: QmYwAPJzv5CZsnA625s3Xf2nem...
Key (256-bit hash of CID): 0x3a7f2b1c...
Step 1: Local node picks k closest peers from its routing table
Local Target Key
โ โ
โผ โผ
โโโโโ Query โโโโโ โโโโโ
โ A โโโโโโโโโโโโโโโโบโ B โ โ ? โ
โโโโโ โโโฌโโ โโโโโ
โ
โ "Here are peers closer to key..."
โผ
Step 2: Query returned peers, they return even closer peers
โโโโโ โโโโโ Query โโโโโ
โ A โ โ B โโโโโโโโโโโโโโโโบโ C โ
โโโโโ โโโโโ โโโฌโโ
โ
โ "I have providers!"
โผ
Step 3: Eventually find nodes storing provider records
Provider Record: "Peer X at /ip4/1.2.3.4/tcp/4001 has QmYwAP..."
Total hops: O(log N) where N = network size
For 1 million nodes: ~20 hops
Questions to test your understanding:
- Why XOR instead of numerical difference?
- Whatโs the purpose of keeping k peers per bucket instead of just 1?
- How does Kademlia handle nodes going offline?
- What prevents an attacker from controlling which nodes store a provider record?
Book Reference: โDesigning Data-Intensive Applicationsโ Ch. 6 - Kleppmann Web Reference: Kademlia DHT Paper
5. Peer-to-Peer Networking
Peer Discovery:
- Bootstrap nodes: Well-known nodes that help new nodes join
- mDNS: Local network discovery
- DHT: Find new peers through routing table refresh
Connection Management:
Multiaddr Format:
/ip4/192.168.1.1/tcp/4001/p2p/12D3KooWRq4tRk8jF3d9VqMn5WVzQv9V3GC9p7uXh3nE8X1YZ2AB
โ โ โ โ โ
โ โ โ โ โโโ Peer ID (hash of public key)
โ โ โ โโโ Port number
โ โ โโโ Protocol (TCP)
โ โโโ IP address
โโโ IP version
Other multiaddr examples:
/ip6/::1/tcp/4001/p2p/12D3KooW... (IPv6)
/dns4/node.example.com/tcp/443/wss/p2p/... (WebSocket over HTTPS)
/ip4/192.168.1.1/udp/4001/quic/p2p/... (QUIC transport)
NAT Traversal:
- Relay servers: Third party relays connections
- Hole punching: Direct connection through coordinated timing
- AutoNAT: Detect if youโre behind NAT
Book Reference: โComputer Networks, Fifth Editionโ Ch. 2 - Tanenbaum
6. BitSwap Protocol
BitSwap is IPFSโs block exchange protocolโhow peers request and send blocks to each other.
BitSwap Message Flow:
Node A (wants blocks) Node B (has blocks)
โ โ
โ WANT_HAVE [CID1, CID2, CID3] โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโบโ
โ โ
โ HAVE [CID1, CID3] โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ WANT_BLOCK [CID1] โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโบโ
โ โ
โ BLOCK [CID1, data...] โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ (verify: hash(data) == CID1) โ
โ โ
Message Types:
- WANT_HAVE: "Do you have these blocks?"
- HAVE: "Yes, I have these blocks"
- WANT_BLOCK: "Please send me this block"
- BLOCK: Here's the block data
- CANCEL: Never mind, I got it elsewhere
Incentive Considerations:
- BitSwap tracks peer โledgersโ (how much exchanged)
- Preferential treatment for peers who reciprocate
- Similar to BitTorrentโs tit-for-tat but more flexible
Web Reference: BitSwap Specification
7. Data Persistence and Garbage Collection
Content in IPFS is ephemeral by default. Without explicit preservation:
Pinning Hierarchy:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ PINNING STRATEGIES โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ Direct Pin: Pin only the root CID โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Pinned: Root โโโบ Not pinned: Chunk1, Chunk2, Chunk3 โ โ
โ โ (WARNING: Children can be garbage collected!) โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ Recursive Pin: Pin root and all descendants (DEFAULT) โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Pinned: Root โโโบ Pinned: Chunk1, Chunk2, Chunk3 โ โ
โ โ (All blocks preserved) โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ Indirect Pin: Pinned because an ancestor is recursively pinned โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ If RootA pins Chunk1, and RootB also uses Chunk1: โ โ
โ โ Chunk1 is safe even if RootB is unpinned โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Garbage Collection Process:
1. Mark all pinned CIDs and their descendants
2. Mark all blocks in active transfer
3. Delete unmarked blocks
4. Update blockstore metrics
Provider Record Lifecycle:
- When you add content, you announce to DHT
- Provider records expire after 24 hours
- You must re-announce periodically to stay discoverable
- If all providers stop announcing, content becomes unfindable
Questions to Guide Your Design
Before implementing, think through these:
1. Content Addressing
- How do you compute a CID from file bytes?
- What hash function to use? (SHA-256 for compatibility)
- How do you encode the CID? (base58, base32, or base64?)
- How do you handle CID v0 vs v1?
2. Chunking Strategy
- What chunk size? (IPFS uses 256 KB default)
- Fixed-size chunks or content-defined chunking?
- How do you rebuild the file from chunks?
- How do you handle small files (< chunk size)?
3. Merkle DAG Construction
- How do you represent the DAG in memory and on disk?
- What format for DAG nodes? (protobuf, JSON, CBOR?)
- How do you link chunks to the root node?
- How do you traverse the DAG to fetch all blocks?
4. DHT Implementation
- How many k-buckets? (IPFS uses 256 for 256-bit hashes)
- Whatโs the k-value (peers per bucket)? (IPFS uses k=20)
- How do you bootstrap (initial peer discovery)?
- How often do you refresh routing tables?
- How do you handle peers going offline?
5. Provider Records
- When a node has content, how does it announce to the DHT?
- How many DHT nodes should store the provider record? (IPFS uses ~20)
- How long are provider records valid? (IPFS uses 24 hours, with refresh)
- How do you clean up stale records?
6. Block Exchange
- How do you request blocks from peers (BitSwap vs simple request/response)?
- Do you implement incentives to prevent freeloading?
- How do you handle peers that have incomplete data?
- Parallel downloads from multiple peers or sequential?
Thinking Exercises
Exercise 1: Trace Content Retrieval
Before coding, trace this scenario step by step:
1. Alice adds file.pdf to IPFS on Node A
2. CID computed: QmAbc123...
3. Node A announces to DHT: "I have QmAbc123"
4. Bob (on Node B) tries to get QmAbc123
5. Node B queries DHT: "Who has QmAbc123?"
6. DHT returns: "Node A has it"
7. Node B connects to Node A
8. Node B requests blocks from Node A
9. Node B verifies each block hash
10. Node B reconstructs file.pdf
Questions while tracing:
- At step 3, which DHT nodes store the provider record? (20 closest to QmAbc123โs hash)
- At step 5, how many hops in the DHT? (O(log N) where N = network size)
- At step 6, what if Node A is offline? (Query returns other providers, if any)
- At step 9, what if a blockโs hash doesnโt match? (Reject, try different peer)
- What if Charlie on Node C also fetches the file? (Now 3 providers: A, B, C)
Exercise 2: Design CID Computation
Given a file, compute its CID:
File: "Hello, IPFS!\n" (13 bytes)
Steps:
1. Hash the content: SHA-256("Hello, IPFS!\n") = 0x2c26b46b...
2. Create multihash: <hash-function-code><hash-length><hash-bytes>
- SHA-256 code: 0x12
- Length: 32 (0x20)
- Result: 0x12202c26b46b...
3. Create CIDv1: <version><codec><multihash>
- Version: 0x01
- Codec: raw (0x55)
- Result: 0x015512202c26b46b...
4. Encode in base58: QmZ...
Exercise 3: XOR Distance Calculation
Calculate XOR distances:
Node A: 0b11110000 (simplified to 8 bits)
Node B: 0b11110001
Node C: 0b00001111
Key K: 0b11110010
Distance(A, K) = 0b11110000 XOR 0b11110010 = 0b00000010 = 2
Distance(B, K) = 0b11110001 XOR 0b11110010 = 0b00000011 = 3
Distance(C, K) = 0b00001111 XOR 0b11110010 = 0b11111101 = 253
Closest to K: Node A (distance 2)
Question: Which k-bucket would Node A put Node B in? (Bucket for bit position where they first differ)
Real World Outcome
When your IPFS-like decentralized storage client is complete, youโll have built a peer-to-peer content distribution system that demonstrates how Web3 handles data storage without centralized servers. Hereโs what youโll see:
# Terminal 1 - Start first node (Bootstrap node)
$ ./ipfs-lite daemon --port 4001
[Node A] Peer ID: 12D3KooWRq4tRk8jF3d9VqMn5WVzQv9V3GC9p7uXh3nE8X1YZ2AB
[Node A] Listening on /ip4/127.0.0.1/tcp/4001
[Node A] Listening on /ip6/::1/tcp/4001
[Node A] DHT server mode enabled
[Node A] Local storage: ~/.ipfs-lite/blocks/
[Node A] Ready to accept connections
# Terminal 2 - Start second node and connect to bootstrap
$ ./ipfs-lite daemon --port 4002 --bootstrap /ip4/127.0.0.1/tcp/4001/p2p/12D3Koo...
[Node B] Peer ID: 12D3KooWPq8mTUv2oX4cD5nN9sVzQy8R4GC0p7yWh3nA8Z1YZ3CD
[Node B] Connecting to bootstrap node...
[Node B] Connected to 12D3KooWRq4tRk8j... โ
[Node B] DHT synchronized with 1 peers
[Node B] Ready
# Terminal 3 - Add a file (Node A)
$ ./ipfs-lite add ~/Documents/whitepaper.pdf --node http://localhost:4001
Adding file: whitepaper.pdf (2.4 MB)
[*] Computing content hash...
[*] File size exceeds chunk size, splitting into chunks...
Chunking strategy:
Chunk size: 256 KB
Total chunks: 10
DAG structure: Merkle tree with 10 leaf nodes
Chunk 0: QmHash0... (262144 bytes) โ
Chunk 1: QmHash1... (262144 bytes) โ
Chunk 2: QmHash2... (262144 bytes) โ
...
Chunk 9: QmHash9... (131072 bytes) โ
Building Merkle DAG...
Root node links to 10 chunks
Computing CID (Content Identifier)...
CIDv1: base58-btc
Codec: dag-pb
Multihash: sha2-256
+---------------------------------------------+
| File added successfully! |
| CID: QmYwAPJzv5CZsnA625s3Xf2nemtYgPpHdWEz79o |
+---------------------------------------------+
File structure:
+---------------------------------------------+
| Root: QmYwAPJzv5CZsnA625s3Xf2nemtYgPpHdWEz7 |
| +-- Chunk-0: QmHash0... (256 KB) |
| +-- Chunk-1: QmHash1... (256 KB) |
| +-- Chunk-2: QmHash2... (256 KB) |
| +-- ... |
| +-- Chunk-9: QmHash9... (128 KB) |
+---------------------------------------------+
[Node A] Announcing to DHT...
[DHT] Publishing provider record to 20 closest peers
[DHT] Provider record stored on peers: 12D3KooWPq8m..., 12D3KooWXy9z...
[Node A] File is now available on the network!
# Terminal 4 - Retrieve file from different node (Node B)
$ ./ipfs-lite get QmYwAPJzv5CZsnA625s3Xf2nemtYgPpHdWEz79ojWnPbdG --node http://localhost:4002
Retrieving: QmYwAPJzv5CZsnA625s3Xf2nemtYgPpHdWEz79ojWnPbdG
[*] Querying DHT for providers...
[DHT] Finding peers closest to QmYwAP...
[DHT] XOR distance calculation:
Peer 12D3KooWRq4tRk8j... distance: 0x3f2a1b...
Peer 12D3KooWXy9z... distance: 0x7c4d2e...
[DHT] Querying 20 closest peers...
[DHT] Found 2 providers:
โ 12D3KooWRq4tRk8j... (Node A) - RTT: 2ms
โ 12D3KooWZa5x... (Node C) - RTT: 45ms
[*] Fetching Merkle DAG root...
[Node A] Sending root node (1.2 KB)
[*] Parsing DAG structure...
Total size: 2.4 MB
Chunks needed: 10
[*] Downloading chunks in parallel (max 5 concurrent)...
[#####-----] Chunk 0/10 - downloading from 12D3KooWRq4t... (256 KB)
[#####-----] Chunk 1/10 - downloading from 12D3KooWRq4t... (256 KB)
[#####-----] Chunk 2/10 - downloading from 12D3KooWZa5x... (256 KB)
[#####-----] Chunk 3/10 - downloading from 12D3KooWRq4t... (256 KB)
[#####-----] Chunk 4/10 - downloading from 12D3KooWZa5x... (256 KB)
...
[##########] All chunks downloaded (2.4 MB)
[*] Verifying chunk hashes...
Chunk 0: Hash matches QmHash0... โ
Chunk 1: Hash matches QmHash1... โ
...
All chunks verified โ
[*] Reconstructing file...
+---------------------------------------------+
| File retrieved: whitepaper.pdf (2.4 MB) |
| Downloaded in 0.8 seconds (3.0 MB/s) |
| SHA-256 matches original โ |
+---------------------------------------------+
# Terminal 5 - Pin the file (keep it permanently)
$ ./ipfs-lite pin add QmYwAPJzv5CZsnA625s3Xf2nemtYgPpHdWEz79ojWnPbdG
[Node B] Pinning CID: QmYwAP...
[Node B] Pinned content will not be garbage collected
[Node B] Announcing as provider to DHT...
[DHT] Now 3 peers provide this content
Pinned: QmYwAPJzv5CZsnA625s3Xf2nemtYgPpHdWEz79ojWnPbdG
# Check network statistics
$ ./ipfs-lite stats
+============================================================+
| IPFS-Lite Network Statistics |
+============================================================+
| Local Peer ID: 12D3KooWPq8mTUv2oX4cD5nN9sVzQy8R4GC0 |
| |
| DHT Information: |
| Connected peers: 47 |
| Routing table size: 256 k-buckets |
| K-value: 20 |
| Closest peers: 20 |
| |
| Storage: |
| Blocks stored: 423 |
| Total size: 156 MB |
| Pinned blocks: 89 |
| Pinned size: 34 MB |
| |
| Bandwidth (last hour): |
| Downloaded: 245 MB |
| Uploaded: 89 MB |
| Peers served: 12 |
| |
| Provider Records: |
| Content we provide: 89 CIDs |
| DHT announcements: 2,140 |
+============================================================+
# Test content addressing (same content = same CID)
$ ./ipfs-lite add file1.txt
CID: QmXa5b...
$ ./ipfs-lite add file2.txt # Same content as file1.txt
CID: QmXa5b... # Same CID! Content-addressed
# Test DHT lookup manually
$ ./ipfs-lite dht findprovs QmYwAPJzv5CZsnA625s3Xf2nemtYgPpHdWEz79ojWnPbdG
Searching DHT for providers of: QmYwAP...
DHT Lookup Process:
1. Hash CID to 256-bit key: 0x3a7f2b1c...
2. Find k-closest peers to key using XOR metric
3. Iteratively query closer and closer peers (Kademlia algorithm)
Iteration 1: Querying 3 bootstrap peers
Found 15 peers closer to key
Iteration 2: Querying 15 peers
Found 8 peers even closer
Iteration 3: Querying 8 peers
Found 3 provider records โ
Providers:
โ 12D3KooWRq4tRk8j... (Node A) - /ip4/127.0.0.1/tcp/4001
โ 12D3KooWPq8mTUv2... (Node B) - /ip4/127.0.0.1/tcp/4002
โ 12D3KooWZa5x... (Node C) - /ip4/192.168.1.50/tcp/4001
Lookup completed in 3 hops (< 0.1s)
# Simulate node going offline (test resilience)
$ ./ipfs-lite daemon --port 4003 --bootstrap /ip4/127.0.0.1/tcp/4001
[Node D] Connected to network
[Node D] Fetching QmYwAP...
# Even if Node A goes offline...
[Node A] Shutting down...
# Node D can still fetch from Node B or C
[Node D] Node A unreachable, trying alternative provider
[Node D] Downloading from Node B... โ
[Node D] File retrieved successfully
Content remains available as long as ONE peer has it!
# View DAG structure visually
$ ./ipfs-lite dag inspect QmYwAPJzv5CZsnA625s3Xf2nemtYgPpHdWEz79ojWnPbdG
Merkle DAG Structure:
================================================================================
Root: QmYwAPJzv5CZsnA625s3Xf2nemtYgPpHdWEz79ojWnPbdG
Type: UnixFS File
Size: 2,516,582 bytes (2.4 MB)
Links (10):
+-- [0] QmHash0... -> Chunk (262144 bytes)
+-- [1] QmHash1... -> Chunk (262144 bytes)
+-- [2] QmHash2... -> Chunk (262144 bytes)
+-- [3] QmHash3... -> Chunk (262144 bytes)
+-- [4] QmHash4... -> Chunk (262144 bytes)
+-- [5] QmHash5... -> Chunk (262144 bytes)
+-- [6] QmHash6... -> Chunk (262144 bytes)
+-- [7] QmHash7... -> Chunk (262144 bytes)
+-- [8] QmHash8... -> Chunk (262144 bytes)
+-- [9] QmHash9... -> Chunk (131072 bytes)
CID Format:
Version: 1
Codec: dag-pb (0x70)
Multihash: sha2-256
Base Encoding: base58-btc
Verification:
โ All chunk hashes valid
โ DAG structure consistent
โ Total size matches sum of chunks
================================================================================
Youโve built a censorship-resistant, content-addressed storage network where data is identified by its hash (not location), distributed across peers, and verifiable without trusting any single party. This is how NFT metadata, dApp frontends, and Web3 data live beyond centralized servers.
Implementation Phases
Phase 1: Content Addressing Foundation
Goal: Compute CIDs from file bytes.
Tasks:
- Implement SHA-256 hashing
- Create multihash encoding (function code + length + hash)
- Build CIDv1 structure (version + codec + multihash)
- Implement base58 encoding for human-readable CIDs
Validation:
data := []byte("Hello, IPFS!")
cid := ComputeCID(data)
fmt.Println(cid.String()) // Should match real IPFS CID
Hints if stuck:
- Start with the
go-cidandgo-multihashlibraries to understand the format - Then implement from scratch for learning
- Verify against real IPFS:
echo "Hello, IPFS!" | ipfs add --only-hash
Phase 2: File Chunking
Goal: Split large files into fixed-size chunks.
Tasks:
- Implement 256 KB chunk splitting
- Handle files smaller than chunk size
- Handle files with non-aligned sizes
- Compute CID for each chunk
Validation:
chunks := ChunkFile(largeFileBytes, 256*1024)
for i, chunk := range chunks {
fmt.Printf("Chunk %d: %s (%d bytes)\n", i, ComputeCID(chunk), len(chunk))
}
Hints if stuck:
- Last chunk will usually be smaller
- Consider edge case: empty file
- Consider edge case: file exactly chunk size
Phase 3: Merkle DAG Construction
Goal: Link chunks into a root node.
Tasks:
- Define DAGNode and Link structures
- Build links from chunks
- Serialize root node (protobuf or CBOR)
- Compute root CID
Validation:
root := BuildDAG(chunks)
fmt.Println("Root CID:", root.CID)
fmt.Println("Links:", len(root.Links))
Hints if stuck:
- Study the UnixFS protobuf format
- Root node contains links, not raw data
- CID of root = hash of serialized root node
Phase 4: Local Block Store
Goal: Persist blocks to disk.
Tasks:
- Design key-value store interface
- Implement file-system-based storage (CID -> bytes)
- Add block retrieval
- Implement simple garbage collection
Validation:
$ ls ~/.ipfs-lite/blocks/
QmHash0... QmHash1... QmHash2... root...
Hints if stuck:
- Shard directories by first 2 chars of CID (prevents too many files in one dir)
- Consider using LevelDB or BadgerDB for production
- GC: mark pinned, sweep unpinned
Phase 5: DHT Routing Table
Goal: Implement k-bucket structure.
Tasks:
- Generate peer ID from keypair
- Implement XOR distance calculation
- Build k-bucket array (256 buckets, k=20)
- Add peer insertion/lookup logic
Validation:
rt := NewRoutingTable(myPeerID, k=20)
rt.AddPeer(peer1)
rt.AddPeer(peer2)
closest := rt.FindClosest(targetKey, 3)
Hints if stuck:
- Distance is calculated by XORing two 256-bit values
- Find which bucket by finding first differing bit position
- Evict stale peers when bucket is full
Phase 6: DHT Operations
Goal: Implement FindNode and FindProviders.
Tasks:
- Implement iterative FindNode (find peers closest to key)
- Implement provider record storage
- Implement FindProviders query
- Add provider announcement
Validation:
providers := dht.FindProviders(cid)
for _, p := range providers {
fmt.Println("Provider:", p.ID, p.Addrs)
}
Hints if stuck:
- Alpha=3: query 3 peers at a time
- Terminate when no closer peers found
- Provider records have TTL (24 hours)
Phase 7: P2P Networking
Goal: Establish peer connections.
Tasks:
- Set up TCP listener
- Implement connection handling
- Add bootstrap node connection
- Handle multiaddr parsing
Validation:
$ ./ipfs-lite daemon
Listening on /ip4/0.0.0.0/tcp/4001/p2p/12D3KooW...
Connected peers: 5
Hints if stuck:
- Use libp2p for production quality
- For learning, raw TCP + length-prefixed messages works
- Handle connection failures gracefully
Phase 8: Block Exchange
Goal: Request and serve blocks between peers.
Tasks:
- Implement WANT/HAVE/BLOCK messages
- Add block request handling
- Implement parallel downloads
- Add block verification on receipt
Validation:
block := peer.RequestBlock(cid)
if ComputeCID(block) != cid {
return errors.New("block verification failed")
}
Hints if stuck:
- Verify every received block by hashing
- Download multiple blocks in parallel
- Retry with different peer on failure
Phase 9: Full Integration
Goal: Complete add/get workflow.
Tasks:
- Wire up all components
- Implement CLI commands
- Add pinning support
- Test with multiple nodes
Validation:
# Node 1
$ ./ipfs-lite add file.pdf
CID: Qm...
# Node 2 (different machine)
$ ./ipfs-lite get Qm...
Downloaded: file.pdf
The Interview Questions Theyโll Ask
Prepare to answer these:
1. โExplain how IPFS is different from HTTP.โ
Expected answer: HTTP is location-basedโURLs point to specific servers. IPFS is content-basedโCIDs identify data by its hash, not where it lives. This enables verification (you can check the hash), deduplication (same content = same CID), and decentralization (fetch from anyone who has it).
2. โWhat happens if I change one byte in a file on IPFS?โ
Expected answer: The entire file gets a new CID because the hash changes. The old CID still points to the old content (immutability). Thereโs no โupdatingโ a CIDโyou create a new one. This is similar to Git commits.
3. โHow does the DHT find content in a large network?โ
Expected answer: Kademlia uses XOR distance between peer IDs and content keys. It iteratively queries peers closer and closer to the target key. Each hop cuts the search space roughly in half, achieving O(log N) lookup complexity. For a million-node network, this means about 20 hops.
4. โWhat if no peer has the content Iโm looking for?โ
Expected answer: The DHT query returns emptyโno providers found. The content is effectively lost unless someone who has it comes back online or re-adds it. This is why pinning and redundancy are important.
5. โHow does IPFS prevent Sybil attacks on the DHT?โ
Expected answer: Peer IDs are hashes of public keys. While creating many IDs is cheap, theyโre randomly distributed in the keyspace. An attacker canโt control which k-buckets they land in or which content theyโre โcloseโ to. Additionally, provider records are stored on the k closest peers, requiring the attacker to control many IDs near each target.
6. โWhatโs the difference between IPFS and BitTorrent?โ
Expected answer:
- BitTorrent uses per-file swarms with .torrent files. IPFS has a global DHT where any CID can be looked up.
- BitTorrent is mainly for large file sharing. IPFS handles any data size and supports directories.
- IPFS integrates content addressing at the protocol level. BitTorrent uses info hashes but URLs still require trackers or DHT.
- IPFS supports Merkle DAGs for incremental updates. BitTorrent treats files as opaque chunks.
7. โWhy do NFTs use IPFS for metadata and images?โ
Expected answer: On-chain storage is extremely expensive (hundreds of dollars per KB). IPFS provides:
- Decentralized storage that canโt be censored
- Content addressing that guarantees the metadata/image matches what the NFT references
- Permanence as long as any node pins the content
- Verification without trusting the hosting server
8. โWhatโs pinning and why is it needed?โ
Expected answer: Pinning tells your node to keep content forever and not garbage collect it. Without pinning, unpopular content that no one is actively requesting will eventually be removed to free space. Pinning services (like Pinata, Infura) provide redundant storage so content survives even if original uploader goes offline.
Hints in Layers
Hint 1: Start with Simple Content Addressing
Compute CID from bytes:
import (
"crypto/sha256"
"github.com/multiformats/go-multihash"
"github.com/ipfs/go-cid"
)
func computeCID(data []byte) (cid.Cid, error) {
// 1. Hash the data
hash := sha256.Sum256(data)
// 2. Create multihash
mh, err := multihash.Encode(hash[:], multihash.SHA2_256)
if err != nil {
return cid.Cid{}, err
}
// 3. Create CID v1
c := cid.NewCidV1(cid.Raw, mh)
return c, nil
}
Hint 2: Chunking Large Files
const ChunkSize = 256 * 1024 // 256 KB
func chunkFile(data []byte) [][]byte {
var chunks [][]byte
for i := 0; i < len(data); i += ChunkSize {
end := i + ChunkSize
if end > len(data) {
end = len(data)
}
chunks = append(chunks, data[i:end])
}
return chunks
}
Hint 3: Build Merkle DAG
type DAGNode struct {
Data []byte
Links []Link
}
type Link struct {
Name string
Cid cid.Cid
Size uint64
}
func buildDAG(chunks [][]byte) (cid.Cid, error) {
// Create leaf nodes for each chunk
var links []Link
for i, chunk := range chunks {
chunkCID, _ := computeCID(chunk)
links = append(links, Link{
Name: fmt.Sprintf("chunk-%d", i),
Cid: chunkCID,
Size: uint64(len(chunk)),
})
// Store chunk in local blockstore
storeBlock(chunkCID, chunk)
}
// Create root node with links
rootNode := DAGNode{Links: links}
rootBytes := encodeNode(rootNode) // Protobuf or CBOR
rootCID, _ := computeCID(rootBytes)
storeBlock(rootCID, rootBytes)
return rootCID, nil
}
Hint 4: Simple DHT (Kademlia basics)
type RoutingTable struct {
localID PeerID
buckets [256]*KBucket // One bucket per bit in ID
}
type KBucket struct {
peers []PeerInfo
k int // Usually 20
}
func (rt *RoutingTable) FindClosest(key []byte, count int) []PeerInfo {
// Compute XOR distance between key and each peer ID
// Return 'count' closest peers
}
func (rt *RoutingTable) FindProviders(cid cid.Cid) ([]PeerInfo, error) {
key := cid.Hash()
// Iterative lookup
visited := make(map[string]bool)
toQuery := rt.FindClosest(key, 3) // Start with 3 closest
for len(toQuery) > 0 {
peer := toQuery[0]
toQuery = toQuery[1:]
if visited[peer.ID] {
continue
}
visited[peer.ID] = true
// Ask peer for providers
providers := peer.FindProviders(cid)
if len(providers) > 0 {
return providers, nil
}
// Ask peer for closer peers
closerPeers := peer.FindCloserPeers(key)
toQuery = append(toQuery, closerPeers...)
}
return nil, errors.New("no providers found")
}
Hint 5: Use Existing IPFS Libraries
Donโt build from scratch (unless you want the deepest learning):
- go-ipfs: Full IPFS implementation (study this!)
- go-cid: CID handling
- go-multihash: Multihash encoding
- libp2p: P2P networking layer
- go-datastore: Key-value storage
Hint 6: Test with Real IPFS
Verify your CIDs match real IPFS:
# Add file to your implementation
$ ./ipfs-lite add test.txt
CID: QmXa5b...
# Add same file to real IPFS
$ ipfs add test.txt
QmXa5b... # Should be SAME CID!
# Fetch from real IPFS gateway
$ curl https://ipfs.io/ipfs/QmXa5b...
# Should return your file
Testing Strategy
Unit Tests
func TestComputeCID_Empty(t *testing.T) {
cid := ComputeCID([]byte{})
// Empty data should still produce valid CID
assert.NotEmpty(t, cid.String())
}
func TestComputeCID_Deterministic(t *testing.T) {
data := []byte("Hello, IPFS!")
cid1 := ComputeCID(data)
cid2 := ComputeCID(data)
assert.Equal(t, cid1, cid2)
}
func TestChunking_SmallFile(t *testing.T) {
data := []byte("small")
chunks := ChunkFile(data)
assert.Equal(t, 1, len(chunks))
}
func TestChunking_ExactMultiple(t *testing.T) {
data := make([]byte, 256*1024*3) // Exactly 3 chunks
chunks := ChunkFile(data)
assert.Equal(t, 3, len(chunks))
}
func TestDAG_Reconstruction(t *testing.T) {
original := randomBytes(1024 * 1024) // 1 MB
// Add to system
rootCID := Add(original)
// Retrieve
retrieved := Get(rootCID)
assert.Equal(t, original, retrieved)
}
DHT Tests
func TestXORDistance(t *testing.T) {
a := []byte{0xFF, 0x00}
b := []byte{0xF0, 0x00}
dist := XORDistance(a, b)
expected := []byte{0x0F, 0x00}
assert.Equal(t, expected, dist)
}
func TestKBucket_Insertion(t *testing.T) {
rt := NewRoutingTable(myID, 20)
// Insert 25 peers
for i := 0; i < 25; i++ {
rt.AddPeer(randomPeer())
}
// Bucket should have at most k=20 peers
bucket := rt.GetBucket(0)
assert.LessOrEqual(t, len(bucket.peers), 20)
}
func TestFindClosest(t *testing.T) {
rt := NewRoutingTable(myID, 20)
// Add known peers
rt.AddPeer(peer1)
rt.AddPeer(peer2)
rt.AddPeer(peer3)
closest := rt.FindClosest(targetKey, 2)
// Verify returned peers are actually closest
dist0 := XORDistance(closest[0].ID, targetKey)
dist1 := XORDistance(closest[1].ID, targetKey)
assert.True(t, dist0 <= dist1)
}
Integration Tests
func TestEndToEnd_TwoNodes(t *testing.T) {
// Start two nodes
node1 := StartNode(port: 4001)
node2 := StartNode(port: 4002, bootstrap: node1.Addr())
// Add file on node1
data := []byte("test data for e2e")
cid := node1.Add(data)
// Retrieve on node2
retrieved := node2.Get(cid)
assert.Equal(t, data, retrieved)
// Verify node2 is now a provider
providers := node1.DHT.FindProviders(cid)
assert.Contains(t, providers, node2.ID)
}
func TestResilience_NodeOffline(t *testing.T) {
node1 := StartNode(port: 4001)
node2 := StartNode(port: 4002, bootstrap: node1.Addr())
node3 := StartNode(port: 4003, bootstrap: node1.Addr())
// Add on node1
cid := node1.Add(testData)
// Fetch on node2 (becomes provider)
node2.Get(cid)
// Stop node1
node1.Stop()
// Node3 should still be able to get from node2
retrieved := node3.Get(cid)
assert.Equal(t, testData, retrieved)
}
Learning Milestones
Track your progress:
- Content addressing works - You compute CIDs that match real IPFS
- Large files chunk correctly - Files split and reassemble properly
- Merkle DAG builds correctly - Root CID represents entire file structure
- DHT finds providers - Iterative lookup locates content
- Multi-peer download works - Parallel fetching from multiple sources
- Pinning preserves content - GC doesnโt remove pinned data
- Node resilience proven - Content survives node failures
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Content addressing | โDesigning Data-Intensive Applicationsโ | Ch. 6: Partitioning & Ch. 11: Stream Processing |
| Merkle DAGs | โMastering Bitcoinโ | Ch. 9: The Blockchain |
| Distributed hash tables | โDesigning Data-Intensive Applicationsโ | Ch. 6: Partitioning (DHT section) |
| P2P networking | โComputer Networks, Fifth Editionโ | Ch. 2: Application Layer (P2P section) |
| Cryptographic hashing | โSerious Cryptography, 2nd Editionโ | Ch. 6: Hash Functions |
| Distributed systems | โDesigning Data-Intensive Applicationsโ | Ch. 5: Replication & Ch. 6: Partitioning |
| Consensus and fault tolerance | โDesigning Data-Intensive Applicationsโ | Ch. 8: Trouble with Distributed Systems |
Additional Resources
Official Documentation
- IPFS Documentation - Comprehensive guide to IPFS concepts
- IPFS Specifications - Protocol specifications
- Content Addressing - CID deep dive
- libp2p Documentation - P2P networking stack
Academic Papers
- Kademlia: A Peer-to-peer Information System Based on the XOR Metric - Original DHT paper
- IPFS - Content Addressed, Versioned, P2P File System - IPFS whitepaper
Implementation References
- go-ipfs source code - Reference implementation
- BitSwap Specification - Block exchange protocol
- libp2p specs - Networking layer specs
Community Resources
- How IPFS DHT Works - Visual DHT explanation
- ProtoSchool - Interactive IPFS tutorials
- IPFS Camp - Workshop materials
What Youโll Have Built
By completing this project, you will have created:
- A content-addressed storage system that identifies data by cryptographic hash
- A chunking and DAG engine that handles files of any size efficiently
- A Kademlia DHT implementation with k-buckets and iterative routing
- A peer-to-peer network with connection management and block exchange
- A complete CLI tool for adding, getting, and pinning content
More importantly, youโll deeply understand:
- Why Web3 needs a separate data layer from blockchains
- How content addressing enables trustless verification
- How DHTs scale to millions of nodes
- Why IPFS is the backbone of NFT metadata, dApp frontends, and decentralized storage
This knowledge directly applies to:
- Building dApps that store data off-chain
- Understanding how Filecoin incentivizes IPFS storage
- Designing censorship-resistant systems
- Working with any content-addressed system (Git, Nix, etc.)