P15: Decentralized Storage Client (IPFS-like)

P15: Decentralized Storage Client (IPFS-like)

Project Overview

Attribute Value
Main Language Go
Alternative Languages Rust, TypeScript
Difficulty Expert
Coolness Level Level 4: Hardcore Tech Flex
Business Potential The โ€œOpen Coreโ€ Infrastructure
Knowledge Area Distributed Systems / Storage
Software or Tool Content-Addressed Storage
Main Book โ€œDesigning Data-Intensive Applicationsโ€ by Martin Kleppmann
Time Estimate 3-4 weeks
Prerequisites Projects 3 & 4, P2P networking experience

Learning Objectives

By completing this project, you will:

  1. Master content-addressed storage understanding how cryptographic hashes replace location-based addressing and enable data integrity verification without trusting the source
  2. Implement the Kademlia DHT algorithm including XOR distance metric, k-buckets, and iterative routing that finds content in O(log N) hops
  3. Build Merkle DAG structures splitting large files into chunks, linking them with cryptographic hashes, and enabling efficient partial downloads and deduplication
  4. Understand peer-to-peer networking fundamentals including peer discovery, NAT traversal, connection management, and multiaddr formatting
  5. Design and implement block exchange protocols handling concurrent downloads, peer prioritization, and verification of received data
  6. Explore content persistence and garbage collection understanding pinning strategies, provider record management, and data availability guarantees
  7. Connect theory to Web3 infrastructure understanding why IPFS is the data layer for NFT metadata, dApp frontends, and decentralized applications
  8. Apply distributed systems principles including eventual consistency, Byzantine fault tolerance concepts, and censorship resistance

Deep Theoretical Foundation

The Problem: Location vs. Content

The internet we use today is fundamentally location-based. When you request https://example.com/file.pdf, youโ€™re saying โ€œgive me whatever is at this location.โ€ This has several problems:

  1. Single point of failure: If the server goes down, the file is unavailable
  2. Trust required: You trust that the server gives you the correct file
  3. Duplication inefficiency: The same file hosted on 100 servers has 100 different URLs
  4. Censorship vulnerability: Block the server, block the content

Content-addressed storage flips this model: โ€œGive me the file with this hash, from anyone who has it.โ€

Location-Based (HTTP):                Content-Based (IPFS):

    Request                               Request
       โ”‚                                     โ”‚
       โ–ผ                                     โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   Server A  โ”‚                     โ”‚    ANY Peer     โ”‚
โ”‚ example.com โ”‚                     โ”‚   that has it   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜                     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
       โ”‚                                     โ”‚
       โ–ผ                                     โ–ผ
    file.pdf                             file.pdf
   (trust it?)                      (verify hash = CID)
                                           โœ“

Why This Matters for Web3

Blockchains are not designed for large file storage:

  • Ethereum stores ~1 KB of state per transaction
  • Storing 1 MB on-chain costs hundreds of dollars in gas
  • Blockchain state must be replicated by every node

So where do NFT images live? Where do dApp frontends live? Where does the actual data of Web3 live?

IPFS and similar systems provide the data layer. When an NFT points to ipfs://QmXyz..., thatโ€™s a content-addressed reference that:

  • Works regardless of which server hosts the file
  • Can be verified by hashing the received data
  • Enables censorship resistance (content persists if ANY node has it)

The Architecture of Content-Addressed Storage

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                        IPFS-LIKE SYSTEM ARCHITECTURE                        โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                                            โ”‚
โ”‚   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚
โ”‚   โ”‚                          APPLICATION LAYER                          โ”‚  โ”‚
โ”‚   โ”‚   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”‚  โ”‚
โ”‚   โ”‚   โ”‚  CLI    โ”‚  โ”‚   API   โ”‚  โ”‚ Gateway โ”‚  โ”‚  Name Resolution   โ”‚    โ”‚  โ”‚
โ”‚   โ”‚   โ”‚ ipfs addโ”‚  โ”‚ /api/v0 โ”‚  โ”‚  HTTP   โ”‚  โ”‚     (IPNS/DNSLink) โ”‚    โ”‚  โ”‚
โ”‚   โ”‚   โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ”‚  โ”‚
โ”‚   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚
โ”‚            โ”‚            โ”‚           โ”‚                  โ”‚                    โ”‚
โ”‚   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚
โ”‚   โ”‚                           DATA LAYER                                 โ”‚  โ”‚
โ”‚   โ”‚   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚  โ”‚
โ”‚   โ”‚   โ”‚   MERKLE DAG    โ”‚  โ”‚    BLOCK STORE  โ”‚  โ”‚    PINNING          โ”‚  โ”‚  โ”‚
โ”‚   โ”‚   โ”‚                 โ”‚  โ”‚                 โ”‚  โ”‚                     โ”‚  โ”‚  โ”‚
โ”‚   โ”‚   โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚  โ”‚  CID โ†’ Bytes    โ”‚  โ”‚  CID โ†’ PinStatus    โ”‚  โ”‚  โ”‚
โ”‚   โ”‚   โ”‚  โ”‚   Root    โ”‚  โ”‚  โ”‚                 โ”‚  โ”‚                     โ”‚  โ”‚  โ”‚
โ”‚   โ”‚   โ”‚  โ”œโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ค  โ”‚  โ”‚  Local storage  โ”‚  โ”‚  Garbage collection โ”‚  โ”‚  โ”‚
โ”‚   โ”‚   โ”‚  โ”‚Chunkโ”‚Chunkโ”‚  โ”‚  โ”‚  (LevelDB, FS)  โ”‚  โ”‚  reference counting โ”‚  โ”‚  โ”‚
โ”‚   โ”‚   โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚  โ”‚                 โ”‚  โ”‚                     โ”‚  โ”‚  โ”‚
โ”‚   โ”‚   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚  โ”‚
โ”‚   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚
โ”‚                                    โ”‚                                         โ”‚
โ”‚   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚
โ”‚   โ”‚                         NETWORK LAYER                                  โ”‚  โ”‚
โ”‚   โ”‚   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”‚  โ”‚
โ”‚   โ”‚   โ”‚  KADEMLIA DHT   โ”‚  โ”‚    BITSWAP      โ”‚  โ”‚    PEER MANAGER     โ”‚   โ”‚  โ”‚
โ”‚   โ”‚   โ”‚                 โ”‚  โ”‚   (Exchange)    โ”‚  โ”‚                     โ”‚   โ”‚  โ”‚
โ”‚   โ”‚   โ”‚  k-buckets      โ”‚  โ”‚                 โ”‚  โ”‚  Connection pooling โ”‚   โ”‚  โ”‚
โ”‚   โ”‚   โ”‚  XOR distance   โ”‚  โ”‚  Want/Have/     โ”‚  โ”‚  NAT traversal      โ”‚   โ”‚  โ”‚
โ”‚   โ”‚   โ”‚  Provider       โ”‚  โ”‚  Block msgs     โ”‚  โ”‚  Bootstrap nodes    โ”‚   โ”‚  โ”‚
โ”‚   โ”‚   โ”‚  records        โ”‚  โ”‚                 โ”‚  โ”‚                     โ”‚   โ”‚  โ”‚
โ”‚   โ”‚   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚  โ”‚
โ”‚   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚
โ”‚                                    โ”‚                                         โ”‚
โ”‚   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚
โ”‚   โ”‚                        TRANSPORT LAYER                                 โ”‚  โ”‚
โ”‚   โ”‚   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”‚  โ”‚
โ”‚   โ”‚   โ”‚   TCP    โ”‚  โ”‚   QUIC   โ”‚  โ”‚  WebRTC  โ”‚  โ”‚   WebSocket        โ”‚    โ”‚  โ”‚
โ”‚   โ”‚   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ”‚  โ”‚
โ”‚   โ”‚                                                                        โ”‚  โ”‚
โ”‚   โ”‚   Multiaddr: /ip4/192.168.1.1/tcp/4001/p2p/12D3KooW...                โ”‚  โ”‚
โ”‚   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚
โ”‚                                                                            โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

The Core Question Youโ€™re Answering

โ€œHow can we store and retrieve data in a decentralized network where content is addressed by its cryptographic hash, not by server location?โ€

This fundamentally changes data storage:

  • Location independence: โ€œWhatโ€ the data is (hash) matters, not โ€œwhereโ€ it lives (server)
  • Verifiability: You can prove data integrity by hashing it
  • Deduplication: Identical content always has the same CID
  • Censorship resistance: No single party controls the data

Understanding content-addressed storage means understanding why IPFS is the data layer for Web3โ€“blockchains store small state, IPFS stores large files.


Concepts You Must Understand First

Stop and research these before coding:

1. Content Addressing vs. Location Addressing

Traditional HTTP addressing identifies content by WHERE it lives:

https://example.com/files/document.pdf
        โ–ฒ               โ–ฒ
        โ”‚               โ”‚
     Server          Path on server

Content addressing identifies content by WHAT it is:

QmYwAPJzv5CZsnA625s3Xf2nemtYgPpHdWEz79ojWnPbdG
โ–ฒ
โ””โ”€โ”€ Hash of the content itself

Key insight: The same content always produces the same hash. If two people add the same file, they get the same CID. This enables automatic deduplication and verification.

What happens when content changes? New content = new hash = new CID. Thereโ€™s no โ€œupdatingโ€ a CIDโ€“you create a new one. This is immutability by design.

Questions to test your understanding:

  • Why does content addressing enable deduplication?
  • What happens if I change one byte in a file? (Entire CID changes)
  • How does this relate to Git commits? (Git uses content-addressing too!)
  • Why is immutability important for NFT metadata?

Book Reference: โ€œDesigning Data-Intensive Applicationsโ€ Ch. 6 - Kleppmann

2. Cryptographic Hashing and CIDs

A Content Identifier (CID) is more than just a hash. Itโ€™s a self-describing structure:

CID Structure:
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                      Content Identifier (CID)                        โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  Version  โ”‚   Codec    โ”‚              Multihash                     โ”‚
โ”‚   (v0/v1) โ”‚  (dag-pb,  โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚           โ”‚  raw, etc) โ”‚ โ”‚ Hash Func โ”‚ Length โ”‚     Hash Bytes     โ”‚
โ”‚   0x01    โ”‚    0x70    โ”‚ โ”‚   0x12    โ”‚  0x20  โ”‚  sha256(content)   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Example breakdown of: bafybeigdyrzt5sfp7udm7hu76uh7y26nf3efuylqabf3okuez3...
  - Version: 1 (CIDv1)
  - Codec: dag-pb (0x70) - protobuf-encoded DAG
  - Multihash function: sha2-256 (0x12)
  - Hash length: 32 bytes (0x20)
  - Hash bytes: [actual 32-byte SHA-256 hash]

Why โ€œmultihashโ€? Hash algorithm agility. If SHA-256 is broken, the network can migrate to SHA-3 without changing the CID format. The hash function is embedded in the CID itself.

CIDv0 vs CIDv1:

  • CIDv0: Legacy format, starts with Qm..., always uses dag-pb + SHA-256
  • CIDv1: Modern format, self-describing, supports multiple codecs and hash functions

Questions to test your understanding:

  • What is the advantage of embedding the hash function in the identifier?
  • Why might you choose raw codec vs dag-pb codec?
  • How do you upgrade hash algorithms without breaking old content?

Project Reference: You should have built Project 1 (SHA-256 Visualizer) and Project 3 (Merkle Tree Library) Web Reference: IPFS CID Specification

3. Merkle DAGs (Directed Acyclic Graphs)

For large files, we donโ€™t hash the entire file as one blob. We split it into chunks and link them with a Merkle DAG:

Merkle DAG for a 2.4 MB File (10 chunks):

                    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                    โ”‚            Root Node                โ”‚
                    โ”‚  CID: QmYwAPJzv5CZsnA625s3Xf2nem... โ”‚
                    โ”‚  Type: UnixFS File                  โ”‚
                    โ”‚  Size: 2,516,582 bytes              โ”‚
                    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                      โ”‚
        โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
        โ”‚          โ”‚          โ”‚       โ”‚       โ”‚          โ”‚          โ”‚
        โ–ผ          โ–ผ          โ–ผ       โ–ผ       โ–ผ          โ–ผ          โ–ผ
   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” ... โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
   โ”‚ Chunk 0 โ”‚โ”‚ Chunk 1 โ”‚โ”‚ Chunk 2 โ”‚     โ”‚ Chunk 7 โ”‚โ”‚ Chunk 8 โ”‚โ”‚ Chunk 9 โ”‚
   โ”‚ 256 KB  โ”‚โ”‚ 256 KB  โ”‚โ”‚ 256 KB  โ”‚     โ”‚ 256 KB  โ”‚โ”‚ 256 KB  โ”‚โ”‚ 128 KB  โ”‚
   โ”‚ QmHash0 โ”‚โ”‚ QmHash1 โ”‚โ”‚ QmHash2 โ”‚     โ”‚ QmHash7 โ”‚โ”‚ QmHash8 โ”‚โ”‚ QmHash9 โ”‚
   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Root node contains:
{
  "type": "file",
  "links": [
    {"name": "chunk-0", "cid": "QmHash0...", "size": 262144},
    {"name": "chunk-1", "cid": "QmHash1...", "size": 262144},
    ...
    {"name": "chunk-9", "cid": "QmHash9...", "size": 131072}
  ]
}

CID(file) = hash(root_node_serialized)

Why DAGs instead of simple Merkle trees?

  • DAGs allow sharing: if two files have identical chunks, they reference the same CID
  • DAGs support directories: a directory node links to file nodes
  • DAGs enable incremental updates: change one chunk, only that chunkโ€™s ancestors change

The relationship to Git: Git commits form a DAG. Each commit points to parent commits and a tree. The tree points to blobs (file contents). Sound familiar?

Questions to test your understanding:

  • If you change one byte in chunk 5, which CIDs change? (Chunk 5, root node)
  • How does chunking enable parallel downloads?
  • Why is 256 KB a common chunk size? (Balance between overhead and parallelism)
  • How would you represent a directory with 10,000 files?

Book Reference: โ€œMastering Bitcoinโ€ Ch. 9 (Merkle Trees) - Antonopoulos Project Reference: Project 3 (Merkle Tree Library)

4. Kademlia DHT (Distributed Hash Table)

The DHT is how we find who has the content. Kademlia is the specific DHT algorithm used by IPFS.

Core concepts:

  • Every peer has a 256-bit Peer ID (hash of their public key)
  • Every CID maps to a 256-bit key
  • Distance is measured by XOR: distance(a, b) = a XOR b
XOR Distance Metric Visualization:

Node A ID: 1010 0000 0000 0000 ...
Node B ID: 1010 0001 0000 0000 ...
          โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
XOR:       0000 0001 0000 0000 ...
                   โ–ฒ
                   โ””โ”€โ”€ 1 bit different = distance 256

Node A ID: 1010 0000 0000 0000 ...
Node C ID: 0000 0000 0000 0000 ...
          โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
XOR:       1010 0000 0000 0000 ...
           โ–ฒ
           โ””โ”€โ”€ Many bits different = much larger distance

Key insight: XOR distance has these properties:
- d(a,a) = 0 (identity)
- d(a,b) = d(b,a) (symmetric)
- d(a,c) <= d(a,b) + d(b,c) (triangle inequality)

K-Buckets:

K-Bucket Structure (k=3 for simplicity):
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Local Node ID: 1010 0000 0000 0000 0000 0000 ... (256 bits)       โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                                    โ”‚
โ”‚  Bucket 255: Peers with ID differing in bit 255 (very close)       โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚  โ”‚ Peer1: 1010 0000 0000 0000 ... 0001 โ”‚ Last seen: 2s ago       โ”‚ โ”‚
โ”‚  โ”‚ Peer2: 1010 0000 0000 0000 ... 0010 โ”‚ Last seen: 5s ago       โ”‚ โ”‚
โ”‚  โ”‚ [can hold up to k=3 peers]                                    โ”‚ โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ”‚                                                                    โ”‚
โ”‚  Bucket 254: Peers with ID differing in bit 254                    โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚  โ”‚ Peer3: 1010 0000 0000 0000 ... 0100 โ”‚ Last seen: 1s ago       โ”‚ โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ”‚                                                                    โ”‚
โ”‚  ...                                                               โ”‚
โ”‚                                                                    โ”‚
โ”‚  Bucket 0: Peers with ID differing in bit 0 (very far)             โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚  โ”‚ Peer99: 0... (starts with 0, we start with 1)                 โ”‚ โ”‚
โ”‚  โ”‚ Peer100: 0...                                                 โ”‚ โ”‚
โ”‚  โ”‚ Peer101: 0...                                                 โ”‚ โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ”‚                                                                    โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Rule: Each bucket stores up to k peers at that distance.
      Closer buckets have fewer potential peers but we know them better.
      Farther buckets have many potential peers but we only need a sample.

Iterative Lookup Process:

Finding content CID: QmYwAPJzv5CZsnA625s3Xf2nem...
Key (256-bit hash of CID): 0x3a7f2b1c...

Step 1: Local node picks k closest peers from its routing table

        Local                                    Target Key
          โ”‚                                          โ”‚
          โ–ผ                                          โ–ผ
        โ”Œโ”€โ”€โ”€โ”     Query     โ”Œโ”€โ”€โ”€โ”                 โ”Œโ”€โ”€โ”€โ”
        โ”‚ A โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–บโ”‚ B โ”‚                 โ”‚ ? โ”‚
        โ””โ”€โ”€โ”€โ”˜               โ””โ”€โ”ฌโ”€โ”˜                 โ””โ”€โ”€โ”€โ”˜
                              โ”‚
                              โ”‚ "Here are peers closer to key..."
                              โ–ผ
Step 2: Query returned peers, they return even closer peers

        โ”Œโ”€โ”€โ”€โ”               โ”Œโ”€โ”€โ”€โ”     Query     โ”Œโ”€โ”€โ”€โ”
        โ”‚ A โ”‚               โ”‚ B โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–บโ”‚ C โ”‚
        โ””โ”€โ”€โ”€โ”˜               โ””โ”€โ”€โ”€โ”˜               โ””โ”€โ”ฌโ”€โ”˜
                                                  โ”‚
                                                  โ”‚ "I have providers!"
                                                  โ–ผ
Step 3: Eventually find nodes storing provider records

        Provider Record: "Peer X at /ip4/1.2.3.4/tcp/4001 has QmYwAP..."

Total hops: O(log N) where N = network size
For 1 million nodes: ~20 hops

Questions to test your understanding:

  • Why XOR instead of numerical difference?
  • Whatโ€™s the purpose of keeping k peers per bucket instead of just 1?
  • How does Kademlia handle nodes going offline?
  • What prevents an attacker from controlling which nodes store a provider record?

Book Reference: โ€œDesigning Data-Intensive Applicationsโ€ Ch. 6 - Kleppmann Web Reference: Kademlia DHT Paper

5. Peer-to-Peer Networking

Peer Discovery:

  • Bootstrap nodes: Well-known nodes that help new nodes join
  • mDNS: Local network discovery
  • DHT: Find new peers through routing table refresh

Connection Management:

Multiaddr Format:
/ip4/192.168.1.1/tcp/4001/p2p/12D3KooWRq4tRk8jF3d9VqMn5WVzQv9V3GC9p7uXh3nE8X1YZ2AB
  โ”‚         โ”‚       โ”‚   โ”‚         โ”‚
  โ”‚         โ”‚       โ”‚   โ”‚         โ””โ”€โ”€ Peer ID (hash of public key)
  โ”‚         โ”‚       โ”‚   โ””โ”€โ”€ Port number
  โ”‚         โ”‚       โ””โ”€โ”€ Protocol (TCP)
  โ”‚         โ””โ”€โ”€ IP address
  โ””โ”€โ”€ IP version

Other multiaddr examples:
/ip6/::1/tcp/4001/p2p/12D3KooW...           (IPv6)
/dns4/node.example.com/tcp/443/wss/p2p/...  (WebSocket over HTTPS)
/ip4/192.168.1.1/udp/4001/quic/p2p/...      (QUIC transport)

NAT Traversal:

  • Relay servers: Third party relays connections
  • Hole punching: Direct connection through coordinated timing
  • AutoNAT: Detect if youโ€™re behind NAT

Book Reference: โ€œComputer Networks, Fifth Editionโ€ Ch. 2 - Tanenbaum

6. BitSwap Protocol

BitSwap is IPFSโ€™s block exchange protocolโ€“how peers request and send blocks to each other.

BitSwap Message Flow:

    Node A (wants blocks)              Node B (has blocks)
         โ”‚                                    โ”‚
         โ”‚  WANT_HAVE [CID1, CID2, CID3]      โ”‚
         โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–บโ”‚
         โ”‚                                    โ”‚
         โ”‚  HAVE [CID1, CID3]                 โ”‚
         โ”‚โ—„โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
         โ”‚                                    โ”‚
         โ”‚  WANT_BLOCK [CID1]                 โ”‚
         โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–บโ”‚
         โ”‚                                    โ”‚
         โ”‚  BLOCK [CID1, data...]             โ”‚
         โ”‚โ—„โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
         โ”‚                                    โ”‚
         โ”‚  (verify: hash(data) == CID1)      โ”‚
         โ”‚                                    โ”‚

Message Types:
- WANT_HAVE: "Do you have these blocks?"
- HAVE: "Yes, I have these blocks"
- WANT_BLOCK: "Please send me this block"
- BLOCK: Here's the block data
- CANCEL: Never mind, I got it elsewhere

Incentive Considerations:

  • BitSwap tracks peer โ€œledgersโ€ (how much exchanged)
  • Preferential treatment for peers who reciprocate
  • Similar to BitTorrentโ€™s tit-for-tat but more flexible

Web Reference: BitSwap Specification

7. Data Persistence and Garbage Collection

Content in IPFS is ephemeral by default. Without explicit preservation:

Pinning Hierarchy:
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                       PINNING STRATEGIES                           โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                                    โ”‚
โ”‚  Direct Pin: Pin only the root CID                                 โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”‚
โ”‚  โ”‚  Pinned: Root โ”€โ”€โ–บ Not pinned: Chunk1, Chunk2, Chunk3        โ”‚   โ”‚
โ”‚  โ”‚  (WARNING: Children can be garbage collected!)              โ”‚   โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚
โ”‚                                                                    โ”‚
โ”‚  Recursive Pin: Pin root and all descendants (DEFAULT)             โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”‚
โ”‚  โ”‚  Pinned: Root โ”€โ”€โ–บ Pinned: Chunk1, Chunk2, Chunk3            โ”‚   โ”‚
โ”‚  โ”‚  (All blocks preserved)                                     โ”‚   โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚
โ”‚                                                                    โ”‚
โ”‚  Indirect Pin: Pinned because an ancestor is recursively pinned   โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”‚
โ”‚  โ”‚  If RootA pins Chunk1, and RootB also uses Chunk1:          โ”‚   โ”‚
โ”‚  โ”‚  Chunk1 is safe even if RootB is unpinned                   โ”‚   โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚
โ”‚                                                                    โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Garbage Collection Process:
1. Mark all pinned CIDs and their descendants
2. Mark all blocks in active transfer
3. Delete unmarked blocks
4. Update blockstore metrics

Provider Record Lifecycle:

  • When you add content, you announce to DHT
  • Provider records expire after 24 hours
  • You must re-announce periodically to stay discoverable
  • If all providers stop announcing, content becomes unfindable

Questions to Guide Your Design

Before implementing, think through these:

1. Content Addressing

  • How do you compute a CID from file bytes?
  • What hash function to use? (SHA-256 for compatibility)
  • How do you encode the CID? (base58, base32, or base64?)
  • How do you handle CID v0 vs v1?

2. Chunking Strategy

  • What chunk size? (IPFS uses 256 KB default)
  • Fixed-size chunks or content-defined chunking?
  • How do you rebuild the file from chunks?
  • How do you handle small files (< chunk size)?

3. Merkle DAG Construction

  • How do you represent the DAG in memory and on disk?
  • What format for DAG nodes? (protobuf, JSON, CBOR?)
  • How do you link chunks to the root node?
  • How do you traverse the DAG to fetch all blocks?

4. DHT Implementation

  • How many k-buckets? (IPFS uses 256 for 256-bit hashes)
  • Whatโ€™s the k-value (peers per bucket)? (IPFS uses k=20)
  • How do you bootstrap (initial peer discovery)?
  • How often do you refresh routing tables?
  • How do you handle peers going offline?

5. Provider Records

  • When a node has content, how does it announce to the DHT?
  • How many DHT nodes should store the provider record? (IPFS uses ~20)
  • How long are provider records valid? (IPFS uses 24 hours, with refresh)
  • How do you clean up stale records?

6. Block Exchange

  • How do you request blocks from peers (BitSwap vs simple request/response)?
  • Do you implement incentives to prevent freeloading?
  • How do you handle peers that have incomplete data?
  • Parallel downloads from multiple peers or sequential?

Thinking Exercises

Exercise 1: Trace Content Retrieval

Before coding, trace this scenario step by step:

1. Alice adds file.pdf to IPFS on Node A
2. CID computed: QmAbc123...
3. Node A announces to DHT: "I have QmAbc123"
4. Bob (on Node B) tries to get QmAbc123
5. Node B queries DHT: "Who has QmAbc123?"
6. DHT returns: "Node A has it"
7. Node B connects to Node A
8. Node B requests blocks from Node A
9. Node B verifies each block hash
10. Node B reconstructs file.pdf

Questions while tracing:

  • At step 3, which DHT nodes store the provider record? (20 closest to QmAbc123โ€™s hash)
  • At step 5, how many hops in the DHT? (O(log N) where N = network size)
  • At step 6, what if Node A is offline? (Query returns other providers, if any)
  • At step 9, what if a blockโ€™s hash doesnโ€™t match? (Reject, try different peer)
  • What if Charlie on Node C also fetches the file? (Now 3 providers: A, B, C)

Exercise 2: Design CID Computation

Given a file, compute its CID:

File: "Hello, IPFS!\n" (13 bytes)

Steps:
1. Hash the content: SHA-256("Hello, IPFS!\n") = 0x2c26b46b...
2. Create multihash: <hash-function-code><hash-length><hash-bytes>
   - SHA-256 code: 0x12
   - Length: 32 (0x20)
   - Result: 0x12202c26b46b...
3. Create CIDv1: <version><codec><multihash>
   - Version: 0x01
   - Codec: raw (0x55)
   - Result: 0x015512202c26b46b...
4. Encode in base58: QmZ...

Exercise 3: XOR Distance Calculation

Calculate XOR distances:

Node A: 0b11110000 (simplified to 8 bits)
Node B: 0b11110001
Node C: 0b00001111
Key K:  0b11110010

Distance(A, K) = 0b11110000 XOR 0b11110010 = 0b00000010 = 2
Distance(B, K) = 0b11110001 XOR 0b11110010 = 0b00000011 = 3
Distance(C, K) = 0b00001111 XOR 0b11110010 = 0b11111101 = 253

Closest to K: Node A (distance 2)

Question: Which k-bucket would Node A put Node B in? (Bucket for bit position where they first differ)


Real World Outcome

When your IPFS-like decentralized storage client is complete, youโ€™ll have built a peer-to-peer content distribution system that demonstrates how Web3 handles data storage without centralized servers. Hereโ€™s what youโ€™ll see:

# Terminal 1 - Start first node (Bootstrap node)
$ ./ipfs-lite daemon --port 4001
[Node A] Peer ID: 12D3KooWRq4tRk8jF3d9VqMn5WVzQv9V3GC9p7uXh3nE8X1YZ2AB
[Node A] Listening on /ip4/127.0.0.1/tcp/4001
[Node A] Listening on /ip6/::1/tcp/4001
[Node A] DHT server mode enabled
[Node A] Local storage: ~/.ipfs-lite/blocks/
[Node A] Ready to accept connections

# Terminal 2 - Start second node and connect to bootstrap
$ ./ipfs-lite daemon --port 4002 --bootstrap /ip4/127.0.0.1/tcp/4001/p2p/12D3Koo...
[Node B] Peer ID: 12D3KooWPq8mTUv2oX4cD5nN9sVzQy8R4GC0p7yWh3nA8Z1YZ3CD
[Node B] Connecting to bootstrap node...
[Node B] Connected to 12D3KooWRq4tRk8j... โœ“
[Node B] DHT synchronized with 1 peers
[Node B] Ready

# Terminal 3 - Add a file (Node A)
$ ./ipfs-lite add ~/Documents/whitepaper.pdf --node http://localhost:4001

Adding file: whitepaper.pdf (2.4 MB)
[*] Computing content hash...
[*] File size exceeds chunk size, splitting into chunks...

Chunking strategy:
  Chunk size: 256 KB
  Total chunks: 10
  DAG structure: Merkle tree with 10 leaf nodes

Chunk 0: QmHash0... (262144 bytes) โœ“
Chunk 1: QmHash1... (262144 bytes) โœ“
Chunk 2: QmHash2... (262144 bytes) โœ“
...
Chunk 9: QmHash9... (131072 bytes) โœ“

Building Merkle DAG...
  Root node links to 10 chunks

Computing CID (Content Identifier)...
  CIDv1: base58-btc
  Codec: dag-pb
  Multihash: sha2-256

+---------------------------------------------+
| File added successfully!                     |
| CID: QmYwAPJzv5CZsnA625s3Xf2nemtYgPpHdWEz79o |
+---------------------------------------------+

File structure:
+---------------------------------------------+
| Root: QmYwAPJzv5CZsnA625s3Xf2nemtYgPpHdWEz7 |
|  +-- Chunk-0: QmHash0... (256 KB)           |
|  +-- Chunk-1: QmHash1... (256 KB)           |
|  +-- Chunk-2: QmHash2... (256 KB)           |
|  +-- ...                                    |
|  +-- Chunk-9: QmHash9... (128 KB)           |
+---------------------------------------------+

[Node A] Announcing to DHT...
[DHT] Publishing provider record to 20 closest peers
[DHT] Provider record stored on peers: 12D3KooWPq8m..., 12D3KooWXy9z...
[Node A] File is now available on the network!

# Terminal 4 - Retrieve file from different node (Node B)
$ ./ipfs-lite get QmYwAPJzv5CZsnA625s3Xf2nemtYgPpHdWEz79ojWnPbdG --node http://localhost:4002

Retrieving: QmYwAPJzv5CZsnA625s3Xf2nemtYgPpHdWEz79ojWnPbdG

[*] Querying DHT for providers...
[DHT] Finding peers closest to QmYwAP...
[DHT] XOR distance calculation:
  Peer 12D3KooWRq4tRk8j... distance: 0x3f2a1b...
  Peer 12D3KooWXy9z...     distance: 0x7c4d2e...
[DHT] Querying 20 closest peers...
[DHT] Found 2 providers:
  โœ“ 12D3KooWRq4tRk8j... (Node A) - RTT: 2ms
  โœ“ 12D3KooWZa5x...     (Node C) - RTT: 45ms

[*] Fetching Merkle DAG root...
[Node A] Sending root node (1.2 KB)

[*] Parsing DAG structure...
Total size: 2.4 MB
Chunks needed: 10

[*] Downloading chunks in parallel (max 5 concurrent)...
[#####-----] Chunk 0/10 - downloading from 12D3KooWRq4t... (256 KB)
[#####-----] Chunk 1/10 - downloading from 12D3KooWRq4t... (256 KB)
[#####-----] Chunk 2/10 - downloading from 12D3KooWZa5x... (256 KB)
[#####-----] Chunk 3/10 - downloading from 12D3KooWRq4t... (256 KB)
[#####-----] Chunk 4/10 - downloading from 12D3KooWZa5x... (256 KB)
...
[##########] All chunks downloaded (2.4 MB)

[*] Verifying chunk hashes...
  Chunk 0: Hash matches QmHash0... โœ“
  Chunk 1: Hash matches QmHash1... โœ“
  ...
  All chunks verified โœ“

[*] Reconstructing file...
+---------------------------------------------+
| File retrieved: whitepaper.pdf (2.4 MB)     |
| Downloaded in 0.8 seconds (3.0 MB/s)        |
| SHA-256 matches original โœ“                  |
+---------------------------------------------+

# Terminal 5 - Pin the file (keep it permanently)
$ ./ipfs-lite pin add QmYwAPJzv5CZsnA625s3Xf2nemtYgPpHdWEz79ojWnPbdG

[Node B] Pinning CID: QmYwAP...
[Node B] Pinned content will not be garbage collected
[Node B] Announcing as provider to DHT...
[DHT] Now 3 peers provide this content

Pinned: QmYwAPJzv5CZsnA625s3Xf2nemtYgPpHdWEz79ojWnPbdG

# Check network statistics
$ ./ipfs-lite stats

+============================================================+
|              IPFS-Lite Network Statistics                  |
+============================================================+
| Local Peer ID: 12D3KooWPq8mTUv2oX4cD5nN9sVzQy8R4GC0        |
|                                                            |
| DHT Information:                                           |
|   Connected peers:           47                            |
|   Routing table size:        256 k-buckets                 |
|   K-value:                   20                            |
|   Closest peers:             20                            |
|                                                            |
| Storage:                                                   |
|   Blocks stored:             423                           |
|   Total size:                156 MB                        |
|   Pinned blocks:             89                            |
|   Pinned size:               34 MB                         |
|                                                            |
| Bandwidth (last hour):                                     |
|   Downloaded:                245 MB                        |
|   Uploaded:                  89 MB                         |
|   Peers served:              12                            |
|                                                            |
| Provider Records:                                          |
|   Content we provide:        89 CIDs                       |
|   DHT announcements:         2,140                         |
+============================================================+

# Test content addressing (same content = same CID)
$ ./ipfs-lite add file1.txt
CID: QmXa5b...

$ ./ipfs-lite add file2.txt  # Same content as file1.txt
CID: QmXa5b...  # Same CID! Content-addressed

# Test DHT lookup manually
$ ./ipfs-lite dht findprovs QmYwAPJzv5CZsnA625s3Xf2nemtYgPpHdWEz79ojWnPbdG

Searching DHT for providers of: QmYwAP...

DHT Lookup Process:
1. Hash CID to 256-bit key: 0x3a7f2b1c...
2. Find k-closest peers to key using XOR metric
3. Iteratively query closer and closer peers (Kademlia algorithm)

Iteration 1: Querying 3 bootstrap peers
  Found 15 peers closer to key
Iteration 2: Querying 15 peers
  Found 8 peers even closer
Iteration 3: Querying 8 peers
  Found 3 provider records โœ“

Providers:
  โœ“ 12D3KooWRq4tRk8j... (Node A) - /ip4/127.0.0.1/tcp/4001
  โœ“ 12D3KooWPq8mTUv2... (Node B) - /ip4/127.0.0.1/tcp/4002
  โœ“ 12D3KooWZa5x...     (Node C) - /ip4/192.168.1.50/tcp/4001

Lookup completed in 3 hops (< 0.1s)

# Simulate node going offline (test resilience)
$ ./ipfs-lite daemon --port 4003 --bootstrap /ip4/127.0.0.1/tcp/4001
[Node D] Connected to network
[Node D] Fetching QmYwAP...

# Even if Node A goes offline...
[Node A] Shutting down...

# Node D can still fetch from Node B or C
[Node D] Node A unreachable, trying alternative provider
[Node D] Downloading from Node B... โœ“
[Node D] File retrieved successfully

Content remains available as long as ONE peer has it!

# View DAG structure visually
$ ./ipfs-lite dag inspect QmYwAPJzv5CZsnA625s3Xf2nemtYgPpHdWEz79ojWnPbdG

Merkle DAG Structure:
================================================================================
Root: QmYwAPJzv5CZsnA625s3Xf2nemtYgPpHdWEz79ojWnPbdG
Type: UnixFS File
Size: 2,516,582 bytes (2.4 MB)

Links (10):
  +-- [0] QmHash0... -> Chunk (262144 bytes)
  +-- [1] QmHash1... -> Chunk (262144 bytes)
  +-- [2] QmHash2... -> Chunk (262144 bytes)
  +-- [3] QmHash3... -> Chunk (262144 bytes)
  +-- [4] QmHash4... -> Chunk (262144 bytes)
  +-- [5] QmHash5... -> Chunk (262144 bytes)
  +-- [6] QmHash6... -> Chunk (262144 bytes)
  +-- [7] QmHash7... -> Chunk (262144 bytes)
  +-- [8] QmHash8... -> Chunk (262144 bytes)
  +-- [9] QmHash9... -> Chunk (131072 bytes)

CID Format:
  Version: 1
  Codec: dag-pb (0x70)
  Multihash: sha2-256
  Base Encoding: base58-btc

Verification:
  โœ“ All chunk hashes valid
  โœ“ DAG structure consistent
  โœ“ Total size matches sum of chunks
================================================================================

Youโ€™ve built a censorship-resistant, content-addressed storage network where data is identified by its hash (not location), distributed across peers, and verifiable without trusting any single party. This is how NFT metadata, dApp frontends, and Web3 data live beyond centralized servers.


Implementation Phases

Phase 1: Content Addressing Foundation

Goal: Compute CIDs from file bytes.

Tasks:

  1. Implement SHA-256 hashing
  2. Create multihash encoding (function code + length + hash)
  3. Build CIDv1 structure (version + codec + multihash)
  4. Implement base58 encoding for human-readable CIDs

Validation:

data := []byte("Hello, IPFS!")
cid := ComputeCID(data)
fmt.Println(cid.String())  // Should match real IPFS CID

Hints if stuck:

  • Start with the go-cid and go-multihash libraries to understand the format
  • Then implement from scratch for learning
  • Verify against real IPFS: echo "Hello, IPFS!" | ipfs add --only-hash

Phase 2: File Chunking

Goal: Split large files into fixed-size chunks.

Tasks:

  1. Implement 256 KB chunk splitting
  2. Handle files smaller than chunk size
  3. Handle files with non-aligned sizes
  4. Compute CID for each chunk

Validation:

chunks := ChunkFile(largeFileBytes, 256*1024)
for i, chunk := range chunks {
    fmt.Printf("Chunk %d: %s (%d bytes)\n", i, ComputeCID(chunk), len(chunk))
}

Hints if stuck:

  • Last chunk will usually be smaller
  • Consider edge case: empty file
  • Consider edge case: file exactly chunk size

Phase 3: Merkle DAG Construction

Goal: Link chunks into a root node.

Tasks:

  1. Define DAGNode and Link structures
  2. Build links from chunks
  3. Serialize root node (protobuf or CBOR)
  4. Compute root CID

Validation:

root := BuildDAG(chunks)
fmt.Println("Root CID:", root.CID)
fmt.Println("Links:", len(root.Links))

Hints if stuck:

  • Study the UnixFS protobuf format
  • Root node contains links, not raw data
  • CID of root = hash of serialized root node

Phase 4: Local Block Store

Goal: Persist blocks to disk.

Tasks:

  1. Design key-value store interface
  2. Implement file-system-based storage (CID -> bytes)
  3. Add block retrieval
  4. Implement simple garbage collection

Validation:

$ ls ~/.ipfs-lite/blocks/
QmHash0... QmHash1... QmHash2... root...

Hints if stuck:

  • Shard directories by first 2 chars of CID (prevents too many files in one dir)
  • Consider using LevelDB or BadgerDB for production
  • GC: mark pinned, sweep unpinned

Phase 5: DHT Routing Table

Goal: Implement k-bucket structure.

Tasks:

  1. Generate peer ID from keypair
  2. Implement XOR distance calculation
  3. Build k-bucket array (256 buckets, k=20)
  4. Add peer insertion/lookup logic

Validation:

rt := NewRoutingTable(myPeerID, k=20)
rt.AddPeer(peer1)
rt.AddPeer(peer2)
closest := rt.FindClosest(targetKey, 3)

Hints if stuck:

  • Distance is calculated by XORing two 256-bit values
  • Find which bucket by finding first differing bit position
  • Evict stale peers when bucket is full

Phase 6: DHT Operations

Goal: Implement FindNode and FindProviders.

Tasks:

  1. Implement iterative FindNode (find peers closest to key)
  2. Implement provider record storage
  3. Implement FindProviders query
  4. Add provider announcement

Validation:

providers := dht.FindProviders(cid)
for _, p := range providers {
    fmt.Println("Provider:", p.ID, p.Addrs)
}

Hints if stuck:

  • Alpha=3: query 3 peers at a time
  • Terminate when no closer peers found
  • Provider records have TTL (24 hours)

Phase 7: P2P Networking

Goal: Establish peer connections.

Tasks:

  1. Set up TCP listener
  2. Implement connection handling
  3. Add bootstrap node connection
  4. Handle multiaddr parsing

Validation:

$ ./ipfs-lite daemon
Listening on /ip4/0.0.0.0/tcp/4001/p2p/12D3KooW...
Connected peers: 5

Hints if stuck:

  • Use libp2p for production quality
  • For learning, raw TCP + length-prefixed messages works
  • Handle connection failures gracefully

Phase 8: Block Exchange

Goal: Request and serve blocks between peers.

Tasks:

  1. Implement WANT/HAVE/BLOCK messages
  2. Add block request handling
  3. Implement parallel downloads
  4. Add block verification on receipt

Validation:

block := peer.RequestBlock(cid)
if ComputeCID(block) != cid {
    return errors.New("block verification failed")
}

Hints if stuck:

  • Verify every received block by hashing
  • Download multiple blocks in parallel
  • Retry with different peer on failure

Phase 9: Full Integration

Goal: Complete add/get workflow.

Tasks:

  1. Wire up all components
  2. Implement CLI commands
  3. Add pinning support
  4. Test with multiple nodes

Validation:

# Node 1
$ ./ipfs-lite add file.pdf
CID: Qm...

# Node 2 (different machine)
$ ./ipfs-lite get Qm...
Downloaded: file.pdf

The Interview Questions Theyโ€™ll Ask

Prepare to answer these:

1. โ€œExplain how IPFS is different from HTTP.โ€

Expected answer: HTTP is location-basedโ€“URLs point to specific servers. IPFS is content-basedโ€“CIDs identify data by its hash, not where it lives. This enables verification (you can check the hash), deduplication (same content = same CID), and decentralization (fetch from anyone who has it).

2. โ€œWhat happens if I change one byte in a file on IPFS?โ€

Expected answer: The entire file gets a new CID because the hash changes. The old CID still points to the old content (immutability). Thereโ€™s no โ€œupdatingโ€ a CIDโ€“you create a new one. This is similar to Git commits.

3. โ€œHow does the DHT find content in a large network?โ€

Expected answer: Kademlia uses XOR distance between peer IDs and content keys. It iteratively queries peers closer and closer to the target key. Each hop cuts the search space roughly in half, achieving O(log N) lookup complexity. For a million-node network, this means about 20 hops.

4. โ€œWhat if no peer has the content Iโ€™m looking for?โ€

Expected answer: The DHT query returns emptyโ€“no providers found. The content is effectively lost unless someone who has it comes back online or re-adds it. This is why pinning and redundancy are important.

5. โ€œHow does IPFS prevent Sybil attacks on the DHT?โ€

Expected answer: Peer IDs are hashes of public keys. While creating many IDs is cheap, theyโ€™re randomly distributed in the keyspace. An attacker canโ€™t control which k-buckets they land in or which content theyโ€™re โ€œcloseโ€ to. Additionally, provider records are stored on the k closest peers, requiring the attacker to control many IDs near each target.

6. โ€œWhatโ€™s the difference between IPFS and BitTorrent?โ€

Expected answer:

  • BitTorrent uses per-file swarms with .torrent files. IPFS has a global DHT where any CID can be looked up.
  • BitTorrent is mainly for large file sharing. IPFS handles any data size and supports directories.
  • IPFS integrates content addressing at the protocol level. BitTorrent uses info hashes but URLs still require trackers or DHT.
  • IPFS supports Merkle DAGs for incremental updates. BitTorrent treats files as opaque chunks.

7. โ€œWhy do NFTs use IPFS for metadata and images?โ€

Expected answer: On-chain storage is extremely expensive (hundreds of dollars per KB). IPFS provides:

  • Decentralized storage that canโ€™t be censored
  • Content addressing that guarantees the metadata/image matches what the NFT references
  • Permanence as long as any node pins the content
  • Verification without trusting the hosting server

8. โ€œWhatโ€™s pinning and why is it needed?โ€

Expected answer: Pinning tells your node to keep content forever and not garbage collect it. Without pinning, unpopular content that no one is actively requesting will eventually be removed to free space. Pinning services (like Pinata, Infura) provide redundant storage so content survives even if original uploader goes offline.


Hints in Layers

Hint 1: Start with Simple Content Addressing

Compute CID from bytes:

import (
    "crypto/sha256"
    "github.com/multiformats/go-multihash"
    "github.com/ipfs/go-cid"
)

func computeCID(data []byte) (cid.Cid, error) {
    // 1. Hash the data
    hash := sha256.Sum256(data)

    // 2. Create multihash
    mh, err := multihash.Encode(hash[:], multihash.SHA2_256)
    if err != nil {
        return cid.Cid{}, err
    }

    // 3. Create CID v1
    c := cid.NewCidV1(cid.Raw, mh)
    return c, nil
}

Hint 2: Chunking Large Files

const ChunkSize = 256 * 1024  // 256 KB

func chunkFile(data []byte) [][]byte {
    var chunks [][]byte
    for i := 0; i < len(data); i += ChunkSize {
        end := i + ChunkSize
        if end > len(data) {
            end = len(data)
        }
        chunks = append(chunks, data[i:end])
    }
    return chunks
}

Hint 3: Build Merkle DAG

type DAGNode struct {
    Data  []byte
    Links []Link
}

type Link struct {
    Name string
    Cid  cid.Cid
    Size uint64
}

func buildDAG(chunks [][]byte) (cid.Cid, error) {
    // Create leaf nodes for each chunk
    var links []Link
    for i, chunk := range chunks {
        chunkCID, _ := computeCID(chunk)
        links = append(links, Link{
            Name: fmt.Sprintf("chunk-%d", i),
            Cid:  chunkCID,
            Size: uint64(len(chunk)),
        })
        // Store chunk in local blockstore
        storeBlock(chunkCID, chunk)
    }

    // Create root node with links
    rootNode := DAGNode{Links: links}
    rootBytes := encodeNode(rootNode)  // Protobuf or CBOR
    rootCID, _ := computeCID(rootBytes)

    storeBlock(rootCID, rootBytes)
    return rootCID, nil
}

Hint 4: Simple DHT (Kademlia basics)

type RoutingTable struct {
    localID  PeerID
    buckets  [256]*KBucket  // One bucket per bit in ID
}

type KBucket struct {
    peers []PeerInfo
    k     int  // Usually 20
}

func (rt *RoutingTable) FindClosest(key []byte, count int) []PeerInfo {
    // Compute XOR distance between key and each peer ID
    // Return 'count' closest peers
}

func (rt *RoutingTable) FindProviders(cid cid.Cid) ([]PeerInfo, error) {
    key := cid.Hash()

    // Iterative lookup
    visited := make(map[string]bool)
    toQuery := rt.FindClosest(key, 3)  // Start with 3 closest

    for len(toQuery) > 0 {
        peer := toQuery[0]
        toQuery = toQuery[1:]

        if visited[peer.ID] {
            continue
        }
        visited[peer.ID] = true

        // Ask peer for providers
        providers := peer.FindProviders(cid)
        if len(providers) > 0 {
            return providers, nil
        }

        // Ask peer for closer peers
        closerPeers := peer.FindCloserPeers(key)
        toQuery = append(toQuery, closerPeers...)
    }

    return nil, errors.New("no providers found")
}

Hint 5: Use Existing IPFS Libraries

Donโ€™t build from scratch (unless you want the deepest learning):

  • go-ipfs: Full IPFS implementation (study this!)
  • go-cid: CID handling
  • go-multihash: Multihash encoding
  • libp2p: P2P networking layer
  • go-datastore: Key-value storage

Hint 6: Test with Real IPFS

Verify your CIDs match real IPFS:

# Add file to your implementation
$ ./ipfs-lite add test.txt
CID: QmXa5b...

# Add same file to real IPFS
$ ipfs add test.txt
QmXa5b...  # Should be SAME CID!

# Fetch from real IPFS gateway
$ curl https://ipfs.io/ipfs/QmXa5b...
# Should return your file

Testing Strategy

Unit Tests

func TestComputeCID_Empty(t *testing.T) {
    cid := ComputeCID([]byte{})
    // Empty data should still produce valid CID
    assert.NotEmpty(t, cid.String())
}

func TestComputeCID_Deterministic(t *testing.T) {
    data := []byte("Hello, IPFS!")
    cid1 := ComputeCID(data)
    cid2 := ComputeCID(data)
    assert.Equal(t, cid1, cid2)
}

func TestChunking_SmallFile(t *testing.T) {
    data := []byte("small")
    chunks := ChunkFile(data)
    assert.Equal(t, 1, len(chunks))
}

func TestChunking_ExactMultiple(t *testing.T) {
    data := make([]byte, 256*1024*3)  // Exactly 3 chunks
    chunks := ChunkFile(data)
    assert.Equal(t, 3, len(chunks))
}

func TestDAG_Reconstruction(t *testing.T) {
    original := randomBytes(1024 * 1024)  // 1 MB

    // Add to system
    rootCID := Add(original)

    // Retrieve
    retrieved := Get(rootCID)

    assert.Equal(t, original, retrieved)
}

DHT Tests

func TestXORDistance(t *testing.T) {
    a := []byte{0xFF, 0x00}
    b := []byte{0xF0, 0x00}

    dist := XORDistance(a, b)
    expected := []byte{0x0F, 0x00}

    assert.Equal(t, expected, dist)
}

func TestKBucket_Insertion(t *testing.T) {
    rt := NewRoutingTable(myID, 20)

    // Insert 25 peers
    for i := 0; i < 25; i++ {
        rt.AddPeer(randomPeer())
    }

    // Bucket should have at most k=20 peers
    bucket := rt.GetBucket(0)
    assert.LessOrEqual(t, len(bucket.peers), 20)
}

func TestFindClosest(t *testing.T) {
    rt := NewRoutingTable(myID, 20)

    // Add known peers
    rt.AddPeer(peer1)
    rt.AddPeer(peer2)
    rt.AddPeer(peer3)

    closest := rt.FindClosest(targetKey, 2)

    // Verify returned peers are actually closest
    dist0 := XORDistance(closest[0].ID, targetKey)
    dist1 := XORDistance(closest[1].ID, targetKey)
    assert.True(t, dist0 <= dist1)
}

Integration Tests

func TestEndToEnd_TwoNodes(t *testing.T) {
    // Start two nodes
    node1 := StartNode(port: 4001)
    node2 := StartNode(port: 4002, bootstrap: node1.Addr())

    // Add file on node1
    data := []byte("test data for e2e")
    cid := node1.Add(data)

    // Retrieve on node2
    retrieved := node2.Get(cid)

    assert.Equal(t, data, retrieved)

    // Verify node2 is now a provider
    providers := node1.DHT.FindProviders(cid)
    assert.Contains(t, providers, node2.ID)
}

func TestResilience_NodeOffline(t *testing.T) {
    node1 := StartNode(port: 4001)
    node2 := StartNode(port: 4002, bootstrap: node1.Addr())
    node3 := StartNode(port: 4003, bootstrap: node1.Addr())

    // Add on node1
    cid := node1.Add(testData)

    // Fetch on node2 (becomes provider)
    node2.Get(cid)

    // Stop node1
    node1.Stop()

    // Node3 should still be able to get from node2
    retrieved := node3.Get(cid)
    assert.Equal(t, testData, retrieved)
}

Learning Milestones

Track your progress:

  1. Content addressing works - You compute CIDs that match real IPFS
  2. Large files chunk correctly - Files split and reassemble properly
  3. Merkle DAG builds correctly - Root CID represents entire file structure
  4. DHT finds providers - Iterative lookup locates content
  5. Multi-peer download works - Parallel fetching from multiple sources
  6. Pinning preserves content - GC doesnโ€™t remove pinned data
  7. Node resilience proven - Content survives node failures

Books That Will Help

Topic Book Chapter
Content addressing โ€œDesigning Data-Intensive Applicationsโ€ Ch. 6: Partitioning & Ch. 11: Stream Processing
Merkle DAGs โ€œMastering Bitcoinโ€ Ch. 9: The Blockchain
Distributed hash tables โ€œDesigning Data-Intensive Applicationsโ€ Ch. 6: Partitioning (DHT section)
P2P networking โ€œComputer Networks, Fifth Editionโ€ Ch. 2: Application Layer (P2P section)
Cryptographic hashing โ€œSerious Cryptography, 2nd Editionโ€ Ch. 6: Hash Functions
Distributed systems โ€œDesigning Data-Intensive Applicationsโ€ Ch. 5: Replication & Ch. 6: Partitioning
Consensus and fault tolerance โ€œDesigning Data-Intensive Applicationsโ€ Ch. 8: Trouble with Distributed Systems

Additional Resources

Official Documentation

Academic Papers

Implementation References

Community Resources


What Youโ€™ll Have Built

By completing this project, you will have created:

  1. A content-addressed storage system that identifies data by cryptographic hash
  2. A chunking and DAG engine that handles files of any size efficiently
  3. A Kademlia DHT implementation with k-buckets and iterative routing
  4. A peer-to-peer network with connection management and block exchange
  5. A complete CLI tool for adding, getting, and pinning content

More importantly, youโ€™ll deeply understand:

  • Why Web3 needs a separate data layer from blockchains
  • How content addressing enables trustless verification
  • How DHTs scale to millions of nodes
  • Why IPFS is the backbone of NFT metadata, dApp frontends, and decentralized storage

This knowledge directly applies to:

  • Building dApps that store data off-chain
  • Understanding how Filecoin incentivizes IPFS storage
  • Designing censorship-resistant systems
  • Working with any content-addressed system (Git, Nix, etc.)