LEARN SRE BY DOING
Learn Site Reliability Engineering (SRE): Engineering Reliability
Goal: Master the discipline of treating operations as a software problem. You will move from “fixing servers” to “building systems that fix themselves.”
What is an SRE?
Site Reliability Engineering (SRE) is a discipline created by Google. If “DevOps” is the philosophy, “SRE” is the concrete implementation of that philosophy using software engineering tools.
An SRE doesn’t just “keep the site up.” An SRE:
- Quantifies Reliability: Defines exactly how much “downtime” is acceptable (Error Budgets).
- Eliminates Toil: Automates manual, repetitive operational work.
- Practices Observability: Measures the internal state of the system based on its outputs (Logs, Metrics, Traces).
- Embraces Failure: Builds systems that expect things to break and recover automatically.
After completing these projects, you will be able to:
- Define and measure SLIs and SLOs.
- Build monitoring stacks from scratch.
- Write code that automates infrastructure repair.
- Conduct “Chaos Engineering” experiments.
- Perform blameless post-mortems.
Core Concept Analysis
The SRE Hierarchy of Needs
To be an SRE, you must master the stack from bottom to top:
- Monitoring: Is it working?
- Incident Response: How do we fix it when it breaks?
- Post-Mortem: How do we prevent it from happening again?
- Testing/Release: How do we change it safely?
- Capacity Planning: Will it work next month?
- Development: Building internal tools to do the above.
Project List
Projects are ordered from “Observability Basics” to “Automation” to “Resilience Engineering.”
Project 1: The “Golden Signals” Monitor
- File: LEARN_SRE_BY_DOING.md
- Main Programming Language: Go (standard for SRE tools)
- Alternative Programming Languages: Python
- Coolness Level: Level 2: Practical but Forgettable
- Business Potential: 2. The “Micro-SaaS / Pro Tool”
- Difficulty: Level 2: Intermediate
- Knowledge Area: Observability / Metrics
- Software or Tool: Prometheus, Grafana, Node Exporter
- Main Book: “Site Reliability Engineering” (The Google Book) - Chapter on Monitoring
What you’ll build: A full monitoring stack that tracks the “Four Golden Signals” (Latency, Traffic, Errors, and Saturation) of a dummy web application. You will set up Prometheus to scrape metrics and build a Grafana dashboard that visualizes them.
Why it teaches SRE: You can’t fix what you can’t measure. SREs rely on time-series data to know if a system is healthy. This project teaches you the difference between “The server is up” (binary) and “The server is slow” (latency).
Core challenges you’ll face:
- Instrumentation: Modifying code to expose metrics (
/metricsendpoint). - Querying: Learning PromQL (Prometheus Query Language) to calculate rates and percentiles.
- Saturation: Defining what “100% full” actually means for a CPU or Memory.
Key Concepts:
- The 4 Golden Signals: Latency, Traffic, Errors, Saturation.
- Pull vs Push Monitoring: How Prometheus scrapes targets.
- Percentiles (p95, p99): Why averages lie.
Difficulty: Beginner/Intermediate Time estimate: Weekend Pre-requirements: Basic Linux and Docker knowledge.
Real world outcome: You create a dashboard. You stress-test your app. You see the “Latency” graph spike and the “Saturation” gauge turn red before the app actually crashes.
Implementation Hints:
Run a simple HTTP server in Go or Python. Use the prometheus_client library to add counters (for requests) and histograms (for duration). Run Prometheus in Docker. Configure prometheus.yml to scrape your app.
Project 2: The SLO Calculator & Error Budget Tracker
- File: LEARN_SRE_BY_DOING.md
- Main Programming Language: Python
- Alternative Programming Languages: Excel/Google Sheets (Yes, really, for prototyping)
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 2: Intermediate
- Knowledge Area: Service Level Objectives (SLO)
- Software or Tool: Python Scripts, Prometheus API
- Main Book: “The Site Reliability Workbook” - Chapter on SLOs
What you’ll build: A tool that queries your Project 1 metrics and calculates the burn rate of your Error Budget. It will output: “At this rate of errors, we will violate our SLA in 4 days.”
Why it teaches SRE: This is the core mathematical concept of SRE. 100% uptime is impossible and expensive. SREs negotiate a target (e.g., 99.9%) and “spend” the remaining 0.1% (the error budget) on risky deployments.
Core challenges you’ll face:
- Defining the SLI: Turning “User is happy” into a metric (e.g., “Requests < 200ms”).
- Time Windows: Calculating availability over a rolling 28-day window.
- Alerting Logic: You don’t alert on every error; you alert on the burn rate.
Key Concepts:
- SLI (Indicator): The metric.
- SLO (Objective): The target value.
- SLA (Agreement): The legal contract/penalty.
- Error Budget: The allowed unreliability.
Difficulty: Intermediate Time estimate: 1 week Pre-requirements: Project 1.
Real world outcome: Your script runs. It sends a Slack notification: “Error Budget Burn Rate is 10x. We have consumed 50% of our monthly budget in 2 hours. Freeze all deployments.”
Project 3: The “Toil Killer” (Auto-Remediation Bot)
- File: LEARN_SRE_BY_DOING.md
- Main Programming Language: Go
- Alternative Programming Languages: Bash, Python
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 2. The “Micro-SaaS / Pro Tool”
- Difficulty: Level 2: Intermediate
- Knowledge Area: Automation / Scripting
- Software or Tool: Kubernetes API or Docker API
- Main Book: “Automating System Administration with Python”
What you’ll build: A daemon that watches for specific alerts (e.g., “Disk Full”, “Process Stuck”, “High Memory”). When triggered, it automatically executes a “Runbook” script to fix the issue (e.g., clear /tmp, restart the service) without human intervention.
Why it teaches SRE: SREs hate “Toil”—manual, repetitive work that scales linearly with service growth. If you wake up at 3 AM to restart a server, you failed. This project teaches you to code yourself out of a job.
Core challenges you’ll face:
- Safety: Ensuring your bot doesn’t restart all servers at once (Crash loops).
- Idempotency: Running the fix script twice shouldn’t break things.
- Logging: The bot must leave a paper trail (“I restarted Service X because of Alert Y”).
Key Concepts:
- Toil: Manual, tactical, devoid of enduring value, scales linearly.
- Self-Healing Systems: Automatic recovery.
- Control Loops: Observe -> Orient -> Decide -> Act.
Difficulty: Intermediate Time estimate: 1 week Pre-requirements: Basic scripting.
Real world outcome:
You simulate a memory leak in your app. Instead of crashing, your logs show: [Bot] Detected High Memory. Restarting container... [Success]. You slept through the whole thing.
Project 4: Infrastructure as Code (IaC) Pipeline
- File: LEARN_SRE_BY_DOING.md
- Main Programming Language: HCL (Terraform)
- Alternative Programming Languages: Ansible
- Coolness Level: Level 2: Practical but Forgettable
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 2: Intermediate
- Knowledge Area: Infrastructure Management
- Software or Tool: Terraform, GitHub Actions
- Main Book: “Terraform: Up & Running”
What you’ll build: A Git repository where you define a complete infrastructure (Load Balancer, 2 Web Servers, Database) in code. A CI/CD pipeline applies this infrastructure automatically when you create a Pull Request.
Why it teaches SRE: SREs treat infrastructure as software. You don’t click buttons in the AWS console. You verify, version control, and review infrastructure changes just like application code. This prevents “Configuration Drift.”
Core challenges you’ll face:
- State Management: Handling the Terraform state file securely.
- Dependency Graph: Ensuring the database exists before the web server tries to connect.
- Drift: What happens if someone manually changes a setting in the console?
Key Concepts:
- Immutable Infrastructure: Don’t update servers; replace them.
- Declarative vs Imperative: Saying “I want 3 servers” vs “Create 3 servers.”
- GitOps: Using Git as the single source of truth.
Difficulty: Intermediate Time estimate: 1 week Pre-requirements: AWS/Azure/GCP Free Tier.
Real world outcome:
You change count = 2 to count = 5 in a text file. You commit. Minutes later, 3 new servers appear in your cloud console automatically.
Project 5: The Chaos Monkey Clone
- File: LEARN_SRE_BY_DOING.md
- Main Programming Language: Python
- Alternative Programming Languages: Go, Rust
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 3: Advanced
- Knowledge Area: Chaos Engineering / Resilience
- Software or Tool: Docker / Kubernetes
- Main Book: “Chaos Engineering” by Casey Rosenthal
What you’ll build: A script that connects to your environment and randomly kills processes, deletes network routes, or pauses containers during business hours.
Why it teaches SRE: “Hope is not a strategy.” Systems typically fail in complex, unpredictable ways. SREs proactively inject failure to verify that the redundant systems (load balancers, retries) actually work.
Core challenges you’ll face:
- Blast Radius: Limiting the chaos so you don’t take down the entire production.
- Observability: Verifying that the system recovered.
- Randomness: Generating unpredictable failure modes.
Key Concepts:
- Resilience: The ability to recover from failure.
- Fallbacks: Degraded modes of operation.
- Game Days: Scheduled times to break things and practice response.
Difficulty: Advanced Time estimate: Weekend Pre-requirements: A running application with redundancy (Project 4).
Real world outcome: You run the Chaos Monkey. It kills one of your web servers. You watch the Load Balancer metrics (Project 1) drop connection to that node and route traffic to the healthy ones with zero user errors.
Project 6: Log Aggregator and Analyzer
- File: LEARN_SRE_BY_DOING.md
- Main Programming Language: Configuration (Fluentd/Logstash) + Querying (Lucene)
- Alternative Programming Languages: Go (to write a custom shipper)
- Coolness Level: Level 2: Practical but Forgettable
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 2: Intermediate
- Knowledge Area: Logging / Debugging
- Software or Tool: ELK Stack (Elasticsearch, Logstash, Kibana) or Loki
- Main Book: “Logging and Log Management”
What you’ll build: A centralized logging system. You will configure multiple applications to ship their logs (structured JSON) to a central collector, which indexes them. You will build dashboards to search for error patterns across the entire fleet.
Why it teaches SRE: When a distributed system fails, you can’t SSH into 100 servers to check text files. You need centralized, searchable logs to trace a request ID across multiple services.
Core challenges you’ll face:
- Structured Logging: Parsing messy text logs vs JSON.
- Volume: Logs can eat disk space fast. You need rotation/retention policies.
- Correlation: Linking logs from App A to App B using a Trace ID.
Key Concepts:
- Structured vs Unstructured Data: JSON vs Text.
- Sampling: Do you really need to log every success?
- Retention: Hot vs Cold storage.
Difficulty: Intermediate Time estimate: 1-2 weeks Pre-requirements: Docker knowledge.
Real world outcome:
You search error_code: 500 in your dashboard and see exactly which microservice caused the error and on which specific host, correlated with the user’s ID.
Project 7: The “Canary” Deployment Controller
- File: LEARN_SRE_BY_DOING.md
- Main Programming Language: Go
- Alternative Programming Languages: Python
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 3: Advanced
- Knowledge Area: Release Engineering
- Software or Tool: Kubernetes / Istio or Nginx
- Main Book: “Continuous Delivery” by Jez Humble
What you’ll build: A deployment controller that rolls out a new version of an app to only 5% of users. It watches the error rate (SLIs). If errors stay low, it increases traffic to 20%, then 50%, then 100%. If errors spike, it automatically rolls back.
Why it teaches SRE: Changing things is the #1 cause of outages. SREs minimize this risk using progressive delivery. You don’t just “push to prod”; you carefully expose the new version to a small blast radius.
Core challenges you’ll face:
- Traffic Splitting: Configuring a load balancer to send 5% of traffic to version B.
- Automated Decision Making: Writing logic that compares error rates of V1 vs V2.
- Rollback Speed: Reverting traffic instantly if things go wrong.
Key Concepts:
- Canary Releases: The coal mine concept.
- Blue/Green Deployment: Zero downtime switching.
- Progressive Delivery: Gradual rollouts.
Difficulty: Advanced Time estimate: 2 weeks Pre-requirements: Project 1 (Metrics) and Project 4 (IaC).
Real world outcome: You deploy a “broken” version of your app. The controller sends 1% traffic to it. It detects a 500 error spike. It immediately cuts traffic to 0% and alerts you. The other 99% of users never noticed.
Project 8: Distributed Tracing Implementation
- File: LEARN_SRE_BY_DOING.md
- Main Programming Language: Python/Go (App instrumentation)
- Alternative Programming Languages: Java
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 2. The “Micro-SaaS / Pro Tool”
- Difficulty: Level 3: Advanced
- Knowledge Area: Observability / APM
- Software or Tool: Jaeger or OpenTelemetry
- Main Book: “Distributed Systems Observability”
What you’ll build: Three microservices (Frontend -> API -> DB). You will instrument them with OpenTelemetry to pass a “Trace Context” headers. You will visualize the “Waterfall” of a request as it hops between services to find latency bottlenecks.
Why it teaches SRE: In microservices, “Why is it slow?” is a hard question. Tracing shows you exactly where the time went (e.g., 10ms in Frontend, 500ms in DB).
Core challenges you’ll face:
- Context Propagation: Ensuring the Trace ID is passed in HTTP headers.
- Span Management: Defining start/stop times for functions.
- Overhead: Tracing adds a tiny bit of latency; you need to manage sampling rates.
Key Concepts:
- Spans & Traces: Units of work.
- Context Propagation: Passing IDs across network boundaries.
- RED Method: Rate, Errors, Duration.
Difficulty: Advanced Time estimate: 1 week Pre-requirements: Basic microservices knowledge.
Real world outcome: You see a request taking 2 seconds. The Jaeger UI shows a giant bar for the “SQL Query” span. You identified the bottleneck immediately.
Project 9: Capacity Planner & Load Tester
- File: LEARN_SRE_BY_DOING.md
- Main Programming Language: Python (Locust) or Go (K6)
- Alternative Programming Languages: Lua (Wrk)
- Coolness Level: Level 2: Practical but Forgettable
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 2: Intermediate
- Knowledge Area: Capacity Planning / Scalability
- Software or Tool: K6 / Locust
- Main Book: “The Art of Capacity Planning”
What you’ll build: A load testing suite that hammers your application with simulated traffic. You will correlate traffic volume with resource usage (CPU/RAM) to generate a formula: “For every 1000 users, we need X CPU cores.”
Why it teaches SRE: SREs must predict future needs. You need to know the breaking point of your system before Black Friday arrives. This project teaches you the non-linear relationship between load and latency.
Core challenges you’ll face:
- Realistic Traffic: Simulating user behavior (login, browse, buy), not just hitting the homepage.
- Resource Saturation: Identifying the bottleneck (is it DB connections? CPU? Network bandwidth?).
- Cost Analysis: Calculating the cost per 1000 users.
Key Concepts:
- Stress Testing: Finding the breaking point.
- Soak Testing: Finding leaks over time.
- Little’s Law: Relationship between concurrency, throughput, and latency.
Difficulty: Intermediate Time estimate: Weekend Pre-requirements: Project 1 (Metrics).
Real world outcome: You determine that your app crashes at 500 requests/second due to database connection limits. You fix the config, and now it scales to 2000 RPS.
Project 10: The “On-Call” Simulator (Incident Response Bot)
- File: LEARN_SRE_BY_DOING.md
- Main Programming Language: Python
- Alternative Programming Languages: Go
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 2. The “Micro-SaaS / Pro Tool”
- Difficulty: Level 3: Advanced
- Knowledge Area: Incident Management
- Software or Tool: PagerDuty API (Simulated) / Slack Bot
- Main Book: “PagerDuty Incident Response Documentation”
What you’ll build: A system that generates random alerts based on your metrics. It pages you (via Telegram/Slack). You must acknowledge the alert within 5 minutes, or it escalates. It also provides a CLI to open an “Incident” channel and record a timeline.
Why it teaches SRE: Tech skills aren’t enough. You need to manage the chaos of an outage. This teaches you the workflow: Detect -> Acknowledge -> Investigate -> Mitigate -> Resolve -> Post-Mortem.
Core challenges you’ll face:
- Alert Fatigue: Tuning alerts so you aren’t woken up for non-issues.
- Escalation Policies: If you don’t answer, who gets called next?
- ChatOps: Managing an incident entirely from Slack/Discord.
Key Concepts:
- MTTA (Mean Time to Acknowledge).
- MTTR (Mean Time to Resolve).
- Blameless Post-Mortems.
Difficulty: Advanced Time estimate: 1-2 weeks Pre-requirements: Projects 1 and 2.
Real world outcome:
The bot simulates an outage. You get a ping. You type /incident start in Slack. The bot creates a channel, invites the team, and starts a timer. You fix it, and the bot generates a draft Post-Mortem document.
Project 11: Certificate Rotator (Security Reliability)
- File: LEARN_SRE_BY_DOING.md
- Main Programming Language: Go
- Alternative Programming Languages: Python, Bash
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 2. The “Micro-SaaS / Pro Tool”
- Difficulty: Level 3: Advanced
- Knowledge Area: Security / Automation
- Software or Tool: Let’s Encrypt / Vault
- Main Book: “Building Secure and Reliable Systems” (Google)
What you’ll build: A service that automatically monitors SSL/TLS certificates expiration. Before they expire, it generates a new private key, requests a new cert from a CA (like Let’s Encrypt), and gracefully reloads the web servers with the new cert without dropping connections.
Why it teaches SRE: Expired certificates are a classic cause of embarrassing downtime. SREs automate security maintenance to prevent this.
Core challenges you’ll face:
- Zero Downtime: Reloading Nginx/Apache configs without killing active user sessions.
- Secret Management: Storing the private key securely.
- Validation: Proving to the CA that you own the domain (DNS-01 or HTTP-01 challenge).
Key Concepts:
- PKI (Public Key Infrastructure).
- Short-lived Credentials: Why shorter validity is safer.
- Automated Renewal.
Difficulty: Advanced Time estimate: 1 week Pre-requirements: Basic Crypto/Web knowledge.
Real world outcome: You set your cert validity to 24 hours. Every day at midnight, the logs show the cert updating automatically. You never have to worry about an “Expired Certificate” warning screen again.
Project 12: Distributed Rate Limiter
- File: LEARN_SRE_BY_DOING.md
- Main Programming Language: Go (Redis)
- Alternative Programming Languages: Lua (Nginx)
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 3: Advanced
- Knowledge Area: Traffic Management / Protection
- Software or Tool: Redis (Token Bucket Algorithm)
- Main Book: “System Design Interview” (Alex Xu) - Rate Limiter Chapter
What you’ll build: A sidecar service that intercepts all traffic to your API. It enforces limits (e.g., “User X can only make 10 requests per minute”). It shares this state across multiple load-balanced servers using Redis.
Why it teaches SRE: Protecting your service from abuse (intentional or accidental) is vital for reliability. One noisy neighbor shouldn’t take down the platform for everyone else.
Core challenges you’ll face:
- Race Conditions: Two requests coming in at the exact same millisecond.
- Distributed State: Synchronizing counters across multiple servers.
- Latency: The check must be super fast (<2ms).
Key Concepts:
- Token Bucket / Leaky Bucket Algorithms.
- Noisy Neighbor Problem.
- Load Shedding: Dropping traffic to save the system.
Difficulty: Advanced Time estimate: 1 week Pre-requirements: Redis knowledge.
Real world outcome:
You write a script to spam your API. After 10 requests, the API starts returning 429 Too Many Requests. Your CPU usage stays flat despite the spam.
Project 13: Capstone - The “Anti-Fragile” Platform
- File: LEARN_SRE_BY_DOING.md
- Main Programming Language: Multiple (Go, Python, HCL)
- Alternative Programming Languages: Rust
- Coolness Level: Level 5: Pure Magic
- Business Potential: 4. The “Open Core” Infrastructure
- Difficulty: Level 5: Master
- Knowledge Area: All SRE Domains
- Software or Tool: Kubernetes, Prometheus, Jaeger, Terraform, ArgoCD
- Main Book: All of the above.
What you’ll build: You will combine all previous projects into a single platform.
- Infrastructure: Provisioned by Terraform.
- Deployment: GitOps pipeline using ArgoCD.
- Observability: Full Golden Signals dashboard + Tracing.
- Reliability: Horizontal Pod Autoscaler + Rate Limiting.
- Chaos: A background job that kills pods randomly.
- Alerting: Connected to a Slack channel.
Why it teaches SRE: This is the job. You are architecting a living, breathing system that resists failure.
Core challenges you’ll face:
- Complexity Management: Keeping all these moving parts working together.
- Cost: Running a K8s cluster (use Minikube/Kind to save money).
- Tuning: Adjusting alerts and scaling thresholds so they don’t fight each other.
Key Concepts:
- System Integration.
- Operational Excellence.
- Anti-Fragility.
Difficulty: Master Time estimate: 1 month+ Pre-requirements: All previous projects.
Real world outcome: You push code. It deploys. You cut the internet to one node. The system self-heals. You spam the API. It rate-limits you. You introduce a bug. The canary rollback saves you. You are an SRE.
Project Comparison Table
| Project | Difficulty | Time | Depth of Understanding | Fun Factor |
|---|---|---|---|---|
| Golden Signals Monitor | Beginner | Weekend | ⭐⭐⭐ | ⭐⭐ |
| SLO Calculator | Intermediate | 1 Week | ⭐⭐⭐⭐⭐ | ⭐ |
| Toil Killer Bot | Intermediate | 1 Week | ⭐⭐⭐ | ⭐⭐⭐⭐ |
| IaC Pipeline | Intermediate | 1 Week | ⭐⭐⭐⭐ | ⭐⭐ |
| Chaos Monkey | Advanced | Weekend | ⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
| Log Aggregator | Intermediate | 1-2 Weeks | ⭐⭐ | ⭐⭐ |
| Canary Controller | Advanced | 2 Weeks | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
| Distributed Tracing | Advanced | 1 Week | ⭐⭐⭐⭐ | ⭐⭐⭐ |
| Capacity Planner | Intermediate | Weekend | ⭐⭐⭐ | ⭐⭐ |
| On-Call Simulator | Advanced | 2 Weeks | ⭐⭐⭐ | ⭐⭐⭐ |
| Cert Rotator | Advanced | 1 Week | ⭐⭐ | ⭐⭐⭐ |
| Rate Limiter | Advanced | 1 Week | ⭐⭐⭐ | ⭐⭐⭐ |
| Anti-Fragile Platform | Master | 1 Month+ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
Recommendation
Where to Start?
Start with Project 1 (Golden Signals Monitor). Observability is the foundation of SRE. If you cannot see it, you cannot manage it.
The Most Important Project
Project 2 (SLO Calculator) is what distinguishes an SRE from a Sysadmin. Understanding the math of reliability (Error Budgets) is the “Business Logic” of the SRE role.
For the Programmer
Project 3 (Toil Killer) and Project 12 (Rate Limiter) are pure coding challenges that solve operational problems.
For the Sysadmin
Project 4 (IaC) and Project 5 (Chaos Monkey) will help you transition from manual operations to automated reliability engineering.
Summary: All Projects
| # | Project Name | Main Language |
|---|---|---|
| 1 | The “Golden Signals” Monitor | Go |
| 2 | The SLO Calculator & Error Budget Tracker | Python |
| 3 | The “Toil Killer” (Auto-Remediation Bot) | Go |
| 4 | Infrastructure as Code (IaC) Pipeline | HCL (Terraform) |
| 5 | The Chaos Monkey Clone | Python |
| 6 | Log Aggregator and Analyzer | Config (ELK) |
| 7 | The “Canary” Deployment Controller | Go |
| 8 | Distributed Tracing Implementation | Python |
| 9 | Capacity Planner & Load Tester | Python |
| 10 | The “On-Call” Simulator | Python |
| 11 | Certificate Rotator | Go |
| 12 | Distributed Rate Limiter | Go |
| 13 | The “Anti-Fragile” Platform (Capstone) | Multiple |
Essential Resources
Books
- “Site Reliability Engineering” (Google) - The foundational text.
- “The Site Reliability Workbook” (Google) - Practical implementation guide.
- “Seeking SRE” by David N. Blank-Edelman - Diverse perspectives on the role.
- “Infrastructure as Code” by Kief Morris - Essential for modern ops.
- “Prometheus: Up & Running” by Brian Brazil - The guide to metrics.
Tools to Master
- Prometheus & Grafana: The industry standard for monitoring.
- Terraform: The industry standard for infrastructure.
- Kubernetes: The operating system of the cloud.
- Go (Golang): The language of cloud-native tools.
- Docker: The unit of deployment.
Concepts to Grok
- Idempotency: Doing it twice yields the same result.
- Ephemerality: Servers are cattle, not pets.
- Blast Radius: Limiting the impact of failure.
- MTTR/MTBF: Metrics of failure and recovery.
Reliability is not a feature; it is the outcome of engineering.