Project 7: Zero-Downtime Rolling Updates
Orchestrate staged application rollout with host draining, health gates, and controlled failure domains.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 3 |
| Time Estimate | 20-32 hours |
| Main Programming Language | YAML orchestration |
| Coolness | Level 4 |
| Business Potential | 3. Service & Support |
| Prerequisites | P03, P05, P06 |
| Key Topics | serial batches, delegation, health checks, recovery |
1. Learning Objectives
- Design canary-to-full rollout sequencing.
- Use delegated tasks for load balancer drain/rejoin.
- Gate host progression on objective health checks.
- Recover safely from mid-rollout failures.
2. All Theory Needed (Per-Concept Breakdown)
2.1 Failure-Domain-Aware Orchestration
Fundamentals
Rolling updates reduce blast radius by limiting concurrent changes. serial controls batch size; health gates decide progression.
Deep Dive into the concept The dangerous assumption is that sequential execution alone ensures safety. Real safety requires drain-before-change, post-change readiness checks, and deterministic rollback/hold behavior when gates fail.
Mental model diagram
batch select -> drain -> deploy -> verify -> rejoin -> next batch
Where you’ll apply it P07 and capstone.
3. Project Specification
3.1 What You Will Build
A rolling deployment playbook that:
- drains one host from load balancer
- deploys artifact/config
- validates readiness
- rejoins host
- repeats by batch
3.2 Functional Requirements
- Uses explicit batch strategy.
- Uses delegated load balancer operations.
- Blocks progression on failed readiness check.
- Produces clear per-host rollout evidence.
3.4 Example Output
$ ansible-playbook -i inventory_cloud.yml rolling_update.yml
TASK [Drain host from LB] .... changed
TASK [Deploy release] ........ changed
TASK [Health gate] ........... ok
TASK [Rejoin host to LB] ..... changed
3.7 Real World Outcome
3.7.1 How to Run
ansible-playbook -i inventory_cloud.yml rolling_update.yml
3.7.2 Golden Path Demo
Continuous client checks show no request failure spike.
3.7.3 Failure Demo
One host fails readiness; rollout pauses with host isolated from pool and clear failure output.
4. Solution Architecture
LB control (delegated) <-> web host update (targeted) <-> health validation
5. Implementation Guide
5.3 The Core Question You’re Answering
“How do I deliver changes across a live fleet without violating availability goals?”
5.4 Concepts You Must Understand First
serialbatch semantics.- Delegated task context and concurrency.
- Readiness vs liveness checks.
5.5 Questions to Guide Your Design
- What is your safe initial canary size?
- Which gate failure should stop rollout immediately?
5.6 Thinking Exercise
For a 10-node service, model capacity impact with batch size 2 if one host fails permanently during rollout.
5.7 Interview Questions
- Why is health gating mandatory in rolling updates?
- How do delegated race conditions happen?
- What is your rollback trigger policy?
5.8 Hints in Layers
- Hint 1: Start with batch size 1 canary.
- Hint 2: Add explicit timeout policy.
- Hint 3: Record LB state transitions.
- Hint 4: Run client-side continuous probe during deployment.
6. Testing Strategy
- Clean golden-path rollout.
- Induced readiness failure on one host.
- Rerun from failed midpoint and verify safe recovery.
7. Common Pitfalls & Debugging
| Pitfall | Symptom | Solution |
|---|---|---|
| weak health gate | bad host rejoins | stronger readiness checks |
| parallel LB writes | inconsistent pool state | throttle delegated operations |
| oversized batch | visible outage | reduce serial and add canary |
8. Extensions & Challenges
- Add staged percentages (
1,20%,100%). - Add automated rollback handler on failed gate.
- Add SLO-aware pause/continue decision step.
9. Real-World Connections
This is the same control pattern used in high-availability update pipelines in mature platform teams.
10. Resources
- Ansible strategies docs
- Delegation docs
- SRE release engineering chapters
11. Self-Assessment Checklist
- I can explain batch progression and stop conditions.
- I can prove host drain/rejoin sequencing.
- I can recover safely from mid-rollout failure.
12. Submission / Completion Criteria
- Minimum: successful staged rollout.
- Full: includes controlled failure and recovery evidence.
- Excellence: includes automated rollback and SLO gating.