Project 7: Zero-Downtime Rolling Updates

Orchestrate staged application rollout with host draining, health gates, and controlled failure domains.

Quick Reference

Attribute Value
Difficulty Level 3
Time Estimate 20-32 hours
Main Programming Language YAML orchestration
Coolness Level 4
Business Potential 3. Service & Support
Prerequisites P03, P05, P06
Key Topics serial batches, delegation, health checks, recovery

1. Learning Objectives

  1. Design canary-to-full rollout sequencing.
  2. Use delegated tasks for load balancer drain/rejoin.
  3. Gate host progression on objective health checks.
  4. Recover safely from mid-rollout failures.

2. All Theory Needed (Per-Concept Breakdown)

2.1 Failure-Domain-Aware Orchestration

Fundamentals Rolling updates reduce blast radius by limiting concurrent changes. serial controls batch size; health gates decide progression.

Deep Dive into the concept The dangerous assumption is that sequential execution alone ensures safety. Real safety requires drain-before-change, post-change readiness checks, and deterministic rollback/hold behavior when gates fail.

Mental model diagram

batch select -> drain -> deploy -> verify -> rejoin -> next batch

Where you’ll apply it P07 and capstone.


3. Project Specification

3.1 What You Will Build

A rolling deployment playbook that:

  • drains one host from load balancer
  • deploys artifact/config
  • validates readiness
  • rejoins host
  • repeats by batch

3.2 Functional Requirements

  1. Uses explicit batch strategy.
  2. Uses delegated load balancer operations.
  3. Blocks progression on failed readiness check.
  4. Produces clear per-host rollout evidence.

3.4 Example Output

$ ansible-playbook -i inventory_cloud.yml rolling_update.yml
TASK [Drain host from LB] .... changed
TASK [Deploy release] ........ changed
TASK [Health gate] ........... ok
TASK [Rejoin host to LB] ..... changed

3.7 Real World Outcome

3.7.1 How to Run

ansible-playbook -i inventory_cloud.yml rolling_update.yml

3.7.2 Golden Path Demo

Continuous client checks show no request failure spike.

3.7.3 Failure Demo

One host fails readiness; rollout pauses with host isolated from pool and clear failure output.


4. Solution Architecture

LB control (delegated) <-> web host update (targeted) <-> health validation

5. Implementation Guide

5.3 The Core Question You’re Answering

“How do I deliver changes across a live fleet without violating availability goals?”

5.4 Concepts You Must Understand First

  1. serial batch semantics.
  2. Delegated task context and concurrency.
  3. Readiness vs liveness checks.

5.5 Questions to Guide Your Design

  1. What is your safe initial canary size?
  2. Which gate failure should stop rollout immediately?

5.6 Thinking Exercise

For a 10-node service, model capacity impact with batch size 2 if one host fails permanently during rollout.

5.7 Interview Questions

  1. Why is health gating mandatory in rolling updates?
  2. How do delegated race conditions happen?
  3. What is your rollback trigger policy?

5.8 Hints in Layers

  • Hint 1: Start with batch size 1 canary.
  • Hint 2: Add explicit timeout policy.
  • Hint 3: Record LB state transitions.
  • Hint 4: Run client-side continuous probe during deployment.

6. Testing Strategy

  1. Clean golden-path rollout.
  2. Induced readiness failure on one host.
  3. Rerun from failed midpoint and verify safe recovery.

7. Common Pitfalls & Debugging

Pitfall Symptom Solution
weak health gate bad host rejoins stronger readiness checks
parallel LB writes inconsistent pool state throttle delegated operations
oversized batch visible outage reduce serial and add canary

8. Extensions & Challenges

  • Add staged percentages (1, 20%, 100%).
  • Add automated rollback handler on failed gate.
  • Add SLO-aware pause/continue decision step.

9. Real-World Connections

This is the same control pattern used in high-availability update pipelines in mature platform teams.


10. Resources

  • Ansible strategies docs
  • Delegation docs
  • SRE release engineering chapters

11. Self-Assessment Checklist

  • I can explain batch progression and stop conditions.
  • I can prove host drain/rejoin sequencing.
  • I can recover safely from mid-rollout failure.

12. Submission / Completion Criteria

  • Minimum: successful staged rollout.
  • Full: includes controlled failure and recovery evidence.
  • Excellence: includes automated rollback and SLO gating.