Project 7: Zero-Downtime Rolling Updates

Orchestrate staged application rollout with host draining, health gates, and controlled failure domains.

Quick Reference

Attribute	Value
Difficulty	Level 3
Time Estimate	20-32 hours
Main Programming Language	YAML orchestration
Coolness	Level 4
Business Potential	3. Service & Support
Prerequisites	P03, P05, P06
Key Topics	serial batches, delegation, health checks, recovery

1. Learning Objectives

Design canary-to-full rollout sequencing.
Use delegated tasks for load balancer drain/rejoin.
Gate host progression on objective health checks.
Recover safely from mid-rollout failures.

2. All Theory Needed (Per-Concept Breakdown)

2.1 Failure-Domain-Aware Orchestration

Fundamentals Rolling updates reduce blast radius by limiting concurrent changes. serial controls batch size; health gates decide progression.

Deep Dive into the concept The dangerous assumption is that sequential execution alone ensures safety. Real safety requires drain-before-change, post-change readiness checks, and deterministic rollback/hold behavior when gates fail.

Mental model diagram

batch select -> drain -> deploy -> verify -> rejoin -> next batch

Where you’ll apply it P07 and capstone.

3. Project Specification

3.1 What You Will Build

A rolling deployment playbook that:

drains one host from load balancer
deploys artifact/config
validates readiness
rejoins host
repeats by batch

3.2 Functional Requirements

Uses explicit batch strategy.
Uses delegated load balancer operations.
Blocks progression on failed readiness check.
Produces clear per-host rollout evidence.

3.4 Example Output

$ ansible-playbook -i inventory_cloud.yml rolling_update.yml
TASK [Drain host from LB] .... changed
TASK [Deploy release] ........ changed
TASK [Health gate] ........... ok
TASK [Rejoin host to LB] ..... changed

3.7 Real World Outcome

3.7.1 How to Run

ansible-playbook -i inventory_cloud.yml rolling_update.yml

3.7.2 Golden Path Demo

Continuous client checks show no request failure spike.

3.7.3 Failure Demo

One host fails readiness; rollout pauses with host isolated from pool and clear failure output.

4. Solution Architecture

LB control (delegated) <-> web host update (targeted) <-> health validation

5. Implementation Guide

5.3 The Core Question You’re Answering

“How do I deliver changes across a live fleet without violating availability goals?”

5.4 Concepts You Must Understand First

serial batch semantics.
Delegated task context and concurrency.
Readiness vs liveness checks.

5.5 Questions to Guide Your Design

What is your safe initial canary size?
Which gate failure should stop rollout immediately?

5.6 Thinking Exercise

For a 10-node service, model capacity impact with batch size 2 if one host fails permanently during rollout.

5.7 Interview Questions

Why is health gating mandatory in rolling updates?
How do delegated race conditions happen?
What is your rollback trigger policy?

5.8 Hints in Layers

Hint 1: Start with batch size 1 canary.
Hint 2: Add explicit timeout policy.
Hint 3: Record LB state transitions.
Hint 4: Run client-side continuous probe during deployment.

6. Testing Strategy

Clean golden-path rollout.
Induced readiness failure on one host.
Rerun from failed midpoint and verify safe recovery.

7. Common Pitfalls & Debugging

Pitfall	Symptom	Solution
weak health gate	bad host rejoins	stronger readiness checks
parallel LB writes	inconsistent pool state	throttle delegated operations
oversized batch	visible outage	reduce serial and add canary

8. Extensions & Challenges

Add staged percentages (1, 20%, 100%).
Add automated rollback handler on failed gate.
Add SLO-aware pause/continue decision step.

9. Real-World Connections

This is the same control pattern used in high-availability update pipelines in mature platform teams.

10. Resources

Ansible strategies docs
Delegation docs
SRE release engineering chapters

11. Self-Assessment Checklist

I can explain batch progression and stop conditions.
I can prove host drain/rejoin sequencing.
I can recover safely from mid-rollout failure.

12. Submission / Completion Criteria

Minimum: successful staged rollout.
Full: includes controlled failure and recovery evidence.
Excellence: includes automated rollback and SLO gating.