← Back to all projects

OPERATING MODEL DESIGN MASTERY

In the early days of software, we focused on individual productivity. Today, the bottleneck isn't how fast a single developer types; it's how efficiently 50, 500, or 5,000 people coordinate.

Operating Model Design: From Zero to Organizational Architect

Goal: Deeply understand how to design, implement, and evolve high-performance operating models. You will learn to define team interfaces, ownership boundaries, and service expectations that minimize coordination overhead and maximize autonomy, effectively applying Conway’s Law to align organizational structure with technical architecture.


Why Operating Model Design Matters

In the early days of software, we focused on individual productivity. Today, the bottleneck isn’t how fast a single developer types; it’s how efficiently 50, 500, or 5,000 people coordinate.

Most organizations suffer from “Coordination Tax”—the massive overhead of meetings, handoffs, and misaligned priorities that slows down delivery. Operating Model Design is the engineering of the organization itself.

  • Historical Context: From Taylorism (scientific management) to Agile/Scrum, and now to Team Topologies and Platform Engineering.
  • Real-World Impact: High-performing teams (DORA metrics) are 2x more likely to have clear team boundaries and well-defined service interfaces.
  • The “Why”: Without a designed operating model, your organization defaults to a “messy middle” where everyone is responsible for everything, and therefore, no one is accountable for the outcome.

Core Concept Analysis

1. Conway’s Law & The Inverse Maneuver

Melvin Conway stated in 1967: “Organizations which design systems… are constrained to produce designs which are copies of the communication structures of these organizations.”

ORGANIZATION STRUCTURE          SOFTWARE ARCHITECTURE
┌─────────┐   ┌─────────┐      ┌─────────┐   ┌─────────┐
│ Team A  │<─>│ Team B  │  ==> │ Module A│<─>│ Module B│
└─────────┘   └─────────┘      └─────────┘   └─────────┘
      ^             ^                ^             ^
      └──────┬──────┘                └──────┬──────┘
             │                              │
       Communication                  Dependencies

The Inverse Conway Maneuver: If you want a microservices architecture, you must first create independent, small teams that communicate via APIs, not shared databases or frequent sync meetings.

2. Team Topologies

Following the work of Matthew Skelton and Manuel Pais, we categorize teams by their mission and interaction patterns.

STREAM-ALIGNED TEAM (Primary)
┌─────────────────────────────────┐
│ Full-stack, Long-lived, Outcome │
└─────────────────────────────────┘
        ↑           ↑
        │           └──────────────────────────┐
        │                                      │
PLATFORM TEAM                          ENABLING TEAM
┌─────────────────────────┐            ┌─────────────────────────┐
│ Internal "As-a-Service" │            │ Bridge Knowledge Gaps   │
└─────────────────────────┘            └─────────────────────────┘

3. Interaction Modes

Teams shouldn’t just “talk”; they should interact with intent:

  • Collaboration: Working together for a defined period to discover a new boundary.
  • X-as-a-Service: One team provides a service with a clean interface; the other consumes it.
  • Facilitation: One team helps another clear an obstacle or learn a skill.

4. Service Expectations & Ownership

Ownership isn’t just “who writes the code.” It’s who answers the 2 AM page (On-call), who defines the roadmap (Product), and who ensures the service stays within its SLOs (Reliability).

SERVICE INTERFACE
┌───────────────────────────────────────────┐
│ [API/Endpoint] [Documentation] [SLIs/SLOs]│
├───────────────────────────────────────────┤
│ OWNERSHIP: Team Delta                     │
│ ESCALATION: #ops-delta (Slack)            │
│ DEPENDENCIES: Auth-Service, DB-Cluster-1  │
└───────────────────────────────────────────┘

Concept Summary Table

Concept Cluster What You Need to Internalize
Cognitive Load Teams have a finite capacity. If they own too much “toil” or complex logic, they stop delivering value.
Team Boundaries Boundaries should be “fracture planes”—natural places where the system can be split with minimal communication.
Interaction Modes Communication is expensive. Define how teams talk to reduce noise and increase signal.
Escalation & SLOs Trust is built on predictable responses. Define what happens when things go wrong before they go wrong.

Deep Dive Reading by Concept

Organizational Theory & Team Design

Concept Book & Chapter
Team Topologies “Team Topologies” by Matthew Skelton and Manuel Pais — Ch. 3: “Team-First Architecture”
Cognitive Load “Team Topologies” by Matthew Skelton and Manuel Pais — Ch. 2: “Conway’s Law and Why It Matters”
Domain Boundaries “Domain-Driven Design” by Eric Evans — Ch. 14: “Maintaining Model Integrity”

Operational Excellence

Concept Book & Chapter
Service Level Objectives “Site Reliability Engineering (Google)” — Ch. 4: “Service Level Objectives”
Incident Management “The Site Reliability Workbook” — Ch. 9: “Incident Response”
Platform Engineering “Accelerate” by Nicole Forsgren — Ch. 5: “Architecture”

Essential Reading Order

  1. The Foundation (Week 1):
    • Team Topologies Ch. 1-4 (The “Why” and the “Four Team Types”)
    • Accelerate Part 1 (The metrics of high performance)
  2. The Practicality (Week 2):
    • SRE Book (Google) Ch. 4 (SLOs)
    • Dynamic Reteaming by Heidi Helfand (How teams change over time)

Project List

Projects are ordered from fundamental understanding to advanced organizational architecture.


Project 1: The Team Interaction Audit (Mapping Friction)

  • File: OPERATING_MODEL_DESIGN_MASTERY.md
  • Main Programming Language: Markdown / Graphviz (DOT)
  • Alternative Programming Languages: Python, Mermaid.js
  • Coolness Level: Level 2: Practical but Forgettable
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 1: Beginner
  • Knowledge Area: Organizational Analysis
  • Software or Tool: Obsidian, Miro, or Mermaid.js
  • Main Book: “Team Topologies” by Skelton & Pais

What you’ll build: A visual map of how teams currently interact, identifying “high-friction” zones where communication is constant and “low-friction” zones where teams work autonomously.

Why it teaches Operating Model Design: You cannot design a better model until you see the existing one. This project forces you to distinguish between “necessary communication” (discovery) and “unnecessary communication” (handoffs/wait times).

Core challenges you’ll face:

  • Identifying “Shadow” interactions → maps to informal networks vs. formal org charts
  • Categorizing interaction types → maps to Collaboration vs. X-as-a-Service
  • Quantifying wait times → maps to identifying bottlenecks in the flow

Key Concepts:

  • Interaction Modes: “Team Topologies” Ch. 7
  • Value Stream Mapping: “Learning to See” by Mike Rother

Real World Outcome: A directed graph showing teams (nodes) and their communication frequency/type (edges).

Example Output (Mermaid):

graph TD
    TeamA[Checkout Team] -- "High Friction (Daily Sync)" --> TeamB[Payment Team]
    TeamC[Platform Team] -- "X-as-a-Service (API)" --> TeamA
    TeamD[SRE Team] -- "Facilitation (Mentoring)" --> TeamB
    TeamB -- "Wait Time: 3 Days" --> TeamE[Security Review]

Real World Outcome

You will produce a “Friction Heatmap” of your organization or a hypothetical one. Success looks like a diagram where “Hot” lines (thick/red) indicate areas where teams are stuck in meetings or waiting for each other, and “Cool” lines (thin/green) indicate clean, service-based interfaces.

Example Output:

[Interaction Log]
- Checkout Team <-> DB Admin: 15 Slack messages today about schema locks. (HIGH FRICTION)
- Checkout Team <-> Platform: 0 messages (used self-service API). (LOW FRICTION)
- SRE Team <-> Payments: 2h meeting on "How to use PagerDuty". (FACILITATION)

The Core Question You’re Answering

“Where does the ‘Coordination Tax’ live in our organization, and why is it so high?”

Before you write any code, sit with this question. Most coordination overhead is invisible until you map it. It feels like “work,” but it’s actually “waste” generated by poorly defined boundaries.


Concepts You Must Understand First

Stop and research these before coding:

  1. The Three Interaction Modes
    • What is the difference between Collaboration and Facilitation?
    • When is X-as-a-Service appropriate?
    • Book Reference: “Team Topologies” Ch. 7 - Skelton & Pais
  2. Wait Time vs. Processing Time
    • Why does a 1-hour task take 3 days to complete in most companies?
    • Book Reference: “The Phoenix Project” - Gene Kim

Questions to Guide Your Design

Before implementing, think through these:

  1. Data Collection
    • How will you track “interaction”? (Slack data, calendar invites, or surveys?)
    • How do you distinguish a social chat from a technical dependency?
  2. Visualization
    • How can you represent “Wait Time” on a graph?
    • What happens if Team A says they interact with Team B, but Team B says they don’t?

Thinking Exercise

The “Silent Meeting” Test

Imagine a week where no team is allowed to have a meeting with another team.

Questions while analyzing:

  • Which teams would stop functioning immediately?
  • Which teams would continue delivering without issue?
  • The teams that stop are your “High Coupling” points. Are those boundaries correct?

The Interview Questions They’ll Ask

Prepare to answer these:

  1. “How do you identify if two teams are too tightly coupled?”
  2. “What are the signs that a team boundary is in the wrong place?”
  3. “Explain Conway’s Law and how it impacts system design.”
  4. “How do you measure the cognitive load of a team?”
  5. “When is ‘Collaboration’ actually a bad thing?”

Hints in Layers

Hint 1: Start with the Org Chart List every team in the department. Don’t look at names; look at what they actually do.

Hint 2: Track the Handoffs Pick one feature and trace its path from “Idea” to “Production.” Every time it moves from one person/team to another, that’s an interaction.

Hint 3: Categorize by Intent Use the Team Topologies labels: Is Team A helping Team B (Facilitation), building with Team B (Collaboration), or providing a tool for Team B (X-as-a-Service)?

Hint 4: Use DOT or Mermaid Don’t get stuck in UI tools. Write the interactions as text (A -> B [label=”High Friction”]) and let a tool render it.


Books That Will Help

Topic Book Chapter
Interaction Modes “Team Topologies” by Skelton & Pais Ch. 7
Value Stream “Learning to See” by Mike Rother All

Project 2: The Team “Service Interface” (The Team README)

  • File: OPERATING_MODEL_DESIGN_MASTERY.md
  • Main Programming Language: Markdown
  • Alternative Programming Languages: HTML, Static Site Generator
  • Coolness Level: Level 1: Pure Corporate Snoozefest
  • Business Potential: 2. The “Micro-SaaS / Pro Tool”
  • Difficulty: Level 1: Beginner
  • Knowledge Area: Documentation / Service Design
  • Software or Tool: GitHub/GitLab Pages, Backstage
  • Main Book: “Team Topologies” (Ch. 6: Team APIs)

What you’ll build: A standardized “Team API” document that defines how other teams interact with your team. It’s not just code; it’s the human and process interface.

Why it teaches Operating Model Design: It forces a team to think of themselves as a service provider. If a team can’t explain “how to use us” without a meeting, their interface is broken.

Core challenges you’ll face:

  • Defining service boundaries → maps to What is “in scope” vs “out of scope”
  • Setting communication protocols → maps to Synchronous (Slack/Zoom) vs Asynchronous (Tickets/Docs)
  • Establishing SLAs → maps to Response time expectations

Key Concepts:

  • Team API: “Team Topologies” Ch. 6
  • Self-Service Principles: “Platform Engineering Guide”

Real World Outcome

A TEAM_INTERFACE.md file that lives in every repository owned by the team. When a new developer from another team wants to use your service, they read this file and have everything they need to start without talking to you.

Example Content:

# Team: Ghostbusters (Infrastructure)
## Our Mission
To provide reliable, self-service compute for app teams.

## Interaction Protocol
- **General Questions**: #ask-ghostbusters (Slack)
- **Feature Requests**: Open a Ticket [Link]
- **Urgent (Prod Down)**: PagerDuty [Link]

## Service SLOs
- Ticket Response: 2 Business Days
- System Availability: 99.9%

The Core Question You’re Answering

“Can a stranger use your team’s services without ever speaking to you?”

Before you write any code, sit with this question. If the answer is “No,” you have a high-coordination operating model. High-performing organizations strive for “X-as-a-Service” where the interface is documented and self-contained.


Concepts You Must Understand First

Stop and research these before coding:

  1. Information Hiding
    • What internal team details should stay hidden from the outside?
    • Book Reference: “A Philosophy of Software Design” - John Ousterhout
  2. The “Team API”
    • What are the components of a team’s public interface?
    • Book Reference: “Team Topologies” Ch. 6

Questions to Guide Your Design

Before implementing, think through these:

  1. Accessibility
    • Where should this document live so it’s easily found?
    • How do you keep it from going out of date?
  2. Scope
    • If someone asks for something “Out of Scope,” how does the document handle that?
    • Does the document include “On-call” details?

Thinking Exercise

The “New Hire” Experiment

Imagine a new hire starts in a different department. They are told they need to integrate with your system.

Questions while analyzing:

  • What are the first 5 questions they will ask?
  • Are all 5 of those questions answered in your README?
  • If not, your interface has a “leak.”

The Interview Questions They’ll Ask

Prepare to answer these:

  1. “What makes a ‘good’ team interface?”
  2. “How do you handle requests that fall outside your team’s defined scope?”
  3. “Why is documentation considered part of the ‘Team API’?”
  4. “How do you balance ‘self-service’ with the need for security/compliance?”
  5. “What do you do if another team ignores your interface and DMs your devs?”

Hints in Layers

Hint 1: Start with the ‘Catalog’ List every service, tool, or process your team owns.

Hint 2: Define the ‘Front Door’ Decide on exactly ONE way for people to ask for new things. Is it a Jira form? A specific Slack channel? Close all other doors.

Hint 3: Add ‘Service Expectations’ Don’t just say “we help.” Say “we respond to tickets in 48 hours and we only support Python 3.9+.”

Hint 4: Make it a Template Don’t just write one. Write a template so every team in the company can have the same structure.


Books That Will Help

Topic Book Chapter
Team APIs “Team Topologies” Ch. 6
Interface Design “A Philosophy of Software Design” Ch. 4

Project 3: Ownership Boundary Mapper (RACI 2.0)

  • File: OPERATING_MODEL_DESIGN_MASTERY.md
  • Main Programming Language: YAML / JSON
  • Alternative Programming Languages: Python, Go
  • Coolness Level: Level 2: Practical but Forgettable
  • Business Potential: 4. The “Open Core” Infrastructure
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Governance & Accountability
  • Software or Tool: CODEOWNERS, Terraform
  • Main Book: “Modern Software Engineering” by David Farley

What you’ll build: A schema and validation tool that maps every technical asset (repo, bucket, microservice) to exactly one owning team and an escalation path.

Why it teaches Operating Model Design: It moves ownership from “tribal knowledge” to “machine-readable truth.” It enforces the rule: “If it exists, someone owns it.”

Core challenges you’ll face:

  • Handling “Orphan” resources → maps to cleaning up technical debt
  • Managing shared resources → maps to the “tragedy of the commons” in software
  • Automation → maps to syncing the map with cloud provider tags

Key Concepts:

  • Accountability vs Responsibility: RACI Matrix
  • Domain Ownership: “Domain-Driven Design” (Bounded Contexts)

Real World Outcome

A CLI tool (e.g., owner-check) that scans your cloud configuration or directory structure and flags any resource that doesn’t have a valid owner tag matching your team registry.

Example Output:

$ ./owner-check --dir ./services
[OK] service-auth -> Team: Identity
[OK] service-payment -> Team: Fintech
[ERROR] orphan-bucket-123 -> NO OWNER FOUND!
[ERROR] legacy-cron-job -> Owner 'Team: Rocket' no longer exists!

The Core Question You’re Answering

“If this service breaks at 3 AM, whose phone rings, and do they know it’s their problem?”

Before you write any code, sit with this question. Ambiguous ownership is the leading cause of “Incident Ping-Pong,” where teams keep passing a ticket back and forth because no one is sure they own the fix.


Concepts You Must Understand First

Stop and research these before coding:

  1. Bounded Contexts
    • How do you draw lines around code so it can be owned by one team?
    • Book Reference: “Domain-Driven Design” - Eric Evans
  2. The RACI Matrix
    • Responsible, Accountable, Consulted, Informed. Which one is the most important for an operating model? (Hint: Accountable).
    • Book Reference: Standard Project Management literature.

Questions to Guide Your Design

Before implementing, think through these:

  1. Granularity
    • Do you own at the “Repo” level, the “Microservice” level, or the “S3 Bucket” level?
    • What happens when one repo contains code for multiple services?
  2. The “Registry”
    • Where do the “Teams” live? A YAML file? An LDAP group? A database?

Thinking Exercise

The “Burning Building” Trace

Take a random microservice in your system. Imagine it starts returning 500 errors.

Questions while analyzing:

  • Who is the first person to get an alert?
  • How do they know which team the alert belongs to?
  • If they look at the source code, is there a clear “Contact Us” or “Owned By” header?
  • If the owner isn’t listed, how many people do they have to ask before finding the owner?

The Interview Questions They’ll Ask

Prepare to answer these:

  1. “How do you handle shared infrastructure that multiple teams use?”
  2. “What are the dangers of ‘Shared Ownership’?”
  3. “How do you transition ownership of a legacy system to a new team?”
  4. “Should the person who writes the code always be the one who owns it in production?”
  5. “What metrics can you use to prove that ownership boundaries are clear?”

Hints in Layers

Hint 1: Use CODEOWNERS Start by looking at GitHub’s CODEOWNERS file format. It’s a great simple way to map file paths to teams.

Hint 2: Define the ‘Team’ Schema Create a teams.yaml that lists every team, their ID, and their primary Slack/On-call link.

Hint 3: Create the ‘Assets’ Schema Create an assets.yaml that maps IDs to Team IDs.

Hint 4: Write the Validator Write a script that ensures every ID in assets.yaml exists in teams.yaml. Then, write a script that checks your real infrastructure (or a mock of it) against this file.


Books That Will Help

Topic Book Chapter
Bounded Contexts “Domain-Driven Design” Ch. 14
Ownership “Modern Software Engineering” Ch. 12

Project 4: The Escalation Logic Tree (Incident Design)

  • File: OPERATING_MODEL_DESIGN_MASTERY.md
  • Main Programming Language: Python (Logic) / Mermaid
  • Alternative Programming Languages: JavaScript, Go
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Reliability Engineering
  • Software or Tool: PagerDuty API, OpsGenie
  • Main Book: “The Site Reliability Workbook” (Ch. 9)

What you’ll build: A programmable escalation engine that determines who gets paged based on the service boundary, time of day, and type of failure.

Why it teaches Operating Model Design: Escalation is where team interfaces are tested under pressure. Designing the logic forces you to define exactly where one team’s responsibility ends and another’s begins.

Core challenges you’ll face:

  • Determining “Secondary” responders → maps to Enabling teams vs Stream teams during incidents
  • Handling cross-team outages → maps to Incident Commander roles
  • Reducing Alert Fatigue → maps to Cognitive Load management

Key Concepts:

  • Incident Response Lifecycle: “SRE Book” Ch. 14
  • Hierarchical vs. Networked Escalation

Real World Outcome

A decision-tree simulator where you input an incident event (e.g., {service: "payments", error: "DB_TIMEOUT", severity: "P1"}) and it outputs the exact notification path and the “Reasoning” behind it.

Example Output:

$ ./escalate --service payments --error 500 --severity P1
[DECISION] 
1. Lookup Owner: Team-Fintech (Match)
2. Severity Check: P1 (Notify Primary On-call)
3. Dependency Check: Payments depends on Auth. (Notify Auth-On-Call as "Informed")
4. NOTIFYING: @fintech-oncall via PagerDuty.
5. POSTING: #incidents channel.

The Core Question You’re Answering

“In a crisis, does the system know how to find the right human without a human having to look it up?”

Before you write any code, sit with this question. Manual escalation during a P0 incident is a failure of the operating model. The model should have pre-defined paths for every foreseeable failure.


Concepts You Must Understand First

Stop and research these before coding:

  1. Mean Time to Acknowledge (MTTA)
    • How does escalation logic impact this metric?
    • Book Reference: “Accelerate” - Nicole Forsgren
  2. The “Incident Commander” Role
    • When does a team-level issue become an org-level incident?
    • Book Reference: “The Site Reliability Workbook” Ch. 9

Questions to Guide Your Design

Before implementing, think through these:

  1. Dependencies
    • If Service A depends on Service B, and Service A is failing, should Service B’s owner be paged?
    • How do you avoid “Cascade Paging” (paging everyone)?
  2. Time and Context
    • Does the escalation change if it’s 2 PM on a Tuesday vs. 2 AM on a Sunday?
    • What if the primary responder doesn’t answer in 15 minutes?

Thinking Exercise

The “Blame Game” Simulation

Trace a failure in a shared component (like a Load Balancer).

Questions while analyzing:

  • Who is responsible for the Load Balancer?
  • If the app team pages the Load Balancer team, and the Load Balancer team says “it’s your app,” who breaks the tie?
  • Does your escalation logic include a “Final Arbiter” (like a CTO or Architect)?

The Interview Questions They’ll Ask

Prepare to answer these:

  1. “How do you design an on-call rotation that doesn’t burn people out?”
  2. “What is the difference between an alert and an incident?”
  3. “Explain the ‘Secondary’ escalation layer.”
  4. “How do you handle ‘Silent Failures’ where no one gets paged?”
  5. “What are the common pitfalls of automated escalation?”

Hints in Layers

Hint 1: Map the ‘Metadata’ Every service needs an owner_team_id and a criticality_score.

Hint 2: Define ‘Rules’ Write rules like: “If criticality > 3 AND time is Outside Business Hours AND severity is P1 -> Notify PagerDuty.”

Hint 3: Handle Dependencies Add a depends_on list to your service metadata. If Service A fails, look up Service B’s metadata to see who to “Inform.”

Hint 4: Test with Edge Cases What happens if a service has no owner? What happens if PagerDuty is down? Build “Default” or “Safety” rules for these cases.


Books That Will Help

Topic Book Chapter
Incident Management “SRE Book” (Google) Ch. 14
Response Strategies “The Site Reliability Workbook” Ch. 9

Project 5: Platform-as-a-Product Blueprint

  • File: OPERATING_MODEL_DESIGN_MASTERY.md
  • Main Programming Language: Markdown / Product Roadmap (JSON)
  • Alternative Programming Languages: HTML, Figma (for UI mocks)
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 5. The “Industry Disruptor”
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Platform Engineering / Product Management
  • Software or Tool: Backstage.io, Productboard
  • Main Book: “Team Topologies” (Ch. 5: Platform Teams)

What you’ll build: A product strategy for an internal developer platform (IDP). This includes a value proposition, a “Feature Roadmap” for developers, and success metrics (User Adoption, Time to Hello World).

Why it teaches Operating Model Design: Platform teams often fail because they build what they think developers need, rather than treating them as customers. This project teaches you to design the “internal market” of your organization.

Core challenges you’ll face:

  • Identifying “Thinnest Viable Platform” (TVP) → maps to reducing cognitive load without over-engineering
  • Defining internal success metrics → maps to Developer Experience (DevEx)
  • Creating an “Onboarding” flow → maps to Self-Service interaction

Key Concepts:

  • Thinnest Viable Platform: “Team Topologies” Ch. 5
  • Developer Experience (DX): “Accelerate” (Quality of life metrics)

Real World Outcome

A “Platform Vision” document and a mock “Internal Service Portal” (like Backstage) that shows exactly how a developer would provision a new database or service without ever talking to an infrastructure engineer.

Example Roadmap:

{
  "q1": "Automated AWS Account Provisioning (Self-Service)",
  "q2": "Centralized Log Aggregation (Opt-in)",
  "q3": "Standardized CI/CD Templates for Go/Python",
  "q4": "Internal API Catalog (Read-Only)"
}

The Core Question You’re Answering

“If you had to charge your developers money to use your platform, would they pay for it or find an alternative?”

Before you write any code, sit with this question. Internal platforms succeed only when they are easier to use than the alternatives (like raw AWS or “doing it manually”). This project shifts your mindset from “Authority” to “Service.”


Concepts You Must Understand First

Stop and research these before coding:

  1. The TVP (Thinnest Viable Platform)
    • Why is a “Golden Path” better than a “Golden Cage”?
    • Book Reference: “Team Topologies” Ch. 5 - Skelton & Pais
  2. User Research for Devs
    • How do you interview developers to find their biggest pain points?
    • Book Reference: “Continuous Discovery Habits” - Teresa Torres

Questions to Guide Your Design

Before implementing, think through these:

  1. Mandatory vs. Optional
    • Should developers be forced to use the platform, or should it be so good they want to use it?
    • How do you handle “Edge Cases” that the platform doesn’t support yet?
  2. Feedback Loops
    • How will you know if a new platform feature is actually reducing cognitive load?

Thinking Exercise

The “Credit Card” Test

Imagine every team is given a budget and a corporate credit card. They can use your internal platform, or they can go straight to AWS/GCP and manage it themselves.

Questions while analyzing:

  • What is your platform’s “Unique Selling Point”?
  • If your platform is “Free” but takes 2 weeks to provision a DB, while AWS takes 2 minutes but costs $50/mo, which one will the team choose?
  • How do you lower the “Cost of Entry” for your platform?

The Interview Questions They’ll Ask

Prepare to answer these:

  1. “What is a ‘Thinnest Viable Platform’ and why is it important?”
  2. “How do you measure the success of a Platform Team?”
  3. “How do you handle ‘Feature Requests’ from app teams without becoming a bottleneck?”
  4. “Why should a Platform Team have a Product Manager?”
  5. “What is the ‘Golden Path’ and how does it differ from a standard?”

Hints in Layers

Hint 1: Start with the ‘Toil’ List the top 3 things developers complain about (e.g., “It takes too long to get a staging environment”).

Hint 2: Define the ‘Product’ Don’t just build a script. Define a “Service”: “Staging-as-a-Service.” What are the inputs? What are the outputs?

Hint 3: Map the Onboarding Write down the steps a new developer takes to go from “Repo Created” to “Code in Production.” Your platform should automate at least 50% of those steps.

Hint 4: Use Backstage as a North Star Look at backstage.io. You don’t have to install it, but look at their “Software Templates” feature. That is the outcome you want.


Books That Will Help

Topic Book Chapter
Platform Teams “Team Topologies” Ch. 5
Metrics “Accelerate” Ch. 5

Project 6: Cognitive Load Survey & Heatmap

  • File: OPERATING_MODEL_DESIGN_MASTERY.md
  • Main Programming Language: Python (Data Analysis) / Google Forms (Source)
  • Alternative Programming Languages: R, JavaScript (D3.js)
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Human Factors / Data Science
  • Software or Tool: Pandas, Jupyter Notebook, SurveyMonkey
  • Main Book: “Team Topologies” (Ch. 2: Cognitive Load)

What you’ll build: A data-driven survey instrument and visualization tool that measures the “Cognitive Load” of various teams. You’ll identify which teams are “Drowning” (too much to own) vs. “Thriving” (manageable load).

Why it teaches Operating Model Design: Operating models exist to manage cognitive load. If a team’s load is too high, the model has failed. This project teaches you to use “Subjective Data” as a “Hard Metric” for re-org decisions.

Core challenges you’ll face:

  • Defining the ‘Load’ metrics → maps to Business vs. Technical vs. Internal load
  • Anonymizing responses → maps to Psychological safety in data collection
  • Visualizing ‘Burnout Risk’ → maps to predictive organizational health

Key Concepts:

  • Three Types of Cognitive Load: Intrinsic, Extraneous, Germane
  • Psychological Safety: “The Fearless Organization” - Amy Edmondson

Real World Outcome

A “Cognitive Load Heatmap” where you can see:

  1. Red Teams: High extraneous load (wasted effort on tools/process).
  2. Green Teams: High germane load (focusing on business value).

Example Output:

Team: Checkout
- Intrinsic Load (Domain Knowledge): 7/10
- Extraneous Load (Tooling/Process): 9/10 (CRITICAL - Platform intervention needed)
- Germane Load (Value Add): 2/10
- VERDICT: Team is drowning in "Toil".

The Core Question You’re Answering

“Is this team slow because they are ‘bad,’ or because we’ve given them an impossible amount of things to remember?”

Before you write any code, sit with this question. Cognitive load is the “Silent Killer” of software teams. Most managers try to fix “Speed” by adding people, but that often increases the cognitive load (Brooks’ Law). Designing the model means removing load.


Concepts You Must Understand First

Stop and research these before coding:

  1. Intrinsic vs. Extraneous vs. Germane Load
    • Which one do we want to maximize, and which one do we want to minimize?
    • Book Reference: “Team Topologies” Ch. 2
  2. Brooks’ Law
    • “Adding manpower to a late software project makes it later.”
    • Book Reference: “The Mythical Man-Month” - Fred Brooks

Questions to Guide Your Design

Before implementing, think through these:

  1. Survey Design
    • How do you ask “How much do you have to think?” without being vague?
    • Should you ask about “Time spent in meetings” or “Difficulty of the codebase”?
  2. Actionability
    • Once you find a “Red Team,” what is the operating model change you would recommend? (Change boundary? Add a platform tool? Add an enabling team?)

Thinking Exercise

The “Context Switch” Counter

Pick a single day. Every time you have to stop what you’re doing to answer a question, attend a meeting, or fix a broken tool, mark a tally.

Questions while analyzing:

  • How many tallies do you have by noon?
  • How much of that was “Value Add” (Germane) vs. “Frustration” (Extraneous)?
  • If everyone on your team has 10+ tallies, your team’s operating model is broken.

The Interview Questions They’ll Ask

Prepare to answer these:

  1. “How do you measure cognitive load in a software team?”
  2. “What is the difference between Extraneous and Germane cognitive load?”
  3. “How does team size affect cognitive load?”
  4. “What are the signs that a team is suffering from too much cognitive load?”
  5. “How can an ‘Enabling Team’ help reduce cognitive load?”

Hints in Layers

Hint 1: Use the ‘Four Question’ Method Ask teams to rate 1-5:

  1. “How easy is it to deploy?”
  2. “How much of the domain do you understand?”
  3. “How much time is spent on ‘Toil’?”
  4. “How often do you get interrupted?”

Hint 2: Aggregate by Team Don’t look at individuals. Operating Model Design is about Teams. Average the scores per team.

Hint 3: Visualize the ‘Gap’ Create a chart showing “Domain Complexity” vs. “Tooling Complexity.” The teams in the top-right corner are your biggest risk.

Hint 4: Map to Team Topologies If a Stream-aligned team has high Tooling Complexity, they need a Platform Team. If they have high Domain Complexity, they might need to split the boundary.


Books That Will Help

Topic Book Chapter
Cognitive Load “Team Topologies” Ch. 2
Psychological Safety “The Fearless Organization” All

Project 7: Service Level Expectation (SLE) Agreement

  • File: OPERATING_MODEL_DESIGN_MASTERY.md
  • Main Programming Language: Markdown / Prometheus (for monitoring)
  • Alternative Programming Languages: Python, Terraform
  • Coolness Level: Level 2: Practical but Forgettable
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Reliability Engineering / Service Design
  • Software or Tool: Grafana, Datadog, Google Sheets
  • Main Book: “Site Reliability Engineering” (Ch. 4: SLOs)

What you’ll build: A set of “Service Level Expectations” (SLEs) between two teams (e.g., App Team and Platform Team). Unlike an SLA (contract), an SLE is an operating agreement on how they will behave.

Why it teaches Operating Model Design: Interfaces are useless if they aren’t predictable. This project teaches you to define “Quality of Service” (QoS) as a team boundary. It turns “We’ll try to help” into “We provide 99.9% uptime and 4-hour ticket response.”

Core challenges you’ll face:

  • Choosing the right SLIs (Indicators) → maps to measuring what matters to the customer
  • Defining an “Error Budget” → maps to balancing speed vs. reliability
  • Negotiating the agreement → maps to stakeholder management in operating models

Key Concepts:

  • SLI vs. SLO vs. SLA: “SRE Book” Ch. 4
  • Error Budgets: “SRE Book” Ch. 3

Real World Outcome

A live dashboard showing the “Health” of the team interface. If the Platform Team’s ticket response time drops, the dashboard turns yellow, and they focus on the backlog.

Example Metric:

  • Indicator: Time from “Ticket Created” to “Ticket Resolved” for ‘Access Requests’.
  • Target: 90% of requests resolved in < 8 business hours.
  • Current: 82% (VIOLATION).

The Core Question You’re Answering

“What is the ‘Contract’ between our teams, and how do we know if we’re breaking it?”

Before you write any code, sit with this question. Most team frustrations come from mismatched expectations. Team A expects an answer in 10 minutes; Team B thinks 2 days is fine. An SLE makes the operating model explicit and measurable.


Concepts You Must Understand First

Stop and research these before coding:

  1. Service Level Indicators (SLIs)
    • What are the “Four Golden Signals” of monitoring?
    • Book Reference: “SRE Book” Ch. 6
  2. Error Budgets
    • How do you use “Failure” to decide when to stop shipping new features?
    • Book Reference: “SRE Book” Ch. 3

Questions to Guide Your Design

Before implementing, think through these:

  1. User Focus
    • If you are the Platform Team, who is your “User”?
    • What does that user actually care about? (Uptime? Latency? Ease of use?)
  2. Consequences
    • What happens if the SLE is missed? Does a manager get paged? Does the team change their priorities?

Thinking Exercise

The “No-Phone” Week

Imagine your team is forbidden from using Slack or Zoom for one week. You can only communicate via your documented SLEs and “Team API” (Tickets/Docs).

Questions while analyzing:

  • Does your SLE define what happens when a ticket is “Urgent”?
  • Does it define where to find documentation?
  • If the other team “fails” their SLE, do you have a path to escalate without a Zoom call?

The Interview Questions They’ll Ask

Prepare to answer these:

  1. “What is the difference between an SLO and an SLA?”
  2. “How do you handle a situation where a team is consistently missing their SLOs?”
  3. “Explain the concept of an Error Budget.”
  4. “What are the ‘Four Golden Signals’?”
  5. “How do you choose the right SLI for a non-technical team (like HR)?”

Hints in Layers

Hint 1: Start with the ‘Pain’ Ask: “What is the one thing we always argue about with Team X?” That is your first SLI.

Hint 2: Define ‘Availability’ If you own a service, what does “Up” mean? Is it 200 OK responses? Is it < 500ms latency? Be precise.

Hint 3: Set ‘Aspirational’ Targets Don’t aim for 100%. Aim for what is sufficient. If a developer can wait 4 hours for a PR review, 95% in < 4 hours is your SLO.

Hint 4: Automate the Dashboard The agreement must be visible to both teams at all times.


Books That Will Help

Topic Book Chapter
SLOs “SRE Book” (Google) Ch. 4
Monitoring “SRE Book” (Google) Ch. 6

Project 8: The Dependency Spaghetti Visualizer

  • File: OPERATING_MODEL_DESIGN_MASTERY.md
  • Main Programming Language: Python (NetworkX) / Graphviz
  • Alternative Programming Languages: JavaScript (Cytoscape.js), Neo4j
  • Coolness Level: Level 5: Pure Magic (Super Cool)
  • Business Potential: 4. The “Open Core” Infrastructure
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Graph Theory / Systems Architecture
  • Software or Tool: Prometheus (Dependency data), Zipkin/Jaeger (Tracing)
  • Main Book: “Team Topologies” (Ch. 4: Static Team Patterns)

What you’ll build: A tool that extracts dependency data and visualizes the “Team-to-Team Dependencies.” You’ll specifically look for “Circular Dependencies” and “Bottleneck Teams.”

Why it teaches Operating Model Design: Dependencies are the “Enemy of Flow.” If Team A can’t ship without Team B, your operating model is a “Distributed Monolith.” This project teaches you to find and break these technical-organizational knots.

Core challenges you’ll face:

  • Data Extraction → maps to where does the ‘Truth’ about dependencies live?
  • Graph Clustering → maps to identifying ‘Fracture Planes’ for new team boundaries
  • Filtering Noise → maps to distinguishing between a library import and a service dependency

Key Concepts:

  • Tight vs. Loose Coupling: “Building Evolutionary Architectures”
  • Fracture Planes: “Team Topologies” Ch. 8

Real World Outcome

A 3D or 2D interactive graph of your organization. When you click a team, it highlights everyone they “Block” and everyone who “Blocks” them.

Example Output:

[Dependency Report]
- CRITICAL PATH: Team Checkout -> Team Payments -> Team Auth -> Team DB-Admins.
- CIRCULAR DEP: Team Promo <-> Team Cart (Must be merged or split).
- FAN-IN: Team DevOps is depended on by 42 teams. (HIGH RISK).

The Core Question You’re Answering

“Who is stopping us from moving faster, and is it a technical problem or an organizational one?”

Before you write any code, sit with this question. Most “Technical Debt” is actually “Organizational Debt”—we built a messy system because we have a messy team structure. Visualizing the spaghetti is the first step to untangling the architecture.


Concepts You Must Understand First

Stop and research these before coding:

  1. Conway’s Law (The Inverse)
    • If the code is spaghetti, the team structure is likely spaghetti.
    • Book Reference: “Team Topologies” Ch. 2
  2. Fracture Planes
    • What are the natural ways to split a large system?
    • Book Reference: “Team Topologies” Ch. 8

Questions to Guide Your Design

Before implementing, think through these:

  1. Types of Dependencies
    • Is it a “Design” dependency (I need their approval)?
    • Is it a “Runtime” dependency (My service calls their API)?
  2. Visual Encoding
    • How do you show the “Strength” of a dependency?
    • How do you show “Direction”?

Thinking Exercise

The “Feature Trace”

Pick a new feature that was recently launched.

Questions while analyzing:

  • How many teams had to touch the code?
  • How many teams had to “Approve” the PR?
  • If the answer is > 3, your dependency graph is too dense.

The Interview Questions They’ll Ask

Prepare to answer these:

  1. “How do you identify a ‘bottleneck team’?”
  2. “What is a ‘Circular Dependency’ in an organizational context?”
  3. “How can you use Conway’s Law to fix architecture?”
  4. “Explain the concept of ‘Fracture Planes’.”
  5. “When should you merge two teams into one?”

Hints in Layers

Hint 1: Use ‘distributed tracing’ data Map services back to their owners.

Hint 2: Simplify the Nodes Map the 10 Teams. If Team A’s services talk to Team B’s services 10,000 times a day, that’s one thick line.

Hint 3: Look for ‘Crossing Boundaries’ Are there teams that only talk to each other? They should probably be one team.

Hint 4: Use ‘NetworkX’ (Python) You can calculate “Centrality” and “Clusters” with a few lines of code.


Books That Will Help

Topic Book Chapter
Fracture Planes “Team Topologies” Ch. 8
Evolutionary Architecture “Building Evolutionary Architectures” Ch. 3

Project 9: The Operational Readiness Review (ORR) System

  • File: OPERATING_MODEL_DESIGN_MASTERY.md
  • Main Programming Language: Markdown / Checklist-as-Code (YAML)
  • Alternative Programming Languages: Python, Go
  • Coolness Level: Level 2: Practical but Forgettable
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Governance / Reliability
  • Software or Tool: GitHub Actions, Jira
  • Main Book: “The Site Reliability Workbook” (Ch. 11)

What you’ll build: A system that automates the “handover” or “promotion” of a service from “Experimental” to “Production-Ready.” It’s a set of automated checks (e.g., “Does it have a README?”, “Is there an on-call rotation?”, “Is logging enabled?”) that must pass before a service is officially supported by the platform.

Why it teaches Operating Model Design: It defines the “Quality Bar” for team boundaries. It prevents “Boundary Bleed,” where one team’s poor operational habits become another team’s 3 AM problem.

Core challenges you’ll face:

  • Balancing “Strictness” vs. “Speed” → maps to reducing friction while maintaining quality
  • Automating the checks → maps to Self-Service governance
  • Integrating into CI/CD → maps to shifting operations left

Key Concepts:

  • Operational Readiness Review (ORR)
  • Service Maturity Models

Real World Outcome

A “Maturity Badge” on every repository. A service cannot be deployed to Production unless it reaches “Level 2 Maturity” (as defined by your ORR system).

Example ORR Checklist:

checks:
  - id: "on-call-defined"
    type: "manual"
    description: "PagerDuty service exists and has an active rotation."
  - id: "logging-standard"
    type: "automated"
    query: "grep 'structured_logging' config.yaml"
  - id: "slo-published"
    type: "automated"
    query: "check_url service.com/.well-known/slo"

The Core Question You’re Answering

“How do we ensure that ‘Freedom’ for teams doesn’t lead to ‘Chaos’ for the organization?”

Before you write any code, sit with this question. Modern operating models give teams autonomy, but autonomy without standards is dangerous. The ORR system is the “Policy” that makes autonomy safe.


Concepts You Must Understand First

Stop and research these before coding:

  1. Shift-Left Operations
    • Why should operational checks happen during development, not after?
    • Book Reference: “The Phoenix Project”
  2. Service Maturity Models
    • What are the levels of maturity for a microservice?
    • Book Reference: “The Site Reliability Workbook” Ch. 11

Questions to Guide Your Design

Before implementing, think through these:

  1. Automation
    • Which checks can be done by a script (e.g., checking for a file) vs. a human (e.g., reviewing a design)?
    • How do you handle “Exceptions”?
  2. Incentives
    • Why would a developer want to pass the ORR? (e.g., Does it unlock better support? Does it allow them to deploy to Prod?)

Thinking Exercise

The “Day 0” Disaster

Imagine you launch a service and it crashes within 5 minutes.

Questions while analyzing:

  • What information would you need to fix it? (Logs? Metrics? Owner name?)
  • If that information isn’t available, whose fault is it?
  • Could an automated check have caught the missing information before launch?

The Interview Questions They’ll Ask

Prepare to answer these:

  1. “What is an Operational Readiness Review?”
  2. “How do you automate governance in a fast-moving organization?”
  3. “Should a centralized team perform ORRs, or should they be self-service?”
  4. “What are the top 3 items you would put on a production-readiness checklist?”
  5. “How do you handle ‘Legacy Services’ that don’t meet new standards?”

Hints in Layers

Hint 1: Start with a simple Markdown file List 10 things every service must have.

Hint 2: Use GitHub ‘Branch Protection’ Require that a specific check (like orr-status) passes before merging to main.

Hint 3: Build the ‘ORR Checker’ Write a script that looks for specific files (e.g., OWNERS, SLO.md, k8s/probes.yaml).

Hint 4: Reward the ‘Gold’ Status Create a dashboard that shows which teams have the most “Gold Level” services. Gamify operational excellence.


Books That Will Help

Topic Book Chapter
Readiness “The Site Reliability Workbook” Ch. 11
Governance “Modern Software Engineering” Ch. 11

Project 10: Incident Response “Battle Cards”

  • File: OPERATING_MODEL_DESIGN_MASTERY.md
  • Main Programming Language: Markdown / HTML
  • Alternative Programming Languages: JavaScript (for interactive cards)
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 1: Beginner
  • Knowledge Area: Crisis Management / Process Design
  • Software or Tool: Notion, Confluence, GitHub Wikis
  • Main Book: “The Site Reliability Workbook” (Ch. 9: Incident Response)

What you’ll build: A set of “Battle Cards”—highly condensed, 1-page guides for specific incident scenarios (e.g., “Database Down,” “DDoS Attack,” “API Latency”). These cards define exactly who is the lead, who to notify, and the first 3 diagnostic steps.

Why it teaches Operating Model Design: It codifies the “Escalation Path” and “Team Interaction” during a crisis. It reduces the “Cognitive Load” when people are under extreme stress.

Core challenges you’ll face:

  • Condensing complex info → maps to identifying the ‘Critical Path’ of an interface
  • Defining roles → maps to Incident Commander vs. Communications Lead
  • Keeping cards updated → maps to process maintenance

Key Concepts:

  • The OODA Loop (Observe, Orient, Decide, Act)
  • Incident Command System (ICS)

Real World Outcome

A physical or digital “Deck” of cards. When an incident starts, the team pulls the relevant card and follows the protocol. No one has to ask “Who should I call?”

Example Card:

# Scenario: Checkout Service 5xx Errors
**Primary Lead**: Team Checkout On-call
**Comms Lead**: SRE Lead
**First Steps**:
1. Check RDS Latency in Grafana [Link]
2. Verify Redis Connection Pool [Link]
3. If Latency > 2s, Scale Checkout Pods to 10x.
**Notify**: #incidents-public, Product-Owner-Checkout

The Core Question You’re Answering

“Can your team handle a P0 incident without the ‘Senior Dev’ being awake?”

Before you write any code, sit with this question. If your operating model depends on a single person’s “Intuition,” it is not scalable. Battle Cards turn intuition into a repeatable process.


Concepts You Must Understand First

Stop and research these before coding:

  1. The Incident Command System (ICS)
    • What are the standard roles in an incident?
    • Book Reference: “The Site Reliability Workbook” Ch. 9
  2. The OODA Loop
    • How do you speed up the cycle of making decisions during a crisis?
    • Reference: Military strategy literature.

Questions to Guide Your Design

Before implementing, think through these:

  1. Simplicity
    • Can a tired person read this at 3 AM?
    • Are the links easy to click?
  2. Interaction
    • When does the card tell you to stop working and call another team?
    • How is that “Handover” defined?

Thinking Exercise

The “Blank Screen” Drill

Imagine you are looking at a dashboard where every chart is red. You have 3 minutes to decide who to page.

Questions while analyzing:

  • Does your Battle Card help you narrow down the source?
  • Does it tell you who to notify before you start the fix?
  • If the fix takes 2 hours, does the card have a schedule for updates?

The Interview Questions They’ll Ask

Prepare to answer these:

  1. “How do you organize an incident response team?”
  2. “What is the role of a ‘Communications Lead’?”
  3. “How do you conduct a blameless post-mortem?”
  4. “Why is it important to have pre-defined ‘Battle Cards’?”
  5. “How do you measure the effectiveness of your incident response?”

Hints in Layers

Hint 1: Pick the ‘Top 5’ incidents Don’t write cards for everything. Write them for the 5 things that happen most often.

Hint 2: Follow the ‘3-Step Rule’ Every card should have exactly 3 “Immediate Actions.” Don’t overwhelm people.

Hint 3: Explicitly define ‘Notification’ The card must say: “At minute 10, post to #incidents. At minute 30, page the CTO.”

Hint 4: Use a ‘Template’ Ensure every card has the same layout so people know where to look for the “Links” or “Contacts.”


Books That Will Help

Topic Book Chapter
Incident Response “The Site Reliability Workbook” Ch. 9
Post-mortems “SRE Book” (Google) Ch. 15

Project 11: The Internal Service Catalog (Metadata Design)

  • File: OPERATING_MODEL_DESIGN_MASTERY.md
  • Main Programming Language: Python / SQL
  • Alternative Programming Languages: Go, TypeScript
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 4. The “Open Core” Infrastructure
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Information Architecture / Metadata
  • Software or Tool: Backstage.io, Postgres, GraphQL
  • Main Book: “Team Topologies” (Ch. 6: Team APIs)

What you’ll build: A centralized “Source of Truth” for every service in the company. This is a database (and API) that stores service name, owner, repository, SLO link, documentation link, and dependency list.

Why it teaches Operating Model Design: This is the “Phone Book” of the operating model. It turns vague organizational structures into a queryable graph. It allows you to ask: “Which services are owned by Team X?” or “Which services don’t have an SLO?”

Core challenges you’ll face:

  • Designing the Schema → maps to what is the minimum metadata required to define a service?
  • Keeping data in sync → maps to GitOps vs. Manual entry
  • Searchability → maps to helping developers discover existing services

Key Concepts:

  • Service Mesh vs. Service Registry
  • Catalog Metadata (C4 Model)

Real World Outcome

A searchable UI (or CLI) where any developer can type catalog search checkout and get back the repository, the owning team’s Slack channel, and the current on-call responder.

Example GraphQL Schema:

type Service {
  id: ID!
  name: String!
  owner: Team!
  repoUrl: String
  onCall: String
  dependencies: [Service]
  slo: SLO
}

The Core Question You’re Answering

“How do we prevent people from building the same thing twice because they couldn’t find the existing version?”

Before you write any code, sit with this question. In large organizations, “Discovery” is a major source of waste. If a team needs an “Email Notification Service,” they should be able to find the existing one in seconds. The Catalog is the interface for discovery.


Concepts You Must Understand First

Stop and research these before coding:

  1. The C4 Model (Context, Containers, Components, Code)
    • How do you visualize software at different levels of abstraction?
    • Reference: c4model.com
  2. GitOps for Metadata
    • Why is it better to store service metadata in a YAML file in the repo than in a separate UI?
    • Reference: Standard GitOps literature.

Questions to Guide Your Design

Before implementing, think through these:

  1. Updates
    • If a team renames themselves, how do you update 100 services in the catalog?
    • Should the catalog “pull” data from repos, or should repos “push” data to the catalog?
  2. Consumers
    • Who uses the catalog API? (Developers? Security auditors? The CEO? The automated deployment pipeline?)

Thinking Exercise

The “New Microservice” Flow

Trace the steps of creating a new service from scratch.

Questions while analyzing:

  • At what point does the rest of the company find out this service exists?
  • How do they know what it does?
  • How do they know if it’s “Production Ready”?
  • If they have to search through 500 Slack channels, your discovery model is broken.

The Interview Questions They’ll Ask

Prepare to answer these:

  1. “Why is a Service Catalog essential for microservices?”
  2. “How do you ensure the metadata in the catalog stays accurate?”
  3. “What are the core fields every service should have in its metadata?”
  4. “How does a Service Catalog help with incident response?”
  5. “What is the difference between a Service Catalog and a CMDB (Configuration Management Database)?”

Hints in Layers

Hint 1: Use ‘Catalog-as-Code’ Every service repo should have a catalog-info.yaml file.

Hint 2: Build a ‘Scraper’ Write a script that clones all repos, reads the catalog-info.yaml, and inserts it into a database.

Hint 3: Expose via CLI Developers live in the terminal. Give them a command like service info <name>.

Hint 4: Connect to On-call Integrate with the PagerDuty API so the catalog shows the real-time on-call person, not just a static name.


Books That Will Help

Topic Book Chapter
Service Discovery “SRE Book” (Google) Ch. 10
Metadata “Modern Software Engineering” Ch. 4

Project Comparison Table

Project Difficulty Time Depth of Understanding Fun Factor
1. Interaction Audit Level 1 Weekend High (Foundational) ★★★☆☆
2. Team Service Interface Level 1 Weekend Medium ★★☆☆☆
3. Ownership Mapper Level 2 1 Week High (Practical) ★★★★☆
4. Escalation Tree Level 2 1 Week High (Logic) ★★★★☆
5. Platform Blueprint Level 3 2 Weeks Very High (Architecture) ★★★★☆
6. Cognitive Load Survey Level 2 1 Week Medium (Human Factors) ★★★☆☆
7. SLE Agreement Level 2 1 Week High (Reliability) ★★☆☆☆
8. Dependency Visualizer Level 3 2 Weeks Expert (Systems) ★★★★★
9. ORR System Level 2 2 Weeks High (Governance) ★★★☆☆
10. Battle Cards Level 1 Weekend Medium (Process) ★★★★☆
11. Service Catalog Level 3 1 Month Expert (Metadata) ★★★★☆
12. Coordination Calc Level 3 2 Weeks Expert (Economics) ★★★★★

Recommendation

Where to start?

  • If you are a Team Lead: Start with Project 2 (Team Interface). It provides immediate value to your team and reduces the number of “random questions” you get on Slack.
  • If you are an Architect: Start with Project 8 (Dependency Visualizer). You need to see the “Spaghetti” before you can suggest a better structure.
  • If you are an Engineering Manager: Start with Project 6 (Cognitive Load Survey). This gives you the data to justify changing team sizes or hiring platform engineers.

Final Overall Project: The “Organizational Digital Twin”

What you’ll build: A comprehensive simulation platform that integrates the data from the Dependency Visualizer (8), the Service Catalog (11), and the Coordination Calculator (12) to create a “Digital Twin” of your organization.

The Goal: You should be able to “What-if” a re-org.

  • “What if we move the Payment Gateway from the Fintech team to the Platform team?”
  • “What if we split the Checkout team into ‘Cart’ and ‘Billing’?”

The simulation should output:

  1. Expected change in Lead Time.
  2. Expected change in Cognitive Load for affected teams.
  3. Expected Cost of Coordination delta.

Why this is the Master Project: This requires you to understand the organizational structure as a complex, interconnected system. It combines technical data (dependencies), human data (ownership), and economic data (cost) into a single model. This is the ultimate tool for an “Organizational Architect.”


Summary

This learning path covers Operating Model Design through 12 hands-on projects. Here’s the complete list:

# Project Name Main Language Difficulty Time Estimate
1 Team Interaction Audit Markdown/DOT Level 1 Weekend
2 Team Service Interface Markdown Level 1 Weekend
3 Ownership Mapper YAML/Python Level 2 1 Week
4 Escalation Logic Tree Python Level 2 1 Week
5 Platform Blueprint JSON/Mocks Level 3 2 Weeks
6 Cognitive Load Survey Python/R Level 2 1 Week
7 SLE Agreement Prometheus Level 2 1 Week
8 Dependency Visualizer Python Level 3 2 Weeks
9 ORR System YAML/CI-CD Level 2 2 Weeks
10 Battle Cards Markdown Level 1 Weekend
11 Service Catalog GraphQL/SQL Level 3 1 Month
12 Coordination Calculator Python Level 3 2 Weeks

For beginners: Start with projects #1, #2, and #10. Focus on documentation and visibility. For intermediate: Jump to projects #3, #4, #6, and #7. Focus on defining boundaries and measuring load. For advanced: Focus on projects #5, #8, #11, and #12. Focus on systems architecture and organizational economics.

Expected Outcomes

After completing these projects, you will:

  • Master Team Topologies: Know exactly when to use each team type and interaction mode.
  • Quantify Waste: Be able to prove the financial cost of poor organization.
  • Design for Autonomy: Create boundaries that allow teams to ship without constant meetings.
  • Implement Governance-as-Code: Automate the “Quality Bar” for your organization.
  • Manage Cognitive Load: Systematically reduce the mental burden on your engineering teams.

You’ll have built 12 working tools and frameworks that demonstrate deep understanding of how to engineer an organization from first principles.


Project 12: The “Cost of Coordination” Calculator

  • File: OPERATING_MODEL_DESIGN_MASTERY.md
  • Main Programming Language: Python / Excel
  • Alternative Programming Languages: R, JavaScript
  • Coolness Level: Level 5: Pure Magic (Super Cool)
  • Business Potential: 5. The “Industry Disruptor”
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Economic Engineering / Game Theory
  • Software or Tool: Jupyter Notebook, Monte Carlo Simulation
  • Main Book: “Principles of Product Development Flow” by Donald Reinertsen

What you’ll build: A mathematical model that calculates the “Cost of Coordination” for a given organizational structure. It accounts for meeting hours, handoff wait times, and “Context Switching Cost.”

Why it teaches Operating Model Design: It turns the “Feeling” of being slow into a “Financial Number.” It allows you to prove that a re-org to smaller, more autonomous teams will save $X million in lost productivity.

Core challenges you’ll face:

  • Quantifying “Wait Time” → maps to Little’s Law and Queueing Theory
  • Estimating “Context Switching Cost” → maps to Cognitive Load economics
  • Sensitivity Analysis → maps to which factor is the biggest bottleneck?

Key Concepts:

  • Little’s Law: L = λW (Lead Time = WIP / Throughput)
  • The Economic Cost of Delay (CoD)

Real World Outcome

A simulation report that says: “By reducing cross-team dependencies from 5 to 2, we reduce our Average Lead Time by 40% and save $500k/year in salary hours spent in sync meetings.”

Example Formula: Total Cost = (Number of Teams * Coordination Complexity^2) + (Average Wait Time * Cost of Delay)


The Core Question You’re Answering

“What is the price we pay for our organizational complexity?”

Before you write any code, sit with this question. Most managers think “Communication is good.” Economically, communication is a cost. We want “High-bandwidth communication” inside a team, but “Low-bandwidth communication” between teams. This project quantifies that cost.


Concepts You Must Understand First

Stop and research these before coding:

  1. Queueing Theory
    • How do queues (backlogs) form when utilization reaches 100%?
    • Book Reference: “Principles of Product Development Flow” - Reinertsen
  2. Cost of Delay
    • How much money do we lose for every week a feature is stuck in a “Handoff”?
    • Book Reference: “The Art of Business Value” - Mark Schwartz

Questions to Guide Your Design

Before implementing, think through these:

  1. Variables
    • How many hours a week does a dev spend in “Inter-team meetings”?
    • How long does a PR sit waiting for a review from another team?
  2. The “Non-Linear” Effect
    • Does coordination cost grow linearly with team size, or exponentially? (Hint: It’s closer to n^2).

Thinking Exercise

The “CEO Pitch”

Imagine you have 2 minutes to explain to the CEO why the current team structure is costing the company money.

Questions while analyzing:

  • Can you explain “Wait Time” without using technical jargon?
  • Can you show a chart that shows “Speed” vs “Number of Handoffs”?
  • If the CEO says “Just work harder,” do you have the data to show that “Working Harder” doesn’t fix a queue problem?

The Interview Questions They’ll Ask

Prepare to answer these:

  1. “Explain Little’s Law and its relevance to software delivery.”
  2. “How do you calculate the ‘Cost of Delay’?”
  3. “What is the ‘Economic’ reason for having small, autonomous teams?”
  4. “Why does 100% resource utilization actually slow down a system?”
  5. “How do you balance the cost of ‘Duplication’ (building two things) vs the cost of ‘Coordination’ (sharing one thing)?”

Hints in Layers

Hint 1: Start with ‘Meeting Hours’ Look at the calendars of 5 developers from different teams. Calculate the average % of time spent in meetings with other teams.

Hint 2: Use Little’s Law If a team has 10 items in their “Waiting for Review” column, and they finish 2 items a day, the “Wait Time” is 5 days.

Hint 3: Model the ‘Handoff’ Use a Python script to simulate a feature passing through 5 teams. Randomize the “Processing Time” and the “Wait Time” at each step.

Hint 4: Calculate the ‘Delta’ Run the simulation again with 2 teams. Compare the total time. Multiply the difference by the average hourly rate of the developers.


Books That Will Help

Topic Book Chapter
Flow “Principles of Product Development Flow” Ch. 2
Economics “The Art of Business Value” All