Project 11: The Internal Service Catalog (Metadata Design)

Build a centralized “Source of Truth” for every service in the company with ownership, dependencies, documentation, and SLO links.

Quick Reference

Attribute Value
Difficulty Advanced
Time Estimate 1 Month (40-60 hours)
Primary Language Python / SQL
Alternative Languages Go, TypeScript
Prerequisites API design, database basics
Key Topics Service Discovery, Metadata Management, Developer Portal

1. Learning Objectives

By completing this project, you will:

  1. Design a metadata schema for service information
  2. Build a searchable catalog with API access
  3. Integrate with source-of-truth (repos, PagerDuty, Grafana)
  4. Enable service discovery for developers
  5. Create a “phone book” for the entire organization

2. Theoretical Foundation

2.1 Core Concepts

The Discovery Problem

WITHOUT CATALOG                     WITH CATALOG
┌─────────────────────────────┐    ┌─────────────────────────────┐
│ "Where's the user service?" │    │ catalog search user         │
│ → Ask in Slack              │    │ → user-service (Identity)   │
│ → "I think Team X owns it"  │    │   Owner: @identity-team     │
│ → DM 3 people              │    │   Docs: wiki.internal/user  │
│ → Find outdated wiki page   │    │   On-call: PagerDuty link   │
│ → 45 minutes later...       │    │   2 seconds.                │
└─────────────────────────────┘    └─────────────────────────────┘

Catalog-Info as Code

# catalog-info.yaml (lives in every repo)
apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
  name: checkout-service
  description: Handles customer checkout flow
  annotations:
    pagerduty.com/service-id: PXXXXXX
    grafana/dashboard-url: https://grafana.internal/d/checkout
spec:
  type: service
  lifecycle: production
  owner: team-checkout
  dependsOn:
    - component:payment-gateway
    - component:inventory-service
  providesApis:
    - checkout-api

The C4 Model for Visualization

┌─────────────────────────────────────────────────────────────────┐
│ LEVEL 1: SYSTEM CONTEXT                                        │
│                                                                 │
│   ┌─────────┐         ┌─────────────────┐         ┌─────────┐  │
│   │ Customer│◄───────►│  E-Commerce     │◄───────►│ Payment │  │
│   │         │         │  Platform       │         │ Provider│  │
│   └─────────┘         └─────────────────┘         └─────────┘  │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│ LEVEL 2: CONTAINER (Zoom into Platform)                        │
│                                                                 │
│   ┌─────────────┐   ┌─────────────┐   ┌─────────────┐          │
│   │  Checkout   │──►│   Payment   │──►│  Inventory  │          │
│   │   Service   │   │   Gateway   │   │   Service   │          │
│   └─────────────┘   └─────────────┘   └─────────────┘          │
│          │                                    │                 │
│          └────────────────────────────────────┘                 │
│                          │                                      │
│                    ┌─────────────┐                             │
│                    │  Database   │                             │
│                    └─────────────┘                             │
└─────────────────────────────────────────────────────────────────┘

2.2 Why This Matters

In large organizations:

  • Teams don’t know what other teams are building
  • Duplicate services get created
  • Incidents take longer (can’t find owner)
  • Onboarding is slow (where is everything?)

The Catalog solves:

  • “Who owns this service?”
  • “What does this service do?”
  • “What are its dependencies?”
  • “Where are the docs?”
  • “Who do I page if it’s down?”

2.3 Historical Context

  • CMDBs (1990s): Configuration Management Databases (often stale)
  • Service Registries (2010s): Consul, Eureka for runtime discovery
  • Developer Portals (2020s): Backstage, Cortex, OpsLevel

2.4 Common Misconceptions

Misconception Reality
“We have a wiki” Wikis go stale. Catalog-as-code stays current.
“Service mesh is enough” Mesh is runtime. Catalog is human-readable metadata.
“It’s too much overhead” The overhead of NOT having it is higher.
“Only big companies need this” At 20+ services, discovery becomes a problem.

3. Project Specification

3.1 What You Will Build

  1. Schema Definition: What metadata every service must have
  2. Data Ingestion: Pull from repos, APIs, and manual input
  3. API: Query services programmatically
  4. UI: Search and browse interface
  5. Integrations: Link to PagerDuty, Grafana, GitHub

3.2 Functional Requirements

  1. Core Entity: Service
    • Name, description, owner team
    • Repository URL
    • Documentation URL
    • SLO dashboard link
    • On-call schedule link
    • Dependencies (what it depends on)
    • Dependents (what depends on it)
  2. Search
    • Full-text search across all fields
    • Filter by team, lifecycle, type
    • Results include key metadata
  3. API
    • GET /services - List all
    • GET /services/{id} - Get details
    • GET /services?owner={team} - Filter by team
    • GET /teams/{id}/services - All team’s services
  4. UI
    • Homepage with search
    • Service detail page
    • Team overview page
    • Dependency graph visualization

3.3 Non-Functional Requirements

  • Support 500+ services
  • Search returns in < 500ms
  • Sync from sources at least hourly
  • Data must be version-controlled (YAML in repos)

3.4 Example Usage / Output

CLI Query:

$ catalog search checkout

╔══════════════════════════════════════════════════════════════════╗
║                    SERVICE CATALOG SEARCH                        ║
║                    Query: "checkout"                             ║
╠══════════════════════════════════════════════════════════════════╣

NAME                 OWNER           LIFECYCLE    TYPE
─────────────────────────────────────────────────────────────────
checkout-service     team-checkout   production   service
checkout-frontend    team-checkout   production   website
checkout-worker      team-checkout   production   worker

══════════════════════════════════════════════════════════════════

$ catalog info checkout-service

╔══════════════════════════════════════════════════════════════════╗
║                  checkout-service                                ║
╠══════════════════════════════════════════════════════════════════╣

Description:    Handles customer checkout flow including cart
                validation, price calculation, and order creation.

Owner:          team-checkout (#team-checkout)
Lifecycle:      production
Type:           service

Repository:     github.com/acme/checkout-service
Documentation:  wiki.internal/checkout-service
SLO Dashboard:  grafana.internal/d/checkout-slo

On-Call:        @checkout-oncall (PagerDuty)
                https://pagerduty.com/services/PXXXXXX

─────────────────────────────────────────────────────────────────
DEPENDENCIES (3)
─────────────────────────────────────────────────────────────────
→ payment-gateway      (team-payments)       production
→ inventory-service    (team-inventory)      production
→ user-service         (team-identity)       production

─────────────────────────────────────────────────────────────────
DEPENDENTS (2)
─────────────────────────────────────────────────────────────────
← mobile-app           (team-mobile)         production
← web-frontend         (team-web)            production

─────────────────────────────────────────────────────────────────
APIS PROVIDED
─────────────────────────────────────────────────────────────────
• checkout-api (OpenAPI spec: /docs/openapi.yaml)

╚══════════════════════════════════════════════════════════════════╝

GraphQL API:

query GetService {
  service(name: "checkout-service") {
    name
    description
    owner {
      name
      slackChannel
    }
    dependencies {
      name
      owner { name }
    }
    onCall {
      currentPrimary
      pagerDutyUrl
    }
    slo {
      availability
      latencyP99
    }
  }
}

Response:

{
  "data": {
    "service": {
      "name": "checkout-service",
      "description": "Handles customer checkout flow",
      "owner": {
        "name": "team-checkout",
        "slackChannel": "#team-checkout"
      },
      "dependencies": [
        {
          "name": "payment-gateway",
          "owner": { "name": "team-payments" }
        }
      ],
      "onCall": {
        "currentPrimary": "@jane-doe",
        "pagerDutyUrl": "https://pagerduty.com/services/PXXXXXX"
      },
      "slo": {
        "availability": 99.9,
        "latencyP99": 200
      }
    }
  }
}

3.5 Real World Outcome

After building the catalog:

  • Developer onboarding reduced from 2 weeks to 2 days
  • Incident MTTA reduced (find owner in seconds)
  • Duplicate service creation prevented
  • Audit/compliance easier (clear ownership)

4. Solution Architecture

4.1 High-Level Design

┌─────────────────────────────────────────────────────────────────┐
│                     SERVICE CATALOG                             │
└─────────────────────────────────────────────────────────────────┘
                              │
        ┌─────────────────────┼─────────────────────┐
        │                     │                     │
        ▼                     ▼                     ▼
┌───────────────┐     ┌───────────────┐     ┌───────────────┐
│ DATA INGESTION│     │   CORE API    │     │   WEB UI      │
│               │     │               │     │               │
│ - GitHub scan │     │ - GraphQL     │     │ - Search      │
│ - Manual YAML │     │ - REST        │     │ - Browse      │
│ - PagerDuty   │     │ - Webhooks    │     │ - Dependency  │
│ - Grafana     │     │               │     │   graph       │
└───────────────┘     └───────────────┘     └───────────────┘
        │                     │                     │
        └─────────────────────┼─────────────────────┘
                              │
                    ┌─────────────────────┐
                    │     DATABASE        │
                    │   (PostgreSQL)      │
                    └─────────────────────┘

4.2 Key Components

  1. Ingestion Pipeline: Scrapes repos for catalog-info.yaml
  2. Core Database: PostgreSQL with service/team/dependency tables
  3. GraphQL API: Flexible queries for any client
  4. REST API: Simple queries for CLI/scripts
  5. Web UI: React/Vue search and browse interface
  6. Integrations: PagerDuty, Grafana, GitHub enrichment

4.3 Data Structures

-- PostgreSQL Schema

CREATE TABLE teams (
  id VARCHAR(50) PRIMARY KEY,
  name VARCHAR(100) NOT NULL,
  slack_channel VARCHAR(50),
  pagerduty_schedule VARCHAR(50),
  github_team VARCHAR(100),
  created_at TIMESTAMP DEFAULT NOW()
);

CREATE TABLE services (
  id VARCHAR(100) PRIMARY KEY,
  name VARCHAR(100) NOT NULL,
  description TEXT,
  owner_team_id VARCHAR(50) REFERENCES teams(id),
  lifecycle VARCHAR(20) NOT NULL, -- experimental, development, production, deprecated
  type VARCHAR(20) NOT NULL,      -- service, website, library, worker
  repo_url VARCHAR(255),
  docs_url VARCHAR(255),
  slo_dashboard_url VARCHAR(255),
  created_at TIMESTAMP DEFAULT NOW(),
  updated_at TIMESTAMP DEFAULT NOW()
);

CREATE TABLE dependencies (
  id SERIAL PRIMARY KEY,
  source_service_id VARCHAR(100) REFERENCES services(id),
  target_service_id VARCHAR(100) REFERENCES services(id),
  dependency_type VARCHAR(20), -- runtime, build, data
  UNIQUE(source_service_id, target_service_id)
);

CREATE TABLE service_metadata (
  service_id VARCHAR(100) REFERENCES services(id),
  key VARCHAR(50) NOT NULL,
  value TEXT,
  PRIMARY KEY (service_id, key)
);
# catalog-info.yaml (in each repo)
apiVersion: v1
kind: Service
metadata:
  name: checkout-service
  description: Handles customer checkout flow
spec:
  owner: team-checkout
  lifecycle: production
  type: service
  links:
    - type: docs
      url: https://wiki.internal/checkout
    - type: dashboard
      url: https://grafana.internal/d/checkout
  dependencies:
    - service:payment-gateway
    - service:inventory-service
  annotations:
    pagerduty.com/service-id: PXXXXXX

4.4 Algorithm Overview

# Ingestion pipeline

async def sync_catalog():
    # 1. Scan all repos for catalog-info.yaml
    repos = await github.list_org_repos("acme")

    for repo in repos:
        try:
            content = await github.get_file(repo, "catalog-info.yaml")
            spec = parse_yaml(content)

            # 2. Upsert service record
            service = Service(
                id=spec.metadata.name,
                name=spec.metadata.name,
                description=spec.metadata.description,
                owner_team_id=spec.spec.owner,
                lifecycle=spec.spec.lifecycle,
                type=spec.spec.type,
                repo_url=repo.url
            )
            await db.upsert_service(service)

            # 3. Sync dependencies
            await db.clear_dependencies(service.id)
            for dep in spec.spec.dependencies:
                await db.add_dependency(service.id, dep)

            # 4. Sync metadata (annotations)
            for key, value in spec.metadata.annotations.items():
                await db.set_metadata(service.id, key, value)

        except FileNotFoundError:
            # No catalog-info.yaml in this repo
            continue

    # 5. Enrich with external data
    await enrich_pagerduty_data()
    await enrich_grafana_dashboards()

5. Implementation Guide

5.1 Development Environment Setup

# Create project
mkdir service-catalog && cd service-catalog
python3 -m venv venv
source venv/bin/activate

# Install dependencies
pip install fastapi uvicorn sqlalchemy asyncpg pyyaml httpx strawberry-graphql

# Database
docker run -d --name catalog-db -p 5432:5432 \
  -e POSTGRES_DB=catalog \
  -e POSTGRES_PASSWORD=secret \
  postgres:15

5.2 Project Structure

service-catalog/
├── src/
│   ├── __init__.py
│   ├── models.py           # SQLAlchemy models
│   ├── database.py         # DB connection
│   ├── api/
│   │   ├── rest.py         # FastAPI REST endpoints
│   │   └── graphql.py      # Strawberry GraphQL
│   ├── ingestion/
│   │   ├── github.py       # Scan repos
│   │   ├── pagerduty.py    # Enrich on-call
│   │   └── grafana.py      # Enrich dashboards
│   └── cli.py              # Command-line tool
├── web/
│   ├── src/
│   │   ├── App.tsx
│   │   ├── components/
│   │   │   ├── Search.tsx
│   │   │   └── ServiceDetail.tsx
│   │   └── api.ts
│   └── package.json
├── migrations/
│   └── 001_initial.sql
├── docker-compose.yaml
└── README.md

5.3 The Core Question You’re Answering

“How do we prevent people from building the same thing twice because they couldn’t find the existing version?”

In large organizations, discovery is a major source of waste. The Catalog is the interface for discovery.

5.4 Concepts You Must Understand First

Stop and research these before coding:

  1. The C4 Model
    • How do you visualize software at different levels?
    • Reference: c4model.com
  2. GitOps for Metadata
    • Why store metadata in repo instead of separate UI?
    • Reference: Standard GitOps literature
  3. Service Mesh vs. Service Catalog
    • What’s the difference? (Runtime vs. design-time)
    • Reference: Backstage documentation

5.5 Questions to Guide Your Design

Before implementing, think through these:

Data Freshness

  • How often does catalog sync from sources?
  • What happens if a repo is deleted?
  • How do you handle renames?

Consistency

  • If catalog-info.yaml is wrong, who fixes it?
  • Is there validation before accepting?
  • How do you enforce required fields?

Consumers

  • Who uses the API? (Developers, incident responders, auditors)
  • What are their different needs?

5.6 Thinking Exercise

The “New Microservice” Flow

Trace the steps of creating a new service from scratch.

Questions:

  1. At what point does the catalog learn about the service?
  2. How does someone else discover it exists?
  3. How do they know if it’s production-ready?

Write down:

  • The moment a new service should appear in catalog
  • The minimum required metadata
  • How you’d enforce that metadata exists

5.7 Hints in Layers

Hint 1: Use Catalog-as-Code Every repo should have a catalog-info.yaml. The source of truth is Git, not a database.

Hint 2: Build a Scraper Write a script that clones all repos, reads catalog-info.yaml, and inserts into database.

Hint 3: Expose via CLI First Developers live in the terminal. catalog info service-name is more useful than a fancy UI initially.

Hint 4: Connect to On-Call Integrate with PagerDuty API. Show the real-time on-call person, not a static name.

5.8 The Interview Questions They’ll Ask

Prepare to answer these:

  1. “Why is a Service Catalog essential for microservices?”
    • Discovery at scale, ownership clarity, dependency visibility
  2. “How do you ensure metadata stays accurate?”
    • GitOps (code review), validation, regular audits, integration tests
  3. “What are the core fields every service should have?”
    • Name, owner, lifecycle, repo, docs, on-call
  4. “How does a Service Catalog help with incident response?”
    • Find owner in seconds, see dependencies, get runbook link
  5. “What’s the difference between Service Catalog and CMDB?”
    • Catalog is for developers; CMDB is for IT operations. Catalog is code-centric.

5.9 Books That Will Help

Topic Book Chapter
Service Discovery “SRE Book” (Google) Ch. 10: Practical Alerting
Metadata Design “Modern Software Engineering” Ch. 4
API Design “Designing APIs with Swagger” All

5.10 Implementation Phases

Phase 1: Schema & Data Model (5-8 hours)

  1. Define catalog-info.yaml schema
  2. Create PostgreSQL schema
  3. Build SQLAlchemy models

Phase 2: Ingestion Pipeline (8-10 hours)

  1. GitHub repo scanner
  2. YAML parser and validator
  3. Database upsert logic

Phase 3: REST API (5-8 hours)

  1. FastAPI endpoints
  2. Search with full-text
  3. Filter by team/lifecycle

Phase 4: GraphQL API (5-8 hours)

  1. Strawberry schema
  2. Resolvers for relationships
  3. Nested queries (service → owner → on-call)

Phase 5: Web UI (10-15 hours)

  1. React/Vue setup
  2. Search component
  3. Service detail page
  4. Dependency graph visualization

Phase 6: Integrations (8-10 hours)

  1. PagerDuty enrichment
  2. Grafana dashboard links
  3. Real-time on-call display

5.11 Key Implementation Decisions

Decision Option A Option B Recommendation
Storage PostgreSQL Neo4j PostgreSQL (simpler)
API REST only GraphQL + REST Both
UI framework React Backstage React if learning, Backstage for production
Sync Push (webhooks) Pull (cron) Pull initially, add webhooks later

6. Testing Strategy

Unit Tests

def test_parse_catalog_info():
    yaml = """
    apiVersion: v1
    kind: Service
    metadata:
      name: test-service
    spec:
      owner: team-test
    """
    result = parse_catalog_info(yaml)
    assert result.metadata.name == "test-service"
    assert result.spec.owner == "team-test"

def test_service_search():
    # Setup test data
    db.insert_service(Service(name="checkout-service", ...))
    db.insert_service(Service(name="checkout-worker", ...))

    results = api.search("checkout")
    assert len(results) == 2

Integration Tests

  • Scan a real GitHub organization
  • Verify all expected services are ingested
  • Test API returns correct relationships

E2E Tests

  • Search in UI, verify results
  • Click through to service detail
  • Verify dependency graph renders

7. Common Pitfalls & Debugging

Problem Symptom Root Cause Fix
Missing services Fewer than expected No catalog-info.yaml Require file in PR checks
Stale data Wrong owner shown Sync not running Check cron job, add monitoring
Slow search > 1s response No index Add PostgreSQL full-text index
Broken links 404s to dashboards URLs changed Add link validation to sync

8. Extensions & Challenges

Extension 1: API Documentation

Embed OpenAPI specs, show request/response examples.

Extension 2: Cost Attribution

Show cloud costs per service (integrate with billing).

Extension 3: Tech Radar

Tag services with technology, show tech stack trends.

Extension 4: Deprecation Workflow

Mark services as deprecated, notify dependents.


9. Real-World Connections

Open Source:

  • Backstage - Spotify’s open-source developer portal
  • Cortex - Service ownership platform
  • OpsLevel - Service catalog SaaS

How Big Tech Does This:

  • Spotify: Built Backstage
  • Netflix: Internal service registry with ownership
  • Google: Extensive internal service directory

10. Resources

Backstage

C4 Model


11. Self-Assessment Checklist

Before considering this project complete, verify:

  • catalog-info.yaml schema is defined
  • Database has 20+ services ingested
  • Search works and returns in < 500ms
  • API has both REST and GraphQL endpoints
  • UI allows search and browse
  • Dependency graph is visible
  • At least one integration (PagerDuty or Grafana)
  • Someone outside your team has used the catalog

12. Submission / Completion Criteria

This project is complete when you have:

  1. Schema specification for catalog-info.yaml
  2. Ingestion pipeline scanning repos
  3. PostgreSQL database with services/teams/dependencies
  4. REST API with search and CRUD
  5. GraphQL API with nested queries
  6. Web UI with search and service detail
  7. CLI tool for quick lookups
  8. Documentation for adding new services

Previous Project: P10: Incident Response Battle Cards Next Project: P12: Cost of Coordination Calculator