Project 11: The Internal Service Catalog (Metadata Design)

Build a centralized “Source of Truth” for every service in the company with ownership, dependencies, documentation, and SLO links.

Quick Reference

Attribute	Value
Difficulty	Advanced
Time Estimate	1 Month (40-60 hours)
Primary Language	Python / SQL
Alternative Languages	Go, TypeScript
Prerequisites	API design, database basics
Key Topics	Service Discovery, Metadata Management, Developer Portal

1. Learning Objectives

By completing this project, you will:

Design a metadata schema for service information
Build a searchable catalog with API access
Integrate with source-of-truth (repos, PagerDuty, Grafana)
Enable service discovery for developers
Create a “phone book” for the entire organization

2. Theoretical Foundation

2.1 Core Concepts

The Discovery Problem

WITHOUT CATALOG                     WITH CATALOG
┌─────────────────────────────┐    ┌─────────────────────────────┐
│ "Where's the user service?" │    │ catalog search user         │
│ → Ask in Slack              │    │ → user-service (Identity)   │
│ → "I think Team X owns it"  │    │   Owner: @identity-team     │
│ → DM 3 people              │    │   Docs: wiki.internal/user  │
│ → Find outdated wiki page   │    │   On-call: PagerDuty link   │
│ → 45 minutes later...       │    │   2 seconds.                │
└─────────────────────────────┘    └─────────────────────────────┘

Catalog-Info as Code

# catalog-info.yaml (lives in every repo)
apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
  name: checkout-service
  description: Handles customer checkout flow
  annotations:
    pagerduty.com/service-id: PXXXXXX
    grafana/dashboard-url: https://grafana.internal/d/checkout
spec:
  type: service
  lifecycle: production
  owner: team-checkout
  dependsOn:
    - component:payment-gateway
    - component:inventory-service
  providesApis:
    - checkout-api

The C4 Model for Visualization

┌─────────────────────────────────────────────────────────────────┐
│ LEVEL 1: SYSTEM CONTEXT                                        │
│                                                                 │
│   ┌─────────┐         ┌─────────────────┐         ┌─────────┐  │
│   │ Customer│◄───────►│  E-Commerce     │◄───────►│ Payment │  │
│   │         │         │  Platform       │         │ Provider│  │
│   └─────────┘         └─────────────────┘         └─────────┘  │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│ LEVEL 2: CONTAINER (Zoom into Platform)                        │
│                                                                 │
│   ┌─────────────┐   ┌─────────────┐   ┌─────────────┐          │
│   │  Checkout   │──►│   Payment   │──►│  Inventory  │          │
│   │   Service   │   │   Gateway   │   │   Service   │          │
│   └─────────────┘   └─────────────┘   └─────────────┘          │
│          │                                    │                 │
│          └────────────────────────────────────┘                 │
│                          │                                      │
│                    ┌─────────────┐                             │
│                    │  Database   │                             │
│                    └─────────────┘                             │
└─────────────────────────────────────────────────────────────────┘

2.2 Why This Matters

In large organizations:

Teams don’t know what other teams are building
Duplicate services get created
Incidents take longer (can’t find owner)
Onboarding is slow (where is everything?)

The Catalog solves:

“Who owns this service?”
“What does this service do?”
“What are its dependencies?”
“Where are the docs?”
“Who do I page if it’s down?”

2.3 Historical Context

CMDBs (1990s): Configuration Management Databases (often stale)
Service Registries (2010s): Consul, Eureka for runtime discovery
Developer Portals (2020s): Backstage, Cortex, OpsLevel

2.4 Common Misconceptions

Misconception	Reality
“We have a wiki”	Wikis go stale. Catalog-as-code stays current.
“Service mesh is enough”	Mesh is runtime. Catalog is human-readable metadata.
“It’s too much overhead”	The overhead of NOT having it is higher.
“Only big companies need this”	At 20+ services, discovery becomes a problem.

3. Project Specification

3.1 What You Will Build

Schema Definition: What metadata every service must have
Data Ingestion: Pull from repos, APIs, and manual input
API: Query services programmatically
UI: Search and browse interface
Integrations: Link to PagerDuty, Grafana, GitHub

3.2 Functional Requirements

Core Entity: Service
- Name, description, owner team
- Repository URL
- Documentation URL
- SLO dashboard link
- On-call schedule link
- Dependencies (what it depends on)
- Dependents (what depends on it)
Search
- Full-text search across all fields
- Filter by team, lifecycle, type
- Results include key metadata
API
- GET /services - List all
- GET /services/{id} - Get details
- GET /services?owner={team} - Filter by team
- GET /teams/{id}/services - All team’s services
UI
- Homepage with search
- Service detail page
- Team overview page
- Dependency graph visualization

3.3 Non-Functional Requirements

Support 500+ services
Search returns in < 500ms
Sync from sources at least hourly
Data must be version-controlled (YAML in repos)

3.4 Example Usage / Output

CLI Query:

$ catalog search checkout

╔══════════════════════════════════════════════════════════════════╗
║                    SERVICE CATALOG SEARCH                        ║
║                    Query: "checkout"                             ║
╠══════════════════════════════════════════════════════════════════╣

NAME                 OWNER           LIFECYCLE    TYPE
─────────────────────────────────────────────────────────────────
checkout-service     team-checkout   production   service
checkout-frontend    team-checkout   production   website
checkout-worker      team-checkout   production   worker

══════════════════════════════════════════════════════════════════

$ catalog info checkout-service

╔══════════════════════════════════════════════════════════════════╗
║                  checkout-service                                ║
╠══════════════════════════════════════════════════════════════════╣

Description:    Handles customer checkout flow including cart
                validation, price calculation, and order creation.

Owner:          team-checkout (#team-checkout)
Lifecycle:      production
Type:           service

Repository:     github.com/acme/checkout-service
Documentation:  wiki.internal/checkout-service
SLO Dashboard:  grafana.internal/d/checkout-slo

On-Call:        @checkout-oncall (PagerDuty)
                https://pagerduty.com/services/PXXXXXX

─────────────────────────────────────────────────────────────────
DEPENDENCIES (3)
─────────────────────────────────────────────────────────────────
→ payment-gateway      (team-payments)       production
→ inventory-service    (team-inventory)      production
→ user-service         (team-identity)       production

─────────────────────────────────────────────────────────────────
DEPENDENTS (2)
─────────────────────────────────────────────────────────────────
← mobile-app           (team-mobile)         production
← web-frontend         (team-web)            production

─────────────────────────────────────────────────────────────────
APIS PROVIDED
─────────────────────────────────────────────────────────────────
• checkout-api (OpenAPI spec: /docs/openapi.yaml)

╚══════════════════════════════════════════════════════════════════╝

GraphQL API:

query GetService {
  service(name: "checkout-service") {
    name
    description
    owner {
      name
      slackChannel
    }
    dependencies {
      name
      owner { name }
    }
    onCall {
      currentPrimary
      pagerDutyUrl
    }
    slo {
      availability
      latencyP99
    }
  }
}

Response:

{
  "data": {
    "service": {
      "name": "checkout-service",
      "description": "Handles customer checkout flow",
      "owner": {
        "name": "team-checkout",
        "slackChannel": "#team-checkout"
      },
      "dependencies": [
        {
          "name": "payment-gateway",
          "owner": { "name": "team-payments" }
        }
      ],
      "onCall": {
        "currentPrimary": "@jane-doe",
        "pagerDutyUrl": "https://pagerduty.com/services/PXXXXXX"
      },
      "slo": {
        "availability": 99.9,
        "latencyP99": 200
      }
    }
  }
}

3.5 Real World Outcome

After building the catalog:

Developer onboarding reduced from 2 weeks to 2 days
Incident MTTA reduced (find owner in seconds)
Duplicate service creation prevented
Audit/compliance easier (clear ownership)

4. Solution Architecture

4.1 High-Level Design

┌─────────────────────────────────────────────────────────────────┐
│                     SERVICE CATALOG                             │
└─────────────────────────────────────────────────────────────────┘
                              │
        ┌─────────────────────┼─────────────────────┐
        │                     │                     │
        ▼                     ▼                     ▼
┌───────────────┐     ┌───────────────┐     ┌───────────────┐
│ DATA INGESTION│     │   CORE API    │     │   WEB UI      │
│               │     │               │     │               │
│ - GitHub scan │     │ - GraphQL     │     │ - Search      │
│ - Manual YAML │     │ - REST        │     │ - Browse      │
│ - PagerDuty   │     │ - Webhooks    │     │ - Dependency  │
│ - Grafana     │     │               │     │   graph       │
└───────────────┘     └───────────────┘     └───────────────┘
        │                     │                     │
        └─────────────────────┼─────────────────────┘
                              │
                    ┌─────────────────────┐
                    │     DATABASE        │
                    │   (PostgreSQL)      │
                    └─────────────────────┘

4.2 Key Components

Ingestion Pipeline: Scrapes repos for catalog-info.yaml
Core Database: PostgreSQL with service/team/dependency tables
GraphQL API: Flexible queries for any client
REST API: Simple queries for CLI/scripts
Web UI: React/Vue search and browse interface
Integrations: PagerDuty, Grafana, GitHub enrichment

4.3 Data Structures

-- PostgreSQL Schema

CREATE TABLE teams (
  id VARCHAR(50) PRIMARY KEY,
  name VARCHAR(100) NOT NULL,
  slack_channel VARCHAR(50),
  pagerduty_schedule VARCHAR(50),
  github_team VARCHAR(100),
  created_at TIMESTAMP DEFAULT NOW()
);

CREATE TABLE services (
  id VARCHAR(100) PRIMARY KEY,
  name VARCHAR(100) NOT NULL,
  description TEXT,
  owner_team_id VARCHAR(50) REFERENCES teams(id),
  lifecycle VARCHAR(20) NOT NULL, -- experimental, development, production, deprecated
  type VARCHAR(20) NOT NULL,      -- service, website, library, worker
  repo_url VARCHAR(255),
  docs_url VARCHAR(255),
  slo_dashboard_url VARCHAR(255),
  created_at TIMESTAMP DEFAULT NOW(),
  updated_at TIMESTAMP DEFAULT NOW()
);

CREATE TABLE dependencies (
  id SERIAL PRIMARY KEY,
  source_service_id VARCHAR(100) REFERENCES services(id),
  target_service_id VARCHAR(100) REFERENCES services(id),
  dependency_type VARCHAR(20), -- runtime, build, data
  UNIQUE(source_service_id, target_service_id)
);

CREATE TABLE service_metadata (
  service_id VARCHAR(100) REFERENCES services(id),
  key VARCHAR(50) NOT NULL,
  value TEXT,
  PRIMARY KEY (service_id, key)
);

# catalog-info.yaml (in each repo)
apiVersion: v1
kind: Service
metadata:
  name: checkout-service
  description: Handles customer checkout flow
spec:
  owner: team-checkout
  lifecycle: production
  type: service
  links:
    - type: docs
      url: https://wiki.internal/checkout
    - type: dashboard
      url: https://grafana.internal/d/checkout
  dependencies:
    - service:payment-gateway
    - service:inventory-service
  annotations:
    pagerduty.com/service-id: PXXXXXX

4.4 Algorithm Overview

# Ingestion pipeline

async def sync_catalog():
    # 1. Scan all repos for catalog-info.yaml
    repos = await github.list_org_repos("acme")

    for repo in repos:
        try:
            content = await github.get_file(repo, "catalog-info.yaml")
            spec = parse_yaml(content)

            # 2. Upsert service record
            service = Service(
                id=spec.metadata.name,
                name=spec.metadata.name,
                description=spec.metadata.description,
                owner_team_id=spec.spec.owner,
                lifecycle=spec.spec.lifecycle,
                type=spec.spec.type,
                repo_url=repo.url
            )
            await db.upsert_service(service)

            # 3. Sync dependencies
            await db.clear_dependencies(service.id)
            for dep in spec.spec.dependencies:
                await db.add_dependency(service.id, dep)

            # 4. Sync metadata (annotations)
            for key, value in spec.metadata.annotations.items():
                await db.set_metadata(service.id, key, value)

        except FileNotFoundError:
            # No catalog-info.yaml in this repo
            continue

    # 5. Enrich with external data
    await enrich_pagerduty_data()
    await enrich_grafana_dashboards()

5. Implementation Guide

5.1 Development Environment Setup

# Create project
mkdir service-catalog && cd service-catalog
python3 -m venv venv
source venv/bin/activate

# Install dependencies
pip install fastapi uvicorn sqlalchemy asyncpg pyyaml httpx strawberry-graphql

# Database
docker run -d --name catalog-db -p 5432:5432 \
  -e POSTGRES_DB=catalog \
  -e POSTGRES_PASSWORD=secret \
  postgres:15

5.2 Project Structure

service-catalog/
├── src/
│   ├── __init__.py
│   ├── models.py           # SQLAlchemy models
│   ├── database.py         # DB connection
│   ├── api/
│   │   ├── rest.py         # FastAPI REST endpoints
│   │   └── graphql.py      # Strawberry GraphQL
│   ├── ingestion/
│   │   ├── github.py       # Scan repos
│   │   ├── pagerduty.py    # Enrich on-call
│   │   └── grafana.py      # Enrich dashboards
│   └── cli.py              # Command-line tool
├── web/
│   ├── src/
│   │   ├── App.tsx
│   │   ├── components/
│   │   │   ├── Search.tsx
│   │   │   └── ServiceDetail.tsx
│   │   └── api.ts
│   └── package.json
├── migrations/
│   └── 001_initial.sql
├── docker-compose.yaml
└── README.md

5.3 The Core Question You’re Answering

“How do we prevent people from building the same thing twice because they couldn’t find the existing version?”

In large organizations, discovery is a major source of waste. The Catalog is the interface for discovery.

5.4 Concepts You Must Understand First

Stop and research these before coding:

The C4 Model
- How do you visualize software at different levels?
- Reference: c4model.com
GitOps for Metadata
- Why store metadata in repo instead of separate UI?
- Reference: Standard GitOps literature
Service Mesh vs. Service Catalog
- What’s the difference? (Runtime vs. design-time)
- Reference: Backstage documentation

5.5 Questions to Guide Your Design

Before implementing, think through these:

Data Freshness

How often does catalog sync from sources?
What happens if a repo is deleted?
How do you handle renames?

Consistency

If catalog-info.yaml is wrong, who fixes it?
Is there validation before accepting?
How do you enforce required fields?

Consumers

Who uses the API? (Developers, incident responders, auditors)
What are their different needs?

5.6 Thinking Exercise

The “New Microservice” Flow

Trace the steps of creating a new service from scratch.

Questions:

At what point does the catalog learn about the service?
How does someone else discover it exists?
How do they know if it’s production-ready?

Write down:

The moment a new service should appear in catalog
The minimum required metadata
How you’d enforce that metadata exists

5.7 Hints in Layers

Hint 1: Use Catalog-as-Code Every repo should have a catalog-info.yaml. The source of truth is Git, not a database.

Hint 2: Build a Scraper Write a script that clones all repos, reads catalog-info.yaml, and inserts into database.

Hint 3: Expose via CLI First Developers live in the terminal. catalog info service-name is more useful than a fancy UI initially.

Hint 4: Connect to On-Call Integrate with PagerDuty API. Show the real-time on-call person, not a static name.

5.8 The Interview Questions They’ll Ask

Prepare to answer these:

“Why is a Service Catalog essential for microservices?”
- Discovery at scale, ownership clarity, dependency visibility
“How do you ensure metadata stays accurate?”
- GitOps (code review), validation, regular audits, integration tests
“What are the core fields every service should have?”
- Name, owner, lifecycle, repo, docs, on-call
“How does a Service Catalog help with incident response?”
- Find owner in seconds, see dependencies, get runbook link
“What’s the difference between Service Catalog and CMDB?”
- Catalog is for developers; CMDB is for IT operations. Catalog is code-centric.

5.9 Books That Will Help

Topic	Book	Chapter
Service Discovery	“SRE Book” (Google)	Ch. 10: Practical Alerting
Metadata Design	“Modern Software Engineering”	Ch. 4
API Design	“Designing APIs with Swagger”	All

5.10 Implementation Phases

Phase 1: Schema & Data Model (5-8 hours)

Define catalog-info.yaml schema
Create PostgreSQL schema
Build SQLAlchemy models

Phase 2: Ingestion Pipeline (8-10 hours)

GitHub repo scanner
YAML parser and validator
Database upsert logic

Phase 3: REST API (5-8 hours)

FastAPI endpoints
Search with full-text
Filter by team/lifecycle

Phase 4: GraphQL API (5-8 hours)

Strawberry schema
Resolvers for relationships
Nested queries (service → owner → on-call)

Phase 5: Web UI (10-15 hours)

React/Vue setup
Search component
Service detail page
Dependency graph visualization

Phase 6: Integrations (8-10 hours)

PagerDuty enrichment
Grafana dashboard links
Real-time on-call display

5.11 Key Implementation Decisions

Decision	Option A	Option B	Recommendation
Storage	PostgreSQL	Neo4j	PostgreSQL (simpler)
API	REST only	GraphQL + REST	Both
UI framework	React	Backstage	React if learning, Backstage for production
Sync	Push (webhooks)	Pull (cron)	Pull initially, add webhooks later

6. Testing Strategy

Unit Tests

def test_parse_catalog_info():
    yaml = """
    apiVersion: v1
    kind: Service
    metadata:
      name: test-service
    spec:
      owner: team-test
    """
    result = parse_catalog_info(yaml)
    assert result.metadata.name == "test-service"
    assert result.spec.owner == "team-test"

def test_service_search():
    # Setup test data
    db.insert_service(Service(name="checkout-service", ...))
    db.insert_service(Service(name="checkout-worker", ...))

    results = api.search("checkout")
    assert len(results) == 2

Integration Tests

Scan a real GitHub organization
Verify all expected services are ingested
Test API returns correct relationships

E2E Tests

Search in UI, verify results
Click through to service detail
Verify dependency graph renders

7. Common Pitfalls & Debugging

Problem	Symptom	Root Cause	Fix
Missing services	Fewer than expected	No catalog-info.yaml	Require file in PR checks
Stale data	Wrong owner shown	Sync not running	Check cron job, add monitoring
Slow search	> 1s response	No index	Add PostgreSQL full-text index
Broken links	404s to dashboards	URLs changed	Add link validation to sync

8. Extensions & Challenges

Extension 1: API Documentation

Embed OpenAPI specs, show request/response examples.

Extension 2: Cost Attribution

Show cloud costs per service (integrate with billing).

Extension 3: Tech Radar

Tag services with technology, show tech stack trends.

Extension 4: Deprecation Workflow

Mark services as deprecated, notify dependents.

9. Real-World Connections

Open Source:

Backstage - Spotify’s open-source developer portal
Cortex - Service ownership platform
OpsLevel - Service catalog SaaS

How Big Tech Does This:

Spotify: Built Backstage
Netflix: Internal service registry with ownership
Google: Extensive internal service directory

10. Resources

Backstage

C4 Model

c4model.com

P03: Ownership Mapper - Define owners
P08: Dependency Visualizer - Dependency graphs

11. Self-Assessment Checklist

Before considering this project complete, verify:

catalog-info.yaml schema is defined
Database has 20+ services ingested
Search works and returns in < 500ms
API has both REST and GraphQL endpoints
UI allows search and browse
Dependency graph is visible
At least one integration (PagerDuty or Grafana)
Someone outside your team has used the catalog

12. Submission / Completion Criteria

This project is complete when you have:

Schema specification for catalog-info.yaml
Ingestion pipeline scanning repos
PostgreSQL database with services/teams/dependencies
REST API with search and CRUD
GraphQL API with nested queries
Web UI with search and service detail
CLI tool for quick lookups
Documentation for adding new services

Previous Project: P10: Incident Response Battle Cards Next Project: P12: Cost of Coordination Calculator