Project 11: The Internal Service Catalog (Metadata Design)
Build a centralized “Source of Truth” for every service in the company with ownership, dependencies, documentation, and SLO links.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Advanced |
| Time Estimate | 1 Month (40-60 hours) |
| Primary Language | Python / SQL |
| Alternative Languages | Go, TypeScript |
| Prerequisites | API design, database basics |
| Key Topics | Service Discovery, Metadata Management, Developer Portal |
1. Learning Objectives
By completing this project, you will:
- Design a metadata schema for service information
- Build a searchable catalog with API access
- Integrate with source-of-truth (repos, PagerDuty, Grafana)
- Enable service discovery for developers
- Create a “phone book” for the entire organization
2. Theoretical Foundation
2.1 Core Concepts
The Discovery Problem
WITHOUT CATALOG WITH CATALOG
┌─────────────────────────────┐ ┌─────────────────────────────┐
│ "Where's the user service?" │ │ catalog search user │
│ → Ask in Slack │ │ → user-service (Identity) │
│ → "I think Team X owns it" │ │ Owner: @identity-team │
│ → DM 3 people │ │ Docs: wiki.internal/user │
│ → Find outdated wiki page │ │ On-call: PagerDuty link │
│ → 45 minutes later... │ │ 2 seconds. │
└─────────────────────────────┘ └─────────────────────────────┘
Catalog-Info as Code
# catalog-info.yaml (lives in every repo)
apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
name: checkout-service
description: Handles customer checkout flow
annotations:
pagerduty.com/service-id: PXXXXXX
grafana/dashboard-url: https://grafana.internal/d/checkout
spec:
type: service
lifecycle: production
owner: team-checkout
dependsOn:
- component:payment-gateway
- component:inventory-service
providesApis:
- checkout-api
The C4 Model for Visualization
┌─────────────────────────────────────────────────────────────────┐
│ LEVEL 1: SYSTEM CONTEXT │
│ │
│ ┌─────────┐ ┌─────────────────┐ ┌─────────┐ │
│ │ Customer│◄───────►│ E-Commerce │◄───────►│ Payment │ │
│ │ │ │ Platform │ │ Provider│ │
│ └─────────┘ └─────────────────┘ └─────────┘ │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ LEVEL 2: CONTAINER (Zoom into Platform) │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Checkout │──►│ Payment │──►│ Inventory │ │
│ │ Service │ │ Gateway │ │ Service │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │ │ │
│ └────────────────────────────────────┘ │
│ │ │
│ ┌─────────────┐ │
│ │ Database │ │
│ └─────────────┘ │
└─────────────────────────────────────────────────────────────────┘
2.2 Why This Matters
In large organizations:
- Teams don’t know what other teams are building
- Duplicate services get created
- Incidents take longer (can’t find owner)
- Onboarding is slow (where is everything?)
The Catalog solves:
- “Who owns this service?”
- “What does this service do?”
- “What are its dependencies?”
- “Where are the docs?”
- “Who do I page if it’s down?”
2.3 Historical Context
- CMDBs (1990s): Configuration Management Databases (often stale)
- Service Registries (2010s): Consul, Eureka for runtime discovery
- Developer Portals (2020s): Backstage, Cortex, OpsLevel
2.4 Common Misconceptions
| Misconception | Reality |
|---|---|
| “We have a wiki” | Wikis go stale. Catalog-as-code stays current. |
| “Service mesh is enough” | Mesh is runtime. Catalog is human-readable metadata. |
| “It’s too much overhead” | The overhead of NOT having it is higher. |
| “Only big companies need this” | At 20+ services, discovery becomes a problem. |
3. Project Specification
3.1 What You Will Build
- Schema Definition: What metadata every service must have
- Data Ingestion: Pull from repos, APIs, and manual input
- API: Query services programmatically
- UI: Search and browse interface
- Integrations: Link to PagerDuty, Grafana, GitHub
3.2 Functional Requirements
- Core Entity: Service
- Name, description, owner team
- Repository URL
- Documentation URL
- SLO dashboard link
- On-call schedule link
- Dependencies (what it depends on)
- Dependents (what depends on it)
- Search
- Full-text search across all fields
- Filter by team, lifecycle, type
- Results include key metadata
- API
GET /services- List allGET /services/{id}- Get detailsGET /services?owner={team}- Filter by teamGET /teams/{id}/services- All team’s services
- UI
- Homepage with search
- Service detail page
- Team overview page
- Dependency graph visualization
3.3 Non-Functional Requirements
- Support 500+ services
- Search returns in < 500ms
- Sync from sources at least hourly
- Data must be version-controlled (YAML in repos)
3.4 Example Usage / Output
CLI Query:
$ catalog search checkout
╔══════════════════════════════════════════════════════════════════╗
║ SERVICE CATALOG SEARCH ║
║ Query: "checkout" ║
╠══════════════════════════════════════════════════════════════════╣
NAME OWNER LIFECYCLE TYPE
─────────────────────────────────────────────────────────────────
checkout-service team-checkout production service
checkout-frontend team-checkout production website
checkout-worker team-checkout production worker
══════════════════════════════════════════════════════════════════
$ catalog info checkout-service
╔══════════════════════════════════════════════════════════════════╗
║ checkout-service ║
╠══════════════════════════════════════════════════════════════════╣
Description: Handles customer checkout flow including cart
validation, price calculation, and order creation.
Owner: team-checkout (#team-checkout)
Lifecycle: production
Type: service
Repository: github.com/acme/checkout-service
Documentation: wiki.internal/checkout-service
SLO Dashboard: grafana.internal/d/checkout-slo
On-Call: @checkout-oncall (PagerDuty)
https://pagerduty.com/services/PXXXXXX
─────────────────────────────────────────────────────────────────
DEPENDENCIES (3)
─────────────────────────────────────────────────────────────────
→ payment-gateway (team-payments) production
→ inventory-service (team-inventory) production
→ user-service (team-identity) production
─────────────────────────────────────────────────────────────────
DEPENDENTS (2)
─────────────────────────────────────────────────────────────────
← mobile-app (team-mobile) production
← web-frontend (team-web) production
─────────────────────────────────────────────────────────────────
APIS PROVIDED
─────────────────────────────────────────────────────────────────
• checkout-api (OpenAPI spec: /docs/openapi.yaml)
╚══════════════════════════════════════════════════════════════════╝
GraphQL API:
query GetService {
service(name: "checkout-service") {
name
description
owner {
name
slackChannel
}
dependencies {
name
owner { name }
}
onCall {
currentPrimary
pagerDutyUrl
}
slo {
availability
latencyP99
}
}
}
Response:
{
"data": {
"service": {
"name": "checkout-service",
"description": "Handles customer checkout flow",
"owner": {
"name": "team-checkout",
"slackChannel": "#team-checkout"
},
"dependencies": [
{
"name": "payment-gateway",
"owner": { "name": "team-payments" }
}
],
"onCall": {
"currentPrimary": "@jane-doe",
"pagerDutyUrl": "https://pagerduty.com/services/PXXXXXX"
},
"slo": {
"availability": 99.9,
"latencyP99": 200
}
}
}
}
3.5 Real World Outcome
After building the catalog:
- Developer onboarding reduced from 2 weeks to 2 days
- Incident MTTA reduced (find owner in seconds)
- Duplicate service creation prevented
- Audit/compliance easier (clear ownership)
4. Solution Architecture
4.1 High-Level Design
┌─────────────────────────────────────────────────────────────────┐
│ SERVICE CATALOG │
└─────────────────────────────────────────────────────────────────┘
│
┌─────────────────────┼─────────────────────┐
│ │ │
▼ ▼ ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ DATA INGESTION│ │ CORE API │ │ WEB UI │
│ │ │ │ │ │
│ - GitHub scan │ │ - GraphQL │ │ - Search │
│ - Manual YAML │ │ - REST │ │ - Browse │
│ - PagerDuty │ │ - Webhooks │ │ - Dependency │
│ - Grafana │ │ │ │ graph │
└───────────────┘ └───────────────┘ └───────────────┘
│ │ │
└─────────────────────┼─────────────────────┘
│
┌─────────────────────┐
│ DATABASE │
│ (PostgreSQL) │
└─────────────────────┘
4.2 Key Components
- Ingestion Pipeline: Scrapes repos for catalog-info.yaml
- Core Database: PostgreSQL with service/team/dependency tables
- GraphQL API: Flexible queries for any client
- REST API: Simple queries for CLI/scripts
- Web UI: React/Vue search and browse interface
- Integrations: PagerDuty, Grafana, GitHub enrichment
4.3 Data Structures
-- PostgreSQL Schema
CREATE TABLE teams (
id VARCHAR(50) PRIMARY KEY,
name VARCHAR(100) NOT NULL,
slack_channel VARCHAR(50),
pagerduty_schedule VARCHAR(50),
github_team VARCHAR(100),
created_at TIMESTAMP DEFAULT NOW()
);
CREATE TABLE services (
id VARCHAR(100) PRIMARY KEY,
name VARCHAR(100) NOT NULL,
description TEXT,
owner_team_id VARCHAR(50) REFERENCES teams(id),
lifecycle VARCHAR(20) NOT NULL, -- experimental, development, production, deprecated
type VARCHAR(20) NOT NULL, -- service, website, library, worker
repo_url VARCHAR(255),
docs_url VARCHAR(255),
slo_dashboard_url VARCHAR(255),
created_at TIMESTAMP DEFAULT NOW(),
updated_at TIMESTAMP DEFAULT NOW()
);
CREATE TABLE dependencies (
id SERIAL PRIMARY KEY,
source_service_id VARCHAR(100) REFERENCES services(id),
target_service_id VARCHAR(100) REFERENCES services(id),
dependency_type VARCHAR(20), -- runtime, build, data
UNIQUE(source_service_id, target_service_id)
);
CREATE TABLE service_metadata (
service_id VARCHAR(100) REFERENCES services(id),
key VARCHAR(50) NOT NULL,
value TEXT,
PRIMARY KEY (service_id, key)
);
# catalog-info.yaml (in each repo)
apiVersion: v1
kind: Service
metadata:
name: checkout-service
description: Handles customer checkout flow
spec:
owner: team-checkout
lifecycle: production
type: service
links:
- type: docs
url: https://wiki.internal/checkout
- type: dashboard
url: https://grafana.internal/d/checkout
dependencies:
- service:payment-gateway
- service:inventory-service
annotations:
pagerduty.com/service-id: PXXXXXX
4.4 Algorithm Overview
# Ingestion pipeline
async def sync_catalog():
# 1. Scan all repos for catalog-info.yaml
repos = await github.list_org_repos("acme")
for repo in repos:
try:
content = await github.get_file(repo, "catalog-info.yaml")
spec = parse_yaml(content)
# 2. Upsert service record
service = Service(
id=spec.metadata.name,
name=spec.metadata.name,
description=spec.metadata.description,
owner_team_id=spec.spec.owner,
lifecycle=spec.spec.lifecycle,
type=spec.spec.type,
repo_url=repo.url
)
await db.upsert_service(service)
# 3. Sync dependencies
await db.clear_dependencies(service.id)
for dep in spec.spec.dependencies:
await db.add_dependency(service.id, dep)
# 4. Sync metadata (annotations)
for key, value in spec.metadata.annotations.items():
await db.set_metadata(service.id, key, value)
except FileNotFoundError:
# No catalog-info.yaml in this repo
continue
# 5. Enrich with external data
await enrich_pagerduty_data()
await enrich_grafana_dashboards()
5. Implementation Guide
5.1 Development Environment Setup
# Create project
mkdir service-catalog && cd service-catalog
python3 -m venv venv
source venv/bin/activate
# Install dependencies
pip install fastapi uvicorn sqlalchemy asyncpg pyyaml httpx strawberry-graphql
# Database
docker run -d --name catalog-db -p 5432:5432 \
-e POSTGRES_DB=catalog \
-e POSTGRES_PASSWORD=secret \
postgres:15
5.2 Project Structure
service-catalog/
├── src/
│ ├── __init__.py
│ ├── models.py # SQLAlchemy models
│ ├── database.py # DB connection
│ ├── api/
│ │ ├── rest.py # FastAPI REST endpoints
│ │ └── graphql.py # Strawberry GraphQL
│ ├── ingestion/
│ │ ├── github.py # Scan repos
│ │ ├── pagerduty.py # Enrich on-call
│ │ └── grafana.py # Enrich dashboards
│ └── cli.py # Command-line tool
├── web/
│ ├── src/
│ │ ├── App.tsx
│ │ ├── components/
│ │ │ ├── Search.tsx
│ │ │ └── ServiceDetail.tsx
│ │ └── api.ts
│ └── package.json
├── migrations/
│ └── 001_initial.sql
├── docker-compose.yaml
└── README.md
5.3 The Core Question You’re Answering
“How do we prevent people from building the same thing twice because they couldn’t find the existing version?”
In large organizations, discovery is a major source of waste. The Catalog is the interface for discovery.
5.4 Concepts You Must Understand First
Stop and research these before coding:
- The C4 Model
- How do you visualize software at different levels?
- Reference: c4model.com
- GitOps for Metadata
- Why store metadata in repo instead of separate UI?
- Reference: Standard GitOps literature
- Service Mesh vs. Service Catalog
- What’s the difference? (Runtime vs. design-time)
- Reference: Backstage documentation
5.5 Questions to Guide Your Design
Before implementing, think through these:
Data Freshness
- How often does catalog sync from sources?
- What happens if a repo is deleted?
- How do you handle renames?
Consistency
- If catalog-info.yaml is wrong, who fixes it?
- Is there validation before accepting?
- How do you enforce required fields?
Consumers
- Who uses the API? (Developers, incident responders, auditors)
- What are their different needs?
5.6 Thinking Exercise
The “New Microservice” Flow
Trace the steps of creating a new service from scratch.
Questions:
- At what point does the catalog learn about the service?
- How does someone else discover it exists?
- How do they know if it’s production-ready?
Write down:
- The moment a new service should appear in catalog
- The minimum required metadata
- How you’d enforce that metadata exists
5.7 Hints in Layers
Hint 1: Use Catalog-as-Code
Every repo should have a catalog-info.yaml. The source of truth is Git, not a database.
Hint 2: Build a Scraper
Write a script that clones all repos, reads catalog-info.yaml, and inserts into database.
Hint 3: Expose via CLI First
Developers live in the terminal. catalog info service-name is more useful than a fancy UI initially.
Hint 4: Connect to On-Call Integrate with PagerDuty API. Show the real-time on-call person, not a static name.
5.8 The Interview Questions They’ll Ask
Prepare to answer these:
- “Why is a Service Catalog essential for microservices?”
- Discovery at scale, ownership clarity, dependency visibility
- “How do you ensure metadata stays accurate?”
- GitOps (code review), validation, regular audits, integration tests
- “What are the core fields every service should have?”
- Name, owner, lifecycle, repo, docs, on-call
- “How does a Service Catalog help with incident response?”
- Find owner in seconds, see dependencies, get runbook link
- “What’s the difference between Service Catalog and CMDB?”
- Catalog is for developers; CMDB is for IT operations. Catalog is code-centric.
5.9 Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Service Discovery | “SRE Book” (Google) | Ch. 10: Practical Alerting |
| Metadata Design | “Modern Software Engineering” | Ch. 4 |
| API Design | “Designing APIs with Swagger” | All |
5.10 Implementation Phases
Phase 1: Schema & Data Model (5-8 hours)
- Define catalog-info.yaml schema
- Create PostgreSQL schema
- Build SQLAlchemy models
Phase 2: Ingestion Pipeline (8-10 hours)
- GitHub repo scanner
- YAML parser and validator
- Database upsert logic
Phase 3: REST API (5-8 hours)
- FastAPI endpoints
- Search with full-text
- Filter by team/lifecycle
Phase 4: GraphQL API (5-8 hours)
- Strawberry schema
- Resolvers for relationships
- Nested queries (service → owner → on-call)
Phase 5: Web UI (10-15 hours)
- React/Vue setup
- Search component
- Service detail page
- Dependency graph visualization
Phase 6: Integrations (8-10 hours)
- PagerDuty enrichment
- Grafana dashboard links
- Real-time on-call display
5.11 Key Implementation Decisions
| Decision | Option A | Option B | Recommendation |
|---|---|---|---|
| Storage | PostgreSQL | Neo4j | PostgreSQL (simpler) |
| API | REST only | GraphQL + REST | Both |
| UI framework | React | Backstage | React if learning, Backstage for production |
| Sync | Push (webhooks) | Pull (cron) | Pull initially, add webhooks later |
6. Testing Strategy
Unit Tests
def test_parse_catalog_info():
yaml = """
apiVersion: v1
kind: Service
metadata:
name: test-service
spec:
owner: team-test
"""
result = parse_catalog_info(yaml)
assert result.metadata.name == "test-service"
assert result.spec.owner == "team-test"
def test_service_search():
# Setup test data
db.insert_service(Service(name="checkout-service", ...))
db.insert_service(Service(name="checkout-worker", ...))
results = api.search("checkout")
assert len(results) == 2
Integration Tests
- Scan a real GitHub organization
- Verify all expected services are ingested
- Test API returns correct relationships
E2E Tests
- Search in UI, verify results
- Click through to service detail
- Verify dependency graph renders
7. Common Pitfalls & Debugging
| Problem | Symptom | Root Cause | Fix |
|---|---|---|---|
| Missing services | Fewer than expected | No catalog-info.yaml | Require file in PR checks |
| Stale data | Wrong owner shown | Sync not running | Check cron job, add monitoring |
| Slow search | > 1s response | No index | Add PostgreSQL full-text index |
| Broken links | 404s to dashboards | URLs changed | Add link validation to sync |
8. Extensions & Challenges
Extension 1: API Documentation
Embed OpenAPI specs, show request/response examples.
Extension 2: Cost Attribution
Show cloud costs per service (integrate with billing).
Extension 3: Tech Radar
Tag services with technology, show tech stack trends.
Extension 4: Deprecation Workflow
Mark services as deprecated, notify dependents.
9. Real-World Connections
Open Source:
- Backstage - Spotify’s open-source developer portal
- Cortex - Service ownership platform
- OpsLevel - Service catalog SaaS
How Big Tech Does This:
- Spotify: Built Backstage
- Netflix: Internal service registry with ownership
- Google: Extensive internal service directory
10. Resources
Backstage
C4 Model
Related Projects
- P03: Ownership Mapper - Define owners
- P08: Dependency Visualizer - Dependency graphs
11. Self-Assessment Checklist
Before considering this project complete, verify:
- catalog-info.yaml schema is defined
- Database has 20+ services ingested
- Search works and returns in < 500ms
- API has both REST and GraphQL endpoints
- UI allows search and browse
- Dependency graph is visible
- At least one integration (PagerDuty or Grafana)
- Someone outside your team has used the catalog
12. Submission / Completion Criteria
This project is complete when you have:
- Schema specification for catalog-info.yaml
- Ingestion pipeline scanning repos
- PostgreSQL database with services/teams/dependencies
- REST API with search and CRUD
- GraphQL API with nested queries
- Web UI with search and service detail
- CLI tool for quick lookups
- Documentation for adding new services
Previous Project: P10: Incident Response Battle Cards Next Project: P12: Cost of Coordination Calculator