Project 9: Prompt Caching Optimizer

Before/after benchmark showing cache-hit gains and cost deltas.

Quick Reference

Attribute Value
Difficulty Level 2: Intermediate
Time Estimate 5-10 days (capstone: 3-5 weeks)
Main Programming Language Python
Alternative Programming Languages TypeScript, Go
Coolness Level Level 3: Cost Slayer
Business Potential 4. Platform ROI
Knowledge Area Performance Optimization
Software or Tool Prefix partitioner + cache monitor
Main Book Designing Data-Intensive Applications
Concept Clusters Context Engineering and Caching; Instruction Hierarchy and Injection Defense

1. Learning Objectives

By completing this project, you will:

  1. Partition prompt templates into stable prefix segments and dynamic suffix segments to maximize KV-cache reuse across requests.
  2. Build a benchmark harness that replays recorded traffic traces through before/after prompt layouts and measures cache hit ratios, token costs, and latency deltas.
  3. Implement provider-specific caching strategies (OpenAI automatic prefix matching, Anthropic explicit cache_control blocks, Google context caching with TTL).
  4. Construct a cost model that calculates savings from cached vs uncached input token pricing and validates that cost wins do not come at the expense of output quality.
  5. Design cache invalidation policies based on TTL, content hashing, and version tagging to prevent stale context and cross-tenant data leakage.
  6. Produce a diff report artifact comparing before/after prompt designs with per-request-class breakdowns of cache hit rate, cost, and latency.

2. All Theory Needed (Per-Concept Breakdown)

Prompt Prefix Partitioning

Fundamentals LLM providers cache the key-value (KV) attention states computed during prompt processing. When two requests share the same beginning tokens, the provider can reuse the cached KV states for those tokens instead of recomputing them. This is prompt prefix caching. The implication for prompt engineers is profound: if you structure your prompts so that the stable, reusable content comes first and the per-request dynamic content comes last, you can dramatically reduce input token costs and latency. Prompt prefix partitioning is the discipline of splitting every prompt template into a static prefix (system instructions, few-shot examples, policy rules) and a dynamic suffix (the user query, session context, retrieved documents). The boundary between prefix and suffix is the single most important architectural decision in this project because it determines cache effectiveness across your entire request population.

Deep Dive into the concept The KV cache works at the token level. When a request arrives, the provider checks whether it has cached KV states for a prefix of the incoming token sequence. The longer the matching prefix, the more computation is saved. A cache hit on 1,000 tokens out of a 2,000-token prompt means roughly half the input processing cost is eliminated (at reduced cached-token pricing) and latency drops proportionally.

The critical insight is that cache matching is exact and sequential. If even a single token differs between two requests at position N, everything from position N onward must be recomputed. This means that seemingly minor differences in the prefix can destroy cache effectiveness. Common prefix-breaking culprits include: timestamps embedded in system prompts (“Today is February 12, 2026”), per-user personalization tokens in the system section, randomized few-shot example ordering, whitespace or formatting differences between deployments, and dynamic retrieval context placed before the user query.

Provider-specific mechanisms differ in important ways. OpenAI provides automatic prefix caching: if two requests to the same model share the same initial token sequence (at least 1,024 tokens), the cached portion is billed at a reduced rate. There is no explicit API to control this; you optimize by structuring prompts so the shared prefix is long and stable. Anthropic provides explicit caching via cache_control breakpoints in the messages array. You insert a cache_control: {"type": "ephemeral"} marker to tell the API where to cache up to. Cached prefixes have a 5-minute TTL and a small write cost on first creation. Google provides context caching as a separate API: you create a named cached content object with an explicit TTL, then reference it by ID in subsequent requests.

Token boundary alignment matters for OpenAI’s automatic caching. The cache operates in fixed-size chunks (currently 128 tokens). If your stable prefix ends at token 1,050 and the chunk boundary is at token 1,024, only the first 1,024 tokens are cached. The remaining 26 stable tokens are recomputed every time. Understanding chunk boundaries lets you pad or trim your prefix to align with cache boundaries for maximum hit efficiency.

Partitioning strategy follows a layered architecture. The outermost layer (first in the token sequence) should be the content that changes least frequently: the system identity prompt. Next comes the safety and policy preamble. Then few-shot examples (if they are static across sessions). Then retrieval-augmented context (which changes per query but might be shared across users asking about the same topic). Finally, the user query itself. The general rule is: sort prompt segments by decreasing stability, most stable first.

For multi-turn conversations, the partitioning challenge intensifies. The conversation history grows with each turn, and earlier turns become part of the prefix. If you reorder or summarize earlier messages, you break the cache. One strategy is to keep conversation history in chronological order and rely on the cache matching progressively longer prefixes as the conversation continues.

How this fit on projects Prompt prefix partitioning is the primary design concept for Project 9. The benchmark harness you build will compare before/after prompt layouts by replaying traffic traces through each layout and measuring how much of the prefix is shared across requests. Your goal is to redesign prompt templates to maximize the shared prefix length.

Definitions & key terms

  • KV cache (Key-Value cache): The cached attention key and value tensors from transformer layers, computed during prompt processing. Reusing these avoids redundant computation.
  • Static prefix: The portion of a prompt template that is identical across all or most requests (system instructions, few-shot examples, policy rules).
  • Dynamic suffix: The portion that changes per request (user query, session context, retrieved documents).
  • Cache boundary marker: An explicit annotation (Anthropic’s cache_control) or an implicit alignment point (OpenAI’s chunk boundaries) where the cache splits prefix from suffix.
  • Cache hit: When an incoming request’s prefix matches a cached prefix, allowing KV state reuse.
  • Cache miss: When no matching prefix exists, requiring full computation.
  • TTL (Time to Live): The duration a cached prefix remains valid before expiration (e.g., 5 minutes for Anthropic, configurable for Google).
  • Prefix stability: A measure of how consistent a prefix is across the request population. Higher stability means higher cache hit rates.

Mental model diagram (ASCII)

BEFORE (poor cache performance):
+----------------------------------------------------------------+
| System prompt | Timestamp | User name | Few-shot | User query   |
|   (stable)    | (dynamic) | (dynamic) | (stable) |  (dynamic)   |
+----------------------------------------------------------------+
  ^-- cache breaks here due to dynamic timestamp at position 2

AFTER (optimized for caching):
+----------------------------------------------------------------+
| System prompt | Few-shot examples | Policy rules | User query   |
|   (stable)    |    (stable)       |   (stable)   |  (dynamic)   |
+----------------------------------------------------------------+
  ^--- entire stable prefix is cached ------^   ^-- only this recomputed

Request population view:
  Request A: [========== CACHED PREFIX ==========][dynamic A]
  Request B: [========== CACHED PREFIX ==========][dynamic B]
  Request C: [========== CACHED PREFIX ==========][dynamic C]
                                                   ^
                                              only this part
                                              costs full price

Provider-specific caching:

  OpenAI (automatic):
  [tokens 0-1023: cached chunk 1][tokens 1024-2047: cached chunk 2][dynamic...]
   (128-token aligned chunks, automatic matching)

  Anthropic (explicit):
  [system message {cache_control: ephemeral}][user message]
   (5-min TTL, write cost on first creation)

  Google (named cache):
  [CachedContent id=abc123, ttl=3600s] -> reference in request
   (separate API call to create, explicit TTL)

How it works (step-by-step, with invariants and failure modes)

  1. Audit the existing prompt template and categorize each segment as static or dynamic. Invariant: every segment is classified; no segment is left ambiguous. Failure mode: a segment classified as static actually contains per-request data (e.g., a timestamp formatted into the system prompt), destroying cache hits.
  2. Reorder segments so all static segments precede all dynamic segments. Invariant: the reordering does not change the semantic meaning of the prompt. Failure mode: moving few-shot examples before the system prompt confuses the model about its role.
  3. Remove or relocate any dynamic content from the prefix. Invariant: the prefix is identical across all requests in a given request class. Failure mode: per-user personalization in the system prompt makes every user’s prefix unique, resulting in zero cache hits.
  4. Align the prefix length to provider cache boundaries (128-token chunks for OpenAI). Invariant: the prefix length is a multiple of the chunk size or as close as possible. Failure mode: the prefix ends 10 tokens past a chunk boundary, wasting a full chunk of cache potential.
  5. Insert explicit cache control markers for providers that support them (Anthropic’s cache_control). Invariant: the marker is placed at the exact boundary between static and dynamic content. Failure mode: the marker is placed too early, leaving stable content uncached; or too late, including dynamic content in the cached region.
  6. Benchmark the new layout against the old using replay traffic. Invariant: the same traffic trace is used for both layouts. Failure mode: using different trace samples for before/after introduces confounding variables.

Minimal concrete example

BEFORE prompt layout (low cache performance):
  messages:
    - role: system
      content: |
        You are a customer support agent for Acme Corp.
        Today is {{current_date}}.              <-- dynamic! breaks cache
        User: {{user_name}}                     <-- dynamic! breaks cache
        Respond in {{user_language}}.            <-- dynamic! breaks cache
    - role: user
      content: "{{user_query}}"

AFTER prompt layout (optimized for caching):
  messages:
    - role: system
      content: |
        You are a customer support agent for Acme Corp.
        Follow the Acme support guidelines v2.3.
        Always be polite, accurate, and concise.
        If unsure, say "Let me check on that."
      cache_control: {"type": "ephemeral"}      <-- Anthropic: cache up to here
    - role: user
      content: |
        Context: User {{user_name}}, language: {{user_language}}, date: {{current_date}}
        Question: {{user_query}}

Result: The system message (stable across all requests) is cached.
Dynamic fields are pushed to the user message (suffix).

Common misconceptions

  • “Caching is a provider-side optimization I cannot influence.” You directly control cache effectiveness through prompt structure. Putting dynamic content early in the prompt destroys caching; moving it to the end enables it.
  • “A longer prompt always costs more.” With effective prefix caching, a longer prompt with a large stable prefix can cost less per request than a shorter prompt with no cacheable prefix, because cached tokens are billed at a fraction of the full input price.
  • “Cache hits are guaranteed for identical prefixes.” Cache hits depend on provider implementation details: TTL expiration (Anthropic’s 5-minute TTL), cache eviction under load, chunk alignment (OpenAI), and whether the same model version is being served. You must measure actual hit rates, not assume them.
  • “I can cache everything, including user-specific context.” Caching user-specific data in a shared prefix creates cross-tenant data leakage risks. Only content that is safe to share across all users in a request class should be in the cached prefix.

Check-your-understanding questions

  1. Why does placing a timestamp in the system prompt destroy cache effectiveness for all requests?
  2. What is the difference between OpenAI’s automatic prefix caching and Anthropic’s explicit cache_control mechanism?
  3. How does token chunk alignment affect cache hit rates on OpenAI?

Check-your-understanding answers

  1. The timestamp changes every second (or minute/day), making the system prompt unique for each time period. Since cache matching is exact and sequential, a different timestamp at position N means all tokens from N onward are cache misses, even if the rest of the prompt is identical.
  2. OpenAI automatically detects shared prefixes across requests (at 128-token chunk granularity, minimum 1,024 tokens) and caches them without any API changes. Anthropic requires you to explicitly insert cache_control: {"type": "ephemeral"} breakpoints in the messages array to tell the API where to create cache boundaries. Anthropic’s approach gives you precise control but requires deliberate prompt restructuring.
  3. OpenAI caches in 128-token chunks. If your stable prefix is 1,050 tokens, only the first 1,024 tokens (8 chunks) are cached. The remaining 26 tokens are recomputed with every request. To maximize cache efficiency, you should pad or trim your prefix to align with the 128-token boundary.

Real-world applications

  • High-volume customer support chatbots where thousands of requests per minute share the same system prompt and policy instructions.
  • RAG (Retrieval-Augmented Generation) pipelines where the system prompt and few-shot examples are stable but the retrieved context varies per query.
  • Multi-tenant SaaS platforms where each tenant shares the same base system prompt but has different user queries.

Where you’ll apply it

  • Phase 1 of this project: audit existing prompt templates, classify segments as static vs dynamic, and restructure for maximum prefix stability.
  • The partitioned prompt layout is then benchmarked using the harness from Concept 2 and validated for safety using the boundaries from Concept 3.

References

  • Anthropic prompt caching documentation (cache_control, TTL behavior, pricing)
  • OpenAI prompt caching documentation (automatic prefix matching, chunk alignment)
  • Google Vertex AI context caching documentation (CachedContent API, TTL configuration)
  • “Designing Data-Intensive Applications” by Martin Kleppmann - Chapters on caching strategies and consistency

Key insights Cache effectiveness is determined by prompt structure, not prompt content: the same words rearranged can mean the difference between 0% and 80% cache hit rates.

Summary Prompt prefix partitioning splits every prompt template into a stable prefix (system instructions, few-shot examples, policies) and a dynamic suffix (user query, session context). Cache matching is exact and sequential, so any dynamic content in the prefix destroys cache hits for all subsequent tokens. Provider mechanisms differ: OpenAI caches automatically in 128-token chunks, Anthropic requires explicit cache_control markers with a 5-minute TTL, and Google uses a separate API with configurable TTL. The optimization strategy is to sort prompt segments by decreasing stability, align to cache boundaries, and push all per-request content to the end.

Homework/Exercises to practice the concept

  • Take a prompt template with 5 segments (system identity, timestamp, few-shot examples, user personalization, user query). Rewrite it so the stable segments form a contiguous prefix. Calculate the cache-eligible token percentage before and after.
  • For a 1,800-token prompt where 1,200 tokens are stable, calculate how many 128-token chunks are fully cached on OpenAI. How many stable tokens are wasted (not aligned to a chunk boundary)? What would you do to recover them?
  • Write a prompt layout for Anthropic that uses cache_control breakpoints to cache the system message and few-shot examples separately from the user message. Explain why you placed each breakpoint where you did.

Solutions to the homework/exercises

  • The rewritten prompt should place system identity first, then few-shot examples (both stable), then the user query, timestamp, and personalization (all dynamic) at the end. Before: perhaps 30% of tokens are in cache-eligible prefix position (system identity only, because timestamp immediately follows). After: 60-70% of tokens are in cache-eligible prefix position.
  • 1,200 stable tokens / 128 tokens per chunk = 9 full chunks (1,152 tokens cached). 48 tokens are wasted (stable but past the 9th chunk boundary). To recover: pad the stable prefix with additional useful context (e.g., more few-shot examples or expanded policy rules) to reach 1,280 tokens (10 full chunks), or trim the prefix to exactly 1,152 tokens.
  • The Anthropic layout should place cache_control: {"type": "ephemeral"} after the system message content. The few-shot examples should be in a separate message with another cache_control breakpoint. The user message comes last without a cache marker. Rationale: the system message is shared across all requests (highest stability); few-shot examples are shared within a task type (medium stability); the user message is unique per request (no caching benefit).

Cache Hit Telemetry and Cost Modeling

Fundamentals Measuring cache effectiveness requires a telemetry pipeline that captures cache hit/miss status for every request and feeds it into a cost model. Without measurement, prompt restructuring is guesswork. You might believe your new layout is better because it looks cleaner, but without telemetry you cannot prove it saves money or reduces latency. Cache hit telemetry gives you the data; cost modeling translates that data into dollars and cents. Together, they form the evidence layer that justifies (or rejects) every proposed prompt layout change. This project’s benchmark harness exists to produce exactly this evidence: before/after comparisons backed by numbers, not intuitions.

Deep Dive into the concept The telemetry pipeline starts at the API response level. LLM providers include caching metadata in their responses. Anthropic returns a cache_creation_input_tokens field (tokens written to cache on first request) and a cache_read_input_tokens field (tokens read from cache on subsequent requests). OpenAI includes cached_tokens in the usage object. Google returns cache hit status in its context caching API responses. Your benchmark harness must capture these fields for every request and store them alongside the request metadata (request class, prompt layout version, timestamp, model name).

The cache hit ratio is the primary effectiveness metric. It is calculated as: cache_hit_tokens / total_input_tokens across a population of requests. But a single global ratio is misleading. Consider a system with two request classes: customer support (80% of traffic, 2,000-token prompts) and code review (20% of traffic, 5,000-token prompts). If the new prompt layout achieves 90% cache hits for support but 10% for code review, the global ratio looks good (about 74%) but the code review class is barely benefiting. You must segment by request class.

Cost modeling takes cache hit ratios and translates them into financial impact. The formula is straightforward but the inputs matter:

For each request class:

  • uncached_cost = total_input_tokens * uncached_price_per_token
  • cached_cost = (cached_tokens * cached_price_per_token) + (uncached_tokens * uncached_price_per_token)
  • savings = uncached_cost - cached_cost

Provider pricing for cached tokens varies. Anthropic charges roughly 10% of the standard input token price for cache reads (but charges a 25% premium for cache writes on creation). OpenAI charges 50% of the standard input price for cached tokens. Google charges based on the cached content storage time plus a reduced per-token rate. These pricing differences mean the same cache hit ratio produces different savings on different providers. Your cost model must be parameterized by provider pricing.

A/B testing prefix designs requires controlled comparison. The benchmark harness replays the same traffic trace through two prompt layouts (before and after). For each request in the trace, it records: (a) total input tokens, (b) cached tokens, (c) uncached tokens, (d) output tokens, (e) latency, and (f) a quality score (if an evaluation function is available). The diff report compares these metrics between layouts, segmented by request class.

Latency modeling complements cost modeling. Cache hits reduce time-to-first-token (TTFT) because the provider skips KV computation for cached tokens. The latency improvement is roughly proportional to the fraction of tokens cached, though actual numbers depend on model size, hardware, and provider infrastructure. Your harness should measure TTFT and total latency for each request and include these in the diff report.

The telemetry pipeline has this shape: for each replayed request, extract cache status from the response, compute cost using the provider pricing table, record latency measurements, and write a structured log entry. After all requests are replayed, aggregate the log entries by request class and layout version, compute summary statistics (mean, median, p95 for cost and latency; overall cache hit ratio), and produce the diff report.

How this fit on projects This concept drives Phase 2 of Project 9. The benchmark harness is the core artifact. Its output (the diff report) is what makes this project actionable: without it, you have a prompt restructuring with no evidence of improvement.

Definitions & key terms

  • Cache hit ratio: The fraction of input tokens served from cache across a request population. Calculated as sum(cached_tokens) / sum(total_input_tokens).
  • Cache write cost: The one-time cost of creating a cache entry (e.g., Anthropic’s 25% premium on cache creation tokens). This cost is amortized across subsequent cache hits.
  • Cache read cost: The per-token cost of reading from cache, typically a fraction of the standard input token price (e.g., 10% for Anthropic, 50% for OpenAI).
  • Request class: A category of requests that share similar prompt structure (e.g., “customer support”, “code review”, “data extraction”). Segmenting by request class prevents misleading global averages.
  • Diff report: The primary artifact of the benchmark harness, comparing before/after layouts on cache hit ratio, cost, latency, and optionally quality.
  • Time-to-first-token (TTFT): The latency from request submission to receiving the first output token. Cache hits reduce TTFT by skipping prefix computation.

Mental model diagram (ASCII)

Traffic Trace (JSONL)
  [req1, req2, req3, ...]
        |
        v
+--------------------+
| Trace Replayer     |
| Replays each req   |
| through BEFORE and |
| AFTER layouts      |
+--------------------+
        |
        v
+--------------------+     +---------------------+
| API Response with  |     | Provider Pricing    |
| cache metadata:    |     | Table:              |
| - cached_tokens    |---->| - uncached: $X/1M   |
| - uncached_tokens  |     | - cached:   $Y/1M   |
| - latency          |     | - cache_write: $Z/1M|
+--------------------+     +---------------------+
        |                           |
        v                           v
+---------------------------------------+
|        Cost Calculator                |
|  per_request_cost = cached * Y        |
|                   + uncached * X      |
|                   + (write ? Z : 0)   |
+---------------------------------------+
        |
        v
+---------------------------------------+
|     Aggregator (by request class)     |
|  class: "support"                     |
|    before: hit_ratio=21%, cost=$4.20  |
|    after:  hit_ratio=69%, cost=$2.10  |
|  class: "code_review"                 |
|    before: hit_ratio=5%,  cost=$8.50  |
|    after:  hit_ratio=45%, cost=$5.80  |
+---------------------------------------+
        |
        v
+--------------------+
|   Diff Report      |
|  (Markdown + JSON) |
+--------------------+

How it works (step-by-step, with invariants and failure modes)

  1. Load the traffic trace (JSONL file where each line is a recorded request with class label, prompt parts, and expected output). Invariant: every trace entry has a request class label. Failure mode: missing class labels cause all requests to be aggregated into a single bucket, hiding per-class differences.
  2. For each trace entry, construct the prompt using the “before” layout and the “after” layout. Invariant: both layouts use the same content, just structured differently. Failure mode: the “after” layout accidentally drops content (e.g., a few-shot example), which improves cache hits but degrades output quality.
  3. Send each constructed prompt to the LLM API (or simulate the call if running in dry-run mode). Capture cache metadata from the response. Invariant: cache metadata fields are present in the response. Failure mode: the provider does not return cache metadata for this model or region; the harness must detect this and report it rather than silently recording zero cache hits.
  4. For each request, compute cost using the provider pricing table. Invariant: the pricing table matches the actual provider pricing at the time of the benchmark. Failure mode: stale pricing data produces incorrect cost estimates; include a pricing data timestamp in the report.
  5. Aggregate metrics by request class and layout version. Compute cache hit ratio, total cost, average cost per request, median latency, and p95 latency. Invariant: aggregation is deterministic (same trace produces same report). Failure mode: non-deterministic request ordering causes slightly different aggregations; sort by trace entry ID before aggregating.
  6. Generate the diff report comparing before vs after for each metric and each request class. Invariant: the report clearly labels which metrics improved and which regressed. Failure mode: a layout change improves cost but increases latency; the report must surface this tradeoff, not hide it.

Minimal concrete example

Provider pricing table (per 1M tokens):
  provider: anthropic
  model: claude-sonnet-4-20250514
  input_uncached:     $3.00
  input_cached_read:  $0.30    (10% of uncached)
  input_cached_write: $3.75    (25% premium on first creation)
  output:             $15.00

Cost calculation for a single request:
  total_input_tokens:    2,000
  cached_tokens:         1,500  (cache hit on prefix)
  uncached_tokens:         500  (dynamic suffix)

  cost = (1,500 * $0.30/1M) + (500 * $3.00/1M)
       = $0.00045 + $0.00150
       = $0.00195

  without caching:
  cost = 2,000 * $3.00/1M = $0.00600

  savings per request: $0.00600 - $0.00195 = $0.00405 (67.5% reduction)

Diff report summary table:
  | Class     | Metric      | Before  | After   | Delta   |
  |-----------|-------------|---------|---------|---------|
  | support   | hit_ratio   | 21.4%   | 68.9%   | +47.5pp |
  | support   | cost/1k req | $6.00   | $1.95   | -67.5%  |
  | support   | p95 latency | 910ms   | 640ms   | -29.7%  |
  | code_rev  | hit_ratio   | 5.1%    | 42.3%   | +37.2pp |
  | code_rev  | cost/1k req | $15.00  | $9.20   | -38.7%  |
  | code_rev  | p95 latency | 1,400ms | 980ms   | -30.0%  |

Common misconceptions

  • “Cache hit ratio is the only metric that matters.” A high cache hit ratio is meaningless if output quality dropped because you removed important context from the prompt to make it more cacheable. Always measure quality alongside cost.
  • “Global averages are sufficient for cost modeling.” If 80% of your traffic is low-cost support queries and 20% is expensive code review queries, a global average hides the fact that code review is barely benefiting from caching. Segment by request class.
  • “Cache write costs are negligible.” On Anthropic, the first request that creates a cache entry pays a 25% premium on those tokens. If your cache TTL is short (5 minutes) and traffic is bursty, you may be paying the write cost frequently. Model the amortization: cache write cost / number of cache reads before expiration.
  • “Latency improvement is always proportional to cache hit ratio.” Latency depends on many factors beyond cache hits: network round-trip time, output token count, model load, and provider infrastructure. Cache hits reduce input processing time but do not affect output generation time. Measure actual latency; do not extrapolate from cache hit ratios alone.

Check-your-understanding questions

  1. Why must the diff report segment metrics by request class rather than reporting only global averages?
  2. How do you amortize cache write costs to determine whether caching is net-positive for a given request class?
  3. What happens to your cost model if the provider changes its cached token pricing?

Check-your-understanding answers

  1. Different request classes have different prompt structures, token counts, and cache hit potentials. A global average can show improvement even when a major request class is regressing, because the other classes dominate the average. Per-class reporting surfaces these hidden regressions.
  2. Amortization: divide the cache write cost by the expected number of cache reads before the cache entry expires. If a cache entry costs $0.00375 to create (1,000 tokens at $3.75/1M) and you expect 50 cache reads before TTL expiration, the amortized write cost is $0.000075 per request. Compare this to the per-request savings from cached reads to determine net benefit.
  3. The cost model must be parameterized by a pricing table that can be updated independently of the benchmark logic. If the provider changes pricing, you update the pricing table and re-run the cost calculation on existing telemetry data without re-running the actual API calls.

Real-world applications

  • Large-scale chatbot platforms (customer support, sales assistants) where millions of requests per day share common system prompts.
  • Enterprise AI platforms that run multiple LLM providers and need to compare caching economics across providers to inform routing decisions.
  • FinOps teams tracking AI spend who need per-team and per-application cost attribution with cache efficiency breakdowns.

Where you’ll apply it

  • Phase 2 of this project: build the benchmark harness, implement the telemetry capture, cost calculator, and diff report generator. The diff report is the primary deliverable of the project.

References

  • Anthropic API documentation: usage object fields (cache_creation_input_tokens, cache_read_input_tokens)
  • OpenAI API documentation: usage object cached_tokens field
  • “Trustworthy Online Controlled Experiments” by Kohavi, Tang, Xu - Chapters on A/B testing methodology and metric design
  • “Designing Data-Intensive Applications” by Martin Kleppmann - Chapters on measuring system performance

Key insights The benchmark harness is not a testing tool; it is the decision-making artifact that proves whether a prompt restructuring saves money or just looks tidier.

Summary Cache hit telemetry captures per-request cache metadata (cached tokens, uncached tokens, latency) from LLM API responses and feeds it into a cost model parameterized by provider pricing. The cost model computes savings from cached vs uncached input tokens. The benchmark harness replays the same traffic trace through before/after prompt layouts and produces a diff report segmented by request class. Key metrics are cache hit ratio, cost per request, and p95 latency. Always measure quality alongside cost to ensure that cache optimization does not degrade output. Global averages are misleading; always segment by request class.

Homework/Exercises to practice the concept

  • Given a traffic trace of 1,000 requests split across 3 request classes (600 support, 300 extraction, 100 code review), design the schema for a telemetry log entry that captures all fields needed for cost modeling and diff reporting.
  • Calculate the monthly cost savings for a system that processes 500,000 requests/day with an average of 2,000 input tokens per request, if cache hit ratio improves from 20% to 65% on Anthropic (uncached: $3.00/1M, cached read: $0.30/1M). Include cache write cost amortization assuming 5-minute TTL and 100 requests per cache entry before expiration.
  • Design a diff report format (in pseudocode or markdown table) that a product manager could use to approve or reject a prompt layout change. Include at least 5 metrics per request class.

Solutions to the homework/exercises

  • The telemetry log schema should include: trace_id, request_class, layout_version (before/after), model, provider, total_input_tokens, cached_tokens, uncached_tokens, output_tokens, cache_write_tokens (non-zero only on first creation), ttft_ms, total_latency_ms, timestamp, and optionally quality_score if an eval function is available.
  • Monthly calculation: 500,000 requests/day * 30 days = 15M requests/month. Total input tokens: 15M * 2,000 = 30B tokens. Before (20% cached): cost = (6B * $0.30/1M) + (24B * $3.00/1M) = $1,800 + $72,000 = $73,800. After (65% cached): cost = (19.5B * $0.30/1M) + (10.5B * $3.00/1M) = $5,850 + $31,500 = $37,350. Cache write cost: assuming each cache entry serves 100 reads before expiration, total write events = 15M / 100 = 150,000. Average cacheable prefix = 1,300 tokens (65% of 2,000). Write cost = 150,000 * 1,300 * $3.75/1M = $731.25. Net monthly savings: $73,800 - $37,350 - $731.25 = approximately $35,719/month.
  • The diff report should include per request class: cache hit ratio (before/after/delta), cost per 1,000 requests (before/after/delta %), p50 latency (before/after/delta), p95 latency (before/after/delta), quality score (before/after/delta), total monthly projected cost, and a recommendation field (PROMOTE if all metrics improve or hold steady, REVIEW if cost improves but quality drops, REJECT if quality drops below threshold).

Safe Context Boundaries Under Caching

Fundamentals Caching introduces safety risks that do not exist in uncached prompt systems. When you cache a prompt prefix, you are storing processed context that will be reused across multiple requests, potentially from different users, sessions, or tenants. If the cached segment contains user-specific data, you have a cross-tenant data leakage problem. If the cached segment contains policy instructions that later change, you have a stale policy problem. If the cached segment includes retrieval-augmented context that becomes outdated, you have a context pollution problem. This concept is about drawing safe boundaries around what can be cached, how long it stays cached, and how you verify that caching has not broken correctness or security.

Deep Dive into the concept The primary risk category is cross-tenant leakage. In a multi-tenant system (e.g., a SaaS platform where each customer has their own data), the cached prefix must never contain tenant-specific information. If Tenant A’s company policies are cached in the prefix and Tenant B’s request hits that cache, Tenant B receives responses influenced by Tenant A’s policies. This is a data breach. The defense is strict: the cached prefix must contain only content that is safe to share across all tenants. Any tenant-specific content must be in the dynamic suffix, which is never cached.

The second risk category is stale policy. Suppose your system prompt includes a refund policy: “Refunds are available within 30 days.” The company changes the policy to 14 days. If the old system prompt is cached with a long TTL, requests served from cache will still use the 30-day policy until the cache expires. The defense is cache invalidation tied to policy version. When the policy version changes, the cache must be invalidated immediately, not at TTL expiration. This requires either content-hash-based caching (where the cache key includes a hash of the content, so any content change automatically creates a new cache entry) or explicit cache invalidation APIs (available on Google’s context caching).

The third risk category is context pollution. In RAG systems, retrieved documents become part of the prompt context. If retrieved documents are placed in the cached prefix (because they are common across many queries), outdated documents remain in the cache after the knowledge base is updated. The defense is to keep retrieval context in the dynamic suffix or to use short TTLs for cached segments that include retrieval content.

Cache invalidation strategies fall into three categories:

  1. TTL-based invalidation: the cache entry expires after a fixed duration (e.g., Anthropic’s 5-minute TTL). Simple but blunt. Works well for content that changes infrequently. Does not protect against urgent policy changes within the TTL window.

  2. Version-based invalidation: the cache key includes a version identifier (e.g., system_prompt_v2.3). When the version changes, the old cache entry is never hit because the key is different. This requires a mechanism to propagate version changes to all services that construct prompts.

  3. Content-hash invalidation: the cache key includes a hash of the cached content. Any content change, no matter how small, produces a different hash and a new cache entry. This is the most robust strategy but requires computing hashes at prompt construction time.

Testing cache correctness requires a specific test pattern. The test creates a cache entry with content version A, verifies that subsequent requests hit the cache, then changes the content to version B, and verifies that the next request does NOT return results influenced by version A. This catches stale-cache bugs that are invisible in normal testing because they only manifest when the cache is populated and the content changes.

Security boundaries between cached segments deserve explicit design. If your system supports multiple caching levels (e.g., system prompt cached separately from few-shot examples), each cached segment has its own security boundary. The system prompt cache is safe to share across all users within a tenant. The few-shot example cache might be safe to share across users within the same task type. The conversation history cache should never be shared across users. These boundaries must be enforced at the cache key level: include tenant ID, task type, or user ID in the cache key as appropriate.

How this fit on projects This concept drives the guardrail checker component of Project 9. The benchmark harness must validate that the proposed prompt layout does not introduce safety risks from caching. Every diff report should include a safety assessment: “Does the cached prefix contain any per-user, per-tenant, or frequently-changing content?”

Definitions & key terms

  • Cross-tenant leakage: When cached context from one tenant influences responses for another tenant. A data breach.
  • Stale policy: When cached instructions reflect outdated policies because the cache has not been invalidated after a policy change.
  • Context pollution: When cached retrieval context becomes outdated after the knowledge base is updated, causing responses based on stale information.
  • Cache invalidation: The process of removing or replacing cache entries when the underlying content changes.
  • Content hash: A deterministic hash of the cached content, used as part of the cache key to ensure that content changes automatically create new cache entries.
  • TTL (Time to Live): The maximum duration a cache entry is valid. After TTL expiration, the entry is evicted.
  • Cache key: The identifier used to look up a cache entry. Must include all factors that affect whether the cached content is appropriate for a given request.
  • Security boundary: A logical separation between cached segments that prevents data from one scope (tenant, user, task) from leaking to another scope.

Mental model diagram (ASCII)

SAFE caching boundary:
+--------------------------------------------------+
|  CACHED PREFIX (shared across all users/tenants)  |
|  ┌──────────────────────────────────────────────┐ |
|  │ System identity (stable)                     │ |
|  │ Global policy rules (versioned)              │ |
|  │ Few-shot examples (stable)                   │ |
|  └──────────────────────────────────────────────┘ |
+--------------------------------------------------+
                    |
                    | BOUNDARY (cache_control marker)
                    |
+--------------------------------------------------+
|  DYNAMIC SUFFIX (never cached, per-request)       |
|  ┌──────────────────────────────────────────────┐ |
|  │ Tenant-specific context                      │ |
|  │ User identity and preferences                │ |
|  │ Retrieved documents (RAG)                    │ |
|  │ Conversation history                         │ |
|  │ User query                                   │ |
|  └──────────────────────────────────────────────┘ |
+--------------------------------------------------+

UNSAFE caching (what NOT to do):
+--------------------------------------------------+
|  CACHED PREFIX (DANGER: contains per-tenant data) |
|  ┌──────────────────────────────────────────────┐ |
|  │ System identity                              │ |
|  │ Tenant A's refund policy  <-- LEAKS TO B!    │ |
|  │ User preferences          <-- LEAKS TO ALL!  │ |
|  └──────────────────────────────────────────────┘ |
+--------------------------------------------------+

Cache invalidation strategies:
  TTL-based:
    cache_entry(key, content, ttl=300s)
    After 300s -> entry evicted -> next request recomputes

  Version-based:
    cache_key = f"system_prompt_v{version}"
    v2.3 -> cache hit
    Policy changes -> v2.4 -> cache miss -> new entry

  Content-hash:
    cache_key = f"prefix_{sha256(content)}"
    Any content change -> different hash -> cache miss

How it works (step-by-step, with invariants and failure modes)

  1. Before creating a cache entry, classify every segment of the cached prefix as: global (safe for all), tenant-scoped (safe within a tenant), or user-scoped (not cacheable in shared cache). Invariant: no user-scoped content appears in a global or tenant-scoped cache. Failure mode: a developer accidentally includes {{user_name}} in the system prompt which is cached globally; all users see the first user’s name.
  2. Assign a cache key that encodes the appropriate scope. For global caches, the key includes model name and content hash. For tenant-scoped caches, the key includes model name, tenant ID, and content hash. Invariant: the key granularity matches the content scope. Failure mode: a tenant-scoped cache uses a global key (missing tenant ID); Tenant B hits Tenant A’s cache.
  3. Set TTL based on content volatility. Global system prompts: long TTL (hours). Policy-dependent content: short TTL (minutes) or version-based invalidation. Retrieval content: very short TTL or no caching. Invariant: TTL is never longer than the maximum acceptable staleness for the content type. Failure mode: policy content cached for 1 hour when policy changes are announced to take effect immediately.
  4. Implement a cache invalidation trigger for policy changes. When a policy version is updated, broadcast an invalidation event that forces new cache entries to be created. Invariant: after an invalidation event, no request is served from the old cache. Failure mode: the invalidation event is lost (network issue); stale policy continues to be served until TTL expires.
  5. Test cache correctness with a sequence: create cache with version A -> verify cache hit -> update to version B -> verify next request does NOT reflect version A content. Invariant: the test fails if stale content is served. Failure mode: the test only checks that the response is “well-formed” but not that it reflects the correct policy version.

Minimal concrete example

Cache invalidation policy config:

  cache_policy:
    segments:
      - name: system_identity
        scope: global
        invalidation: content_hash
        ttl: 3600  # 1 hour, but content hash ensures freshness
      - name: policy_rules
        scope: tenant
        invalidation: version
        version_source: policy_registry
        ttl: 300   # 5 minutes, with version-triggered invalidation
      - name: few_shot_examples
        scope: global
        invalidation: content_hash
        ttl: 86400  # 24 hours (examples change rarely)
      - name: retrieval_context
        scope: user
        invalidation: none  # never cached
        ttl: 0
      - name: user_query
        scope: user
        invalidation: none  # never cached
        ttl: 0

Cache key construction:
  global segment:  "g:{model}:{sha256(content)}"
  tenant segment:  "t:{model}:{tenant_id}:{version}"
  user segment:    not cached (no key needed)

Safety test sequence:
  1. POST /chat with tenant=A, policy_version=2.3
     -> response references "30-day refund policy"
     -> cache entry created: "t:sonnet:tenantA:2.3"
  2. Update policy to version 2.4 ("14-day refund policy")
     -> invalidation event broadcast
  3. POST /chat with tenant=A, policy_version=2.4
     -> response MUST reference "14-day refund policy"
     -> if "30-day" appears, STALE CACHE BUG detected
  4. POST /chat with tenant=B, policy_version=2.4
     -> response MUST NOT reference tenant A's specific data
     -> if tenant A data appears, CROSS-TENANT LEAK detected

Common misconceptions

  • “Caching only affects cost; it has no security implications.” Caching creates shared state. Shared state across security boundaries (tenants, users) creates leakage vectors. Every caching optimization must be reviewed for security, not just cost.
  • “Short TTLs solve all staleness problems.” A 5-minute TTL means up to 5 minutes of stale data after a change. For critical policy changes (e.g., regulatory compliance updates, safety policy revisions), even 5 minutes of staleness may be unacceptable. Version-based or content-hash invalidation provides immediate freshness.
  • “If the cache hit ratio is high, the caching design is good.” A high hit ratio achieved by caching content that should not be cached (tenant-specific data, user preferences) is a security vulnerability, not an achievement. Cache hit ratio must be evaluated alongside a safety audit.
  • “Testing with one user is sufficient for cache validation.” Cache safety bugs only manifest when multiple users/tenants interact with the same cache. Tests must simulate multi-tenant scenarios where different tenants hit the same cache entries.

Check-your-understanding questions

  1. Why is content-hash-based cache invalidation more robust than TTL-based invalidation for policy content?
  2. What information must be included in the cache key for a tenant-scoped cache entry? What happens if you omit the tenant ID?
  3. Describe a test scenario that would detect a cross-tenant data leakage bug caused by caching.

Check-your-understanding answers

  1. TTL-based invalidation allows stale content to be served for up to the TTL duration after a change. Content-hash invalidation produces a new cache key whenever the content changes, so the old entry is never matched. There is no staleness window. The tradeoff is that you must compute the hash at prompt construction time, which adds minimal latency but guarantees freshness.
  2. The cache key must include: model name (different models should not share cache), tenant ID (isolation between tenants), and either a content hash or a version identifier (freshness guarantee). If you omit the tenant ID, a cache entry created by Tenant A can be served to Tenant B, leaking Tenant A’s context into Tenant B’s responses.
  3. Test scenario: (1) Send a request as Tenant A with a tenant-specific system prompt containing “Acme Corp internal pricing: Widget=$50.” (2) Verify the response references Acme’s pricing. (3) Send a request as Tenant B with a different tenant-specific system prompt containing “Beta Inc pricing: Widget=$75.” (4) Verify that Tenant B’s response references Beta’s pricing ($75) and does NOT reference Acme’s pricing ($50). If Tenant B’s response contains “$50” or “Acme Corp,” the cache is leaking across tenants.

Real-world applications

  • Multi-tenant SaaS AI platforms (Intercom, Zendesk AI, Salesforce Einstein) where customer data must be strictly isolated.
  • Healthcare AI systems where cached prompts must not leak patient information between sessions.
  • Financial services AI assistants where regulatory policies change frequently and stale cached policies could cause compliance violations.

Where you’ll apply it

  • Phase 3 of this project: implement the guardrail checker that validates cached prefix safety, write cache invalidation tests, and add the safety assessment section to the diff report.
  • This concept also connects to Project 13 (Tool Permission Firewall) and Project 16 (Human-in-the-Loop Escalation Queue) where safety boundaries are enforced at runtime.

References

  • Anthropic prompt caching documentation: TTL behavior and cache_control semantics
  • Google Vertex AI context caching: TTL configuration and cache invalidation API
  • “Designing Data-Intensive Applications” by Martin Kleppmann - Chapter 5 on replication and cache consistency
  • OWASP Top 10 for LLM Applications: data leakage and multi-tenancy risks

Key insights Every cached token is a shared secret; if you would not put it on a shared whiteboard visible to all users, it should not be in the cached prefix.

Summary Safe context boundaries under caching address three risk categories: cross-tenant data leakage (cached content from one tenant served to another), stale policy (cached instructions reflecting outdated policies), and context pollution (cached retrieval content that has become outdated). Defenses include strict scope classification for cached content (global vs tenant vs user), cache key design that encodes the appropriate scope, and invalidation strategies (TTL for low-risk content, version-based or content-hash for policy content, no caching for user-specific content). Testing cache safety requires multi-tenant simulation that verifies isolation and freshness after content changes.

Homework/Exercises to practice the concept

  • Design a cache key schema for a multi-tenant AI platform with three content scopes (global, tenant, user). Show the key format for each scope and explain what happens if a field is omitted.
  • Write a 4-step test sequence that detects stale policy bugs. Include the expected response content at each step and the pass/fail criteria.
  • Given a prompt template with 6 segments, classify each as cacheable (global), cacheable (tenant-scoped), or not cacheable. Justify each classification and describe the invalidation strategy for each cacheable segment.

Solutions to the homework/exercises

  • Cache key schema: Global: "g:{model}:{sha256(content)}" – omitting the hash means content changes produce cache hits on stale entries. Tenant: "t:{model}:{tenant_id}:{sha256(content)}" – omitting tenant_id causes cross-tenant leakage. User: not cached (no key). For each scope, explain that the key must include all factors that affect whether the cached content is appropriate. If the model field is omitted, different models could share cache entries leading to incoherent behavior.
  • Stale policy test: (1) Set policy version 1.0 (“30-day returns”), send request, verify response mentions 30-day returns. (2) Update policy to version 1.1 (“14-day returns”), send invalidation event. (3) Send request, verify response mentions 14-day returns. Pass: step 3 response says “14-day.” Fail: step 3 response says “30-day.” (4) Bonus: send request 10 seconds after step 2 without invalidation event, verify the TTL-only path also catches staleness. This step may pass or fail depending on TTL length, which demonstrates why version-based invalidation is superior.
  • Segment classification example: System identity (“You are a support agent”) -> global, content-hash, long TTL. Company refund policy (“Returns within 14 days”) -> tenant-scoped, version-based, short TTL. Few-shot examples (3 static Q&A pairs) -> global, content-hash, long TTL. Retrieved FAQ articles -> not cacheable (changes with knowledge base updates). User preferences (“Prefers formal tone”) -> not cacheable (user-scoped). User query -> not cacheable (unique per request). Justification: anything that varies by user or changes frequently belongs outside the cache.

3. Project Specification

3.1 What You Will Build

A benchmark harness that redesigns prompt prefixes for cache-hit improvements and cost reduction.

3.2 Functional Requirements

  1. Replay realistic traffic traces through before/after prompt layouts.
  2. Compute cache hit ratio and token cost metrics.
  3. Enforce deterministic cacheable prefix boundaries.
  4. Export promotion recommendation with regression checks.

3.3 Non-Functional Requirements

  • Performance: Benchmark harness processes 2k traces under 8 minutes.
  • Reliability: Deterministic replay mode produces stable comparative results.
  • Security/Policy: No user-sensitive fields appear in shared cache prefix.

3.4 Example Usage / Output

$ uv run p09-cache bench --trace fixtures/chat_trace.jsonl --before prompts/v1 --after prompts/v2 --out out/p09
[INFO] Requests replayed: 2,000
[PASS] Prefix cache hit-rate: 21.4% -> 68.9%
[PASS] Avg input token cost: -37.2%
[PASS] p95 latency: 910ms -> 640ms
[INFO] Benchmark diff report: out/p09/diff_report.md

3.5 Data Formats / Schemas / Protocols

  • Trace JSONL with request class, prompt parts, and token counters.
  • Profile config defining cacheable boundary markers.
  • Diff report Markdown + JSON summary artifacts.

3.6 Edge Cases

  • Dynamic fields accidentally placed in cacheable prefix.
  • Different locales cause hidden prefix divergence.
  • Hit-rate improves but quality regresses due to truncated context.
  • Benchmark trace is not representative of production traffic.

3.7 Real World Outcome

This section is your golden reference. Your implementation is considered correct when your run looks materially like this and produces the same artifact types.

3.7.1 How to Run (Copy/Paste)

$ uv run p09-cache bench --trace fixtures/chat_trace.jsonl --before prompts/v1 --after prompts/v2 --out out/p09
  • Working directory: project_based_ideas/AI_AGENTS_LLM_RAG/PROMPT_ENGINEERING_PROJECTS
  • Required inputs: project fixtures under fixtures/
  • Output directory: out/p09

3.7.2 Golden Path Demo (Deterministic)

Use the fixed seed already embedded in the command or config profile. You should see stable pass/fail totals between runs.

3.7.3 If CLI: exact terminal transcript

$ uv run p09-cache bench --trace fixtures/chat_trace.jsonl --before prompts/v1 --after prompts/v2 --out out/p09
[INFO] Requests replayed: 2,000
[PASS] Prefix cache hit-rate: 21.4% -> 68.9%
[PASS] Avg input token cost: -37.2%
[PASS] p95 latency: 910ms -> 640ms
[INFO] Benchmark diff report: out/p09/diff_report.md
$ echo $?
0

Failure demo:

$ uv run p09-cache bench --trace fixtures/chat_trace.jsonl --before prompts/v1 --after prompts/v2_bad --out out/p09
[ERROR] After-profile contains non-deterministic timestamp segment in cacheable prefix
[HINT] Move dynamic fields below cache boundary marker
$ echo $?
2

4. Solution Architecture

4.1 High-Level Design

User Input / Trigger
        |
        v
+-------------------------+
| Trace Replayer |
+-------------------------+
        |
        v
+-------------------------+
| Cache Analyzer |
+-------------------------+
        |
        v
+-------------------------+
| Guardrail Checker |
+-------------------------+
        |
        v
Artifacts / API / UI / Logs

4.2 Key Components

| Component | Responsibility | Key Decisions | |———–|—————-|—————| | Trace Replayer | Feeds recorded requests into before/after layouts. | Keep trace ordering deterministic. | | Cache Analyzer | Computes hit/miss and cost deltas. | Report by request class, not only global average. | | Guardrail Checker | Blocks profiles that harm quality or privacy. | Require quality parity before accepting cost gains. |

4.3 Data Structures (No Full Code)

P09_Request:
- trace_id
- input payload/context
- policy profile

P09_Decision:
- status (ALLOW | DENY | RETRY | ESCALATE | PROMOTE | ROLLBACK)
- reason_code
- artifact pointers

4.4 Algorithm Overview

Key algorithm: Policy-aware decision pipeline

  1. Normalize input and attach deterministic trace metadata.
  2. Run contract/schema validation and project-specific core checks.
  3. Apply policy gates and decide: success, retry, deny, escalate, or rollback.
  4. Persist artifacts and publish operational metrics.

Complexity Analysis (conceptual):

  • Time: O(n) over fixture/request items in a batch run.
  • Space: O(n) for traces and report artifacts.

5. Implementation Guide

5.1 Development Environment Setup

# 1) Install dependencies
# 2) Prepare fixtures under fixtures/
# 3) Run the project command(s) listed in section 3.7

5.2 Project Structure

p09/
├── src/
├── fixtures/
├── policies/
├── out/
└── README.md

5.3 The Core Question You’re Answering

“How much cost and latency can I save by redesigning prompt prefixes for cache hits?”

This question matters because it forces the project to produce objective evidence instead of relying on subjective prompt impressions.

5.4 Concepts You Must Understand First

  1. Prompt prefix normalization
    • Why does this concept matter for P09?
    • Book Reference: Provider prompt-caching docs
  2. Cache key strategy
    • Why does this concept matter for P09?
    • Book Reference: “Designing Data-Intensive Applications” by Martin Kleppmann - caching chapters
  3. Benchmark design for latency/cost
    • Why does this concept matter for P09?
    • Book Reference: “Site Reliability Engineering” by Google - measurement discipline

5.5 Questions to Guide Your Design

  1. Boundary and contracts
    • What is the smallest safe contract surface for prompt caching optimizer?
    • Which failure reasons must be explicit and machine-readable?
  2. Runtime policy
    • What is allowed automatically, what needs retry, and what must escalate?
    • Which policy checks must happen before any side effect?
  3. Evidence and observability
    • What traces/metrics are required for fast incident triage?
    • What specific thresholds trigger rollback or human review?

5.6 Thinking Exercise

Pre-Mortem for Prompt Caching Optimizer

Before implementing, write down 10 ways this project can fail in production. Classify each failure into: contract, policy, security, or operations.

Questions to answer:

  • Which failures can be prevented before runtime?
  • Which failures require runtime detection and escalation?

5.7 The Interview Questions They’ll Ask

  1. “What prompt parts should be cacheable versus dynamic?”
  2. “How do you avoid quality regressions while optimizing for cache?”
  3. “Why can global hit-rate be misleading?”
  4. “How would you benchmark caching changes safely?”
  5. “Which privacy risks exist in shared prompt prefixes?”

5.8 Hints in Layers

Hint 1: Mark the boundary explicitly Use a boundary marker between stable and dynamic segments.

Hint 2: Canonicalize aggressively Normalize whitespace and ordering before hashing prefixes.

Hint 3: Benchmark by class Track support, extraction, and coding classes separately.

Hint 4: Guard quality Cost wins are invalid if task pass-rate drops.

5.9 Books That Will Help

| Topic | Book | Chapter | |——-|——|———| | Caching fundamentals | “Designing Data-Intensive Applications” by Martin Kleppmann | Caching-related sections | | Performance measurement | “Site Reliability Engineering” by Google | Measurement chapters | | Experiment rigor | “Trustworthy Online Controlled Experiments” by Kohavi et al. | Experiment design |

5.10 Implementation Phases

Phase 1: Foundation

  • Define contracts, policy profiles, and deterministic fixtures.
  • Build the core execution path and baseline artifact output.
  • Checkpoint: One golden-path scenario runs end-to-end with trace id and artifact.

Phase 2: Core Functionality

  • Add project-specific evaluation/routing/verification logic.
  • Add error paths with unified reason codes.
  • Checkpoint: Golden-path and one failure-path both behave deterministically.

Phase 3: Operational Hardening

  • Add metrics, trend reporting, and release/rollback or escalation gates.
  • Document runbook and incident/debug flow.
  • Checkpoint: Team member can reproduce output from clean checkout.

5.11 Key Implementation Decisions

| Decision | Options | Recommendation | Rationale | |———-|———|—————-|———–| | Validation order | Late checks vs early checks | Early checks | Fail-fast saves cost and reduces unsafe execution | | Failure handling | Silent retries vs explicit reason codes | Explicit reason codes | Enables automation and faster debugging | | Rollout/escalation | Manual-only vs policy-driven | Policy-driven with manual override | Balances speed and safety |

6. Testing Strategy

6.1 Test Categories

| Category | Purpose | Examples | |———-|———|———-| | Unit Tests | Validate deterministic building blocks | schema checks, policy gates, parser behaviors | | Integration Tests | Verify end-to-end project path | golden-path command/API flow | | Edge Case Tests | Ensure robust failure handling | malformed fixture, blocked policy action |

6.2 Critical Test Cases

  1. Golden path succeeds and emits expected artifact shape.
  2. High-risk/invalid path returns deterministic error with reason code.
  3. Replay with same seed/config yields same decision summary.

6.3 Test Data

fixtures/golden_case.*
fixtures/failure_case.*
fixtures/edge_cases/*

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

| Pitfall | Symptom | Solution | |———|———|———-| | “Hit-rate improved but answers got worse” | Important context moved out of prompt or truncated. | Run quality regression suite alongside cache benchmark. | | “Expected cache hits not observed” | Prefix still contains subtle per-request variation. | Normalize whitespace/order and remove dynamic tokens. | | “Savings overestimated” | Benchmark doesn’t include true traffic mix. | Weight results by production class distribution. |

7.2 Debugging Strategies

  • Re-run deterministic fixtures with fixed seed and compare trace ids.
  • Diff latest artifacts against last known-good baseline.
  • Isolate whether failure is contract, policy, or runtime dependency related.

7.3 Performance Traps

  • Unbounded retries inflate latency and cost.
  • Overly broad logging can slow hot paths.
  • Missing cache/canonicalization can create avoidable compute churn.

8. Extensions & Challenges

8.1 Beginner Extensions

  • Add one new fixture category and expected outcome labels.
  • Add one new reason code with deterministic validation.

8.2 Intermediate Extensions

  • Add dashboard-ready trend exports.
  • Add automated regression diff against previous run artifacts.

8.3 Advanced Extensions

  • Integrate with rollout gates or human approval workflows.
  • Add chaos-style fault injection and recovery assertions.

9. Real-World Connections

9.1 Industry Applications

  • PromptOps platform teams operating AI features under compliance constraints.
  • Internal AI governance tooling for release safety and incident response.
  • LangChain/LangSmith style eval and tracing workflows.
  • OpenTelemetry-based observability stacks for decision traces.

9.3 Interview Relevance

  • Demonstrates ability to convert probabilistic model behavior into deterministic software guarantees.
  • Shows practical production-thinking: contracts, policies, monitoring, and operational controls.

10. Resources

10.1 Essential Reading

  • OpenAI/Anthropic/Google provider docs for structured outputs, tool calling, and prompt controls.
  • OWASP LLM Top 10 and NIST AI RMF guidance for safety and governance.

10.2 Video Resources

  • Talks on LLM eval systems, PromptOps, and AI safety operations.

10.3 Tools & Documentation

  • JSON schema validators, policy engines, and tracing infrastructure docs.
  • Previous projects: build specialized primitives.
  • Next projects: integrate these primitives into broader operational systems.

11. Self-Assessment Checklist

11.1 Understanding

  • I can explain the core risk boundaries and policy gates for this project.
  • I can explain the artifact format and why each field exists.
  • I can justify the release/escalation criteria.

11.2 Implementation

  • Golden-path and failure-path flows both work.
  • Deterministic artifacts are produced and reproducible.
  • Observability fields are present for debugging and audits.

11.3 Growth

  • I can describe one tradeoff I made and why.
  • I can explain this project design in an interview setting.

12. Submission / Completion Criteria

Minimum Viable Completion:

  • Golden path works with deterministic output artifact.
  • At least one failure-path scenario returns unified error shape/reason code.
  • Core metrics are emitted and documented.

Full Completion:

  • Includes automated tests, trend reporting, and reproducible runbook.
  • Includes operational thresholds for promote/rollback or escalate/approve.

Excellence (Above & Beyond):

  • Integrates with adjacent projects (registry, rollout, firewall, HITL) cleanly.
  • Demonstrates incident drill replay and fast root-cause workflow.