Project 8: Long-Context Evaluation Harness (Lost in the Middle)

Build a deterministic evaluation harness that measures how memory placement affects model performance.

Quick Reference

Attribute Value
Difficulty Level 3
Time Estimate 1-2 weeks
Main Programming Language Python (Alternatives: TypeScript, Go)
Alternative Programming Languages TypeScript, Go
Coolness Level Level 3
Business Potential Level 2
Prerequisites Prompt templating, basic evaluation
Key Topics Positional bias, benchmark design, deterministic evaluation

1. Learning Objectives

By completing this project, you will:

  1. Create a dataset with facts placed at different prompt positions.
  2. Measure accuracy differences across positions.
  3. Produce a report that quantifies the lost-in-the-middle effect.
  4. Recommend prompt placement strategies based on data.

2. All Theory Needed (Per-Concept Breakdown)

Long-Context Evaluation and Positional Bias

Fundamentals Long-context evaluation measures how model performance changes as relevant information moves within the prompt. Many models exhibit positional bias: they rely more on the beginning and end of a long context, often under-using information placed in the middle. This is the lost-in-the-middle effect. A good evaluation harness isolates prompt position as the only variable so you can measure this effect precisely.

Deep Dive into the concept Positional bias arises from attention patterns and training data distribution. In long contexts, models tend to focus on tokens near the start or end due to primacy and recency effects in attention. This means memory placement is not neutral: placing retrieved memories in the middle can reduce the chance that they are used. The Lost in the Middle study demonstrates this by shifting relevant facts within a long context and measuring the drop in accuracy when facts are placed in the middle. The follow-up Found in the Middle study explores methods to mitigate this effect, showing that better placement strategies and structured prompts can recover some of the lost performance.

To evaluate positional bias, you must design deterministic prompts. Each test case should contain the same facts and question, but with facts placed at different positions (start, middle, end). All other factors-prompt length, temperature, model version-must be held constant. The output is scored against a known answer. This yields accuracy curves by position. If accuracy drops sharply in the middle, you have a measurable lost-in-the-middle gap.

Evaluation must also track latency and cost. Long-context prompts increase cost, and adding memory at multiple positions may not be feasible. Therefore, the harness should output not just accuracy but also prompt length and runtime. This helps you choose memory placement strategies that balance accuracy and cost. For example, you might decide to place only the top-2 memories near the end (recency bias) and rely on summaries near the start.

This project turns positional bias into a measurable signal. The result will guide how you place memory in Project 10’s paging system and how you design retrieval budgets in Project 4.

From a systems perspective, this concept must be treated as a first-class interface between data and behavior. That means you need explicit invariants (what must always be true), observability (how you know it is true), and failure signatures (how it breaks when it is not). In practice, engineers often skip this and rely on ad-hoc fixes, which creates hidden coupling between the memory subsystem and the rest of the agent stack. A better approach is to model the concept as a pipeline stage with clear inputs, outputs, and preconditions: if inputs violate the contract, the stage should fail fast rather than silently corrupt memory. This is especially important because memory errors are long-lived and compound over time. You should also define operational metrics that reveal drift early. Examples include: the percentage of memory entries that lack required metadata, the ratio of retrieved memories that are later unused by the model, or the fraction of queries that trigger a fallback route because the primary memory store is empty. These metrics are not just for dashboards; they are design constraints that force you to keep the system testable and predictable.

Another critical dimension is lifecycle management. The concept may work well at small scale but degrade as the memory grows. This is where policies and thresholds matter: you need rules for promotion, demotion, merging, or deletion that prevent the memory from becoming a landfill. The policy should be deterministic and versioned. When it changes, you should be able to replay historical inputs and measure the delta in outputs. This is the same discipline used in data engineering for schema changes and backfills, and it applies equally to memory systems. Finally, remember that memory is an interface to user trust. If the memory system is noisy, the agent feels unreliable; if it is overly strict, the agent feels forgetful. The best designs expose these trade-offs explicitly, so you can tune them according to product goals rather than guessing in the dark.

How this fits on projects This concept is central to Project 8 and influences Project 4 and Project 10.

Definitions & key terms

  • Positional bias: Model preference for tokens at the start/end.
  • Lost-in-the-middle: Accuracy drop when facts are mid-context.
  • Evaluation harness: Automated system to measure performance.

Mental model diagram (ASCII)

Start        Middle          End
[FACT] ------------------- [FACT]
  ^            x             ^
high use     low use      high use

How It Works (Step-by-Step)

  1. Prepare a set of facts and questions.
  2. Create prompts with facts at different positions.
  3. Run the model with fixed settings.
  4. Score answers and compute accuracy by position.
  5. Report the lost-in-the-middle gap.

Minimal Concrete Example

positions: [start, middle, end]
question: "What is the API base URL?"
expected: "https://api.example.com"

Common Misconceptions

  • “Long-context models use all tokens equally.” (False.)
  • “More context always improves performance.” (False.)

Check-Your-Understanding Questions

  1. Why must prompts be deterministic in evaluation?
  2. What is the lost-in-the-middle gap?
  3. How can you mitigate positional bias?

Check-Your-Understanding Answers

  1. To isolate position as the only variable.
  2. The accuracy difference between middle and edge positions.
  3. Place critical memory near anchors or restructure prompts.

Real-World Applications

  • Memory placement strategies in RAG systems.
  • QA evaluation for long documents.

Where You’ll Apply It

  • In this project: §5.4 Concepts You Must Understand First and §6 Testing Strategy.
  • Also used in: Project 4, Project 10.

References

  • Lost in the Middle - https://arxiv.org/abs/2307.03172
  • Found in the Middle - https://arxiv.org/abs/2406.16008

Key Insights Prompt placement is a controllable variable that can be measured and optimized.

Summary A deterministic harness reveals positional bias and guides memory placement strategies.

Homework/Exercises to Practice the Concept

  1. Create 5 prompts with facts at different positions.
  2. Predict which position yields highest accuracy and test it.

Solutions to the Homework/Exercises

  1. Use fixed length and identical content with position changes.
  2. Expect start/end to outperform middle.

3. Project Specification

3.1 What You Will Build

An evaluation harness that:

  • Generates position-controlled prompts
  • Runs model evaluations
  • Scores outputs deterministically
  • Produces a positional bias report

3.2 Functional Requirements

  1. Dataset Builder: Create prompts with controlled positions.
  2. Runner: Execute model calls with fixed settings.
  3. Scorer: Score outputs vs expected answers.
  4. Reporter: Output accuracy by position.

3.3 Non-Functional Requirements

  • Performance: 100 prompts evaluated under 5 minutes.
  • Reliability: Same run yields same results.
  • Usability: Clear accuracy plots or tables.

3.4 Example Usage / Output

$ lceval run --facts facts.json --positions start,middle,end
[RUN] position=start accuracy=0.78
[RUN] position=middle accuracy=0.52
[RUN] position=end accuracy=0.81

3.5 Data Formats / Schemas / Protocols

{
  "question": "What is the API base URL?",
  "fact": "API base URL is https://api.example.com",
  "positions": ["start", "middle", "end"]
}

3.6 Edge Cases

  • Very short prompts (no middle position)
  • Facts repeated multiple times
  • Ambiguous scoring

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

$ lceval run --facts facts.json --positions start,middle,end
$ lceval report

3.7.2 Golden Path Demo (Deterministic)

$ lceval report
Start accuracy: 0.78
Middle accuracy: 0.52
End accuracy: 0.81
Lost-in-middle gap: 0.29
exit_code=0

3.7.3 Failure Demo (Deterministic)

$ lceval run --facts empty.json --positions start
[ERROR] no facts provided
exit_code=2

4. Solution Architecture

4.1 High-Level Design

Facts -> Prompt Generator -> Runner -> Scorer -> Report

4.2 Key Components

Component Responsibility Key Decisions
Generator Position facts Prompt template
Runner Execute model calls Fixed settings
Scorer Score outputs Exact vs fuzzy
Report Summarize metrics Table vs plot

4.3 Data Structures (No Full Code)

EvalCase:
  question: string
  fact: string
  position: enum
  expected: string

4.4 Algorithm Overview

  1. Generate prompts for each position.
  2. Run model with fixed settings.
  3. Score outputs.
  4. Aggregate results.

5. Implementation Guide

5.1 Development Environment Setup

- Prepare fixed model settings
- Store facts in JSON

5.2 Project Structure

project-root/
├── src/
│   ├── generate/
│   ├── run/
│   ├── score/
│   └── report/

5.3 The Core Question You’re Answering

“Does memory placement change model performance?”

5.4 Concepts You Must Understand First

  1. Positional bias
  2. Deterministic evaluation

5.5 Questions to Guide Your Design

  1. How do you define the middle position?
  2. What scoring method is robust?

5.6 Thinking Exercise

Predict which placement yields highest accuracy for your dataset.

5.7 The Interview Questions They’ll Ask

  1. “What is the lost-in-the-middle effect?”
  2. “How do you measure positional bias?”
  3. “How do you ensure determinism?”
  4. “What metrics beyond accuracy matter?”
  5. “How do you use results to change prompt design?”

5.8 Hints in Layers

Hint 1: Fix temperature to 0 Hint 2: Use exact match scoring first Hint 3: Add partial credit later Hint 4: Report gaps by position

5.9 Books That Will Help

Topic Book Chapter
Evaluation “AI Engineering” Ch. 3-4

5.10 Implementation Phases

Phase 1: Foundation

  • Build dataset and prompt generator

Phase 2: Core

  • Run model and score results

Phase 3: Polish

  • Add reports and plots

5.11 Key Implementation Decisions

Decision Options Recommendation Rationale
Scoring Exact / Fuzzy Exact first Deterministic baseline
Positioning Fixed / Relative Relative Scales with prompt length

6. Testing Strategy

6.1 Test Categories

Category Purpose Examples
Unit Prompt generator Position correctness
Integration Full run Prompt -> score
Edge Empty dataset Error handling

6.2 Critical Test Cases

  1. Positioning produces distinct prompts.
  2. Scores are deterministic.
  3. Empty dataset returns error.

6.3 Test Data

question: "What is the API base URL?"
expected: "https://api.example.com"

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall Symptom Solution
Non-determinism Flaky results Fix model settings
Wrong middle placement No gap detected Adjust positioning
Loose scoring Inflated accuracy Use exact match

7.2 Debugging Strategies

  • Print prompt lengths and positions.
  • Compare outputs across runs.

7.3 Performance Traps

  • Excessive prompt size causing slow runs.

8. Extensions & Challenges

8.1 Beginner Extensions

  • Add more fact categories

8.2 Intermediate Extensions

  • Add partial credit scoring

8.3 Advanced Extensions

  • Add visualization dashboards

9. Real-World Connections

9.1 Industry Applications

  • Prompt placement in production RAG systems
  • Open-source long-context eval suites

9.3 Interview Relevance

  • Evaluation methodology is critical in AI system interviews.

10. Resources

10.1 Essential Reading

  • Lost in the Middle paper
  • Found in the Middle paper

10.2 Video Resources

  • Talks on long-context evaluation

10.3 Tools & Documentation

  • Prompt evaluation toolkits

11. Self-Assessment Checklist

11.1 Understanding

  • I can explain positional bias.

11.2 Implementation

  • Evaluation harness is deterministic.

11.3 Growth

  • I can recommend prompt placement strategies.

12. Submission / Completion Criteria

Minimum Viable Completion:

  • Evaluation harness produces position-based report

Full Completion:

  • Added plots and partial scoring

Excellence (Going Above & Beyond):

  • Adaptive placement recommendations