Project 8: Long-Context Evaluation Harness (Lost in the Middle)
Build a deterministic evaluation harness that measures how memory placement affects model performance.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 3 |
| Time Estimate | 1-2 weeks |
| Main Programming Language | Python (Alternatives: TypeScript, Go) |
| Alternative Programming Languages | TypeScript, Go |
| Coolness Level | Level 3 |
| Business Potential | Level 2 |
| Prerequisites | Prompt templating, basic evaluation |
| Key Topics | Positional bias, benchmark design, deterministic evaluation |
1. Learning Objectives
By completing this project, you will:
- Create a dataset with facts placed at different prompt positions.
- Measure accuracy differences across positions.
- Produce a report that quantifies the lost-in-the-middle effect.
- Recommend prompt placement strategies based on data.
2. All Theory Needed (Per-Concept Breakdown)
Long-Context Evaluation and Positional Bias
Fundamentals Long-context evaluation measures how model performance changes as relevant information moves within the prompt. Many models exhibit positional bias: they rely more on the beginning and end of a long context, often under-using information placed in the middle. This is the lost-in-the-middle effect. A good evaluation harness isolates prompt position as the only variable so you can measure this effect precisely.
Deep Dive into the concept Positional bias arises from attention patterns and training data distribution. In long contexts, models tend to focus on tokens near the start or end due to primacy and recency effects in attention. This means memory placement is not neutral: placing retrieved memories in the middle can reduce the chance that they are used. The Lost in the Middle study demonstrates this by shifting relevant facts within a long context and measuring the drop in accuracy when facts are placed in the middle. The follow-up Found in the Middle study explores methods to mitigate this effect, showing that better placement strategies and structured prompts can recover some of the lost performance.
To evaluate positional bias, you must design deterministic prompts. Each test case should contain the same facts and question, but with facts placed at different positions (start, middle, end). All other factors-prompt length, temperature, model version-must be held constant. The output is scored against a known answer. This yields accuracy curves by position. If accuracy drops sharply in the middle, you have a measurable lost-in-the-middle gap.
Evaluation must also track latency and cost. Long-context prompts increase cost, and adding memory at multiple positions may not be feasible. Therefore, the harness should output not just accuracy but also prompt length and runtime. This helps you choose memory placement strategies that balance accuracy and cost. For example, you might decide to place only the top-2 memories near the end (recency bias) and rely on summaries near the start.
This project turns positional bias into a measurable signal. The result will guide how you place memory in Project 10’s paging system and how you design retrieval budgets in Project 4.
From a systems perspective, this concept must be treated as a first-class interface between data and behavior. That means you need explicit invariants (what must always be true), observability (how you know it is true), and failure signatures (how it breaks when it is not). In practice, engineers often skip this and rely on ad-hoc fixes, which creates hidden coupling between the memory subsystem and the rest of the agent stack. A better approach is to model the concept as a pipeline stage with clear inputs, outputs, and preconditions: if inputs violate the contract, the stage should fail fast rather than silently corrupt memory. This is especially important because memory errors are long-lived and compound over time. You should also define operational metrics that reveal drift early. Examples include: the percentage of memory entries that lack required metadata, the ratio of retrieved memories that are later unused by the model, or the fraction of queries that trigger a fallback route because the primary memory store is empty. These metrics are not just for dashboards; they are design constraints that force you to keep the system testable and predictable.
Another critical dimension is lifecycle management. The concept may work well at small scale but degrade as the memory grows. This is where policies and thresholds matter: you need rules for promotion, demotion, merging, or deletion that prevent the memory from becoming a landfill. The policy should be deterministic and versioned. When it changes, you should be able to replay historical inputs and measure the delta in outputs. This is the same discipline used in data engineering for schema changes and backfills, and it applies equally to memory systems. Finally, remember that memory is an interface to user trust. If the memory system is noisy, the agent feels unreliable; if it is overly strict, the agent feels forgetful. The best designs expose these trade-offs explicitly, so you can tune them according to product goals rather than guessing in the dark.
How this fits on projects This concept is central to Project 8 and influences Project 4 and Project 10.
Definitions & key terms
- Positional bias: Model preference for tokens at the start/end.
- Lost-in-the-middle: Accuracy drop when facts are mid-context.
- Evaluation harness: Automated system to measure performance.
Mental model diagram (ASCII)
Start Middle End
[FACT] ------------------- [FACT]
^ x ^
high use low use high use
How It Works (Step-by-Step)
- Prepare a set of facts and questions.
- Create prompts with facts at different positions.
- Run the model with fixed settings.
- Score answers and compute accuracy by position.
- Report the lost-in-the-middle gap.
Minimal Concrete Example
positions: [start, middle, end]
question: "What is the API base URL?"
expected: "https://api.example.com"
Common Misconceptions
- “Long-context models use all tokens equally.” (False.)
- “More context always improves performance.” (False.)
Check-Your-Understanding Questions
- Why must prompts be deterministic in evaluation?
- What is the lost-in-the-middle gap?
- How can you mitigate positional bias?
Check-Your-Understanding Answers
- To isolate position as the only variable.
- The accuracy difference between middle and edge positions.
- Place critical memory near anchors or restructure prompts.
Real-World Applications
- Memory placement strategies in RAG systems.
- QA evaluation for long documents.
Where You’ll Apply It
- In this project: §5.4 Concepts You Must Understand First and §6 Testing Strategy.
- Also used in: Project 4, Project 10.
References
- Lost in the Middle - https://arxiv.org/abs/2307.03172
- Found in the Middle - https://arxiv.org/abs/2406.16008
Key Insights Prompt placement is a controllable variable that can be measured and optimized.
Summary A deterministic harness reveals positional bias and guides memory placement strategies.
Homework/Exercises to Practice the Concept
- Create 5 prompts with facts at different positions.
- Predict which position yields highest accuracy and test it.
Solutions to the Homework/Exercises
- Use fixed length and identical content with position changes.
- Expect start/end to outperform middle.
3. Project Specification
3.1 What You Will Build
An evaluation harness that:
- Generates position-controlled prompts
- Runs model evaluations
- Scores outputs deterministically
- Produces a positional bias report
3.2 Functional Requirements
- Dataset Builder: Create prompts with controlled positions.
- Runner: Execute model calls with fixed settings.
- Scorer: Score outputs vs expected answers.
- Reporter: Output accuracy by position.
3.3 Non-Functional Requirements
- Performance: 100 prompts evaluated under 5 minutes.
- Reliability: Same run yields same results.
- Usability: Clear accuracy plots or tables.
3.4 Example Usage / Output
$ lceval run --facts facts.json --positions start,middle,end
[RUN] position=start accuracy=0.78
[RUN] position=middle accuracy=0.52
[RUN] position=end accuracy=0.81
3.5 Data Formats / Schemas / Protocols
{
"question": "What is the API base URL?",
"fact": "API base URL is https://api.example.com",
"positions": ["start", "middle", "end"]
}
3.6 Edge Cases
- Very short prompts (no middle position)
- Facts repeated multiple times
- Ambiguous scoring
3.7 Real World Outcome
3.7.1 How to Run (Copy/Paste)
$ lceval run --facts facts.json --positions start,middle,end
$ lceval report
3.7.2 Golden Path Demo (Deterministic)
$ lceval report
Start accuracy: 0.78
Middle accuracy: 0.52
End accuracy: 0.81
Lost-in-middle gap: 0.29
exit_code=0
3.7.3 Failure Demo (Deterministic)
$ lceval run --facts empty.json --positions start
[ERROR] no facts provided
exit_code=2
4. Solution Architecture
4.1 High-Level Design
Facts -> Prompt Generator -> Runner -> Scorer -> Report
4.2 Key Components
| Component | Responsibility | Key Decisions |
|---|---|---|
| Generator | Position facts | Prompt template |
| Runner | Execute model calls | Fixed settings |
| Scorer | Score outputs | Exact vs fuzzy |
| Report | Summarize metrics | Table vs plot |
4.3 Data Structures (No Full Code)
EvalCase:
question: string
fact: string
position: enum
expected: string
4.4 Algorithm Overview
- Generate prompts for each position.
- Run model with fixed settings.
- Score outputs.
- Aggregate results.
5. Implementation Guide
5.1 Development Environment Setup
- Prepare fixed model settings
- Store facts in JSON
5.2 Project Structure
project-root/
├── src/
│ ├── generate/
│ ├── run/
│ ├── score/
│ └── report/
5.3 The Core Question You’re Answering
“Does memory placement change model performance?”
5.4 Concepts You Must Understand First
- Positional bias
- Deterministic evaluation
5.5 Questions to Guide Your Design
- How do you define the middle position?
- What scoring method is robust?
5.6 Thinking Exercise
Predict which placement yields highest accuracy for your dataset.
5.7 The Interview Questions They’ll Ask
- “What is the lost-in-the-middle effect?”
- “How do you measure positional bias?”
- “How do you ensure determinism?”
- “What metrics beyond accuracy matter?”
- “How do you use results to change prompt design?”
5.8 Hints in Layers
Hint 1: Fix temperature to 0 Hint 2: Use exact match scoring first Hint 3: Add partial credit later Hint 4: Report gaps by position
5.9 Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Evaluation | “AI Engineering” | Ch. 3-4 |
5.10 Implementation Phases
Phase 1: Foundation
- Build dataset and prompt generator
Phase 2: Core
- Run model and score results
Phase 3: Polish
- Add reports and plots
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale |
|---|---|---|---|
| Scoring | Exact / Fuzzy | Exact first | Deterministic baseline |
| Positioning | Fixed / Relative | Relative | Scales with prompt length |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples |
|---|---|---|
| Unit | Prompt generator | Position correctness |
| Integration | Full run | Prompt -> score |
| Edge | Empty dataset | Error handling |
6.2 Critical Test Cases
- Positioning produces distinct prompts.
- Scores are deterministic.
- Empty dataset returns error.
6.3 Test Data
question: "What is the API base URL?"
expected: "https://api.example.com"
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution |
|---|---|---|
| Non-determinism | Flaky results | Fix model settings |
| Wrong middle placement | No gap detected | Adjust positioning |
| Loose scoring | Inflated accuracy | Use exact match |
7.2 Debugging Strategies
- Print prompt lengths and positions.
- Compare outputs across runs.
7.3 Performance Traps
- Excessive prompt size causing slow runs.
8. Extensions & Challenges
8.1 Beginner Extensions
- Add more fact categories
8.2 Intermediate Extensions
- Add partial credit scoring
8.3 Advanced Extensions
- Add visualization dashboards
9. Real-World Connections
9.1 Industry Applications
- Prompt placement in production RAG systems
9.2 Related Open Source Projects
- Open-source long-context eval suites
9.3 Interview Relevance
- Evaluation methodology is critical in AI system interviews.
10. Resources
10.1 Essential Reading
- Lost in the Middle paper
- Found in the Middle paper
10.2 Video Resources
- Talks on long-context evaluation
10.3 Tools & Documentation
- Prompt evaluation toolkits
10.4 Related Projects in This Series
11. Self-Assessment Checklist
11.1 Understanding
- I can explain positional bias.
11.2 Implementation
- Evaluation harness is deterministic.
11.3 Growth
- I can recommend prompt placement strategies.
12. Submission / Completion Criteria
Minimum Viable Completion:
- Evaluation harness produces position-based report
Full Completion:
- Added plots and partial scoring
Excellence (Going Above & Beyond):
- Adaptive placement recommendations