Project 8: Long-Context Evaluation Harness (Lost in the Middle)

Build a deterministic evaluation harness that measures how memory placement affects model performance.

Quick Reference

Attribute	Value
Difficulty	Level 3
Time Estimate	1-2 weeks
Main Programming Language	Python (Alternatives: TypeScript, Go)
Alternative Programming Languages	TypeScript, Go
Coolness Level	Level 3
Business Potential	Level 2
Prerequisites	Prompt templating, basic evaluation
Key Topics	Positional bias, benchmark design, deterministic evaluation

1. Learning Objectives

By completing this project, you will:

Create a dataset with facts placed at different prompt positions.
Measure accuracy differences across positions.
Produce a report that quantifies the lost-in-the-middle effect.
Recommend prompt placement strategies based on data.

2. All Theory Needed (Per-Concept Breakdown)

Long-Context Evaluation and Positional Bias

Fundamentals Long-context evaluation measures how model performance changes as relevant information moves within the prompt. Many models exhibit positional bias: they rely more on the beginning and end of a long context, often under-using information placed in the middle. This is the lost-in-the-middle effect. A good evaluation harness isolates prompt position as the only variable so you can measure this effect precisely.

Deep Dive into the concept Positional bias arises from attention patterns and training data distribution. In long contexts, models tend to focus on tokens near the start or end due to primacy and recency effects in attention. This means memory placement is not neutral: placing retrieved memories in the middle can reduce the chance that they are used. The Lost in the Middle study demonstrates this by shifting relevant facts within a long context and measuring the drop in accuracy when facts are placed in the middle. The follow-up Found in the Middle study explores methods to mitigate this effect, showing that better placement strategies and structured prompts can recover some of the lost performance.

To evaluate positional bias, you must design deterministic prompts. Each test case should contain the same facts and question, but with facts placed at different positions (start, middle, end). All other factors-prompt length, temperature, model version-must be held constant. The output is scored against a known answer. This yields accuracy curves by position. If accuracy drops sharply in the middle, you have a measurable lost-in-the-middle gap.

Evaluation must also track latency and cost. Long-context prompts increase cost, and adding memory at multiple positions may not be feasible. Therefore, the harness should output not just accuracy but also prompt length and runtime. This helps you choose memory placement strategies that balance accuracy and cost. For example, you might decide to place only the top-2 memories near the end (recency bias) and rely on summaries near the start.

This project turns positional bias into a measurable signal. The result will guide how you place memory in Project 10’s paging system and how you design retrieval budgets in Project 4.

From a systems perspective, this concept must be treated as a first-class interface between data and behavior. That means you need explicit invariants (what must always be true), observability (how you know it is true), and failure signatures (how it breaks when it is not). In practice, engineers often skip this and rely on ad-hoc fixes, which creates hidden coupling between the memory subsystem and the rest of the agent stack. A better approach is to model the concept as a pipeline stage with clear inputs, outputs, and preconditions: if inputs violate the contract, the stage should fail fast rather than silently corrupt memory. This is especially important because memory errors are long-lived and compound over time. You should also define operational metrics that reveal drift early. Examples include: the percentage of memory entries that lack required metadata, the ratio of retrieved memories that are later unused by the model, or the fraction of queries that trigger a fallback route because the primary memory store is empty. These metrics are not just for dashboards; they are design constraints that force you to keep the system testable and predictable.

Another critical dimension is lifecycle management. The concept may work well at small scale but degrade as the memory grows. This is where policies and thresholds matter: you need rules for promotion, demotion, merging, or deletion that prevent the memory from becoming a landfill. The policy should be deterministic and versioned. When it changes, you should be able to replay historical inputs and measure the delta in outputs. This is the same discipline used in data engineering for schema changes and backfills, and it applies equally to memory systems. Finally, remember that memory is an interface to user trust. If the memory system is noisy, the agent feels unreliable; if it is overly strict, the agent feels forgetful. The best designs expose these trade-offs explicitly, so you can tune them according to product goals rather than guessing in the dark.

How this fits on projects This concept is central to Project 8 and influences Project 4 and Project 10.

Definitions & key terms

Positional bias: Model preference for tokens at the start/end.
Lost-in-the-middle: Accuracy drop when facts are mid-context.
Evaluation harness: Automated system to measure performance.

Mental model diagram (ASCII)

Start        Middle          End
[FACT] ------------------- [FACT]
  ^            x             ^
high use     low use      high use

How It Works (Step-by-Step)

Prepare a set of facts and questions.
Create prompts with facts at different positions.
Run the model with fixed settings.
Score answers and compute accuracy by position.
Report the lost-in-the-middle gap.

Minimal Concrete Example

positions: [start, middle, end]
question: "What is the API base URL?"
expected: "https://api.example.com"

Common Misconceptions

“Long-context models use all tokens equally.” (False.)
“More context always improves performance.” (False.)

Check-Your-Understanding Questions

Why must prompts be deterministic in evaluation?
What is the lost-in-the-middle gap?
How can you mitigate positional bias?

Check-Your-Understanding Answers

To isolate position as the only variable.
The accuracy difference between middle and edge positions.
Place critical memory near anchors or restructure prompts.

Real-World Applications

Memory placement strategies in RAG systems.
QA evaluation for long documents.

Where You’ll Apply It

In this project: §5.4 Concepts You Must Understand First and §6 Testing Strategy.
Also used in: Project 4, Project 10.

References

Lost in the Middle - https://arxiv.org/abs/2307.03172
Found in the Middle - https://arxiv.org/abs/2406.16008

Key Insights Prompt placement is a controllable variable that can be measured and optimized.

Summary A deterministic harness reveals positional bias and guides memory placement strategies.

Homework/Exercises to Practice the Concept

Create 5 prompts with facts at different positions.
Predict which position yields highest accuracy and test it.

Solutions to the Homework/Exercises

Use fixed length and identical content with position changes.
Expect start/end to outperform middle.

3. Project Specification

3.1 What You Will Build

An evaluation harness that:

Generates position-controlled prompts
Runs model evaluations
Scores outputs deterministically
Produces a positional bias report

3.2 Functional Requirements

Dataset Builder: Create prompts with controlled positions.
Runner: Execute model calls with fixed settings.
Scorer: Score outputs vs expected answers.
Reporter: Output accuracy by position.

3.3 Non-Functional Requirements

Performance: 100 prompts evaluated under 5 minutes.
Reliability: Same run yields same results.
Usability: Clear accuracy plots or tables.

3.4 Example Usage / Output

$ lceval run --facts facts.json --positions start,middle,end
[RUN] position=start accuracy=0.78
[RUN] position=middle accuracy=0.52
[RUN] position=end accuracy=0.81

3.5 Data Formats / Schemas / Protocols

{
  "question": "What is the API base URL?",
  "fact": "API base URL is https://api.example.com",
  "positions": ["start", "middle", "end"]
}

3.6 Edge Cases

Very short prompts (no middle position)
Facts repeated multiple times
Ambiguous scoring

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

$ lceval run --facts facts.json --positions start,middle,end
$ lceval report

3.7.2 Golden Path Demo (Deterministic)

$ lceval report
Start accuracy: 0.78
Middle accuracy: 0.52
End accuracy: 0.81
Lost-in-middle gap: 0.29
exit_code=0

3.7.3 Failure Demo (Deterministic)

$ lceval run --facts empty.json --positions start
[ERROR] no facts provided
exit_code=2

4. Solution Architecture

4.1 High-Level Design

Facts -> Prompt Generator -> Runner -> Scorer -> Report

4.2 Key Components

Component	Responsibility	Key Decisions
Generator	Position facts	Prompt template
Runner	Execute model calls	Fixed settings
Scorer	Score outputs	Exact vs fuzzy
Report	Summarize metrics	Table vs plot

4.3 Data Structures (No Full Code)

EvalCase:
  question: string
  fact: string
  position: enum
  expected: string

4.4 Algorithm Overview

Generate prompts for each position.
Run model with fixed settings.
Score outputs.
Aggregate results.

5. Implementation Guide

5.1 Development Environment Setup

- Prepare fixed model settings
- Store facts in JSON

5.2 Project Structure

project-root/
├── src/
│   ├── generate/
│   ├── run/
│   ├── score/
│   └── report/

5.3 The Core Question You’re Answering

“Does memory placement change model performance?”

5.4 Concepts You Must Understand First

Positional bias
Deterministic evaluation

5.5 Questions to Guide Your Design

How do you define the middle position?
What scoring method is robust?

5.6 Thinking Exercise

Predict which placement yields highest accuracy for your dataset.

5.7 The Interview Questions They’ll Ask

“What is the lost-in-the-middle effect?”
“How do you measure positional bias?”
“How do you ensure determinism?”
“What metrics beyond accuracy matter?”
“How do you use results to change prompt design?”

5.8 Hints in Layers

Hint 1: Fix temperature to 0 Hint 2: Use exact match scoring first Hint 3: Add partial credit later Hint 4: Report gaps by position

5.9 Books That Will Help

Topic	Book	Chapter
Evaluation	“AI Engineering”	Ch. 3-4

5.10 Implementation Phases

Phase 1: Foundation

Build dataset and prompt generator

Phase 2: Core

Run model and score results

Phase 3: Polish

Add reports and plots

5.11 Key Implementation Decisions

Decision	Options	Recommendation	Rationale
Scoring	Exact / Fuzzy	Exact first	Deterministic baseline
Positioning	Fixed / Relative	Relative	Scales with prompt length

6. Testing Strategy

6.1 Test Categories

Category	Purpose	Examples
Unit	Prompt generator	Position correctness
Integration	Full run	Prompt -> score
Edge	Empty dataset	Error handling

6.2 Critical Test Cases

Positioning produces distinct prompts.
Scores are deterministic.
Empty dataset returns error.

6.3 Test Data

question: "What is the API base URL?"
expected: "https://api.example.com"

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall	Symptom	Solution
Non-determinism	Flaky results	Fix model settings
Wrong middle placement	No gap detected	Adjust positioning
Loose scoring	Inflated accuracy	Use exact match

7.2 Debugging Strategies

Print prompt lengths and positions.
Compare outputs across runs.

7.3 Performance Traps

Excessive prompt size causing slow runs.

8. Extensions & Challenges

8.1 Beginner Extensions

Add more fact categories

8.2 Intermediate Extensions

Add partial credit scoring

8.3 Advanced Extensions

Add visualization dashboards

9. Real-World Connections

9.1 Industry Applications

Prompt placement in production RAG systems

Open-source long-context eval suites

9.3 Interview Relevance

Evaluation methodology is critical in AI system interviews.

10. Resources

10.1 Essential Reading

Lost in the Middle paper
Found in the Middle paper

10.2 Video Resources

Talks on long-context evaluation

10.3 Tools & Documentation

Prompt evaluation toolkits

11. Self-Assessment Checklist

11.1 Understanding

I can explain positional bias.

11.2 Implementation

Evaluation harness is deterministic.

11.3 Growth

I can recommend prompt placement strategies.

12. Submission / Completion Criteria

Minimum Viable Completion:

Evaluation harness produces position-based report

Full Completion:

Added plots and partial scoring

Excellence (Going Above & Beyond):

Adaptive placement recommendations