Project 5: The Web Researcher Agent (Search & Synthesis)

Build an agent that iteratively searches the web, visits multiple sources, extracts evidence, and produces a cited comparison report.

Quick Reference

Attribute	Value
Difficulty	Level 3: Advanced
Time Estimate	20–30 hours
Language	Python (Alternatives: Go, TypeScript)
Prerequisites	HTTP, basic scraping, prompt discipline, working with rate limits
Key Topics	iterative search, browsing/extraction, citation discipline, termination conditions, source ranking

1. Learning Objectives

By completing this project, you will:

Implement an agent loop that chooses what to search next based on results.
Build a robust extraction pipeline (HTML → cleaned text → key claims).
Enforce citation discipline (every claim traces to a source).
Design stopping criteria (“enough evidence”) to avoid infinite browsing.
Detect and filter low-quality sources (SEO spam, duplicates).

2. Theoretical Foundation

2.1 Core Concepts

Iterative search: One query rarely suffices. Agents refine queries as they learn (query expansion, diversification).
Evidence vs narrative: A research agent must separate “claims” from “citations” and keep the mapping.
Context compression: Web pages are large; you must summarize/extract before feeding the LLM.
Termination conditions: Without explicit stop logic, agents either stop too early or loop forever.
Truthfulness constraints: Agents hallucinate most when they synthesize without evidence; strong source handling reduces this.

2.2 Why This Matters

Web research is a canonical “agent task”: multi-step, ambiguous, tool-heavy, and truth-sensitive. These patterns carry directly into personal assistants that plan purchases, compare services, or gather travel info.

2.3 Common Misconceptions

“The model can just browse once and summarize.” High quality requires multiple, diverse sources and cross-checking.
“Citations fix hallucinations.” Only if you build citations from retrieved sources and forbid uncited claims.
“Scraping is trivial.” Dynamic sites and paywalls require fallback strategies and careful selection.

3. Project Specification

3.1 What You Will Build

A CLI tool that takes a question like:

“Best 3 mechanical keyboards for programmers under $100”
“NVIDIA stock forecast for 2025”
“Compare three privacy-focused note apps”

and produces:

A short summary
A comparison table (features, price, pros/cons)
A citations list with URLs per claim/table row

3.2 Functional Requirements

Search tool: perform web search via an API (or your own crawler).
Fetch tool: download HTML and extract readable text.
Extraction: pull structured facts (price, features, dates) and direct quotes when needed.
Synthesis: produce a report with citations.
Stop condition: halt after enough coverage/diversity or confidence threshold.
Cache: avoid repeatedly fetching the same URLs.

3.3 Non-Functional Requirements

Safety/ethics: respect robots and terms; avoid aggressive scraping.
Robustness: handle timeouts and partial failures; degrade gracefully.
Reproducibility: store fetched pages and intermediate extractions for replay.
Truthfulness: refuse to answer if evidence is weak or contradictory.

3.4 Example Usage / Output

python researcher.py "Best 3 mechanical keyboards under $100 for programmers"

Output shape:

Summary: …
Table: …
Sources:
 [1] https://…
 [2] https://…

4. Solution Architecture

4.1 High-Level Design

┌──────────────┐    plan/query    ┌────────────────┐
│ CLI/UI        │────────────────▶│ Planner (Agent) │
└──────────────┘                 └───────┬────────┘
                                         │ chooses tools
                                         ▼
  ┌────────────────┐  results  ┌────────────────┐  pages  ┌────────────────┐
  │ Search Tool     │──────────▶│ URL Selector   │────────▶│ Fetch+Extract  │
  └────────────────┘           └────────────────┘         └───────┬────────┘
                                                                    │
                                                                    ▼
                                                            ┌────────────────┐
                                                            │ Evidence Store  │
                                                            └───────┬────────┘
                                                                    ▼
                                                            ┌────────────────┐
                                                            │ Synthesizer     │
                                                            │ (cited report)  │
                                                            └────────────────┘

4.2 Key Components

Component	Responsibility	Key Decisions
Planner	decide next search/fetch	bounded loop; diversity goals
Search	get candidate URLs	provider API; rate limits
Fetch/Extract	page text + key facts	readability extraction, caching
Evidence store	track claims + sources	schema that preserves provenance
Synthesizer	generate report from evidence	forbid uncited claims

4.3 Data Structures

from dataclasses import dataclass

@dataclass(frozen=True)
class Evidence:
    claim: str
    source_url: str
    snippet: str
    confidence: float  # your heuristic score

4.4 Algorithm Overview

Key Algorithm: iterative research

Generate initial search queries (diverse: broad + specific).
Search and rank URLs; dedupe domains and near-duplicates.
Fetch top URLs; extract text and candidate facts.
Store evidence as (claim, url, snippet).
Decide whether coverage is sufficient; otherwise refine queries and repeat.
Synthesize report strictly from evidence store.

Complexity Analysis:

Time: O(searches + fetched_pages) network-bound
Space: O(pages_cached + evidence_items)

5. Implementation Guide

5.1 Development Environment Setup

python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 pydantic rich

5.2 Project Structure

web-researcher/
├── src/
│   ├── cli.py
│   ├── planner.py
│   ├── search.py
│   ├── fetch.py
│   ├── extract.py
│   ├── evidence.py
│   └── synthesize.py
└── data/
    └── cache/

5.3 Implementation Phases

Phase 1: Search + fetch + cache (6–8h)

Goals:

Get URLs and fetch pages reliably.

Tasks:

Implement one search backend.
Add caching keyed by URL.
Extract readable text (drop nav/ads).

Checkpoint: You can fetch 5–10 pages and print clean text lengths + titles.

Phase 2: Evidence extraction + synthesis (6–10h)

Goals:

Extract comparable facts and synthesize with citations.

Tasks:

Create evidence schema; store per URL.
Add LLM extraction prompt that returns structured facts.
Generate a comparison table from evidence.

Checkpoint: Report includes sources and avoids uncited claims.

Phase 3: Iterative planning + stop logic (6–12h)

Goals:

Make the agent decide “what next” and “when to stop”.

Tasks:

Add planner loop with max steps.
Add diversity heuristics (domains, publication dates).
Add contradiction detection and “insufficient evidence” refusal.

Checkpoint: Agent improves results vs one-shot search and stops reliably.

5.4 Key Implementation Decisions

Decision	Options	Recommendation	Rationale
Fetch strategy	requests vs headless browser	requests first	simpler; add browser later
Evidence	freeform notes vs schema	schema	ensures provenance/citations
Stop condition	fixed steps vs coverage	coverage + max steps	avoids looping and under-research

6. Testing Strategy

6.1 Test Categories

Category	Purpose	Examples
Unit	extraction/cleanup	remove boilerplate, dedupe URLs
Replay	deterministic tests	run against saved HTML fixtures
Safety	citations	fail build if report has uncited claims

6.2 Critical Test Cases

Duplicate source: same URL appears twice → only fetched once.
Dynamic site: fetch fails → agent chooses alternate sources.
Citation coverage: every table row has at least one citation.

7. Common Pitfalls & Debugging

Pitfall	Symptom	Solution
SEO spam	low-quality sources dominate	domain filters, diversity rules, exclude lists
Context overflow	prompts too long	compress per page, limit evidence items
Hallucinated facts	“sounds right” but wrong	only allow facts with stored evidence
Infinite loop	agent keeps searching	coverage-based termination + max steps

Debugging strategies:

Persist every step output (queries, URLs, extracted evidence).
Add a “replay from cache” mode for deterministic iteration.

8. Extensions & Challenges

8.1 Beginner Extensions

Add markdown report output.
Add date filtering (“only sources after 2024”).

8.2 Intermediate Extensions

Add headless browsing for JS-heavy sites.
Add reranking of sources by credibility.

8.3 Advanced Extensions

Add claim verification against multiple sources.
Add long-running research jobs with resume capability.

9. Real-World Connections

9.1 Industry Applications

Competitive analysis and market research assistants.
Procurement assistants that compare vendors.
Research copilots for writing reports with citations.

9.3 Interview Relevance

Tool-using agents, termination logic, and citation-based truthfulness.

10. Resources

10.1 Essential Reading

Building AI Agents (Packt) — ReAct loops and tool orchestration (Ch. 2)
The LLM Engineering Handbook (Paul Iusztin) — RAG and eval patterns (Ch. 5, 8)

10.3 Tools & Documentation

Search API docs (Tavily/Serper/etc.)
Readability extraction patterns (boilerplate removal)

Previous: Project 4 (calendar action) — safe tool use patterns
Next: Project 6 (multi-tool assistant) — generalized routing and orchestration

11. Self-Assessment Checklist

I can explain why citation mapping must be system-owned, not model-owned.
I can show a termination strategy and why it avoids looping.
I can reproduce a report using cached artifacts.
I can reduce hallucinations by tightening evidence requirements.

12. Submission / Completion Criteria

Minimum Viable Completion:

Search + fetch + extraction pipeline
Report with citations and a comparison table
Caching to avoid repeated fetches

Full Completion:

Iterative planner with stop logic and source diversity
Replay mode with saved HTML/evidence artifacts

Excellence (Going Above & Beyond):

Multi-source claim verification and contradiction handling

This guide was generated from project_based_ideas/AI_PERSONAL_ASSISTANTS_MASTERY.md. For the complete sprint overview, see project_based_ideas/AI_PERSONAL_ASSISTANTS_MASTERY/README.md.

Project 5: The Web Researcher Agent (Search & Synthesis)

Quick Reference

1. Learning Objectives

2. Theoretical Foundation

2.1 Core Concepts

2.2 Why This Matters

2.3 Common Misconceptions

3. Project Specification

3.1 What You Will Build

3.2 Functional Requirements

3.3 Non-Functional Requirements

3.4 Example Usage / Output

4. Solution Architecture

4.1 High-Level Design

4.2 Key Components

4.3 Data Structures

4.4 Algorithm Overview

5. Implementation Guide

5.1 Development Environment Setup

5.2 Project Structure

5.3 Implementation Phases

Phase 1: Search + fetch + cache (6–8h)

Phase 2: Evidence extraction + synthesis (6–10h)

Phase 3: Iterative planning + stop logic (6–12h)

5.4 Key Implementation Decisions

6. Testing Strategy

6.1 Test Categories

6.2 Critical Test Cases

7. Common Pitfalls & Debugging

8. Extensions & Challenges

8.1 Beginner Extensions

8.2 Intermediate Extensions

8.3 Advanced Extensions

9. Real-World Connections

9.1 Industry Applications

9.3 Interview Relevance

10. Resources

10.1 Essential Reading

10.3 Tools & Documentation

10.4 Related Projects in This Series

11. Self-Assessment Checklist

12. Submission / Completion Criteria