Project 5: The Web Researcher Agent (Search & Synthesis)

Build an agent that iteratively searches the web, visits multiple sources, extracts evidence, and produces a cited comparison report.

Quick Reference

Attribute Value
Difficulty Level 3: Advanced
Time Estimate 20–30 hours
Language Python (Alternatives: Go, TypeScript)
Prerequisites HTTP, basic scraping, prompt discipline, working with rate limits
Key Topics iterative search, browsing/extraction, citation discipline, termination conditions, source ranking

1. Learning Objectives

By completing this project, you will:

  1. Implement an agent loop that chooses what to search next based on results.
  2. Build a robust extraction pipeline (HTML → cleaned text → key claims).
  3. Enforce citation discipline (every claim traces to a source).
  4. Design stopping criteria (“enough evidence”) to avoid infinite browsing.
  5. Detect and filter low-quality sources (SEO spam, duplicates).

2. Theoretical Foundation

2.1 Core Concepts

  • Iterative search: One query rarely suffices. Agents refine queries as they learn (query expansion, diversification).
  • Evidence vs narrative: A research agent must separate “claims” from “citations” and keep the mapping.
  • Context compression: Web pages are large; you must summarize/extract before feeding the LLM.
  • Termination conditions: Without explicit stop logic, agents either stop too early or loop forever.
  • Truthfulness constraints: Agents hallucinate most when they synthesize without evidence; strong source handling reduces this.

2.2 Why This Matters

Web research is a canonical “agent task”: multi-step, ambiguous, tool-heavy, and truth-sensitive. These patterns carry directly into personal assistants that plan purchases, compare services, or gather travel info.

2.3 Common Misconceptions

  • “The model can just browse once and summarize.” High quality requires multiple, diverse sources and cross-checking.
  • “Citations fix hallucinations.” Only if you build citations from retrieved sources and forbid uncited claims.
  • “Scraping is trivial.” Dynamic sites and paywalls require fallback strategies and careful selection.

3. Project Specification

3.1 What You Will Build

A CLI tool that takes a question like:

  • “Best 3 mechanical keyboards for programmers under $100”
  • “NVIDIA stock forecast for 2025”
  • “Compare three privacy-focused note apps”

and produces:

  • A short summary
  • A comparison table (features, price, pros/cons)
  • A citations list with URLs per claim/table row

3.2 Functional Requirements

  1. Search tool: perform web search via an API (or your own crawler).
  2. Fetch tool: download HTML and extract readable text.
  3. Extraction: pull structured facts (price, features, dates) and direct quotes when needed.
  4. Synthesis: produce a report with citations.
  5. Stop condition: halt after enough coverage/diversity or confidence threshold.
  6. Cache: avoid repeatedly fetching the same URLs.

3.3 Non-Functional Requirements

  • Safety/ethics: respect robots and terms; avoid aggressive scraping.
  • Robustness: handle timeouts and partial failures; degrade gracefully.
  • Reproducibility: store fetched pages and intermediate extractions for replay.
  • Truthfulness: refuse to answer if evidence is weak or contradictory.

3.4 Example Usage / Output

python researcher.py "Best 3 mechanical keyboards under $100 for programmers"

Output shape:

Summary: …
Table: …
Sources:
 [1] https://…
 [2] https://…

4. Solution Architecture

4.1 High-Level Design

┌──────────────┐    plan/query    ┌────────────────┐
│ CLI/UI        │────────────────▶│ Planner (Agent) │
└──────────────┘                 └───────┬────────┘
                                         │ chooses tools
                                         ▼
  ┌────────────────┐  results  ┌────────────────┐  pages  ┌────────────────┐
  │ Search Tool     │──────────▶│ URL Selector   │────────▶│ Fetch+Extract  │
  └────────────────┘           └────────────────┘         └───────┬────────┘
                                                                    │
                                                                    ▼
                                                            ┌────────────────┐
                                                            │ Evidence Store  │
                                                            └───────┬────────┘
                                                                    ▼
                                                            ┌────────────────┐
                                                            │ Synthesizer     │
                                                            │ (cited report)  │
                                                            └────────────────┘

4.2 Key Components

Component Responsibility Key Decisions
Planner decide next search/fetch bounded loop; diversity goals
Search get candidate URLs provider API; rate limits
Fetch/Extract page text + key facts readability extraction, caching
Evidence store track claims + sources schema that preserves provenance
Synthesizer generate report from evidence forbid uncited claims

4.3 Data Structures

from dataclasses import dataclass

@dataclass(frozen=True)
class Evidence:
    claim: str
    source_url: str
    snippet: str
    confidence: float  # your heuristic score

4.4 Algorithm Overview

Key Algorithm: iterative research

  1. Generate initial search queries (diverse: broad + specific).
  2. Search and rank URLs; dedupe domains and near-duplicates.
  3. Fetch top URLs; extract text and candidate facts.
  4. Store evidence as (claim, url, snippet).
  5. Decide whether coverage is sufficient; otherwise refine queries and repeat.
  6. Synthesize report strictly from evidence store.

Complexity Analysis:

  • Time: O(searches + fetched_pages) network-bound
  • Space: O(pages_cached + evidence_items)

5. Implementation Guide

5.1 Development Environment Setup

python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 pydantic rich

5.2 Project Structure

web-researcher/
├── src/
│   ├── cli.py
│   ├── planner.py
│   ├── search.py
│   ├── fetch.py
│   ├── extract.py
│   ├── evidence.py
│   └── synthesize.py
└── data/
    └── cache/

5.3 Implementation Phases

Phase 1: Search + fetch + cache (6–8h)

Goals:

  • Get URLs and fetch pages reliably.

Tasks:

  1. Implement one search backend.
  2. Add caching keyed by URL.
  3. Extract readable text (drop nav/ads).

Checkpoint: You can fetch 5–10 pages and print clean text lengths + titles.

Phase 2: Evidence extraction + synthesis (6–10h)

Goals:

  • Extract comparable facts and synthesize with citations.

Tasks:

  1. Create evidence schema; store per URL.
  2. Add LLM extraction prompt that returns structured facts.
  3. Generate a comparison table from evidence.

Checkpoint: Report includes sources and avoids uncited claims.

Phase 3: Iterative planning + stop logic (6–12h)

Goals:

  • Make the agent decide “what next” and “when to stop”.

Tasks:

  1. Add planner loop with max steps.
  2. Add diversity heuristics (domains, publication dates).
  3. Add contradiction detection and “insufficient evidence” refusal.

Checkpoint: Agent improves results vs one-shot search and stops reliably.

5.4 Key Implementation Decisions

Decision Options Recommendation Rationale
Fetch strategy requests vs headless browser requests first simpler; add browser later
Evidence freeform notes vs schema schema ensures provenance/citations
Stop condition fixed steps vs coverage coverage + max steps avoids looping and under-research

6. Testing Strategy

6.1 Test Categories

Category Purpose Examples
Unit extraction/cleanup remove boilerplate, dedupe URLs
Replay deterministic tests run against saved HTML fixtures
Safety citations fail build if report has uncited claims

6.2 Critical Test Cases

  1. Duplicate source: same URL appears twice → only fetched once.
  2. Dynamic site: fetch fails → agent chooses alternate sources.
  3. Citation coverage: every table row has at least one citation.

7. Common Pitfalls & Debugging

Pitfall Symptom Solution
SEO spam low-quality sources dominate domain filters, diversity rules, exclude lists
Context overflow prompts too long compress per page, limit evidence items
Hallucinated facts “sounds right” but wrong only allow facts with stored evidence
Infinite loop agent keeps searching coverage-based termination + max steps

Debugging strategies:

  • Persist every step output (queries, URLs, extracted evidence).
  • Add a “replay from cache” mode for deterministic iteration.

8. Extensions & Challenges

8.1 Beginner Extensions

  • Add markdown report output.
  • Add date filtering (“only sources after 2024”).

8.2 Intermediate Extensions

  • Add headless browsing for JS-heavy sites.
  • Add reranking of sources by credibility.

8.3 Advanced Extensions

  • Add claim verification against multiple sources.
  • Add long-running research jobs with resume capability.

9. Real-World Connections

9.1 Industry Applications

  • Competitive analysis and market research assistants.
  • Procurement assistants that compare vendors.
  • Research copilots for writing reports with citations.

9.3 Interview Relevance

  • Tool-using agents, termination logic, and citation-based truthfulness.

10. Resources

10.1 Essential Reading

  • Building AI Agents (Packt) — ReAct loops and tool orchestration (Ch. 2)
  • The LLM Engineering Handbook (Paul Iusztin) — RAG and eval patterns (Ch. 5, 8)

10.3 Tools & Documentation

  • Search API docs (Tavily/Serper/etc.)
  • Readability extraction patterns (boilerplate removal)
  • Previous: Project 4 (calendar action) — safe tool use patterns
  • Next: Project 6 (multi-tool assistant) — generalized routing and orchestration

11. Self-Assessment Checklist

  • I can explain why citation mapping must be system-owned, not model-owned.
  • I can show a termination strategy and why it avoids looping.
  • I can reproduce a report using cached artifacts.
  • I can reduce hallucinations by tightening evidence requirements.

12. Submission / Completion Criteria

Minimum Viable Completion:

  • Search + fetch + extraction pipeline
  • Report with citations and a comparison table
  • Caching to avoid repeated fetches

Full Completion:

  • Iterative planner with stop logic and source diversity
  • Replay mode with saved HTML/evidence artifacts

Excellence (Going Above & Beyond):

  • Multi-source claim verification and contradiction handling

This guide was generated from project_based_ideas/AI_PERSONAL_ASSISTANTS_MASTERY.md. For the complete sprint overview, see project_based_ideas/AI_PERSONAL_ASSISTANTS_MASTERY/README.md.