Project 5: The Web Researcher Agent (Search & Synthesis)

Project 5: The Web Researcher Agent (Search & Synthesis)

Build an agent that iteratively searches the web, visits multiple sources, extracts evidence, and produces a cited comparison report.

Quick Reference

Attribute Value
Difficulty Level 3: Advanced
Time Estimate 20โ€“30 hours
Language Python (Alternatives: Go, TypeScript)
Prerequisites HTTP, basic scraping, prompt discipline, working with rate limits
Key Topics iterative search, browsing/extraction, citation discipline, termination conditions, source ranking

1. Learning Objectives

By completing this project, you will:

  1. Implement an agent loop that chooses what to search next based on results.
  2. Build a robust extraction pipeline (HTML โ†’ cleaned text โ†’ key claims).
  3. Enforce citation discipline (every claim traces to a source).
  4. Design stopping criteria (โ€œenough evidenceโ€) to avoid infinite browsing.
  5. Detect and filter low-quality sources (SEO spam, duplicates).

2. Theoretical Foundation

2.1 Core Concepts

  • Iterative search: One query rarely suffices. Agents refine queries as they learn (query expansion, diversification).
  • Evidence vs narrative: A research agent must separate โ€œclaimsโ€ from โ€œcitationsโ€ and keep the mapping.
  • Context compression: Web pages are large; you must summarize/extract before feeding the LLM.
  • Termination conditions: Without explicit stop logic, agents either stop too early or loop forever.
  • Truthfulness constraints: Agents hallucinate most when they synthesize without evidence; strong source handling reduces this.

2.2 Why This Matters

Web research is a canonical โ€œagent taskโ€: multi-step, ambiguous, tool-heavy, and truth-sensitive. These patterns carry directly into personal assistants that plan purchases, compare services, or gather travel info.

2.3 Common Misconceptions

  • โ€œThe model can just browse once and summarize.โ€ High quality requires multiple, diverse sources and cross-checking.
  • โ€œCitations fix hallucinations.โ€ Only if you build citations from retrieved sources and forbid uncited claims.
  • โ€œScraping is trivial.โ€ Dynamic sites and paywalls require fallback strategies and careful selection.

3. Project Specification

3.1 What You Will Build

A CLI tool that takes a question like:

  • โ€œBest 3 mechanical keyboards for programmers under $100โ€
  • โ€œNVIDIA stock forecast for 2025โ€
  • โ€œCompare three privacy-focused note appsโ€

and produces:

  • A short summary
  • A comparison table (features, price, pros/cons)
  • A citations list with URLs per claim/table row

3.2 Functional Requirements

  1. Search tool: perform web search via an API (or your own crawler).
  2. Fetch tool: download HTML and extract readable text.
  3. Extraction: pull structured facts (price, features, dates) and direct quotes when needed.
  4. Synthesis: produce a report with citations.
  5. Stop condition: halt after enough coverage/diversity or confidence threshold.
  6. Cache: avoid repeatedly fetching the same URLs.

3.3 Non-Functional Requirements

  • Safety/ethics: respect robots and terms; avoid aggressive scraping.
  • Robustness: handle timeouts and partial failures; degrade gracefully.
  • Reproducibility: store fetched pages and intermediate extractions for replay.
  • Truthfulness: refuse to answer if evidence is weak or contradictory.

3.4 Example Usage / Output

python researcher.py "Best 3 mechanical keyboards under $100 for programmers"

Output shape:

Summary: โ€ฆ
Table: โ€ฆ
Sources:
 [1] https://โ€ฆ
 [2] https://โ€ฆ

4. Solution Architecture

4.1 High-Level Design

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    plan/query    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ CLI/UI        โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ถโ”‚ Planner (Agent) โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                 โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                         โ”‚ chooses tools
                                         โ–ผ
  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  results  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  pages  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
  โ”‚ Search Tool     โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ถโ”‚ URL Selector   โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ถโ”‚ Fetch+Extract  โ”‚
  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜           โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜         โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                                                    โ”‚
                                                                    โ–ผ
                                                            โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                                                            โ”‚ Evidence Store  โ”‚
                                                            โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                                                    โ–ผ
                                                            โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                                                            โ”‚ Synthesizer     โ”‚
                                                            โ”‚ (cited report)  โ”‚
                                                            โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

4.2 Key Components

Component Responsibility Key Decisions
Planner decide next search/fetch bounded loop; diversity goals
Search get candidate URLs provider API; rate limits
Fetch/Extract page text + key facts readability extraction, caching
Evidence store track claims + sources schema that preserves provenance
Synthesizer generate report from evidence forbid uncited claims

4.3 Data Structures

from dataclasses import dataclass

@dataclass(frozen=True)
class Evidence:
    claim: str
    source_url: str
    snippet: str
    confidence: float  # your heuristic score

4.4 Algorithm Overview

Key Algorithm: iterative research

  1. Generate initial search queries (diverse: broad + specific).
  2. Search and rank URLs; dedupe domains and near-duplicates.
  3. Fetch top URLs; extract text and candidate facts.
  4. Store evidence as (claim, url, snippet).
  5. Decide whether coverage is sufficient; otherwise refine queries and repeat.
  6. Synthesize report strictly from evidence store.

Complexity Analysis:

  • Time: O(searches + fetched_pages) network-bound
  • Space: O(pages_cached + evidence_items)

5. Implementation Guide

5.1 Development Environment Setup

python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 pydantic rich

5.2 Project Structure

web-researcher/
โ”œโ”€โ”€ src/
โ”‚   โ”œโ”€โ”€ cli.py
โ”‚   โ”œโ”€โ”€ planner.py
โ”‚   โ”œโ”€โ”€ search.py
โ”‚   โ”œโ”€โ”€ fetch.py
โ”‚   โ”œโ”€โ”€ extract.py
โ”‚   โ”œโ”€โ”€ evidence.py
โ”‚   โ””โ”€โ”€ synthesize.py
โ””โ”€โ”€ data/
    โ””โ”€โ”€ cache/

5.3 Implementation Phases

Phase 1: Search + fetch + cache (6โ€“8h)

Goals:

  • Get URLs and fetch pages reliably.

Tasks:

  1. Implement one search backend.
  2. Add caching keyed by URL.
  3. Extract readable text (drop nav/ads).

Checkpoint: You can fetch 5โ€“10 pages and print clean text lengths + titles.

Phase 2: Evidence extraction + synthesis (6โ€“10h)

Goals:

  • Extract comparable facts and synthesize with citations.

Tasks:

  1. Create evidence schema; store per URL.
  2. Add LLM extraction prompt that returns structured facts.
  3. Generate a comparison table from evidence.

Checkpoint: Report includes sources and avoids uncited claims.

Phase 3: Iterative planning + stop logic (6โ€“12h)

Goals:

  • Make the agent decide โ€œwhat nextโ€ and โ€œwhen to stopโ€.

Tasks:

  1. Add planner loop with max steps.
  2. Add diversity heuristics (domains, publication dates).
  3. Add contradiction detection and โ€œinsufficient evidenceโ€ refusal.

Checkpoint: Agent improves results vs one-shot search and stops reliably.

5.4 Key Implementation Decisions

Decision Options Recommendation Rationale
Fetch strategy requests vs headless browser requests first simpler; add browser later
Evidence freeform notes vs schema schema ensures provenance/citations
Stop condition fixed steps vs coverage coverage + max steps avoids looping and under-research

6. Testing Strategy

6.1 Test Categories

Category Purpose Examples
Unit extraction/cleanup remove boilerplate, dedupe URLs
Replay deterministic tests run against saved HTML fixtures
Safety citations fail build if report has uncited claims

6.2 Critical Test Cases

  1. Duplicate source: same URL appears twice โ†’ only fetched once.
  2. Dynamic site: fetch fails โ†’ agent chooses alternate sources.
  3. Citation coverage: every table row has at least one citation.

7. Common Pitfalls & Debugging

Pitfall Symptom Solution
SEO spam low-quality sources dominate domain filters, diversity rules, exclude lists
Context overflow prompts too long compress per page, limit evidence items
Hallucinated facts โ€œsounds rightโ€ but wrong only allow facts with stored evidence
Infinite loop agent keeps searching coverage-based termination + max steps

Debugging strategies:

  • Persist every step output (queries, URLs, extracted evidence).
  • Add a โ€œreplay from cacheโ€ mode for deterministic iteration.

8. Extensions & Challenges

8.1 Beginner Extensions

  • Add markdown report output.
  • Add date filtering (โ€œonly sources after 2024โ€).

8.2 Intermediate Extensions

  • Add headless browsing for JS-heavy sites.
  • Add reranking of sources by credibility.

8.3 Advanced Extensions

  • Add claim verification against multiple sources.
  • Add long-running research jobs with resume capability.

9. Real-World Connections

9.1 Industry Applications

  • Competitive analysis and market research assistants.
  • Procurement assistants that compare vendors.
  • Research copilots for writing reports with citations.

9.3 Interview Relevance

  • Tool-using agents, termination logic, and citation-based truthfulness.

10. Resources

10.1 Essential Reading

  • Building AI Agents (Packt) โ€” ReAct loops and tool orchestration (Ch. 2)
  • The LLM Engineering Handbook (Paul Iusztin) โ€” RAG and eval patterns (Ch. 5, 8)

10.3 Tools & Documentation

  • Search API docs (Tavily/Serper/etc.)
  • Readability extraction patterns (boilerplate removal)
  • Previous: Project 4 (calendar action) โ€” safe tool use patterns
  • Next: Project 6 (multi-tool assistant) โ€” generalized routing and orchestration

11. Self-Assessment Checklist

  • I can explain why citation mapping must be system-owned, not model-owned.
  • I can show a termination strategy and why it avoids looping.
  • I can reproduce a report using cached artifacts.
  • I can reduce hallucinations by tightening evidence requirements.

12. Submission / Completion Criteria

Minimum Viable Completion:

  • Search + fetch + extraction pipeline
  • Report with citations and a comparison table
  • Caching to avoid repeated fetches

Full Completion:

  • Iterative planner with stop logic and source diversity
  • Replay mode with saved HTML/evidence artifacts

Excellence (Going Above & Beyond):

  • Multi-source claim verification and contradiction handling

This guide was generated from project_based_ideas/AI_PERSONAL_ASSISTANTS_MASTERY.md. For the complete sprint overview, see project_based_ideas/AI_PERSONAL_ASSISTANTS_MASTERY/README.md.