Project 5: The Web Researcher Agent (Search & Synthesis)
Project 5: The Web Researcher Agent (Search & Synthesis)
Build an agent that iteratively searches the web, visits multiple sources, extracts evidence, and produces a cited comparison report.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 3: Advanced |
| Time Estimate | 20โ30 hours |
| Language | Python (Alternatives: Go, TypeScript) |
| Prerequisites | HTTP, basic scraping, prompt discipline, working with rate limits |
| Key Topics | iterative search, browsing/extraction, citation discipline, termination conditions, source ranking |
1. Learning Objectives
By completing this project, you will:
- Implement an agent loop that chooses what to search next based on results.
- Build a robust extraction pipeline (HTML โ cleaned text โ key claims).
- Enforce citation discipline (every claim traces to a source).
- Design stopping criteria (โenough evidenceโ) to avoid infinite browsing.
- Detect and filter low-quality sources (SEO spam, duplicates).
2. Theoretical Foundation
2.1 Core Concepts
- Iterative search: One query rarely suffices. Agents refine queries as they learn (query expansion, diversification).
- Evidence vs narrative: A research agent must separate โclaimsโ from โcitationsโ and keep the mapping.
- Context compression: Web pages are large; you must summarize/extract before feeding the LLM.
- Termination conditions: Without explicit stop logic, agents either stop too early or loop forever.
- Truthfulness constraints: Agents hallucinate most when they synthesize without evidence; strong source handling reduces this.
2.2 Why This Matters
Web research is a canonical โagent taskโ: multi-step, ambiguous, tool-heavy, and truth-sensitive. These patterns carry directly into personal assistants that plan purchases, compare services, or gather travel info.
2.3 Common Misconceptions
- โThe model can just browse once and summarize.โ High quality requires multiple, diverse sources and cross-checking.
- โCitations fix hallucinations.โ Only if you build citations from retrieved sources and forbid uncited claims.
- โScraping is trivial.โ Dynamic sites and paywalls require fallback strategies and careful selection.
3. Project Specification
3.1 What You Will Build
A CLI tool that takes a question like:
- โBest 3 mechanical keyboards for programmers under $100โ
- โNVIDIA stock forecast for 2025โ
- โCompare three privacy-focused note appsโ
and produces:
- A short summary
- A comparison table (features, price, pros/cons)
- A citations list with URLs per claim/table row
3.2 Functional Requirements
- Search tool: perform web search via an API (or your own crawler).
- Fetch tool: download HTML and extract readable text.
- Extraction: pull structured facts (price, features, dates) and direct quotes when needed.
- Synthesis: produce a report with citations.
- Stop condition: halt after enough coverage/diversity or confidence threshold.
- Cache: avoid repeatedly fetching the same URLs.
3.3 Non-Functional Requirements
- Safety/ethics: respect robots and terms; avoid aggressive scraping.
- Robustness: handle timeouts and partial failures; degrade gracefully.
- Reproducibility: store fetched pages and intermediate extractions for replay.
- Truthfulness: refuse to answer if evidence is weak or contradictory.
3.4 Example Usage / Output
python researcher.py "Best 3 mechanical keyboards under $100 for programmers"
Output shape:
Summary: โฆ
Table: โฆ
Sources:
[1] https://โฆ
[2] https://โฆ
4. Solution Architecture
4.1 High-Level Design
โโโโโโโโโโโโโโโโ plan/query โโโโโโโโโโโโโโโโโโ
โ CLI/UI โโโโโโโโโโโโโโโโโโถโ Planner (Agent) โ
โโโโโโโโโโโโโโโโ โโโโโโโโโฌโโโโโโโโโ
โ chooses tools
โผ
โโโโโโโโโโโโโโโโโโ results โโโโโโโโโโโโโโโโโโ pages โโโโโโโโโโโโโโโโโโ
โ Search Tool โโโโโโโโโโโโถโ URL Selector โโโโโโโโโโถโ Fetch+Extract โ
โโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโ โโโโโโโโโฌโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโ
โ Evidence Store โ
โโโโโโโโโฌโโโโโโโโโ
โผ
โโโโโโโโโโโโโโโโโโ
โ Synthesizer โ
โ (cited report) โ
โโโโโโโโโโโโโโโโโโ
4.2 Key Components
| Component | Responsibility | Key Decisions |
|---|---|---|
| Planner | decide next search/fetch | bounded loop; diversity goals |
| Search | get candidate URLs | provider API; rate limits |
| Fetch/Extract | page text + key facts | readability extraction, caching |
| Evidence store | track claims + sources | schema that preserves provenance |
| Synthesizer | generate report from evidence | forbid uncited claims |
4.3 Data Structures
from dataclasses import dataclass
@dataclass(frozen=True)
class Evidence:
claim: str
source_url: str
snippet: str
confidence: float # your heuristic score
4.4 Algorithm Overview
Key Algorithm: iterative research
- Generate initial search queries (diverse: broad + specific).
- Search and rank URLs; dedupe domains and near-duplicates.
- Fetch top URLs; extract text and candidate facts.
- Store evidence as (claim, url, snippet).
- Decide whether coverage is sufficient; otherwise refine queries and repeat.
- Synthesize report strictly from evidence store.
Complexity Analysis:
- Time: O(searches + fetched_pages) network-bound
- Space: O(pages_cached + evidence_items)
5. Implementation Guide
5.1 Development Environment Setup
python -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 pydantic rich
5.2 Project Structure
web-researcher/
โโโ src/
โ โโโ cli.py
โ โโโ planner.py
โ โโโ search.py
โ โโโ fetch.py
โ โโโ extract.py
โ โโโ evidence.py
โ โโโ synthesize.py
โโโ data/
โโโ cache/
5.3 Implementation Phases
Phase 1: Search + fetch + cache (6โ8h)
Goals:
- Get URLs and fetch pages reliably.
Tasks:
- Implement one search backend.
- Add caching keyed by URL.
- Extract readable text (drop nav/ads).
Checkpoint: You can fetch 5โ10 pages and print clean text lengths + titles.
Phase 2: Evidence extraction + synthesis (6โ10h)
Goals:
- Extract comparable facts and synthesize with citations.
Tasks:
- Create evidence schema; store per URL.
- Add LLM extraction prompt that returns structured facts.
- Generate a comparison table from evidence.
Checkpoint: Report includes sources and avoids uncited claims.
Phase 3: Iterative planning + stop logic (6โ12h)
Goals:
- Make the agent decide โwhat nextโ and โwhen to stopโ.
Tasks:
- Add planner loop with max steps.
- Add diversity heuristics (domains, publication dates).
- Add contradiction detection and โinsufficient evidenceโ refusal.
Checkpoint: Agent improves results vs one-shot search and stops reliably.
5.4 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale |
|---|---|---|---|
| Fetch strategy | requests vs headless browser | requests first | simpler; add browser later |
| Evidence | freeform notes vs schema | schema | ensures provenance/citations |
| Stop condition | fixed steps vs coverage | coverage + max steps | avoids looping and under-research |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples |
|---|---|---|
| Unit | extraction/cleanup | remove boilerplate, dedupe URLs |
| Replay | deterministic tests | run against saved HTML fixtures |
| Safety | citations | fail build if report has uncited claims |
6.2 Critical Test Cases
- Duplicate source: same URL appears twice โ only fetched once.
- Dynamic site: fetch fails โ agent chooses alternate sources.
- Citation coverage: every table row has at least one citation.
7. Common Pitfalls & Debugging
| Pitfall | Symptom | Solution |
|---|---|---|
| SEO spam | low-quality sources dominate | domain filters, diversity rules, exclude lists |
| Context overflow | prompts too long | compress per page, limit evidence items |
| Hallucinated facts | โsounds rightโ but wrong | only allow facts with stored evidence |
| Infinite loop | agent keeps searching | coverage-based termination + max steps |
Debugging strategies:
- Persist every step output (queries, URLs, extracted evidence).
- Add a โreplay from cacheโ mode for deterministic iteration.
8. Extensions & Challenges
8.1 Beginner Extensions
- Add markdown report output.
- Add date filtering (โonly sources after 2024โ).
8.2 Intermediate Extensions
- Add headless browsing for JS-heavy sites.
- Add reranking of sources by credibility.
8.3 Advanced Extensions
- Add claim verification against multiple sources.
- Add long-running research jobs with resume capability.
9. Real-World Connections
9.1 Industry Applications
- Competitive analysis and market research assistants.
- Procurement assistants that compare vendors.
- Research copilots for writing reports with citations.
9.3 Interview Relevance
- Tool-using agents, termination logic, and citation-based truthfulness.
10. Resources
10.1 Essential Reading
- Building AI Agents (Packt) โ ReAct loops and tool orchestration (Ch. 2)
- The LLM Engineering Handbook (Paul Iusztin) โ RAG and eval patterns (Ch. 5, 8)
10.3 Tools & Documentation
- Search API docs (Tavily/Serper/etc.)
- Readability extraction patterns (boilerplate removal)
10.4 Related Projects in This Series
- Previous: Project 4 (calendar action) โ safe tool use patterns
- Next: Project 6 (multi-tool assistant) โ generalized routing and orchestration
11. Self-Assessment Checklist
- I can explain why citation mapping must be system-owned, not model-owned.
- I can show a termination strategy and why it avoids looping.
- I can reproduce a report using cached artifacts.
- I can reduce hallucinations by tightening evidence requirements.
12. Submission / Completion Criteria
Minimum Viable Completion:
- Search + fetch + extraction pipeline
- Report with citations and a comparison table
- Caching to avoid repeated fetches
Full Completion:
- Iterative planner with stop logic and source diversity
- Replay mode with saved HTML/evidence artifacts
Excellence (Going Above & Beyond):
- Multi-source claim verification and contradiction handling
This guide was generated from project_based_ideas/AI_PERSONAL_ASSISTANTS_MASTERY.md. For the complete sprint overview, see project_based_ideas/AI_PERSONAL_ASSISTANTS_MASTERY/README.md.