Project 8: Community Detection & Summaries

Build a system that detects communities of related entities in your knowledge graph and generates hierarchical summaries, enabling efficient retrieval at multiple abstraction levels.

Quick Reference

Attribute Value
Difficulty Level 4: Expert
Time Estimate 2-3 weeks (30-40 hours)
Language Python (Alternatives: TypeScript with Neo4j driver)
Prerequisites Projects 1-7, graph algorithms, clustering
Key Topics Community detection, Leiden/Louvain algorithms, hierarchical clustering, graph summarization, multi-level abstraction

1. Learning Objectives

By completing this project, you will:

  1. Understand community detection algorithms (Leiden, Louvain).
  2. Apply graph clustering to knowledge graphs.
  3. Build hierarchical community structures.
  4. Generate LLM summaries at each community level.
  5. Use community summaries for efficient retrieval.

2. Theoretical Foundation

2.1 Core Concepts

  • Community Detection: Finding groups of densely connected nodes in a graph. Communities often correspond to topics, themes, or domains.

  • Leiden Algorithm: Improved version of Louvain that guarantees well-connected communities. O(n log n) complexity.

  • Modularity: Metric measuring the strength of community division. Higher modularity = better clustering.

  • Hierarchical Communities: Communities nested within larger communities, forming a tree structure.

  • Community Summaries: LLM-generated descriptions of what a community represents.

2.2 Why This Matters

Knowledge graphs grow too large for simple traversal:

  • 10,000 entities → direct retrieval too slow
  • Community detection creates “topic clusters”
  • Summaries enable two-stage retrieval: find community → search within

This is exactly how Graphiti and Microsoft GraphRAG work.

2.3 Common Misconceptions

  • “K-means works for graphs.” K-means needs vectors; use graph-native algorithms.
  • “Communities are disjoint.” Entities can belong to multiple communities (overlapping).
  • “One level is enough.” Hierarchical summaries enable multi-resolution queries.

2.4 ASCII Diagram: Hierarchical Communities

FULL KNOWLEDGE GRAPH
════════════════════════════════════════════════════════════════

  Alice ──WORKS_AT──► Acme ◄──WORKS_AT── Bob
    │                  │                   │
 KNOWS              OWNS              MANAGES
    │                  │                   │
    ▼                  ▼                   ▼
  Carol          API_Project ◄───────── DevTeam
    │                  │                   │
 WORKS_AT           USES                USES
    │                  │                   │
    ▼                  ▼                   ▼
  TechCo          Python ◄──────────── FastAPI
                       │
                    RUNS_ON
                       │
                       ▼
                     Linux


COMMUNITY DETECTION (Leiden)
════════════════════════════════════════════════════════════════

Level 0 (Leaf Communities):
┌─────────────────────┐  ┌─────────────────────┐  ┌────────────────┐
│ Community A         │  │ Community B         │  │ Community C    │
│ (Acme Employees)    │  │ (Tech Stack)        │  │ (TechCo)       │
│                     │  │                     │  │                │
│ • Alice             │  │ • Python            │  │ • Carol        │
│ • Bob               │  │ • FastAPI           │  │ • TechCo       │
│ • Acme              │  │ • Linux             │  │                │
│ • API_Project       │  │                     │  │                │
│ • DevTeam           │  │                     │  │                │
└─────────────────────┘  └─────────────────────┘  └────────────────┘

Level 1 (Merged Communities):
┌──────────────────────────────────────────────┐  ┌────────────────┐
│ Community AB (Acme Tech Ecosystem)           │  │ Community C    │
│                                              │  │ (TechCo)       │
│ Summary: "Acme Corp engineering team         │  │                │
│ working on API project using Python/FastAPI  │  │ Summary: ...   │
│ stack. Key people: Alice (engineer),         │  │                │
│ Bob (manager)"                               │  │                │
└──────────────────────────────────────────────┘  └────────────────┘

Level 2 (Root):
┌────────────────────────────────────────────────────────────────┐
│ Community ABC (Full Graph)                                      │
│                                                                 │
│ Summary: "Knowledge graph covering two companies (Acme, TechCo) │
│ and their employees, projects, and technology stacks."          │
└────────────────────────────────────────────────────────────────┘


RETRIEVAL USING COMMUNITIES
════════════════════════════════════════════════════════════════

Query: "Who works on Python projects?"

Step 1: Search community summaries
  → Match: Community B (Tech Stack) mentions Python
  → Match: Community AB summary mentions Python/FastAPI

Step 2: Search within matched communities
  → Find: Alice, Bob work on API_Project which USES Python

Result: Alice, Bob (with full context from community)

3. Project Specification

3.1 What You Will Build

A Python system that:

  • Detects communities in a Neo4j knowledge graph
  • Builds hierarchical community structure
  • Generates LLM summaries at each level
  • Provides community-aware retrieval

3.2 Functional Requirements

  1. Detect communities: detector.find_communities(graph) → List[Community]
  2. Build hierarchy: detector.build_hierarchy(communities) → CommunityTree
  3. Generate summaries: summarizer.summarize_community(community) → str
  4. Store communities: store.save_communities(tree) → None
  5. Retrieve by community: retriever.find_relevant_communities(query) → List[Community]
  6. Search within community: retriever.search_community(community, query) → List[Entity]

3.3 Example Usage / Output

from community_detector import CommunityDetector, CommunitySummarizer

detector = CommunityDetector(neo4j_driver)
summarizer = CommunitySummarizer(llm_client)

# Detect communities
communities = detector.find_communities(resolution=1.0)
print(f"Found {len(communities)} communities at level 0")

# Build hierarchy
tree = detector.build_hierarchy(communities, levels=3)
print(f"Hierarchy depth: {tree.depth}")

# Generate summaries
for community in tree.all_communities():
    summary = summarizer.summarize(community)
    print(f"Community {community.id}: {summary[:100]}...")

# Example output:
# Community c_0: "Acme Corp engineering team. Includes Alice (senior engineer),
#                Bob (team lead), and DevTeam working on API_Project..."
# Community c_1: "Technology stack cluster including Python, FastAPI, and Linux.
#                Used by Acme engineering projects..."

# Query using communities
retriever = CommunityRetriever(tree, embedder)
query = "What Python frameworks are used?"

relevant = retriever.find_relevant_communities(query, top_k=2)
for comm in relevant:
    print(f"Community: {comm.summary[:50]}...")
    entities = retriever.search_community(comm, query)
    print(f"  Entities: {[e.name for e in entities]}")

# Community: "Technology stack cluster including Python, FastAPI..."
#   Entities: ['FastAPI', 'Python']

4. Solution Architecture

4.1 High-Level Design

┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│    Neo4j     │────▶│  Community   │────▶│   Hierarchy  │
│    Graph     │     │  Detector    │     │   Builder    │
└──────────────┘     └──────────────┘     └──────────────┘
                                                │
                                                ▼
┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│   Community  │◀────│  Summarizer  │◀────│   Community  │
│    Store     │     │    (LLM)     │     │    Tree      │
└──────────────┘     └──────────────┘     └──────────────┘
       │
       ▼
┌──────────────┐
│  Retriever   │
└──────────────┘

4.2 Key Components

Component Responsibility Technology
CommunityDetector Run Leiden algorithm Neo4j GDS or python-igraph
HierarchyBuilder Create multi-level tree Custom algorithm
Summarizer Generate community descriptions LLM
CommunityStore Persist community structure Neo4j + SQLite
Retriever Two-stage community retrieval Embedding similarity

4.3 Data Models

from pydantic import BaseModel
from typing import Optional

class Community(BaseModel):
    id: str
    level: int
    entities: list[str]  # Entity IDs
    edges_internal: int  # Edges within community
    edges_external: int  # Edges to other communities
    modularity: float
    summary: str | None = None
    embedding: list[float] | None = None

class CommunityTree(BaseModel):
    root: Community
    children: dict[str, list[Community]]  # parent_id -> children
    depth: int

    def all_communities(self) -> list[Community]:
        """Traverse all communities in tree."""
        ...

    def get_level(self, level: int) -> list[Community]:
        """Get all communities at a specific level."""
        ...

5. Implementation Guide

5.1 Development Environment Setup

mkdir community-detection && cd community-detection
python -m venv .venv && source .venv/bin/activate
pip install neo4j igraph leidenalg sentence-transformers openai pydantic

5.2 Project Structure

community-detection/
├── src/
│   ├── detector.py        # Community detection
│   ├── hierarchy.py       # Hierarchy building
│   ├── summarizer.py      # LLM summarization
│   ├── store.py           # Persistence
│   ├── retriever.py       # Community-aware search
│   └── models.py          # Data models
├── tests/
│   ├── test_detection.py
│   └── test_retrieval.py
└── README.md

5.3 Implementation Phases

Phase 1: Community Detection (10-12h)

Goals:

  • Extract graph from Neo4j
  • Run Leiden algorithm
  • Tune resolution parameter

Tasks:

  1. Export Neo4j graph to igraph format
  2. Run Leiden with default parameters
  3. Experiment with resolution values
  4. Visualize communities

Checkpoint: Communities detected from graph.

Phase 2: Hierarchy and Summaries (10-12h)

Goals:

  • Build multi-level hierarchy
  • Generate summaries at each level

Tasks:

  1. Implement hierarchical Leiden (multiple resolutions)
  2. Build tree structure from levels
  3. Design summary prompt
  4. Generate and store summaries

Checkpoint: Hierarchical tree with summaries.

Phase 3: Retrieval (8-10h)

Goals:

  • Search communities by embedding
  • Search within communities

Tasks:

  1. Generate embeddings for summaries
  2. Implement community search
  3. Implement entity search within community
  4. Build two-stage retrieval pipeline

Checkpoint: Query returns relevant entities via communities.


6. Testing Strategy

6.1 Test Categories

Category Purpose Examples
Unit Test algorithm components Leiden on small graph
Integration Test full pipeline Graph → communities → summaries
Quality Test retrieval accuracy Query matches correct community

6.2 Critical Test Cases

  1. Community quality: Modularity above threshold
  2. Hierarchy validity: All entities present at every level
  3. Summary accuracy: Summary mentions key entities
  4. Retrieval precision: Query finds relevant community

7. Common Pitfalls & Debugging

Pitfall Symptom Solution
Wrong resolution Too few/many communities Tune resolution parameter
Disconnected nodes Orphan communities of size 1 Filter or merge small communities
Summary hallucination Summary mentions non-entities Constrain prompt to actual entities
Hierarchy inconsistency Missing entities at levels Verify all entities present

8. Extensions & Challenges

8.1 Beginner Extensions

  • Add community visualization
  • Implement incremental community update

8.2 Intermediate Extensions

  • Add overlapping community detection
  • Implement temporal communities (evolution)

8.3 Advanced Extensions

  • Add community-aware reasoning
  • Implement distributed community detection

9. Real-World Connections

9.1 Industry Applications

  • Microsoft GraphRAG: Exact community summary architecture
  • Graphiti: Community detection for entity clusters
  • Social Network Analysis: Friend groups, interest clusters

9.2 Interview Relevance

  • Explain modularity and community detection
  • Discuss two-stage retrieval benefits
  • Describe hierarchical summarization tradeoffs

10. Resources

10.1 Essential Reading

  • “From Local to Global” (GraphRAG paper) — Microsoft’s approach
  • Leiden Algorithm paper — Traag et al. 2019
  • igraph documentation — Python graph algorithms
  • Previous: Project 7 (Semantic Memory Synthesizer)
  • Next: Project 9 (Graphiti Framework Integration)

11. Self-Assessment Checklist

  • I understand modularity and what it measures
  • I can explain Leiden vs Louvain differences
  • I know how to tune resolution for my graph
  • I understand two-stage retrieval benefits

12. Submission / Completion Criteria

Minimum Viable Completion:

  • Community detection working
  • Single-level summaries generated
  • Basic community search

Full Completion:

  • Hierarchical communities
  • Multi-level summaries
  • Two-stage retrieval pipeline

Excellence:

  • Incremental community updates
  • Temporal community tracking
  • Community-aware reasoning