Project 8: Community Detection & Summaries
Build a system that detects communities of related entities in your knowledge graph and generates hierarchical summaries, enabling efficient retrieval at multiple abstraction levels.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 4: Expert |
| Time Estimate | 2-3 weeks (30-40 hours) |
| Language | Python (Alternatives: TypeScript with Neo4j driver) |
| Prerequisites | Projects 1-7, graph algorithms, clustering |
| Key Topics | Community detection, Leiden/Louvain algorithms, hierarchical clustering, graph summarization, multi-level abstraction |
1. Learning Objectives
By completing this project, you will:
- Understand community detection algorithms (Leiden, Louvain).
- Apply graph clustering to knowledge graphs.
- Build hierarchical community structures.
- Generate LLM summaries at each community level.
- Use community summaries for efficient retrieval.
2. Theoretical Foundation
2.1 Core Concepts
-
Community Detection: Finding groups of densely connected nodes in a graph. Communities often correspond to topics, themes, or domains.
-
Leiden Algorithm: Improved version of Louvain that guarantees well-connected communities. O(n log n) complexity.
-
Modularity: Metric measuring the strength of community division. Higher modularity = better clustering.
-
Hierarchical Communities: Communities nested within larger communities, forming a tree structure.
-
Community Summaries: LLM-generated descriptions of what a community represents.
2.2 Why This Matters
Knowledge graphs grow too large for simple traversal:
- 10,000 entities → direct retrieval too slow
- Community detection creates “topic clusters”
- Summaries enable two-stage retrieval: find community → search within
This is exactly how Graphiti and Microsoft GraphRAG work.
2.3 Common Misconceptions
- “K-means works for graphs.” K-means needs vectors; use graph-native algorithms.
- “Communities are disjoint.” Entities can belong to multiple communities (overlapping).
- “One level is enough.” Hierarchical summaries enable multi-resolution queries.
2.4 ASCII Diagram: Hierarchical Communities
FULL KNOWLEDGE GRAPH
════════════════════════════════════════════════════════════════
Alice ──WORKS_AT──► Acme ◄──WORKS_AT── Bob
│ │ │
KNOWS OWNS MANAGES
│ │ │
▼ ▼ ▼
Carol API_Project ◄───────── DevTeam
│ │ │
WORKS_AT USES USES
│ │ │
▼ ▼ ▼
TechCo Python ◄──────────── FastAPI
│
RUNS_ON
│
▼
Linux
COMMUNITY DETECTION (Leiden)
════════════════════════════════════════════════════════════════
Level 0 (Leaf Communities):
┌─────────────────────┐ ┌─────────────────────┐ ┌────────────────┐
│ Community A │ │ Community B │ │ Community C │
│ (Acme Employees) │ │ (Tech Stack) │ │ (TechCo) │
│ │ │ │ │ │
│ • Alice │ │ • Python │ │ • Carol │
│ • Bob │ │ • FastAPI │ │ • TechCo │
│ • Acme │ │ • Linux │ │ │
│ • API_Project │ │ │ │ │
│ • DevTeam │ │ │ │ │
└─────────────────────┘ └─────────────────────┘ └────────────────┘
Level 1 (Merged Communities):
┌──────────────────────────────────────────────┐ ┌────────────────┐
│ Community AB (Acme Tech Ecosystem) │ │ Community C │
│ │ │ (TechCo) │
│ Summary: "Acme Corp engineering team │ │ │
│ working on API project using Python/FastAPI │ │ Summary: ... │
│ stack. Key people: Alice (engineer), │ │ │
│ Bob (manager)" │ │ │
└──────────────────────────────────────────────┘ └────────────────┘
Level 2 (Root):
┌────────────────────────────────────────────────────────────────┐
│ Community ABC (Full Graph) │
│ │
│ Summary: "Knowledge graph covering two companies (Acme, TechCo) │
│ and their employees, projects, and technology stacks." │
└────────────────────────────────────────────────────────────────┘
RETRIEVAL USING COMMUNITIES
════════════════════════════════════════════════════════════════
Query: "Who works on Python projects?"
Step 1: Search community summaries
→ Match: Community B (Tech Stack) mentions Python
→ Match: Community AB summary mentions Python/FastAPI
Step 2: Search within matched communities
→ Find: Alice, Bob work on API_Project which USES Python
Result: Alice, Bob (with full context from community)
3. Project Specification
3.1 What You Will Build
A Python system that:
- Detects communities in a Neo4j knowledge graph
- Builds hierarchical community structure
- Generates LLM summaries at each level
- Provides community-aware retrieval
3.2 Functional Requirements
- Detect communities:
detector.find_communities(graph)→ List[Community] - Build hierarchy:
detector.build_hierarchy(communities)→ CommunityTree - Generate summaries:
summarizer.summarize_community(community)→ str - Store communities:
store.save_communities(tree)→ None - Retrieve by community:
retriever.find_relevant_communities(query)→ List[Community] - Search within community:
retriever.search_community(community, query)→ List[Entity]
3.3 Example Usage / Output
from community_detector import CommunityDetector, CommunitySummarizer
detector = CommunityDetector(neo4j_driver)
summarizer = CommunitySummarizer(llm_client)
# Detect communities
communities = detector.find_communities(resolution=1.0)
print(f"Found {len(communities)} communities at level 0")
# Build hierarchy
tree = detector.build_hierarchy(communities, levels=3)
print(f"Hierarchy depth: {tree.depth}")
# Generate summaries
for community in tree.all_communities():
summary = summarizer.summarize(community)
print(f"Community {community.id}: {summary[:100]}...")
# Example output:
# Community c_0: "Acme Corp engineering team. Includes Alice (senior engineer),
# Bob (team lead), and DevTeam working on API_Project..."
# Community c_1: "Technology stack cluster including Python, FastAPI, and Linux.
# Used by Acme engineering projects..."
# Query using communities
retriever = CommunityRetriever(tree, embedder)
query = "What Python frameworks are used?"
relevant = retriever.find_relevant_communities(query, top_k=2)
for comm in relevant:
print(f"Community: {comm.summary[:50]}...")
entities = retriever.search_community(comm, query)
print(f" Entities: {[e.name for e in entities]}")
# Community: "Technology stack cluster including Python, FastAPI..."
# Entities: ['FastAPI', 'Python']
4. Solution Architecture
4.1 High-Level Design
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Neo4j │────▶│ Community │────▶│ Hierarchy │
│ Graph │ │ Detector │ │ Builder │
└──────────────┘ └──────────────┘ └──────────────┘
│
▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Community │◀────│ Summarizer │◀────│ Community │
│ Store │ │ (LLM) │ │ Tree │
└──────────────┘ └──────────────┘ └──────────────┘
│
▼
┌──────────────┐
│ Retriever │
└──────────────┘
4.2 Key Components
| Component | Responsibility | Technology |
|---|---|---|
| CommunityDetector | Run Leiden algorithm | Neo4j GDS or python-igraph |
| HierarchyBuilder | Create multi-level tree | Custom algorithm |
| Summarizer | Generate community descriptions | LLM |
| CommunityStore | Persist community structure | Neo4j + SQLite |
| Retriever | Two-stage community retrieval | Embedding similarity |
4.3 Data Models
from pydantic import BaseModel
from typing import Optional
class Community(BaseModel):
id: str
level: int
entities: list[str] # Entity IDs
edges_internal: int # Edges within community
edges_external: int # Edges to other communities
modularity: float
summary: str | None = None
embedding: list[float] | None = None
class CommunityTree(BaseModel):
root: Community
children: dict[str, list[Community]] # parent_id -> children
depth: int
def all_communities(self) -> list[Community]:
"""Traverse all communities in tree."""
...
def get_level(self, level: int) -> list[Community]:
"""Get all communities at a specific level."""
...
5. Implementation Guide
5.1 Development Environment Setup
mkdir community-detection && cd community-detection
python -m venv .venv && source .venv/bin/activate
pip install neo4j igraph leidenalg sentence-transformers openai pydantic
5.2 Project Structure
community-detection/
├── src/
│ ├── detector.py # Community detection
│ ├── hierarchy.py # Hierarchy building
│ ├── summarizer.py # LLM summarization
│ ├── store.py # Persistence
│ ├── retriever.py # Community-aware search
│ └── models.py # Data models
├── tests/
│ ├── test_detection.py
│ └── test_retrieval.py
└── README.md
5.3 Implementation Phases
Phase 1: Community Detection (10-12h)
Goals:
- Extract graph from Neo4j
- Run Leiden algorithm
- Tune resolution parameter
Tasks:
- Export Neo4j graph to igraph format
- Run Leiden with default parameters
- Experiment with resolution values
- Visualize communities
Checkpoint: Communities detected from graph.
Phase 2: Hierarchy and Summaries (10-12h)
Goals:
- Build multi-level hierarchy
- Generate summaries at each level
Tasks:
- Implement hierarchical Leiden (multiple resolutions)
- Build tree structure from levels
- Design summary prompt
- Generate and store summaries
Checkpoint: Hierarchical tree with summaries.
Phase 3: Retrieval (8-10h)
Goals:
- Search communities by embedding
- Search within communities
Tasks:
- Generate embeddings for summaries
- Implement community search
- Implement entity search within community
- Build two-stage retrieval pipeline
Checkpoint: Query returns relevant entities via communities.
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples |
|---|---|---|
| Unit | Test algorithm components | Leiden on small graph |
| Integration | Test full pipeline | Graph → communities → summaries |
| Quality | Test retrieval accuracy | Query matches correct community |
6.2 Critical Test Cases
- Community quality: Modularity above threshold
- Hierarchy validity: All entities present at every level
- Summary accuracy: Summary mentions key entities
- Retrieval precision: Query finds relevant community
7. Common Pitfalls & Debugging
| Pitfall | Symptom | Solution |
|---|---|---|
| Wrong resolution | Too few/many communities | Tune resolution parameter |
| Disconnected nodes | Orphan communities of size 1 | Filter or merge small communities |
| Summary hallucination | Summary mentions non-entities | Constrain prompt to actual entities |
| Hierarchy inconsistency | Missing entities at levels | Verify all entities present |
8. Extensions & Challenges
8.1 Beginner Extensions
- Add community visualization
- Implement incremental community update
8.2 Intermediate Extensions
- Add overlapping community detection
- Implement temporal communities (evolution)
8.3 Advanced Extensions
- Add community-aware reasoning
- Implement distributed community detection
9. Real-World Connections
9.1 Industry Applications
- Microsoft GraphRAG: Exact community summary architecture
- Graphiti: Community detection for entity clusters
- Social Network Analysis: Friend groups, interest clusters
9.2 Interview Relevance
- Explain modularity and community detection
- Discuss two-stage retrieval benefits
- Describe hierarchical summarization tradeoffs
10. Resources
10.1 Essential Reading
- “From Local to Global” (GraphRAG paper) — Microsoft’s approach
- Leiden Algorithm paper — Traag et al. 2019
- igraph documentation — Python graph algorithms
10.2 Related Projects
- Previous: Project 7 (Semantic Memory Synthesizer)
- Next: Project 9 (Graphiti Framework Integration)
11. Self-Assessment Checklist
- I understand modularity and what it measures
- I can explain Leiden vs Louvain differences
- I know how to tune resolution for my graph
- I understand two-stage retrieval benefits
12. Submission / Completion Criteria
Minimum Viable Completion:
- Community detection working
- Single-level summaries generated
- Basic community search
Full Completion:
- Hierarchical communities
- Multi-level summaries
- Two-stage retrieval pipeline
Excellence:
- Incremental community updates
- Temporal community tracking
- Community-aware reasoning