Project 31: “The Legacy Code Archaeologist” — Code Understanding
| Attribute | Value |
|---|---|
| File | KIRO_CLI_LEARNING_PROJECTS.md |
| Main Programming Language | C / Java (Legacy) |
| Coolness Level | Level 3: Genuinely Clever |
| Difficulty | Level 3: Advanced |
| Knowledge Area | Code Understanding |
What you’ll build: Map a legacy repo and generate a call-graph summary.
Why it teaches Exploration: The AI becomes a codebase cartographer.
Success criteria:
DIAGRAM.mdmatches real entry points and call flow.
Real World Outcome
You will have a Kiro CLI workflow that automatically reverse-engineers legacy codebases by generating architectural diagrams, call graphs, and dependency maps from un documented code. When you run it, you’ll see:
Analyzing a Legacy Codebase:
$ kiro "Analyze this legacy Java codebase and create a DIAGRAM.md with the architecture"
🔍 Legacy Code Archaeologist - Codebase Analysis
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Phase 1: Discovery
├─ Scanning directory structure...
│ └─ Found 247 Java files across 38 packages
├─ Identifying entry points...
│ ├─ Main.java (static void main)
│ ├─ ServletInitializer.java (web entry)
│ └─ ScheduledTasks.java (cron jobs)
└─ Extracting package structure...
✓ com.legacy.app (core business logic)
✓ com.legacy.db (database layer)
✓ com.legacy.web (servlets & controllers)
✓ com.legacy.util (utilities)
Phase 2: Call Graph Construction
├─ Parsing method calls...
│ └─ Analyzed 1,843 methods across 247 files
├─ Building dependency graph...
│ └─ Detected 4,521 method-to-method calls
└─ Identifying hot paths...
✓ UserService.authenticate() called from 23 places
✓ DatabaseConnection.execute() called from 156 places
✓ LegacyValidator.validate() called from 89 places
Phase 3: Architecture Extraction
├─ Detecting design patterns...
│ ├─ Singleton: DatabaseConnection, ConfigManager
│ ├─ Factory: UserFactory, OrderFactory
│ └─ Observer: EventBus, NotificationManager
├─ Identifying layers...
│ └─ 3-tier architecture (Web → Service → DAO)
└─ Extracting data flow...
✓ HTTP Request → Servlet → Service → DAO → Database
✍️ Generating DIAGRAM.md...
✓ Architecture diagram created
✓ Call graph generated
✓ Dependency map visualized
✓ Hot paths documented
File created: DIAGRAM.md (523 lines)
Generated DIAGRAM.md Content:
# Legacy App Architecture - Reverse Engineered
## System Overview
This is a 3-tier Java web application built circa 2008 (Servlet 2.5 era).
No documentation exists - this diagram was generated via static analysis.
## Entry Points
1. **Web Requests** → `com.legacy.web.MainServlet`
2. **Scheduled Jobs** → `com.legacy.jobs.ScheduledTasks`
3. **CLI Admin Tool** → `com.legacy.admin.Main`
## Architecture Diagram
┌─────────────────────────────────────────────────┐ │ WEB LAYER │ │ ┌─────────────────────────────────────────┐ │ │ │ MainServlet, UserServlet, OrderServlet │ │ │ └─────────────────────────────────────────┘ │ └─────────────────┬───────────────────────────────┘ │ (HTTP requests) ↓ ┌─────────────────────────────────────────────────┐ │ SERVICE LAYER │ │ ┌─────────────────────────────────────────┐ │ │ │ UserService, OrderService, AuthService │ │ │ │ PaymentService, NotificationService │ │ │ └─────────────────────────────────────────┘ │ └─────────────────┬───────────────────────────────┘ │ (business logic) ↓ ┌─────────────────────────────────────────────────┐ │ DAO LAYER │ │ ┌─────────────────────────────────────────┐ │ │ │ UserDAO, OrderDAO, PaymentDAO │ │ │ │ (JDBC-based, no ORM) │ │ │ └─────────────────────────────────────────┘ │ └─────────────────┬───────────────────────────────┘ │ (SQL queries) ↓ ┌─────────────────────────────────────────────────┐ │ DATABASE │ │ MySQL 5.6 (inferred from JDBC driver) │ │ Schema: users, orders, payments, audit_log │ └─────────────────────────────────────────────────┘
## Critical Path: User Authentication
HTTP POST /login │ ↓ MainServlet.doPost() │ ├─ validate(request) │ └─ LegacyValidator.validate() [89 call sites!] │ ├─ UserService.authenticate(username, password) │ │ │ ├─ UserDAO.findByUsername() │ │ └─ DatabaseConnection.execute(sql) │ │ └─ “SELECT * FROM users WHERE username = ?” │ │ │ └─ PasswordHasher.verify(inputPwd, storedHash) │ └─ WARNING: Uses MD5 (insecure!) │ ├─ SessionManager.createSession(user) │ └─ response.redirect(“/dashboard”)
## Dependency Hot Spots
(Methods called most frequently - refactoring risks)
1. **DatabaseConnection.execute()** - 156 call sites
- Every database query goes through this
- Refactoring would break the entire app
2. **LegacyValidator.validate()** - 89 call sites
- Input validation scattered everywhere
- No clear validation strategy
3. **UserService.authenticate()** - 23 call sites
- Called from multiple servlets
- Session management tightly coupled
## Design Pattern Usage
| Pattern | Implementation | Notes |
|---------|----------------|-------|
| Singleton | DatabaseConnection, ConfigManager | Thread-unsafe (bug risk) |
| Factory | UserFactory, OrderFactory | Creates domain objects |
| Observer | EventBus, NotificationManager | Event-driven async tasks |
| DAO | *DAO classes | Raw JDBC, no ORM |
## Technical Debt Detected
1. ❌ **MD5 Password Hashing** (CRITICAL)
- File: `PasswordHasher.java:42`
- Risk: Passwords crackable in seconds
2. ⚠️ **Thread-Unsafe Singleton** (HIGH)
- File: `DatabaseConnection.java`
- Risk: Race conditions under load
3. ⚠️ **No Connection Pooling** (MEDIUM)
- Each request creates new DB connection
- Scalability bottleneck
4. ℹ️ **Servlet 2.5** (LOW)
- Ancient API (2005)
- Recommend upgrade to Servlet 4.0+
You’re seeing exactly what code archaeology enables - transforming undocumented legacy systems into understandable, maintainable architectures!
The Core Question You’re Answering
“How do you understand a 10-year-old codebase with zero documentation and the original developers long gone?”
Before you write any code, sit with this question. Every developer faces this:
- You inherit a legacy project (acquired company, staff turnover, archaeological dig)
- No README, no diagrams, no comments worth reading
- “Just read the code” - but there are 500 files and 100K lines
Manual approach (weeks of work):
Week 1: Read random files, get overwhelmed
Week 2: Find the entry point (main() or servlet)
Week 3: Trace execution paths with a debugger
Week 4: Draw diagrams on a whiteboard
Week 5: Finally understand 20% of the system
AI-assisted approach (hours):
Step 1: Feed entire codebase to Kiro
Step 2: "Map the architecture and generate call graphs"
Step 3: Review generated DIAGRAM.md
Step 4: Ask follow-up questions ("Why does UserService call PaymentDAO directly?")
Result: 80% understanding in 4 hours
This is code archaeology - using static analysis, pattern recognition, and LLM reasoning to reverse-engineer systems.
Concepts You Must Understand First
Stop and research these before coding:
- Static Code Analysis
- What is an Abstract Syntax Tree (AST)? (parse tree of code structure)
- How do you extract method calls from source code? (AST traversal)
- What tools exist? (Understand, Sourcetrail, javaparser, tree-sitter)
- Book Reference: “Compilers: Principles, Techniques, and Tools” Ch. 4 (Syntax Analysis) - Aho, Lam, Sethi, Ullman
- Call Graph Construction
- What is a call graph? (directed graph: nodes = methods, edges = calls)
- Static vs dynamic call graphs (compile-time vs runtime)
- How do you handle polymorphism? (method dispatch is ambiguous)
- Paper Reference: “Practical Algorithms for Call Graph Construction” - Grove & Chambers
- Dependency Analysis
- What is coupling? (how tightly connected are modules)
- What is cohesion? (how focused is a module’s purpose)
- How do you detect circular dependencies? (cycle detection in directed graphs)
- Book Reference: “Clean Architecture” Ch. 13-14 (Component Cohesion/Coupling) - Robert C. Martin
- Design Pattern Recognition
- Common patterns: Singleton, Factory, Observer, Strategy, DAO
- How do you detect patterns in code? (structural matching, AST patterns)
- What are anti-patterns? (God Object, Spaghetti Code, Shotgun Surgery)
- Book Reference: “Design Patterns” - Gamma, Helm, Johnson, Vlissides (Gang of Four)
- Legacy Code Characteristics
- What defines “legacy”? (no tests, no docs, fear of change)
- How do you prioritize what to understand first? (entry points, hot paths)
- What is the strangler fig pattern? (gradually replace legacy system)
- Book Reference: “Working Effectively with Legacy Code” Ch. 1-2 - Michael Feathers
Questions to Guide Your Design
Before implementing, think through these:
- Entry Point Detection
- How do you find main()? (search for
public static void main) - What about web apps? (Servlet annotations, web.xml)
- What about background jobs? (@Scheduled, cron config)
- How do you find main()? (search for
- Call Graph Scope
- Full codebase or just application code? (exclude libraries?)
- How deep to trace calls? (1 level? All transitive dependencies?)
- How do you handle reflection? (runtime method invocation)
- Visualization Format
- ASCII art in markdown (simple, readable in GitHub)
- Graphviz DOT (generates PNG diagrams)
- Mermaid.js (renders in markdown viewers)
- Prioritization
- What’s most important to document first? (entry points, critical paths)
- How do you identify “hot spots”? (most-called methods)
- What about dead code? (unreachable methods)
- Technical Debt Detection
- Security issues (MD5, SQL injection, XSS)
- Performance problems (N+1 queries, missing indexes)
- Maintainability issues (God classes, long methods)
Thinking Exercise
Trace Architecture Extraction
Before coding, manually analyze this legacy Java snippet:
Given:
// MainServlet.java
public class MainServlet extends HttpServlet {
protected void doPost(HttpServletRequest req, HttpServletResponse resp) {
String username = req.getParameter("username");
String password = req.getParameter("password");
User user = UserService.getInstance().authenticate(username, password);
if (user != null) {
SessionManager.createSession(req, user);
resp.sendRedirect("/dashboard");
}
}
}
// UserService.java
public class UserService {
private static UserService instance;
public static UserService getInstance() {
if (instance == null) {
instance = new UserService();
}
return instance;
}
public User authenticate(String username, String password) {
User user = UserDAO.findByUsername(username);
if (user != null && PasswordHasher.verify(password, user.getPasswordHash())) {
return user;
}
return null;
}
}
Trace the analysis:
- Entry Point Identification
MainServlet.doPost()is the entry point (HTTP POST /login)- Question: How do you know this handles /login? (need web.xml or @WebServlet annotation)
- Call Graph
MainServlet.doPost() ├─ UserService.getInstance() ├─ UserService.authenticate() │ ├─ UserDAO.findByUsername() │ └─ PasswordHasher.verify() └─ SessionManager.createSession()- Question: What methods does UserDAO.findByUsername() call? (need to analyze UserDAO source)
- Pattern Detection
- Singleton:
UserService.getInstance()(lazy initialization) - DAO:
UserDAO(data access layer) - Question: Is this thread-safe? (NO! Double-checked locking bug)
- Singleton:
- Technical Debt
- Singleton is thread-unsafe (race condition)
- No input validation (SQL injection risk)
- Direct string comparison (timing attack on password)
- Question: What’s the priority order for fixing? (security > concurrency > style)
Questions while tracing:
- How do you handle method overloading? (multiple findByUsername() signatures)
- What if UserDAO uses reflection? (can’t see calls statically)
- How deep should the call graph go? (stop at library boundaries?)
The Interview Questions They’ll Ask
Prepare to answer these:
-
“Walk me through how you would reverse-engineer a 100K-line Java codebase with no documentation. What’s your systematic approach?”
-
“You detect a Singleton pattern in legacy code that’s accessed from 50 places. How would you refactor it safely without breaking the app?”
-
“Your call graph tool reports 10,000 method calls. How would you prioritize which ones to document first?”
-
“Explain the difference between static and dynamic call graphs. When would you need a dynamic call graph despite the extra complexity?”
-
“You find a method called 500 times across the codebase. How would you determine if this is a design problem or just a legitimate utility method?”
-
“How would you detect and visualize circular dependencies in a legacy codebase? What tools and algorithms would you use?”
Hints in Layers
Hint 1: Start with Entry Point Detection Use grep to find main methods and servlets:
# Find Java main methods
rg "public static void main" --type java
# Find servlets
rg "extends HttpServlet" --type java
rg "@WebServlet" --type java
# Find Spring controllers
rg "@Controller|@RestController" --type java
Hint 2: Parse Source Code into AST Use a parsing library (don’t write a parser from scratch):
# For Java: use javalang or tree-sitter
import javalang
tree = javalang.parse.parse(java_source_code)
for path, node in tree.filter(javalang.tree.MethodInvocation):
print(f"Method call: {node.member}")
# Extract: UserService.getInstance().authenticate()
Hint 3: Build Call Graph Create a directed graph of method calls:
call_graph = {} # {caller: [callees]}
for file in java_files:
tree = parse(file)
for method in tree.methods:
caller = f"{method.class_name}.{method.name}"
callees = extract_method_calls(method)
call_graph[caller] = callees
# Example output:
# {
# 'MainServlet.doPost': ['UserService.authenticate', 'SessionManager.createSession'],
# 'UserService.authenticate': ['UserDAO.findByUsername', 'PasswordHasher.verify']
# }
Hint 4: Detect Hot Spots Count incoming edges to find most-called methods:
call_counts = {}
for caller, callees in call_graph.items():
for callee in callees:
call_counts[callee] = call_counts.get(callee, 0) + 1
# Sort by frequency
hot_spots = sorted(call_counts.items(), key=lambda x: x[1], reverse=True)
# [('DatabaseConnection.execute', 156), ('LegacyValidator.validate', 89), ...]
Hint 5: Identify Design Patterns Pattern matching on AST structure:
# Detect Singleton (lazy initialization)
for method in tree.filter(javalang.tree.MethodDeclaration):
if method.name == 'getInstance' and 'static' in method.modifiers:
# Check for: if (instance == null) instance = new ...
print(f"Singleton detected: {method.class_name}")
Hint 6: Generate ASCII Diagram Format call paths as a tree:
def print_call_tree(method, graph, depth=0, max_depth=3):
if depth > max_depth:
return
indent = " " * depth
print(f"{indent}├─ {method}")
for callee in graph.get(method, []):
print_call_tree(callee, graph, depth + 1)
# Output:
# ├─ MainServlet.doPost
# ├─ UserService.authenticate
# ├─ UserDAO.findByUsername
# └─ PasswordHasher.verify
Hint 7: Use LLM for Pattern Explanation Once you have the call graph, ask Kiro to explain it:
prompt = f"""
Based on this call graph:
{json.dumps(call_graph, indent=2)}
1. What architectural pattern is this? (MVC, layered, etc.)
2. Identify the entry points
3. Spot any design issues or anti-patterns
4. Generate a markdown diagram
"""
explanation = llm(prompt)
Hint 8: Validate Generated Diagram Cross-reference with actual code execution:
- Run the app with a debugger
- Set breakpoints at entry points
- Trace actual call stack
- Compare with static analysis diagram
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Static Code Analysis | “Compilers: Principles, Techniques, and Tools” by Aho et al. | Ch. 4 (Syntax Analysis) |
| Call Graph Algorithms | “Engineering a Compiler” by Cooper & Torczon | Ch. 9 (Data-Flow Analysis) |
| Design Patterns | “Design Patterns” by Gang of Four | All chapters |
| Legacy Code Understanding | “Working Effectively with Legacy Code” by Michael Feathers | Ch. 1-2, 16 |
| Software Architecture | “Clean Architecture” by Robert C. Martin | Ch. 13-14 (Components) |
Common Pitfalls & Debugging
Problem 1: “Call graph includes too many library methods (java.util.*, etc.)”
- Why: No filtering - you’re graphing the entire JDK.
- Fix: Filter out standard library packages. Only graph application code.
- Quick test: Call graph should have <1000 nodes for a typical app.
Problem 2: “Missing method calls - graph is incomplete”
- Why: Reflection, lambda expressions, or method references not detected.
- Fix: Combine static analysis with dynamic profiling (run app with instrumentation).
- Quick test: Cross-check against actual execution trace from debugger.
Problem 3: “Singleton detection produces false positives”
- Why: Any method named getInstance() triggers detection.
- Fix: Check for static field + lazy initialization pattern, not just method name.
- Quick test: Manual code review of detected Singletons.
Problem 4: “Generated diagram is unreadable - 1000+ lines of ASCII”
- Why: Showing entire call graph instead of high-level architecture.
- Fix: Create multiple diagrams: overview + detailed sub-graphs for each layer.
- Quick test: Overview diagram should fit on one screen (<50 lines).
Problem 5: “Analysis takes 10 minutes for 500 files”
- Why: Parsing each file from scratch on every run.
- Fix: Cache parsed ASTs, only re-parse changed files.
- Quick test: Second run should be <5 seconds (cache hit).
Definition of Done
- Entry points identified: Main methods, servlets, scheduled tasks are found
- Call graph built: Method-to-method calls extracted from source code
- Hot spots detected: Most-called methods identified (top 10)
- Design patterns recognized: Singleton, Factory, DAO, etc. detected
- Architecture diagram generated: High-level 3-tier or N-tier diagram created
- Critical paths documented: Authentication, payment, etc. workflows traced
- Technical debt flagged: Security issues (MD5, SQL injection) highlighted
- Markdown format: DIAGRAM.md is readable in GitHub
- Validation: Diagram matches actual execution (verified with debugger)
- Performance: Analysis completes in <2 minutes for 500-file codebase