Project 31: "The Legacy Code Archaeologist" — Code Understanding

Project 31: “The Legacy Code Archaeologist” — Code Understanding

Attribute	Value
File	`KIRO_CLI_LEARNING_PROJECTS.md`
Main Programming Language	C / Java (Legacy)
Coolness Level	Level 3: Genuinely Clever
Difficulty	Level 3: Advanced
Knowledge Area	Code Understanding

What you’ll build: Map a legacy repo and generate a call-graph summary.

Why it teaches Exploration: The AI becomes a codebase cartographer.

Success criteria:

DIAGRAM.md matches real entry points and call flow.

Real World Outcome

You will have a Kiro CLI workflow that automatically reverse-engineers legacy codebases by generating architectural diagrams, call graphs, and dependency maps from un documented code. When you run it, you’ll see:

Analyzing a Legacy Codebase:

$ kiro "Analyze this legacy Java codebase and create a DIAGRAM.md with the architecture"

🔍 Legacy Code Archaeologist - Codebase Analysis
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Phase 1: Discovery
├─ Scanning directory structure...
│  └─ Found 247 Java files across 38 packages
├─ Identifying entry points...
│  ├─ Main.java (static void main)
│  ├─ ServletInitializer.java (web entry)
│  └─ ScheduledTasks.java (cron jobs)
└─ Extracting package structure...
   ✓ com.legacy.app (core business logic)
   ✓ com.legacy.db (database layer)
   ✓ com.legacy.web (servlets & controllers)
   ✓ com.legacy.util (utilities)

Phase 2: Call Graph Construction
├─ Parsing method calls...
│  └─ Analyzed 1,843 methods across 247 files
├─ Building dependency graph...
│  └─ Detected 4,521 method-to-method calls
└─ Identifying hot paths...
   ✓ UserService.authenticate() called from 23 places
   ✓ DatabaseConnection.execute() called from 156 places
   ✓ LegacyValidator.validate() called from 89 places

Phase 3: Architecture Extraction
├─ Detecting design patterns...
│  ├─ Singleton: DatabaseConnection, ConfigManager
│  ├─ Factory: UserFactory, OrderFactory
│  └─ Observer: EventBus, NotificationManager
├─ Identifying layers...
│  └─ 3-tier architecture (Web → Service → DAO)
└─ Extracting data flow...
   ✓ HTTP Request → Servlet → Service → DAO → Database

✍️ Generating DIAGRAM.md...
✓ Architecture diagram created
✓ Call graph generated
✓ Dependency map visualized
✓ Hot paths documented

File created: DIAGRAM.md (523 lines)

Generated DIAGRAM.md Content:

# Legacy App Architecture - Reverse Engineered

## System Overview

This is a 3-tier Java web application built circa 2008 (Servlet 2.5 era).
No documentation exists - this diagram was generated via static analysis.

## Entry Points

1. **Web Requests** → `com.legacy.web.MainServlet`
2. **Scheduled Jobs** → `com.legacy.jobs.ScheduledTasks`
3. **CLI Admin Tool** → `com.legacy.admin.Main`

## Architecture Diagram

┌─────────────────────────────────────────────────┐ │ WEB LAYER │ │ ┌─────────────────────────────────────────┐ │ │ │ MainServlet, UserServlet, OrderServlet │ │ │ └─────────────────────────────────────────┘ │ └─────────────────┬───────────────────────────────┘ │ (HTTP requests) ↓ ┌─────────────────────────────────────────────────┐ │ SERVICE LAYER │ │ ┌─────────────────────────────────────────┐ │ │ │ UserService, OrderService, AuthService │ │ │ │ PaymentService, NotificationService │ │ │ └─────────────────────────────────────────┘ │ └─────────────────┬───────────────────────────────┘ │ (business logic) ↓ ┌─────────────────────────────────────────────────┐ │ DAO LAYER │ │ ┌─────────────────────────────────────────┐ │ │ │ UserDAO, OrderDAO, PaymentDAO │ │ │ │ (JDBC-based, no ORM) │ │ │ └─────────────────────────────────────────┘ │ └─────────────────┬───────────────────────────────┘ │ (SQL queries) ↓ ┌─────────────────────────────────────────────────┐ │ DATABASE │ │ MySQL 5.6 (inferred from JDBC driver) │ │ Schema: users, orders, payments, audit_log │ └─────────────────────────────────────────────────┘


## Critical Path: User Authentication

HTTP POST /login │ ↓ MainServlet.doPost() │ ├─ validate(request) │ └─ LegacyValidator.validate() [89 call sites!] │ ├─ UserService.authenticate(username, password) │ │ │ ├─ UserDAO.findByUsername() │ │ └─ DatabaseConnection.execute(sql) │ │ └─ “SELECT * FROM users WHERE username = ?” │ │ │ └─ PasswordHasher.verify(inputPwd, storedHash) │ └─ WARNING: Uses MD5 (insecure!) │ ├─ SessionManager.createSession(user) │ └─ response.redirect(“/dashboard”)

## Dependency Hot Spots

(Methods called most frequently - refactoring risks)

1. **DatabaseConnection.execute()** - 156 call sites
   - Every database query goes through this
   - Refactoring would break the entire app

2. **LegacyValidator.validate()** - 89 call sites
   - Input validation scattered everywhere
   - No clear validation strategy

3. **UserService.authenticate()** - 23 call sites
   - Called from multiple servlets
   - Session management tightly coupled

## Design Pattern Usage

| Pattern | Implementation | Notes |
|---------|----------------|-------|
| Singleton | DatabaseConnection, ConfigManager | Thread-unsafe (bug risk) |
| Factory | UserFactory, OrderFactory | Creates domain objects |
| Observer | EventBus, NotificationManager | Event-driven async tasks |
| DAO | *DAO classes | Raw JDBC, no ORM |

## Technical Debt Detected

1. ❌ **MD5 Password Hashing** (CRITICAL)
   - File: `PasswordHasher.java:42`
   - Risk: Passwords crackable in seconds

2. ⚠️ **Thread-Unsafe Singleton** (HIGH)
   - File: `DatabaseConnection.java`
   - Risk: Race conditions under load

3. ⚠️ **No Connection Pooling** (MEDIUM)
   - Each request creates new DB connection
   - Scalability bottleneck

4. ℹ️ **Servlet 2.5** (LOW)
   - Ancient API (2005)
   - Recommend upgrade to Servlet 4.0+

You’re seeing exactly what code archaeology enables - transforming undocumented legacy systems into understandable, maintainable architectures!

The Core Question You’re Answering

“How do you understand a 10-year-old codebase with zero documentation and the original developers long gone?”

Before you write any code, sit with this question. Every developer faces this:

You inherit a legacy project (acquired company, staff turnover, archaeological dig)
No README, no diagrams, no comments worth reading
“Just read the code” - but there are 500 files and 100K lines

Manual approach (weeks of work):

Week 1: Read random files, get overwhelmed
Week 2: Find the entry point (main() or servlet)
Week 3: Trace execution paths with a debugger
Week 4: Draw diagrams on a whiteboard
Week 5: Finally understand 20% of the system

AI-assisted approach (hours):

Step 1: Feed entire codebase to Kiro
Step 2: "Map the architecture and generate call graphs"
Step 3: Review generated DIAGRAM.md
Step 4: Ask follow-up questions ("Why does UserService call PaymentDAO directly?")
Result: 80% understanding in 4 hours

This is code archaeology - using static analysis, pattern recognition, and LLM reasoning to reverse-engineer systems.

Concepts You Must Understand First

Stop and research these before coding:

Static Code Analysis
- What is an Abstract Syntax Tree (AST)? (parse tree of code structure)
- How do you extract method calls from source code? (AST traversal)
- What tools exist? (Understand, Sourcetrail, javaparser, tree-sitter)
- Book Reference: “Compilers: Principles, Techniques, and Tools” Ch. 4 (Syntax Analysis) - Aho, Lam, Sethi, Ullman
Call Graph Construction
- What is a call graph? (directed graph: nodes = methods, edges = calls)
- Static vs dynamic call graphs (compile-time vs runtime)
- How do you handle polymorphism? (method dispatch is ambiguous)
- Paper Reference: “Practical Algorithms for Call Graph Construction” - Grove & Chambers
Dependency Analysis
- What is coupling? (how tightly connected are modules)
- What is cohesion? (how focused is a module’s purpose)
- How do you detect circular dependencies? (cycle detection in directed graphs)
- Book Reference: “Clean Architecture” Ch. 13-14 (Component Cohesion/Coupling) - Robert C. Martin
Design Pattern Recognition
- Common patterns: Singleton, Factory, Observer, Strategy, DAO
- How do you detect patterns in code? (structural matching, AST patterns)
- What are anti-patterns? (God Object, Spaghetti Code, Shotgun Surgery)
- Book Reference: “Design Patterns” - Gamma, Helm, Johnson, Vlissides (Gang of Four)
Legacy Code Characteristics
- What defines “legacy”? (no tests, no docs, fear of change)
- How do you prioritize what to understand first? (entry points, hot paths)
- What is the strangler fig pattern? (gradually replace legacy system)
- Book Reference: “Working Effectively with Legacy Code” Ch. 1-2 - Michael Feathers

Questions to Guide Your Design

Before implementing, think through these:

Entry Point Detection
- How do you find main()? (search for public static void main)
- What about web apps? (Servlet annotations, web.xml)
- What about background jobs? (@Scheduled, cron config)
Call Graph Scope
- Full codebase or just application code? (exclude libraries?)
- How deep to trace calls? (1 level? All transitive dependencies?)
- How do you handle reflection? (runtime method invocation)
Visualization Format
- ASCII art in markdown (simple, readable in GitHub)
- Graphviz DOT (generates PNG diagrams)
- Mermaid.js (renders in markdown viewers)
Prioritization
- What’s most important to document first? (entry points, critical paths)
- How do you identify “hot spots”? (most-called methods)
- What about dead code? (unreachable methods)
Technical Debt Detection
- Security issues (MD5, SQL injection, XSS)
- Performance problems (N+1 queries, missing indexes)
- Maintainability issues (God classes, long methods)

Thinking Exercise

Trace Architecture Extraction

Before coding, manually analyze this legacy Java snippet:

Given:

// MainServlet.java
public class MainServlet extends HttpServlet {
    protected void doPost(HttpServletRequest req, HttpServletResponse resp) {
        String username = req.getParameter("username");
        String password = req.getParameter("password");

        User user = UserService.getInstance().authenticate(username, password);
        if (user != null) {
            SessionManager.createSession(req, user);
            resp.sendRedirect("/dashboard");
        }
    }
}

// UserService.java
public class UserService {
    private static UserService instance;

    public static UserService getInstance() {
        if (instance == null) {
            instance = new UserService();
        }
        return instance;
    }

    public User authenticate(String username, String password) {
        User user = UserDAO.findByUsername(username);
        if (user != null && PasswordHasher.verify(password, user.getPasswordHash())) {
            return user;
        }
        return null;
    }
}

Trace the analysis:

Entry Point Identification
- MainServlet.doPost() is the entry point (HTTP POST /login)
- Question: How do you know this handles /login? (need web.xml or @WebServlet annotation)

Call Graph

MainServlet.doPost()
  ├─ UserService.getInstance()
  ├─ UserService.authenticate()
  │    ├─ UserDAO.findByUsername()
  │    └─ PasswordHasher.verify()
  └─ SessionManager.createSession()

Question: What methods does UserDAO.findByUsername() call? (need to analyze UserDAO source)

Pattern Detection
- Singleton: UserService.getInstance() (lazy initialization)
- DAO: UserDAO (data access layer)
- Question: Is this thread-safe? (NO! Double-checked locking bug)
Technical Debt
- Singleton is thread-unsafe (race condition)
- No input validation (SQL injection risk)
- Direct string comparison (timing attack on password)
- Question: What’s the priority order for fixing? (security > concurrency > style)

Questions while tracing:

How do you handle method overloading? (multiple findByUsername() signatures)
What if UserDAO uses reflection? (can’t see calls statically)
How deep should the call graph go? (stop at library boundaries?)

The Interview Questions They’ll Ask

Prepare to answer these:

“Walk me through how you would reverse-engineer a 100K-line Java codebase with no documentation. What’s your systematic approach?”
“You detect a Singleton pattern in legacy code that’s accessed from 50 places. How would you refactor it safely without breaking the app?”
“Your call graph tool reports 10,000 method calls. How would you prioritize which ones to document first?”
“Explain the difference between static and dynamic call graphs. When would you need a dynamic call graph despite the extra complexity?”
“You find a method called 500 times across the codebase. How would you determine if this is a design problem or just a legitimate utility method?”
“How would you detect and visualize circular dependencies in a legacy codebase? What tools and algorithms would you use?”

Hints in Layers

Hint 1: Start with Entry Point Detection Use grep to find main methods and servlets:

# Find Java main methods
rg "public static void main" --type java

# Find servlets
rg "extends HttpServlet" --type java
rg "@WebServlet" --type java

# Find Spring controllers
rg "@Controller|@RestController" --type java

Hint 2: Parse Source Code into AST Use a parsing library (don’t write a parser from scratch):

# For Java: use javalang or tree-sitter
import javalang

tree = javalang.parse.parse(java_source_code)

for path, node in tree.filter(javalang.tree.MethodInvocation):
    print(f"Method call: {node.member}")
    # Extract: UserService.getInstance().authenticate()

Hint 3: Build Call Graph Create a directed graph of method calls:

call_graph = {}  # {caller: [callees]}

for file in java_files:
    tree = parse(file)
    for method in tree.methods:
        caller = f"{method.class_name}.{method.name}"
        callees = extract_method_calls(method)
        call_graph[caller] = callees

# Example output:
# {
#   'MainServlet.doPost': ['UserService.authenticate', 'SessionManager.createSession'],
#   'UserService.authenticate': ['UserDAO.findByUsername', 'PasswordHasher.verify']
# }

Hint 4: Detect Hot Spots Count incoming edges to find most-called methods:

call_counts = {}
for caller, callees in call_graph.items():
    for callee in callees:
        call_counts[callee] = call_counts.get(callee, 0) + 1

# Sort by frequency
hot_spots = sorted(call_counts.items(), key=lambda x: x[1], reverse=True)
# [('DatabaseConnection.execute', 156), ('LegacyValidator.validate', 89),...]

Hint 5: Identify Design Patterns Pattern matching on AST structure:

# Detect Singleton (lazy initialization)
for method in tree.filter(javalang.tree.MethodDeclaration):
    if method.name == 'getInstance' and 'static' in method.modifiers:
        # Check for: if (instance == null) instance = new ...
        print(f"Singleton detected: {method.class_name}")

Hint 6: Generate ASCII Diagram Format call paths as a tree:

def print_call_tree(method, graph, depth=0, max_depth=3):
    if depth > max_depth:
        return
    indent = "  " * depth
    print(f"{indent}├─ {method}")
    for callee in graph.get(method, []):
        print_call_tree(callee, graph, depth + 1)

# Output:
# ├─ MainServlet.doPost
#   ├─ UserService.authenticate
#     ├─ UserDAO.findByUsername
#     └─ PasswordHasher.verify

Hint 7: Use LLM for Pattern Explanation Once you have the call graph, ask Kiro to explain it:

prompt = f"""
Based on this call graph:

{json.dumps(call_graph, indent=2)}

1. What architectural pattern is this? (MVC, layered, etc.)
2. Identify the entry points
3. Spot any design issues or anti-patterns
4. Generate a markdown diagram
"""

explanation = llm(prompt)

Hint 8: Validate Generated Diagram Cross-reference with actual code execution:

Run the app with a debugger
Set breakpoints at entry points
Trace actual call stack
Compare with static analysis diagram

Books That Will Help

Topic	Book	Chapter
Static Code Analysis	“Compilers: Principles, Techniques, and Tools” by Aho et al.	Ch. 4 (Syntax Analysis)
Call Graph Algorithms	“Engineering a Compiler” by Cooper & Torczon	Ch. 9 (Data-Flow Analysis)
Design Patterns	“Design Patterns” by Gang of Four	All chapters
Legacy Code Understanding	“Working Effectively with Legacy Code” by Michael Feathers	Ch. 1-2, 16
Software Architecture	“Clean Architecture” by Robert C. Martin	Ch. 13-14 (Components)

Common Pitfalls & Debugging

Problem 1: “Call graph includes too many library methods (java.util.*, etc.)”

Why: No filtering - you’re graphing the entire JDK.
Fix: Filter out standard library packages. Only graph application code.
Quick test: Call graph should have <1000 nodes for a typical app.

Problem 2: “Missing method calls - graph is incomplete”

Why: Reflection, lambda expressions, or method references not detected.
Fix: Combine static analysis with dynamic profiling (run app with instrumentation).
Quick test: Cross-check against actual execution trace from debugger.

Problem 3: “Singleton detection produces false positives”

Why: Any method named getInstance() triggers detection.
Fix: Check for static field + lazy initialization pattern, not just method name.
Quick test: Manual code review of detected Singletons.

Problem 4: “Generated diagram is unreadable - 1000+ lines of ASCII”

Why: Showing entire call graph instead of high-level architecture.
Fix: Create multiple diagrams: overview + detailed sub-graphs for each layer.
Quick test: Overview diagram should fit on one screen (<50 lines).

Problem 5: “Analysis takes 10 minutes for 500 files”

Why: Parsing each file from scratch on every run.
Fix: Cache parsed ASTs, only re-parse changed files.
Quick test: Second run should be <5 seconds (cache hit).