Project 17: Incident Response Decision Tree
Build a decision tree for rebuild vs remediation.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 2 |
| Time Estimate | 1-2 weeks |
| Main Programming Language | Python (Alternatives: Python, PowerShell) |
| Alternative Programming Languages | Python, PowerShell |
| Coolness Level | Level 3 |
| Business Potential | Level 2 |
| Prerequisites | OS internals basics, CLI usage, logging familiarity |
| Key Topics | IR decisioning |
1. Learning Objectives
By completing this project, you will:
- Build a repeatable workflow for incident response decision tree.
- Generate reports with deterministic outputs.
- Translate findings into actionable recommendations.
2. All Theory Needed (Per-Concept Breakdown)
Incident Response Decisioning for Boot and Kernel Compromise
Fundamentals Incident response for rootkits is different from ordinary malware response because the trust boundary itself may be compromised. If the kernel or boot chain is untrusted, in-host remediation is unreliable. Decisioning focuses on evidence thresholds: when to contain, when to collect, and when to rebuild from trusted media. A good decision tree reduces ambiguity by defining measurable triggers and approvals. The goal is to balance operational continuity with integrity and safety.
Deep Dive into the concept Rootkit response begins with evidence. You must collect volatile data early: memory images, process lists, and network state. If you wait, the evidence may be lost or altered. The decision tree should define what constitutes “high confidence” of kernel compromise: for example, mismatched boot hashes, unsigned drivers loaded, or memory forensics indicating hidden kernel objects. These triggers should be measurable so responders are not forced to improvise.
Containment decisions depend on risk. A suspected bootkit on a domain controller is higher risk than on a non-critical workstation. The decision tree should include asset criticality, data sensitivity, and business impact. It should also define who approves destructive actions like reimaging. This reduces delays when time matters.
Rebuild vs remediation is the central choice. For many boot or kernel compromises, rebuild from trusted media is the safest path. Live remediation may be possible in some cases, but it should be the exception. The decision tree should include evidence capture requirements before rebuild, because rebuilding destroys evidence. You must also define post-rebuild validation: verifying Secure Boot, restoring baselines, and confirming that suspicious indicators are resolved.
Finally, communication is a required output. Rootkit incidents require clear reporting to stakeholders. A decision tree should include escalation paths and reporting artifacts: what was observed, what was done, and what remains unknown. This is how you maintain trust in the response process.
How this fit on projects You will apply this in Section 3.7 (Real World Outcome), Section 5.10 (Implementation Phases), and Section 11 (Self-Assessment). Also used in: P13-bootkit-response-playbook, P17-incident-response-decision-tree.
Definitions & key terms
- Containment: Actions that limit spread or impact of a compromise.
- Rebuild: Reimage or reinstall from trusted media to restore integrity.
- Evidence threshold: Measured criteria that justify a response decision.
- Post-rebuild validation: Checks that confirm restored integrity after remediation.
Mental model diagram
[Detection Signal] -> [Evidence Capture] -> [Decision Threshold]
| |
v v
[Contain] <--------> [Rebuild or Remediate]
How it works (step-by-step)
- Capture volatile evidence and preserve it externally.
- Evaluate signals against defined thresholds.
- Decide containment actions based on asset criticality.
- Rebuild from trusted media when kernel integrity is in doubt.
- Validate post-rebuild state and update baselines.
Minimal concrete example
if boot_hash_mismatch and unsigned_driver_loaded:
action = 'rebuild'
elif suspicious_process and no_kernel_evidence:
action = 'contain + investigate'
Common misconceptions
- “You can always clean a rootkit in-place.” Kernel compromise undermines trust in local tools.
- “Evidence can be collected later.” Volatile data disappears quickly.
- “Rebuild is overkill.” It is often the only high-confidence remediation.
Check-your-understanding questions
- What signals justify a rebuild?
- Why must evidence be captured before remediation?
- Who should approve destructive actions?
Check-your-understanding answers
- Boot hash mismatches, hidden kernel objects, or unsigned drivers are strong triggers.
- Remediation can destroy volatile evidence needed for attribution or learning.
- Asset owners and security leadership should approve to balance risk and impact.
Real-world applications
- Enterprise incident response playbooks for bootkits.
- High-assurance environments where integrity is critical.
Where you’ll apply it You will apply this in Section 3.7 (Real World Outcome), Section 5.10 (Implementation Phases), and Section 11 (Self-Assessment). Also used in: P13-bootkit-response-playbook, P17-incident-response-decision-tree.
References
- NIST SP 800-61 (Computer Security Incident Handling Guide)
- SANS Incident Response resources
Key insights Rootkit response is about trust: when trust is broken, rebuild is the safest path.
Summary Define thresholds, capture evidence early, and prioritize integrity over convenience.
Homework/Exercises to practice the concept
- Draft a decision tree for boot integrity violations.
- List the evidence you must collect before reimaging.
Solutions to the homework/exercises
- Your decision tree should include at least three measurable triggers.
- Evidence should include memory image, disk image, and boot configuration.
3. Project Specification
3.1 What You Will Build
A tool or document that delivers: Build a decision tree for rebuild vs remediation.
3.2 Functional Requirements
- Collect required system artifacts for the task.
- Normalize data and produce a report output.
- Provide a deterministic golden-path demo.
- Include explicit failure handling and exit codes.
3.3 Non-Functional Requirements
- Performance: Complete within a typical maintenance window.
- Reliability: Outputs must be deterministic and versioned.
- Usability: Clear CLI output and documentation.
3.4 Example Usage / Output
$ ./P17-incident-response-decision-tree.py --report
[ok] report generated
3.5 Data Formats / Schemas / Protocols
Report JSON schema with fields: timestamp, host, findings, severity, remediation.
3.6 Edge Cases
- Missing permissions or insufficient privileges.
- Tooling not installed (e.g., missing sysctl or OS query tools).
- Empty data sets (no drivers/modules found).
3.7 Real World Outcome
A deterministic report output stored in a case directory with hashes.
3.7.1 How to Run (Copy/Paste)
./P17-incident-response-decision-tree.py --out reports/P17-incident-response-decision-tree.json
3.7.2 Golden Path Demo (Deterministic)
- Report file exists and includes findings with severity.
3.7.3 Failure Demo
$ ./P17-incident-response-decision-tree.py --out /readonly/report.json
[error] cannot write report file
exit code: 2
Exit Codes:
0success2output error
4. Solution Architecture
4.1 High-Level Design
[Collector] -> [Analyzer] -> [Report]
4.2 Key Components
| Component | Responsibility | Key Decisions |
|---|---|---|
| Collector | Collects raw artifacts | Prefer OS-native tools |
| Analyzer | Normalizes and scores findings | Deterministic rules |
| Reporter | Outputs report | JSON + Markdown |
4.3 Data Structures (No Full Code)
finding = { id, description, severity, evidence, remediation }
4.4 Algorithm Overview
Key Algorithm: Normalize and Score
- Collect artifacts.
- Normalize fields.
- Apply scoring rules.
- Output report.
Complexity Analysis:
- Time: O(n) for n artifacts.
- Space: O(n) for report.
5. Implementation Guide
5.1 Development Environment Setup
python3 -m venv .venv && source .venv/bin/activate
# install OS-specific tools as needed
5.2 Project Structure
project/
|-- src/
| `-- main.py
|-- reports/
`-- README.md
5.3 The Core Question You’re Answering
“When do you rebuild versus attempt live remediation?”
This project turns theory into a repeatable, auditable workflow.
5.4 Concepts You Must Understand First
- Relevant OS security controls
- Detection workflows
- Evidence handling
5.5 Questions to Guide Your Design
- What data sources are trusted for this task?
- How will you normalize differences across OS versions?
- What is a high-confidence signal vs noise?
5.6 Thinking Exercise
Sketch a pipeline from data collection to report output.
5.7 The Interview Questions They’ll Ask
- What is the main trust boundary in this project?
- How do you validate findings?
- What would you automate in production?
5.8 Hints in Layers
Hint 1: Start with a small, deterministic dataset.
Hint 2: Normalize output fields early.
Hint 3: Add a failure path with clear exit codes.
5.9 Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Rootkit defense | Practical Malware Analysis | Rootkit chapters |
| OS internals | Operating Systems: Three Easy Pieces | Processes and files |
5.10 Implementation Phases
Phase 1: Data Collection (3-4 days)
Goals: Collect raw artifacts reliably.
Tasks:
- Identify OS-native tools.
- Capture sample data.
Checkpoint: Raw dataset stored.
Phase 2: Analysis & Reporting (4-5 days)
Goals: Normalize and score findings.
Tasks:
- Build analyzer.
- Generate report.
Checkpoint: Deterministic report generated.
Phase 3: Validation (2-3 days)
Goals: Validate rules and handle edge cases.
Tasks:
- Add failure tests.
- Document runbook.
Checkpoint: Failure cases documented.
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale |
|---|---|---|---|
| Report format | JSON, CSV | JSON | Structured and diffable |
| Scoring | Simple, Weighted | Weighted | Prioritize high risk findings |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples |
|---|---|---|
| Unit Tests | Parser logic | Sample data parsing |
| Integration Tests | End-to-end run | Generate report |
| Edge Case Tests | Missing permissions | Error path |
6.2 Critical Test Cases
- Report generated with deterministic ordering.
- Exit code indicates failure on invalid output path.
- At least one high-risk finding is flagged in test data.
6.3 Test Data
Provide a small fixture file with one known suspicious artifact.
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution |
|---|---|---|
| Noisy results | Too many alerts | Add normalization and thresholds |
| Missing permissions | Script fails | Detect and warn early |
7.2 Debugging Strategies
- Log raw inputs before normalization.
- Add verbose mode to show rule evaluation.
7.3 Performance Traps
Scanning large datasets without filtering can be slow; restrict scope to critical paths.
8. Extensions & Challenges
8.1 Beginner Extensions
- Add a Markdown summary report.
8.2 Intermediate Extensions
- Add a JSON schema validator for output.
8.3 Advanced Extensions
- Integrate with a SIEM or ticketing system.
9. Real-World Connections
9.1 Industry Applications
- Security operations audits and detection validation.
9.2 Related Open Source Projects
- osquery - endpoint inventory
9.3 Interview Relevance
- Discussing detection workflows and auditability.
10. Resources
10.1 Essential Reading
- Practical Malware Analysis - rootkit detection chapters
10.2 Video Resources
- Conference talks on rootkit detection
10.3 Tools & Documentation
- OS-native logging and audit tools
10.4 Related Projects in This Series
- Previous: P16-persistence-atlas
-
Next: P18-mitre-coverage-mapping
11. Self-Assessment Checklist
11.1 Understanding
- I can describe the trust boundary for this task.
11.2 Implementation
- Report generation is deterministic.
11.3 Growth
- I can explain how to operationalize this check.
12. Submission / Completion Criteria
Minimum Viable Completion:
- Report created and contains at least one finding.
Full Completion:
- Findings are categorized with remediation guidance.
Excellence (Going Above & Beyond):
- Integrated into a broader toolkit or pipeline.