Project 3: Integrity Baseline Builder
Build a tool that hashes boot and kernel components to create a trusted baseline.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 3 |
| Time Estimate | 1-2 weeks |
| Main Programming Language | Python (Alternatives: Bash, PowerShell) |
| Alternative Programming Languages | Bash, PowerShell |
| Coolness Level | Level 3 |
| Business Potential | Level 3 |
| Prerequisites | Hashing basics, filesystem permissions, boot component knowledge |
| Key Topics | Integrity baselines, hashing, drift management |
1. Learning Objectives
By completing this project, you will:
- Design a baseline schema for boot and kernel artifacts.
- Implement a hashing pipeline with offline storage.
- Build a diff workflow for baseline vs live state.
2. All Theory Needed (Per-Concept Breakdown)
Integrity Baselines, Hashing, and Drift Management
Fundamentals An integrity baseline is a recorded snapshot of trusted system components and their cryptographic hashes. In rootkit defense, baselines allow you to detect subtle changes to bootloaders, kernels, drivers, and critical configuration files. A hash is a fingerprint; when it changes, the underlying content has changed. Baselines must be stored out-of-band so a compromised system cannot rewrite them. Drift management is the discipline of updating baselines after legitimate changes and documenting why a change is expected. Without drift management, baselines either generate noise or become obsolete.
Deep Dive into the concept A baseline is only meaningful if you trust the state you captured. That means you collect it from a known-good system state and you store it where it cannot be modified by an attacker. For boot and kernel integrity, the baseline should include paths, hashes, sizes, signer metadata, and version identifiers. Hash choice matters: SHA-256 or better is standard for integrity. You also need to record the hash tool version and operating system build because differences in tooling or build can change file layouts or metadata.
Drift is inevitable because software updates change files. The mistake is treating drift as noise. Instead, drift should be managed as a controlled change process. When a patch is applied, you update the baseline with evidence: the patch ID, the time, and the list of changed components. Some teams sign baseline files or store them in a WORM location to prevent tampering. Another approach is to store baselines in a central secure system and compare the local system to that secure copy.
The baseline schema should be explicit. For each component, include: file path, hash, signer, file size, timestamp, and expected owner/permissions. This allows you to detect not just content changes but also permission changes that might signal tampering. If you run baselines across multiple OSes, create a common schema but allow OS-specific fields (e.g., Authenticode signer on Windows, module signing key on Linux, SIP state on macOS).
Operationally, baselines are only useful if there is a diff workflow. A diff tool should categorize changes: expected (known patch), suspicious (unsigned driver appears), and unknown (hash mismatch without explanation). The diff output should be actionable: it should highlight what changed, where, and how to reproduce the check. If a system is compromised, baselines enable you to identify the earliest point of drift and decide whether to rebuild.
Finally, baselines are not just for detection; they are for accountability. A well-maintained baseline program forces teams to document changes, which makes stealthy tampering harder to hide. In rootkit defense, the baseline is the anchor that turns “unknown” into a measurable delta.
How this fit on projects You will apply this in Section 3.2 (Functional Requirements), Section 3.5 (Data Formats), and Section 6.2 (Critical Test Cases). Also used in: P03-integrity-baseline-builder, P08-boot-integrity-monitor, P20-rootkit-defense-toolkit.
Definitions & key terms
- Baseline: A trusted snapshot of component hashes and metadata used for comparison.
- Drift: Any change between current state and baseline, whether legitimate or malicious.
- Hash: A cryptographic fingerprint that changes when file contents change.
- WORM storage: Write-once, read-many storage to prevent tampering with baselines.
Mental model diagram
[Known-Good System]
| (hash + metadata)
v
[Baseline JSON] --(stored offline)--> [Comparison Engine]
^ |
| (hash live system) v
[Current System] -------------> [Diff Report]
How it works (step-by-step)
- Collect hashes and metadata from a known-good system state.
- Store the baseline offline or in a secured repository.
- Collect the same fields from the live system.
- Compare current state to baseline and categorize drift.
- Update baseline only after verified change approval.
Minimal concrete example
{
"component": "kernel",
"path": "/boot/vmlinuz-6.6.8",
"sha256": "9f...",
"signer": "Build Key",
"version": "6.6.8",
"collected_at": "2026-01-01T10:00:00Z"
}
Common misconceptions
- “Baselines are set-and-forget.” Without updates, they become noisy and ignored.
- “Hashes alone are enough.” Metadata like signer and permissions provides additional context.
- “Storing baselines on the same host is safe.” A compromised host can tamper with them.
Check-your-understanding questions
- Why should baseline data be stored offline?
- What fields beyond hashes are useful in a baseline?
- How do you reduce false positives from software updates?
Check-your-understanding answers
- Offline storage prevents a compromised system from rewriting the baseline.
- Signer, file size, version, and permissions help identify suspicious changes.
- Track patch IDs and update baselines only after verified changes.
Real-world applications
- System integrity monitoring in regulated environments.
- Golden image compliance validation for server fleets.
Where you’ll apply it You will apply this in Section 3.2 (Functional Requirements), Section 3.5 (Data Formats), and Section 6.2 (Critical Test Cases). Also used in: P03-integrity-baseline-builder, P08-boot-integrity-monitor, P20-rootkit-defense-toolkit.
References
- NIST SP 800-94 (Guide to Intrusion Detection and Prevention Systems)
- CIS Benchmarks - integrity monitoring guidance
Key insights A baseline turns invisible changes into measurable drift with accountability.
Summary Hash baselines detect tampering, but only if you store and update them correctly.
Homework/Exercises to practice the concept
- Design a baseline schema for boot files and drivers.
- Write a diff rule that labels changes as expected or suspicious.
Solutions to the homework/exercises
- Your schema should include path, hash, signer, size, timestamp, and OS build.
- A diff rule should check for missing patch IDs or unsigned new components.
3. Project Specification
3.1 What You Will Build
A baseline builder that collects hashes, signer metadata, and version information for bootloaders, kernels, and drivers/modules. It outputs a signed JSON baseline and a diff tool to compare future states.
3.2 Functional Requirements
- Collect hashes and metadata for bootloader, kernel, and drivers/modules.
- Output a signed baseline JSON file stored outside the host.
- Provide a diff mode that flags additions, deletions, and changes.
- Record OS build and collection timestamp.
3.3 Non-Functional Requirements
- Performance: Baseline build completes within 10 minutes.
- Reliability: Hashes are deterministic and stable across runs.
- Usability: Baseline and diff outputs are human-readable.
3.4 Example Usage / Output
$ ./baseline_builder.py --os linux --out baselines/linux_2026-01-01.json
[baseline] boot files: 12
[baseline] kernel: vmlinuz-6.6.8
[baseline] modules: 189
3.5 Data Formats / Schemas / Protocols
Baseline JSON:
{ “os_build”: “6.6.8”, “collected_at”: “2026-01-01T10:00:00Z”, “components”: [ {“path”: “/boot/vmlinuz-6.6.8”, “sha256”: “9f…”} ] }
3.6 Edge Cases
- Baseline stored on the same disk as the system.
- Kernel update changes hashes without baseline update.
- Unsigned modules present in an otherwise signed system.
3.7 Real World Outcome
A trusted baseline file stored offline and a diff report that highlights drift.
3.7.1 How to Run (Copy/Paste)
./baseline_builder.py --os linux --out /secure/baselines/linux_2026-01-01.json
./baseline_builder.py --diff /secure/baselines/linux_2026-01-01.json --live
3.7.2 Golden Path Demo (Deterministic)
- Baseline JSON is produced with hashes and signer metadata.
- Diff run reports zero mismatches on a clean system.
3.7.3 Failure Demo
$ ./baseline_builder.py --diff /secure/baselines/linux_2026-01-01.json --live
[diff] mismatch: /boot/vmlinuz-6.6.8
exit code: 4
Exit Codes:
0success4integrity mismatch detected
4. Solution Architecture
4.1 High-Level Design
[Collector] -> [Hash Engine] -> [Baseline JSON] -> [Signer]
|
v
[Diff Engine] -> [Report]
4.2 Key Components
| Component | Responsibility | Key Decisions |
|---|---|---|
| Collector | Enumerates boot/kernel components | OS-specific adapters |
| Hash Engine | Computes SHA-256 hashes | Streamed hashing for large files |
| Diff Engine | Compares baseline to live | Categorize drift |
4.3 Data Structures (No Full Code)
component = { path, sha256, signer, version, size }
4.4 Algorithm Overview
Key Algorithm: Baseline Diff
- Load baseline file.
- Collect current state and hash components.
- Compare by path and hash.
- Emit diff report with categories.
Complexity Analysis:
- Time: O(n) for n components.
- Space: O(n) for baseline map.
5. Implementation Guide
5.1 Development Environment Setup
python3 -m venv .venv && source .venv/bin/activate
pip install cryptography
5.2 Project Structure
baseline-builder/
|-- src/
| |-- baseline_builder.py
| `-- diff.py
|-- baselines/
`-- reports/
5.3 The Core Question You’re Answering
“What does a trusted boot and kernel baseline look like on this system?”
A baseline is your reference truth; without it, you cannot detect stealthy drift.
5.4 Concepts You Must Understand First
- Integrity baselines and drift management
- Hashing and signature verification
- Boot component inventory
5.5 Questions to Guide Your Design
- Which components are included in the baseline?
- Where will the baseline be stored and who can access it?
- How will you handle updates and version changes?
5.6 Thinking Exercise
Draft a baseline schema that includes path, hash, signer, and version.
5.7 The Interview Questions They’ll Ask
- Why should baselines be stored offline?
- How do you reduce false positives from updates?
- What metadata should accompany hashes?
5.8 Hints in Layers
Hint 1: Start with bootloader, kernel image, and drivers.
Hint 2: Store the baseline in a signed JSON file.
Hint 3: Include OS build metadata in every baseline.
5.9 Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Integrity | Foundations of Information Security | Integrity and assurance |
| Operating systems | Operating Systems: Three Easy Pieces | Boot and storage |
5.10 Implementation Phases
Phase 1: Schema & Collector (3-4 days)
Goals:
- Define baseline schema.
- Implement component discovery.
Tasks:
- Draft JSON schema.
- Implement OS-specific collectors.
Checkpoint: Component list generated.
Phase 2: Hashing & Signing (3-4 days)
Goals:
- Hash and sign baseline.
- Store baseline offline.
Tasks:
- Implement hashing.
- Add signature support.
Checkpoint: Signed baseline stored.
Phase 3: Diff & Reporting (2-3 days)
Goals:
- Implement diff and report.
- Handle drift updates.
Tasks:
- Compare baseline to live.
- Generate diff report.
Checkpoint: Diff report highlights changes.
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale |
|---|---|---|---|
| Hash algorithm | SHA-256, SHA-1 | SHA-256 | Stronger integrity |
| Output format | JSON, YAML | JSON | Standard tooling support |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples |
|---|---|---|
| Unit Tests | Hashing and schema | Hash of known files |
| Integration Tests | Baseline creation | End-to-end run |
| Edge Case Tests | Update drift | Hash mismatch after update |
6.2 Critical Test Cases
- Baseline file created and signed.
- Diff reports mismatch for altered file.
- Baseline stored off-host and read-only.
6.3 Test Data
Use a known file and modify it to verify diff detection.
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution |
|---|---|---|
| Baseline on host | Baseline changes without notice | Store offline |
| Missing signer data | Unable to validate provenance | Capture signer metadata |
7.2 Debugging Strategies
- Log every file hashed and compare to manual hashes.
- Validate JSON schema with a lint step.
7.3 Performance Traps
Hashing large driver directories can be slow; parallelize where safe.
8. Extensions & Challenges
8.1 Beginner Extensions
- Add CSV export for baseline summaries.
8.2 Intermediate Extensions
- Integrate GPG signing for baseline files.
8.3 Advanced Extensions
- Add remote baseline attestation via API.
9. Real-World Connections
9.1 Industry Applications
- Integrity monitoring for regulated environments.
9.2 Related Open Source Projects
- osquery - baseline queries
9.3 Interview Relevance
- Discussing drift management and baseline policy.
10. Resources
10.1 Essential Reading
- NIST SP 800-94 intrusion detection guidance
10.2 Video Resources
- Talks on integrity monitoring and baselining
10.3 Tools & Documentation
- sha256sum documentation
10.4 Related Projects in This Series
- Previous: P02-boot-chain-map
-
Next: P04-windows-driver-signing-audit
11. Self-Assessment Checklist
11.1 Understanding
- I can explain why offline baselines are necessary.
- I can interpret drift categories.
11.2 Implementation
- Baseline and diff commands run successfully.
11.3 Growth
- I can describe how to handle baseline updates after patching.
12. Submission / Completion Criteria
Minimum Viable Completion:
- Baseline JSON created and stored offline.
- Diff report runs and detects changes.
Full Completion:
- Baseline signed and change approvals documented.
Excellence (Going Above & Beyond):
- Baseline integrated into automated monitoring pipeline.