Project 14: Git Secret Scanner — Find and Remove Leaked Credentials
A tool that scans Git history (not just current files) for accidentally committed secrets—API keys, passwords, tokens—and can optionally help remove them from history.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Advanced |
| Time Estimate | 2-3 weeks |
| Main Programming Language | Rust |
| Alternative Programming Languages | Go, Python |
| Coolness Level | Level 4: Hardcore Tech Flex |
| Business Potential | 3. The “Service & Support” Model |
| Prerequisites | Regex, understanding of security basics, Project 1 completed |
| Key Topics | Common Secret Patterns, Git History Traversal, History Rewriting |
1. Learning Objectives
By completing this project, you will:
- Implement a working version of: A tool that scans Git history (not just current files) for accidentally committed secrets—API keys, passwords, tokens—and can optionally help remove them from history..
- Explain the core Git workflow tradeoff this project is designed to surface.
- Design deterministic checks so results can be verified and reproduced.
- Document operational failure modes and safe recovery actions.
2. All Theory Needed (Per-Concept Breakdown)
Common Secret Patterns
Fundamentals
This concept matters in this project because your implementation will fail or become non-deterministic without a precise model of Common Secret Patterns. You should define what the concept controls, what invariants must hold, and which actions are safe versus destructive. Treat this concept as a production concern, not a tutorial checkbox.
Deep Dive into the concept
When applying Common Secret Patterns in this project, reason in three passes: data shape, state transitions, and enforcement. First, identify which artifacts are authoritative (commit objects, refs, metadata, policy config, CI status, or scan findings). Second, map how those artifacts change when your tool runs. Third, define failure behavior explicitly. In Git tooling, silent partial success is dangerous: you need either complete success with evidence or an explicit failure state with remediation guidance. Also account for scale behavior. A workflow that works on a toy repo may fail on large history depth, concurrent updates, or mixed branch policies. Include trace logs for every irreversible action, and separate simulation mode from write mode. For interview readiness, be able to explain how this concept protects delivery speed while reducing operational risk.
How this fit on projects
In this project, Common Secret Patterns is directly used in design decisions, implementation constraints, and verification criteria.
Definitions & key terms
Common Secret Patternsinvariant: A condition that must remain true before and after every operation.- Safety boundary: The point where actions become destructive unless guarded.
- Verification signal: Evidence proving the action behaved as expected.
Mental model diagram
Input state -> Validate invariant -> Apply change -> Verify output -> Record evidence
How it works
- Capture current state and constraints.
- Evaluate whether
Common Secret Patternspreconditions are satisfied. - Execute the minimal safe transition.
- Verify postconditions and publish an auditable result.
Failure modes: stale state, partial writes, race conditions, ambiguous output contracts.
Minimal concrete example
Plan -> dry-run -> execute -> verify -> rollback/forward-fix decision
Common misconceptions
- Assuming local success implies team-safe behavior.
- Treating policy violations as warnings instead of merge blockers.
- Skipping deterministic verification because the output appears correct.
Check-your-understanding questions
- Which invariant is most likely to break first under concurrency?
- What output proves your tool handled an edge case correctly?
- Where should enforcement happen: local hook, CI, or protected branch gate?
Check-your-understanding answers
- The invariant tied to mutable refs or policy-dependent merge eligibility.
- A deterministic transcript showing both success and controlled failure behavior.
- Layered enforcement: fast local checks plus non-bypassable server-side gates.
Real-world applications
- Change-management tooling for fast-moving teams.
- Incident-safe release workflows with traceable rollback paths.
- Compliance-ready source-control automation.
Where you’ll apply it This project and its immediate adjacent projects in this sprint.
References
- https://git-scm.com/docs
- https://dora.dev/capabilities/trunk-based-development/
Key insights
Common Secret Patterns is only valuable when its invariants are encoded into tooling and checks.
Summary
Mastering Common Secret Patterns here gives you transferable patterns for larger workflow systems.
Homework/Exercises to practice the concept
- Write one failing scenario and expected detection output.
- Define one invariant and one explicit violation test.
Solutions to the homework/exercises
- Use a stale branch or invalid metadata case and assert deterministic error reporting.
- Invariant: protected branch must not accept unchecked changes; violation test: bypass attempt should fail fast.
Git History Traversal
Fundamentals
This concept matters in this project because your implementation will fail or become non-deterministic without a precise model of Git History Traversal. You should define what the concept controls, what invariants must hold, and which actions are safe versus destructive. Treat this concept as a production concern, not a tutorial checkbox.
Deep Dive into the concept
When applying Git History Traversal in this project, reason in three passes: data shape, state transitions, and enforcement. First, identify which artifacts are authoritative (commit objects, refs, metadata, policy config, CI status, or scan findings). Second, map how those artifacts change when your tool runs. Third, define failure behavior explicitly. In Git tooling, silent partial success is dangerous: you need either complete success with evidence or an explicit failure state with remediation guidance. Also account for scale behavior. A workflow that works on a toy repo may fail on large history depth, concurrent updates, or mixed branch policies. Include trace logs for every irreversible action, and separate simulation mode from write mode. For interview readiness, be able to explain how this concept protects delivery speed while reducing operational risk.
How this fit on projects
In this project, Git History Traversal is directly used in design decisions, implementation constraints, and verification criteria.
Definitions & key terms
Git History Traversalinvariant: A condition that must remain true before and after every operation.- Safety boundary: The point where actions become destructive unless guarded.
- Verification signal: Evidence proving the action behaved as expected.
Mental model diagram
Input state -> Validate invariant -> Apply change -> Verify output -> Record evidence
How it works
- Capture current state and constraints.
- Evaluate whether
Git History Traversalpreconditions are satisfied. - Execute the minimal safe transition.
- Verify postconditions and publish an auditable result.
Failure modes: stale state, partial writes, race conditions, ambiguous output contracts.
Minimal concrete example
Plan -> dry-run -> execute -> verify -> rollback/forward-fix decision
Common misconceptions
- Assuming local success implies team-safe behavior.
- Treating policy violations as warnings instead of merge blockers.
- Skipping deterministic verification because the output appears correct.
Check-your-understanding questions
- Which invariant is most likely to break first under concurrency?
- What output proves your tool handled an edge case correctly?
- Where should enforcement happen: local hook, CI, or protected branch gate?
Check-your-understanding answers
- The invariant tied to mutable refs or policy-dependent merge eligibility.
- A deterministic transcript showing both success and controlled failure behavior.
- Layered enforcement: fast local checks plus non-bypassable server-side gates.
Real-world applications
- Change-management tooling for fast-moving teams.
- Incident-safe release workflows with traceable rollback paths.
- Compliance-ready source-control automation.
Where you’ll apply it This project and its immediate adjacent projects in this sprint.
References
- https://git-scm.com/docs
- https://dora.dev/capabilities/trunk-based-development/
Key insights
Git History Traversal is only valuable when its invariants are encoded into tooling and checks.
Summary
Mastering Git History Traversal here gives you transferable patterns for larger workflow systems.
Homework/Exercises to practice the concept
- Write one failing scenario and expected detection output.
- Define one invariant and one explicit violation test.
Solutions to the homework/exercises
- Use a stale branch or invalid metadata case and assert deterministic error reporting.
- Invariant: protected branch must not accept unchecked changes; violation test: bypass attempt should fail fast.
History Rewriting
Fundamentals
This concept matters in this project because your implementation will fail or become non-deterministic without a precise model of History Rewriting. You should define what the concept controls, what invariants must hold, and which actions are safe versus destructive. Treat this concept as a production concern, not a tutorial checkbox.
Deep Dive into the concept
When applying History Rewriting in this project, reason in three passes: data shape, state transitions, and enforcement. First, identify which artifacts are authoritative (commit objects, refs, metadata, policy config, CI status, or scan findings). Second, map how those artifacts change when your tool runs. Third, define failure behavior explicitly. In Git tooling, silent partial success is dangerous: you need either complete success with evidence or an explicit failure state with remediation guidance. Also account for scale behavior. A workflow that works on a toy repo may fail on large history depth, concurrent updates, or mixed branch policies. Include trace logs for every irreversible action, and separate simulation mode from write mode. For interview readiness, be able to explain how this concept protects delivery speed while reducing operational risk.
How this fit on projects
In this project, History Rewriting is directly used in design decisions, implementation constraints, and verification criteria.
Definitions & key terms
History Rewritinginvariant: A condition that must remain true before and after every operation.- Safety boundary: The point where actions become destructive unless guarded.
- Verification signal: Evidence proving the action behaved as expected.
Mental model diagram
Input state -> Validate invariant -> Apply change -> Verify output -> Record evidence
How it works
- Capture current state and constraints.
- Evaluate whether
History Rewritingpreconditions are satisfied. - Execute the minimal safe transition.
- Verify postconditions and publish an auditable result.
Failure modes: stale state, partial writes, race conditions, ambiguous output contracts.
Minimal concrete example
Plan -> dry-run -> execute -> verify -> rollback/forward-fix decision
Common misconceptions
- Assuming local success implies team-safe behavior.
- Treating policy violations as warnings instead of merge blockers.
- Skipping deterministic verification because the output appears correct.
Check-your-understanding questions
- Which invariant is most likely to break first under concurrency?
- What output proves your tool handled an edge case correctly?
- Where should enforcement happen: local hook, CI, or protected branch gate?
Check-your-understanding answers
- The invariant tied to mutable refs or policy-dependent merge eligibility.
- A deterministic transcript showing both success and controlled failure behavior.
- Layered enforcement: fast local checks plus non-bypassable server-side gates.
Real-world applications
- Change-management tooling for fast-moving teams.
- Incident-safe release workflows with traceable rollback paths.
- Compliance-ready source-control automation.
Where you’ll apply it This project and its immediate adjacent projects in this sprint.
References
- https://git-scm.com/docs
- https://dora.dev/capabilities/trunk-based-development/
Key insights
History Rewriting is only valuable when its invariants are encoded into tooling and checks.
Summary
Mastering History Rewriting here gives you transferable patterns for larger workflow systems.
Homework/Exercises to practice the concept
- Write one failing scenario and expected detection output.
- Define one invariant and one explicit violation test.
Solutions to the homework/exercises
- Use a stale branch or invalid metadata case and assert deterministic error reporting.
- Invariant: protected branch must not accept unchecked changes; violation test: bypass attempt should fail fast.
3. Project Specification
3.1 What You Will Build
A tool that scans Git history (not just current files) for accidentally committed secrets—API keys, passwords, tokens—and can optionally help remove them from history.
3.2 Functional Requirements
- Scope control: Deliver a deterministic and testable implementation.
- Correctness: Preserve Git invariants and policy constraints.
3.3 Non-Functional Requirements
- Performance: Deterministic execution with documented runtime behavior on representative history sizes.
- Reliability: Repeated runs on the same input produce identical outputs.
- Usability: Clear CLI or report output for both success and failure cases.
3.4 Example Usage / Output
You’ll have a security tool for Git repositories:
Example Output:
$ secret-scan /path/to/repo
=== Git Secret Scanner ===
Scanning repository: my-project
Mode: Full history scan
Scanning 1,247 commits...
[████████████████████████████████████████] 100%
⚠️ SECRETS FOUND: 7
HIGH SEVERITY:
─────────────────────────────────────────
1. AWS Access Key
Commit: abc1234 (2023-06-15)
Author: alice@example.com
File: config/prod.env (line 12)
Pattern: AKIAIOSFODNN7EXAMPLE
Status: ❌ Still in current HEAD
2. GitHub Personal Access Token
Commit: def5678 (2023-08-22)
Author: bob@example.com
File: scripts/deploy.sh (line 45)
Pattern: ghp_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Status: ✓ Deleted in later commit (but still in history!)
MEDIUM SEVERITY:
─────────────────────────────────────────
3. Private RSA Key
Commit: ghi9012 (2023-09-01)
File: .ssh/id_rsa
Status: ✓ Removed via .gitignore
4-7. ... (additional findings)
=== RECOMMENDATIONS ===
IMMEDIATE ACTIONS:
1. Rotate the AWS key AKIAIOSFODNN7EXAMPLE immediately
2. Revoke the GitHub token ghp_xxx...
HISTORY CLEANUP:
To remove secrets from history, run:
$ secret-scan clean --commits abc1234,def5678
⚠️ This will rewrite history. Coordinate with your team!
All clones will need to re-clone or rebase.
$ secret-scan clean --commits abc1234
Rewriting history to remove secrets...
Using BFG Repo-Cleaner strategy...
Before: abc1234 → config/prod.env contains AKIAIOSFODNN7...
After: xyz7890 → config/prod.env contains ***REDACTED***
⚠️ Force push required:
git push --force-with-lease origin main
Secret successfully removed from history!
3.5 Data Formats / Schemas / Protocols
Describe input repository assumptions, output report shape, and any policy/config schema consumed by the tool.
3.6 Edge Cases
- Empty repository or shallow clone state.
- Detached HEAD or rewritten history during execution.
- Invalid metadata/policy configuration.
3.7 Real World Outcome
You’ll have a security tool for Git repositories:
Example Output:
$ secret-scan /path/to/repo
=== Git Secret Scanner ===
Scanning repository: my-project
Mode: Full history scan
Scanning 1,247 commits...
[████████████████████████████████████████] 100%
⚠️ SECRETS FOUND: 7
HIGH SEVERITY:
─────────────────────────────────────────
1. AWS Access Key
Commit: abc1234 (2023-06-15)
Author: alice@example.com
File: config/prod.env (line 12)
Pattern: AKIAIOSFODNN7EXAMPLE
Status: ❌ Still in current HEAD
2. GitHub Personal Access Token
Commit: def5678 (2023-08-22)
Author: bob@example.com
File: scripts/deploy.sh (line 45)
Pattern: ghp_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Status: ✓ Deleted in later commit (but still in history!)
MEDIUM SEVERITY:
─────────────────────────────────────────
3. Private RSA Key
Commit: ghi9012 (2023-09-01)
File: .ssh/id_rsa
Status: ✓ Removed via .gitignore
4-7. ... (additional findings)
=== RECOMMENDATIONS ===
IMMEDIATE ACTIONS:
1. Rotate the AWS key AKIAIOSFODNN7EXAMPLE immediately
2. Revoke the GitHub token ghp_xxx...
HISTORY CLEANUP:
To remove secrets from history, run:
$ secret-scan clean --commits abc1234,def5678
⚠️ This will rewrite history. Coordinate with your team!
All clones will need to re-clone or rebase.
$ secret-scan clean --commits abc1234
Rewriting history to remove secrets...
Using BFG Repo-Cleaner strategy...
Before: abc1234 → config/prod.env contains AKIAIOSFODNN7...
After: xyz7890 → config/prod.env contains ***REDACTED***
⚠️ Force push required:
git push --force-with-lease origin main
Secret successfully removed from history!
4. Solution Architecture
4.1 High-Level Design
Inputs -> Validation -> Core Engine -> Output Formatter -> Verification Report
4.2 Key Components
| Component | Responsibility | Key Decisions |
|---|---|---|
| Input loader | Discover commits/refs/config inputs | Deterministic ordering and clear failure messages |
| Core engine | Compute project-specific logic | Separate read-only simulation from mutating actions |
| Reporter | Produce user-facing output and evidence | Include machine-readable and human-readable forms |
4.4 Data Structures (No Full Code)
ProjectState { refs, commits, policy, findings, metrics }
Result { status, evidence, warnings, next_actions }
4.4 Algorithm Overview
- Collect state from repository and configuration.
- Evaluate invariants and policy preconditions.
- Execute core transformation or analysis logic.
- Verify postconditions and emit deterministic report.
Complexity Analysis:
- Time: O(history + affected scope)
- Space: O(active graph window + report size)
5. Implementation Guide
5.1 Development Environment Setup
Use the environment defined in the main guide. Pin tool versions and fixture data to keep outputs reproducible.
5.2 Project Structure
project-root/
├── fixtures/
├── src/
├── tests/
├── docs/
└── README.md
5.3 The Core Question You’re Answering
“How do secrets persist in Git history, and how do you find and remove them?”
Before you write any code, sit with this question. When you delete a file with secrets, Git still has the old version in its history. Even after a commit is no longer reachable from any branch, it exists in the object database until garbage collection.
5.4 Concepts You Must Understand First
Stop and research these before coding:
- Common Secret Patterns
- What regex patterns match AWS keys, GitHub tokens, etc.?
- How do you balance sensitivity (catch all secrets) vs. specificity (minimize false positives)?
- What’s entropy analysis and how does it help?
- Resource: truffleHog source code, GitGuardian patterns
- Git History Traversal
- How do you iterate through all commits and their files?
- How do you access file content at each commit without checkout?
- How do you handle large repositories efficiently?
- Book Reference: “Pro Git” Ch. 10 — Chacon
- History Rewriting
- What’s the difference between filter-branch, filter-repo, and BFG?
- What are the consequences of rewriting pushed history?
- How do you coordinate history rewrites with a team?
- Book Reference: “Pro Git” Ch. 7.6 — Chacon
5.5 Questions to Guide Your Design
Before implementing, think through these:
- Detection Strategy
- Should you scan every file in every commit or be smarter?
- How do you handle binary files?
- How do you prioritize findings by severity?
- Pattern Database
- How do you organize patterns for different secret types?
- How do you let users add custom patterns?
- How do you handle false positives (UUIDs that look like tokens)?
- Remediation
- Should you just report, or also offer to clean up?
- How do you handle secrets that are in current HEAD vs. only in history?
- What warnings do you give about history rewriting?
5.6 Thinking Exercise
Create and Detect a Secret
Commit a secret and trace what happens:
git init secret-test && cd secret-test
echo "AWS_KEY=AKIAIOSFODNN7EXAMPLE" > config.env
git add . && git commit -m "Add config"
# "Delete" the secret
rm config.env
git add . && git commit -m "Remove config"
# Is it really gone?
git log --all --oneline
git show HEAD~1:config.env
git log -p --all -S "AKIAIOSFODNN7"
Questions while tracing:
- Can you still see the secret?
- What git commands reveal it?
- What would
git gcdo? - How would you truly remove it?
5.7 The Interview Questions They’ll Ask
Prepare to answer these:
- “How would you detect secrets in a Git repository’s history?”
- “What’s the difference between deleting a file and removing it from Git history?”
- “How would you handle a situation where a secret was pushed to a public repository?”
- “What’s entropy analysis and how does it help detect secrets?”
- “What are the risks of using git filter-branch on a shared repository?”
5.8 Hints in Layers
Hint 1: Starting Point
Use git log --all --pretty=format:"%H" -- "*" to get all commits, then git show <commit>:<path> to get file contents.
Hint 2: Pattern Matching
Create a pattern database with regexes like AKIA[0-9A-Z]{16} for AWS keys, ghp_[A-Za-z0-9]{36} for GitHub tokens.
Hint 3: Entropy Analysis Calculate Shannon entropy of strings. High entropy (> 4.5 bits/char) suggests randomness (keys, tokens). Low entropy is probably normal code.
Hint 4: History Cleaning
Use git filter-repo (successor to filter-branch) or BFG Repo-Cleaner. They’re faster and safer than raw filter-branch.
5.9 Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| History rewriting | “Pro Git” by Chacon | Ch. 7.6 |
| Pattern matching | “Mastering Regular Expressions” by Friedl | Ch. 4-5 |
| Security practices | “The DevOps Handbook” by Kim et al. | Ch. 20 |
5.10 Implementation Phases
Phase 1: Foundation (1-2 sessions)
- Define fixtures, expected outputs, and invariant checks.
- Build read-only analysis path.
Phase 2: Core Functionality (2-4 sessions)
- Implement project-specific core logic and deterministic reporting.
- Add policy and edge-case handling.
Phase 3: Polish and Edge Cases (1-2 sessions)
- Add failure demos, performance notes, and usability improvements.
- Finalize docs and validation transcripts.
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale |
|---|---|---|---|
| Execution mode | direct write vs dry-run+write | dry-run+write | Safer and easier to debug |
| Output contract | free text vs structured+text | structured+text | Better automation and readability |
| Enforcement location | local only vs local+CI | local+CI | Prevents bypass in shared branches |
6. Testing Strategy
6.1 Test Categories
- Unit tests for parsing and policy logic.
- Integration tests on fixture repositories.
- Edge-case tests for stale refs, malformed metadata, and large histories.
6.2 Critical Test Cases
- Deterministic golden-path scenario.
- Policy violation hard-fail scenario.
- Recovery path after partial or conflicting state.
6.3 Test Data
Use fixed repository fixtures with known commit graphs and expected outputs stored under version control.
7. Common Pitfalls & Debugging
Problem 1: “Output looks correct but history or metadata is inconsistent”
- Why: Validation happens after mutation, not before.
- Fix: Add a preflight invariant check and a post-write verification step.
- Quick test: Run the same command twice on the same fixture and verify identical results.
Problem 2: “Tool works on small repo but times out on larger history”
- Why: Full traversal is performed where selective traversal is possible.
- Fix: Cache intermediate graph lookups and scope analysis to affected commits/paths.
- Quick test: Compare runtime on small and large fixtures with a clear budget target.
Problem 3: “Policy check can be bypassed by local-only behavior”
- Why: Enforcement is advisory, not server-authoritative.
- Fix: Mirror critical checks in CI and protected branch rules.
- Quick test: Attempt merge with failing policy in CI and confirm hard block.
8. Extensions & Challenges
8.1 Beginner Extensions
- Add richer error messages with remediation hints.
- Add fixture generation helpers for repeatable demos.
8.2 Intermediate Extensions
- Add performance instrumentation and budget assertions.
- Add policy configuration profiles by repository type.
8.3 Advanced Extensions
- Add distributed execution support for large repositories.
- Add signed evidence exports for compliance workflows.
9. Real-World Connections
9.1 Industry Applications
- Internal developer portals.
- Enterprise repository governance systems.
- Release safety and incident diagnostics tooling.
9.2 Related Open Source Projects
- Git core: https://git-scm.com/
- GitHub CLI: https://github.com/cli/cli
- pre-commit framework: https://pre-commit.com/
9.3 Interview Relevance
This project prepares you for architecture and debugging interviews that focus on merge policy, CI gates, and workflow reliability tradeoffs.
10. Resources
10.1 Essential Reading
- Pro Git (Internals and Workflows chapters)
- Software Engineering at Google (Version control and build chapters)
- Accelerate (delivery performance practices)
10.2 Video Resources
- Git internals talks from Git Merge conference archives.
- DORA and delivery metrics conference sessions.
10.3 Tools and Documentation
- https://git-scm.com/docs
- https://docs.github.com/
- https://dora.dev/
10.4 Related Projects in This Series
11. Self-Assessment Checklist
11.1 Understanding
- I can explain the primary invariant this project enforces.
- I can explain one failure mode and one safe recovery path.
11.2 Implementation
- Functional requirements are met on deterministic fixtures.
- Critical edge cases are tested and documented.
11.3 Growth
- I can describe tradeoffs in an interview setting.
- I documented what I would change in a production version.
12. Submission / Completion Criteria
Minimum Viable Completion:
- Deterministic golden-path output exists.
- One failure scenario is handled with clear output.
- Core workflow objective is demonstrably met.
Full Completion:
- Minimum criteria plus policy validation, structured reporting, and edge-case coverage.
Excellence:
- Full completion plus measurable performance budget and production-hardening notes.