Project 4: Code Auditor

Build a security-oriented code scanner that finds risky patterns and produces a ranked audit report.

Quick Reference

Attribute Value
Difficulty Intermediate
Time Estimate 4-8 hours
Main Programming Language Bash (Alternatives: Python)
Alternative Programming Languages Python
Coolness Level Level 3 - “security reviewer”
Business Potential Very High (security audits)
Prerequisites find basics, regex, file types
Key Topics recursive search, pruning, pattern rules, severity

1. Learning Objectives

By completing this project, you will:

  1. Build a safe file selection pipeline that excludes vendor and binary files.
  2. Design a pattern ruleset for risky functions and secret detection.
  3. Rank findings by severity and produce a structured report.
  4. Reduce false positives by anchoring patterns to language context.
  5. Produce a deterministic audit report for repeated scans.

2. All Theory Needed (Per-Concept Breakdown)

2.1 Safe Recursive File Selection (find + prune + include rules)

Fundamentals

Recursive scanning is powerful but dangerous. A naive grep -r will traverse everything, including node_modules, .git, build artifacts, and binary blobs. A code auditor must carefully select which files to scan. This is best done by using find to build a file list, pruning known vendor directories, and filtering by extensions. Safe selection improves performance and reduces noise. It also avoids scanning secrets in generated files that are not part of the source of truth. The file list is your trust boundary. If the list is wrong, the audit is wrong, no matter how good your regexes are. This is why auditors treat selection rules as part of the security policy. A good audit tool makes this policy explicit and easy to review. It should be obvious which directories are excluded and why.

Deep Dive into the concept

File selection is a security decision. If you scan too broadly, you waste time and produce false positives. If you scan too narrowly, you miss real issues. A reliable approach begins with explicit include rules (extensions like .py, .js, .php, .go) and explicit exclude rules (directories like .git, node_modules, vendor, dist, build, coverage). The best practice is to implement pruning at the traversal stage so you never descend into excluded directories.

find provides the mechanisms. You can build a prune expression and then filter by -type f and -name patterns. For example:

find . \( -path './node_modules' -o -path './.git' -o -path './dist' \) -prune -o \
  -type f \( -name '*.py' -o -name '*.js' -o -name '*.php' \) -print0

This pipeline outputs a safe file list, null-delimited to preserve filenames. You can then feed it to xargs -0 grep ... or use -exec ... + for batch execution. The null-delimited approach matters because repositories can contain spaces, newlines, or strange characters in filenames. A security audit must not silently skip such files.

Binary files are another danger. Grep can match binary data and output garbage. Use grep -I to ignore binary files or filter by file extension only. If you must handle unknown types, you can use file --mime to detect text files, but that is slower. For this project, extension-based filtering plus -I is a good balance.

Performance matters too. A codebase might include hundreds of thousands of files. Pruning prevents worst-case scans. -xdev can prevent traversal into mounted volumes, which is important in monorepo environments or developer machines with mounted network drives. In other words, file selection is not just an optimization; it is a correctness and safety requirement.

Finally, determinism: your report should list files in a stable order. If you build the file list with find and pass it into grep, you should sort the list to ensure stable output. This lets you compare audits over time and reduces noise in CI systems.

There is also the question of symlinks and mount points. Following symlinks (-L) can cause the scan to leave the repo boundary, which is both a security risk and a source of noise. For example, a repo might include a symlink to a developer’s home directory. A safe auditor should use the default -P (do not follow symlinks) and optionally provide a flag to include symlink targets if explicitly requested. Similarly, -xdev can prevent traversal into mounted volumes, which is important on build servers or containers.

Binary detection is another subtlety. Some files may have source-like extensions but contain minified or generated output with high entropy and low signal. A scanner can reduce noise by excluding minified .js files (like *.min.js) or by ignoring files above a size threshold. These rules are not universal, but they can dramatically improve result quality in real codebases. The key is to make them configurable and to document the defaults.

Finally, consider how to handle file lists across platforms. On macOS, find behaves slightly differently, and some flags may not exist. If your auditor is intended to be portable, you should document the expected environment or detect and use GNU tools when available. A small compatibility section in the README prevents confusion and makes your tool more reliable for other users.

How this fit on projects

The auditor uses a safe file list as its input. Everything else (pattern matching, severity, reporting) depends on this list being correct and stable.

Definitions & key terms

  • prune: skip a directory subtree during traversal.
  • null-delimited: use NUL as separator to avoid filename parsing bugs.
  • include rule: extension or path pattern to include.
  • exclude rule: directory or path pattern to skip.

Mental model diagram (ASCII)

repo/
  src/        <- include
  node_modules/ <- prune
  dist/       <- prune

find -> prune -> include -> file list -> grep

How it works (step-by-step)

  1. Build a prune expression from excluded directories.
  2. Traverse the tree with find.
  3. Filter by file type and extension.
  4. Output a null-delimited file list.
  5. Feed into grep for scanning.
  6. Invariant: every scanned file is explicitly included by the selection rules.
  7. Failure modes: missing prune rules, symlink loops, or binary noise.

Minimal concrete example

find . \( -path './node_modules' -o -path './.git' \) -prune -o \
  -type f \( -name '*.py' -o -name '*.js' -o -name '*.php' \) -print0

Common misconceptions

  • “grep -r is fine” -> False; it scans vendor and binaries.
  • “Spaces in filenames are rare” -> False in real-world repos.
  • “Prune is optional” -> False for large repositories.

Check-your-understanding questions

  1. Why is -print0 safer than -print?
  2. How does -prune change traversal behavior?
  3. Why filter by extension before scanning?
  4. Why is sorting the file list useful?

Check-your-understanding answers

  1. It prevents parsing errors caused by spaces or newlines.
  2. It skips the entire subtree, preventing descent.
  3. It limits scanning to source files and reduces noise.
  4. It makes audit output deterministic across runs.

Real-world applications

  • CI security scans of monorepos.
  • Pre-commit checks for secrets.
  • Vendor directory exclusion in enterprise code audits.

Where you will apply it

References

  • The Linux Command Line (Shotts), Chapter 17
  • man find
  • man xargs

Key insights

Safe audits start with a safe file list.

Summary

File selection is the foundation of a reliable code audit. Prune early, filter explicitly, and handle filenames safely.

Homework/Exercises to practice the concept

  1. Build a file list that includes .py files but excludes venv and .git.
  2. Create a filename with a space and verify -print0 works.
  3. Compare runtime with and without pruning on a large directory.

Solutions to the homework/exercises

  1. find . \( -path './venv' -o -path './.git' \) -prune -o -type f -name '*.py' -print0
  2. touch 'weird name.py' and ensure it appears in the list.
  3. Use time to compare the two runs.

2.2 Pattern Rules, Severity, and False Positive Control

Fundamentals

A code auditor is only as good as its ruleset. Rules describe risky patterns such as hardcoded secrets, use of insecure functions, or dangerous command execution. Each rule should include a severity level and a short explanation. Without severity, all findings are noise. False positives are inevitable, so your patterns must be precise and your report must make it easy to review. Anchoring patterns to language context, such as eval( in JavaScript or md5( in Python, improves accuracy. Think of rules as hypotheses. Each rule asserts that a pattern is likely to indicate risk. The auditor’s job is to collect evidence for those hypotheses and present it in a way humans can verify. This is why rule metadata (severity, hint, category) is not optional: it is what turns a raw match into an actionable finding.

Deep Dive into the concept

Pattern design is a balancing act between recall and precision. If you match password = in any file, you will find real secrets and many false positives. If you require a specific syntax like API_KEY\s*=\s*"[A-Za-z0-9]{20,}", you reduce false positives but may miss variants. A practical ruleset includes both high-confidence patterns (strict regex with strong signals) and medium-confidence patterns (broader regex that requires human review). The report should distinguish between these.

Severity is about impact and likelihood. For example, eval( in PHP is often high severity because it indicates code execution. Hardcoded AWS keys are critical because they allow immediate access. Use a simple scoring system (Critical, High, Medium, Low) and explain what it means. This helps reviewers prioritize work.

Language context matters. The string eval( might appear in a comment or documentation. To reduce false positives, you can ignore comment-only lines or restrict patterns to executable code areas. For example, you can exclude lines that start with # in Python or // in JavaScript before matching. This is not perfect, but it improves signal. Another technique is to detect assignment context for secrets (e.g., API_KEY=) rather than random mentions in text.

Binary files and minified assets can create noise. A rule that looks for secret in a minified JS bundle will generate false positives. The file selection stage should avoid these artifacts, and the pattern stage should be aware of file types. Some auditors maintain per-language rules to avoid cross-language mismatches.

A robust report includes the filename, line number, matched snippet, and rule name. It may also include a remediation hint. This turns a raw grep match into a usable audit item. The report should be deterministic and sorted by severity, then by file path, so that diffs are meaningful in CI.

Finally, remember that regex is not AST parsing. Your rules are heuristic, not definitive. Your goal is to reduce risk by identifying likely issues quickly, not to prove correctness. That is why severity and review workflows are part of the design.

It is helpful to classify rules by intent: secrets (credentials, tokens), dangerous functions (eval, exec), weak crypto (md5, sha1), and insecure configuration (debug flags, permissive CORS). Each class has different false positive risks. For example, secrets detection benefits from high precision because false positives are expensive to review, while dangerous-function detection may tolerate more noise because the patterns are fewer and the risk is high. Structuring your ruleset this way makes it easier to expand and maintain.

Another advanced technique is to include contextual constraints in rules. Instead of matching eval( anywhere, you can require that it appears in a non-comment line by pre-filtering with grep -v for comment prefixes. For languages with multiple comment styles, this gets tricky, but even partial filtering improves precision. Similarly, a secrets rule can require = or : near the token to indicate assignment rather than documentation. These constraints make the ruleset more accurate without needing a full parser.

Severity assignment should be explicit and consistent. You can map severity labels to numeric ranks for sorting. For example, CRITICAL=4, HIGH=3, MEDIUM=2, LOW=1. This makes deterministic sorting easy and avoids ambiguous ordering. The report can include a summary count by severity, which provides a quick risk snapshot.

Finally, remember that a ruleset must evolve. As your codebase changes, patterns that were high-signal may become normal. Your tool should support rule suppression or baselines (for example, a list of known false positives). This is not required for the first version, but the report format should be designed to allow it in the future.

How this fit on projects

The code auditor produces a ruleset file and a report sorted by severity. Each match includes a rule name and a hint.

Definitions & key terms

  • ruleset: collection of regex patterns with metadata.
  • severity: priority label indicating risk level.
  • false positive: match that is not actually risky.
  • contextual match: pattern that includes language-specific syntax.

Mental model diagram (ASCII)

file list -> ruleset -> matches -> severity sort -> audit report

How it works (step-by-step)

  1. Load rules with names and severities.
  2. For each file, apply rules and collect matches.
  3. Attach severity and hint metadata.
  4. Sort by severity and path.
  5. Render a report with counts.
  6. Invariant: every finding maps to exactly one rule and severity.
  7. Failure modes: overly broad rules, comment-only matches, or missing rule metadata.

Minimal concrete example

# Example rule: high-risk eval in PHP
grep -n -H -E 'eval\s*\(' --include='*.php' src/

Common misconceptions

  • “Regex can fully understand code” -> False; it is heuristic.
  • “All findings are equal” -> False; severity matters.
  • “More rules always means better audits” -> False; quality matters more.

Check-your-understanding questions

  1. Why should rules include severity?
  2. How can you reduce false positives for eval(?
  3. Why should reports include line numbers and snippets?
  4. What is the trade-off between strict and broad patterns?

Check-your-understanding answers

  1. It helps reviewers prioritize the highest risk items.
  2. Restrict by language files and exclude comments or docs.
  3. It makes findings actionable and easier to verify.
  4. Strict patterns reduce false positives but may miss variants.

Real-world applications

  • Static security audits in CI pipelines.
  • Pre-commit secret scanning.
  • Compliance checks before releases.

Where you will apply it

  • In this project: see §3.7 (golden output) and §5.11 (design decisions).
  • Also used in: P03-data-miner.md.

References

  • OWASP Secure Coding Practices
  • man grep
  • Black Hat Bash (Aleks/Farhi), Chapter 10

Key insights

A good ruleset is actionable. Severity and context turn matches into decisions.

Summary

The value of a code auditor is not in how many matches it finds, but in how clearly it highlights risk.

Homework/Exercises to practice the concept

  1. Define five rules with severity levels for a small codebase.
  2. Compare results of a strict regex vs a broad regex for secrets.
  3. Add a rule that ignores comments and test it.

Solutions to the homework/exercises

  1. Create a rules file with columns: name, severity, regex, hint.
  2. Run both patterns and compare false positives.
  3. Pre-filter lines with grep -v '^\s*#' before applying the rule.

3. Project Specification

3.1 What You Will Build

A CLI tool that scans selected source files for risky patterns (secrets, insecure functions), assigns severity, and produces a ranked audit report with file, line, and rule context.

3.2 Functional Requirements

  1. File selection: include specific extensions and prune vendor directories.
  2. Rule engine: apply multiple regex rules with severity labels.
  3. Report output: include filename, line number, rule name, snippet.
  4. Deterministic ordering: sort by severity then path.
  5. Noise control: ignore binary files and common generated directories.

3.3 Non-Functional Requirements

  • Performance: scan 50k files in under 2 minutes.
  • Reliability: handle unreadable files gracefully.
  • Usability: rules can be edited without changing code.

3.4 Example Usage / Output

$ ./code_auditor.sh ./repo --rules rules.txt

3.5 Data Formats / Schemas / Protocols

Rules file format:

severity|rule_name|regex|hint
HIGH|PHP_EVAL|eval\s*\(|Avoid eval; use safe parsing
CRITICAL|AWS_KEY|AKIA[0-9A-Z]{16}|Rotate leaked keys immediately

Report format:

SEVERITY,FILE,LINE,RULE,SNIPPET
CRITICAL,config.js,12,AWS_KEY,API_KEY="AKIA..."
HIGH,utils.php,44,PHP_EVAL,eval($input)

3.6 Edge Cases

  • Files with non-UTF8 encoding.
  • Secrets split across multiple lines.
  • Patterns inside comments.

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

./code_auditor.sh ./fixtures/repo --rules ./rules.txt

3.7.2 Golden Path Demo (Deterministic)

Use fixture repo and frozen scan time 2026-01-01T12:00:00 in report header.

3.7.3 If CLI: exact terminal transcript

$ ./code_auditor.sh ./fixtures/repo --rules rules.txt
[2026-01-01T12:00:00] TARGET=./fixtures/repo
[2026-01-01T12:00:00] RULES=rules.txt
[2026-01-01T12:00:00] REPORT=audit_2026-01-01.csv
[2026-01-01T12:00:00] FINDINGS=3
[2026-01-01T12:00:00] DONE

$ head -3 audit_2026-01-01.csv
SEVERITY,FILE,LINE,RULE,SNIPPET
CRITICAL,config.js,12,AWS_KEY,API_KEY="AKIA..."
HIGH,utils.php,44,PHP_EVAL,eval($input)

Failure demo (rules file missing):

$ ./code_auditor.sh ./fixtures/repo --rules missing.txt
[2026-01-01T12:00:00] ERROR: rules file not found
EXIT_CODE=2

Exit codes:

  • 0: success with findings
  • 1: success with no findings
  • 2: invalid args or rule load error

4. Solution Architecture

4.1 High-Level Design

repo -> find file list -> apply rules -> collect matches -> severity sort -> report

4.2 Key Components

Component Responsibility Key Decisions
Selector build safe file list prune and include rules
Rule loader parse rules file pipe-delimited format
Matcher apply regex rules grep with -n -H
Reporter output CSV deterministic ordering

4.3 Data Structures (No Full Code)

rule: {severity, name, regex, hint}
finding: {severity, file, line, rule, snippet}

4.4 Algorithm Overview

Key Algorithm: Rule Scan

  1. Build file list with find and pruning.
  2. For each rule, scan files and collect matches.
  3. Merge findings and sort by severity and path.

Complexity Analysis:

  • Time: O(r * n) where r is rules and n is files.
  • Space: O(k) where k is findings count.

5. Implementation Guide

5.1 Development Environment Setup

# No extra dependencies required

5.2 Project Structure

project-root/
├── code_auditor.sh
├── rules.txt
├── fixtures/
│   └── repo/
└── README.md

5.3 The Core Question You’re Answering

“How do I scan a large codebase safely and produce a useful audit report?”

5.4 Concepts You Must Understand First

  1. Safe recursive file selection
  2. Rule design and severity

5.5 Questions to Guide Your Design

  1. Which directories must be excluded by default?
  2. How will you rank findings?
  3. How will you reduce false positives?

5.6 Thinking Exercise

Design a ruleset with three levels of severity and run it on a tiny sample repo.

5.7 The Interview Questions They’ll Ask

  1. “How do you exclude large directories safely?”
  2. “Why is eval risky?”
  3. “How do you prioritize audit findings?”

5.8 Hints in Layers

Hint 1: Build file list

find . \( -path './node_modules' -o -path './.git' \) -prune -o -type f -print0

Hint 2: Safe grep

xargs -0 grep -n -H -E 'eval\(' < files.list

Hint 3: Rule sorting

sort -t, -k1,1 -k2,2 audit.csv

5.9 Books That Will Help

Topic Book Chapter
Find basics The Linux Command Line (Shotts) Ch. 17
Regex The Linux Command Line (Shotts) Ch. 19
Security Black Hat Bash (Aleks/Farhi) Ch. 10

5.10 Implementation Phases

Phase 1: Foundation (1-2 hours)

Goals:

  • Build safe file list
  • Implement rule loading

Tasks:

  1. Parse rules file into arrays.
  2. Generate file list with prune and include.

Checkpoint: file list includes only source files.

Phase 2: Core Functionality (2-3 hours)

Goals:

  • Apply rules and collect findings

Tasks:

  1. Run grep per rule and collect results.
  2. Attach severity and rule metadata.

Checkpoint: report contains expected findings.

Phase 3: Polish & Edge Cases (1-2 hours)

Goals:

  • Deterministic output and no false crashes

Tasks:

  1. Sort findings by severity and path.
  2. Handle empty results gracefully.

Checkpoint: exit code 1 for no findings.

5.11 Key Implementation Decisions

Decision Options Recommendation Rationale
Rule format JSON vs text pipe-delimited text simple to edit
File list grep -r vs find find + prune safe and fast
Severity order numeric vs labels labels with map human-friendly

6. Testing Strategy

6.1 Test Categories

Category Purpose Examples
Unit Tests rule parsing parse fields correctly
Integration Tests end-to-end scan fixture repo
Edge Case Tests empty results no findings case

6.2 Critical Test Cases

  1. No findings: exit code 1 with empty report header.
  2. Binary file: should be ignored.
  3. Vendor dir: must be pruned.

6.3 Test Data

fixtures/repo/
  src/app.js
  src/utils.php
  config.js

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall Symptom Solution
Overly broad rules too many matches tighten patterns
Missing prune huge scan time prune vendor dirs
Missing severity report unreadable add severity column

7.2 Debugging Strategies

  • Print the file list before scanning.
  • Test each rule on a single file.
  • Use grep -n to verify line numbers.

7.3 Performance Traps

Scanning large build artifacts can dominate runtime. Always prune and restrict file types.


8. Extensions & Challenges

8.1 Beginner Extensions

  • Add a --exclude flag for extra directories.
  • Add rule categories (secrets, insecure crypto, exec).

8.2 Intermediate Extensions

  • Add confidence scoring based on regex strictness.
  • Add baseline suppression for known safe matches.

8.3 Advanced Extensions

  • Generate SARIF output for code scanning platforms.
  • Integrate with CI to block merges on critical findings.

9. Real-World Connections

9.1 Industry Applications

  • Security auditing before release.
  • Secret scanning in CI/CD pipelines.
  • Compliance checks for regulated industries.
  • gitleaks: secret scanning engine.
  • semgrep: pattern-based code scanning.

9.3 Interview Relevance

  • Safe recursive scanning.
  • Security rule design.
  • False positive management.

10. Resources

10.1 Essential Reading

  • The Linux Command Line (Shotts), Chapters 17 and 19
  • Black Hat Bash (Aleks/Farhi), Chapter 10

10.2 Video Resources

  • “Building a Simple Secret Scanner” (YouTube)
  • “Regex for Security” (conference talk)

10.3 Tools & Documentation

  • man find
  • man grep

11. Self-Assessment Checklist

11.1 Understanding

  • I can explain why pruning is required.
  • I can describe a severity system.
  • I understand false positives vs false negatives.

11.2 Implementation

  • Report includes severity, file, line, and rule.
  • Vendor directories are excluded.
  • Output is deterministic.

11.3 Growth

  • I can improve a rule after reviewing results.
  • I documented audit assumptions.
  • I can explain this project in an interview.

12. Submission / Completion Criteria

Minimum Viable Completion:

  • Scan selected files and produce a report with severity.
  • Exclude vendor and build directories.

Full Completion:

  • Deterministic ordering and stable results.
  • Ruleset is documented and editable.

Excellence (Going Above & Beyond):

  • SARIF output and CI integration.
  • Baseline suppression for known safe findings.