Project 3: Codebase Refactoring Toolkit

Build a safe, reversible CLI that scans a codebase and applies large-scale text refactors with reporting.

Quick Reference

Attribute Value
Difficulty Level 3: Advanced
Time Estimate 2-3 weeks
Main Programming Language Shell + sed + find + grep
Alternative Programming Languages Python, Go
Coolness Level Level 4: Developer Productivity
Business Potential Level 3: Internal Tooling
Prerequisites Comfortable with regex, find, sed, shell scripting
Key Topics safe file traversal, precise regex, backups, reporting

1. Learning Objectives

By completing this project, you will:

  1. Enumerate and filter large codebases safely without touching vendor directories.
  2. Write precise regex patterns that avoid false replacements.
  3. Perform in-place edits in a cross-platform way (GNU/BSD sed).
  4. Build backup and rollback mechanisms to make refactors reversible.
  5. Generate a report that summarizes changes deterministically.

2. All Theory Needed (Per-Concept Breakdown)

2.1 Safe File Discovery and Traversal with find

Fundamentals

find walks a directory tree and evaluates predicates against each entry. It is powerful but dangerous if used naively: it can traverse vendor directories, delete files accidentally, or mis-handle filenames with spaces. Safe traversal means scoping the search, pruning unwanted directories, and using null-delimited output to avoid shell word-splitting problems. In a refactoring toolkit, the file discovery step determines which files are eligible for changes. A single mistake can modify thousands of files unintentionally. Understanding find predicates like -path, -prune, -type, and -name is essential for building a controlled, predictable refactor. A safe discovery step also records exactly which files were considered, so the refactor can be audited and reproduced later.

Deep Dive into the concept

find is a depth-first traversal tool that visits each node of the filesystem tree. It evaluates predicates in order, and the logical structure of your predicates determines which files are visited, skipped, or matched. A common mistake is to think that -path filters after traversal, but without -prune, find still descends into unwanted directories. For example, find . -path './node_modules/*' -prune -o -type f -name '*.js' -print explicitly stops traversal of node_modules, saving time and preventing accidental edits. The -prune -o pattern is a foundational idiom for safe traversal.

Another critical concept is safe filename handling. Shell pipelines split on spaces and newlines by default, so find ... -print piped to xargs can break when filenames contain spaces or newlines. The safe approach is to use null-delimited output: find ... -print0 with xargs -0 or while IFS= read -r -d '' file; do ...; done. This ensures each filename is treated as an atomic unit. In a refactor tool, safe filename handling is not optional; codebases frequently contain spaces, and a refactor that fails on such files is incomplete.

find also supports -exec and -execdir, which can run commands directly on matches. These options are safer than xargs in some cases because they pass filenames directly as arguments without shell expansion. -exec ... {} + groups multiple files per invocation, which is efficient. However, -exec does not provide a built-in dry-run summary. For a refactor tool, you often want a two-phase approach: first gather candidates into a list, then run transformations in a controlled loop. This lets you implement dry-run and reporting.

Time and resource usage are part of safe traversal. find can be expensive on huge trees, so you should prune early and avoid expensive tests on every file. For example, if you only want .py files, use -name '*.py' early in the predicate chain. If you need to exclude hidden directories, use -path '*/.*' -prune. For very large repos, consider a .refactorignore file, similar to .gitignore, and interpret it in your script to define excluded paths. For this project, a basic set of exclusions (.git, node_modules, vendor, dist, build) is sufficient.

Another subtlety is following symlinks. By default, find does not follow symlinks unless you pass -L. Following symlinks can lead to cycles or changes outside the intended tree. For safety, do not follow symlinks unless explicitly requested. If a repo has symlinked directories, treat them as files or log a warning.

Finally, deterministic ordering matters for testing. find does not guarantee ordering; it depends on filesystem traversal order. If you want deterministic reports, you should sort the list of files before processing. This does not change the edits but ensures that the report and dry-run output are stable between runs. Sorting does introduce a memory cost, but the number of files is usually manageable compared to content size.

A practical safety enhancement is to emit the final candidate list to a log file before any modifications. This gives reviewers a concrete artifact of scope, and it allows you to rerun the exact same refactor even if the filesystem changes later.

How this fit on projects

Safe traversal defines the refactoring scope. It is the first guardrail against unintended changes and is essential for both dry-run and apply modes.

Definitions & key terms

  • Predicate: A condition find evaluates on each file.
  • Prune: Skip descending into a directory.
  • Null-delimited output: Using \0 to separate filenames safely.
  • Symlink: A reference that can point outside the target tree.
  • Deterministic order: Stable output order across runs.

Mental model diagram (ASCII)

root/
  src/   -> scanned
  vendor/ -> pruned
  node_modules/ -> pruned
  tests/ -> scanned

find -> apply predicates -> safe file list -> refactor

How it works (step-by-step, with invariants and failure modes)

  1. Start at root directory.
  2. Apply pruning rules (skip vendor/build/.git).
  3. Match file types and extensions.
  4. Emit null-delimited list of files.
  5. Sort the list for deterministic processing.

Invariant: Only files under allowed paths are processed. Failure modes: missing prune rules; unsafe filename handling; symlink traversal.

Minimal concrete example

find . \( -path './.git' -o -path './node_modules' -o -path './vendor' \) -prune -o -type f -name '*.py' -print0

Common misconceptions

  • “find only visits matched files” -> It traverses all unless pruned.
  • “Filenames don’t contain spaces” -> They do; use -print0 and -0.
  • “Ordering is stable” -> It depends on filesystem state.

Check-your-understanding questions

  1. Why is -prune necessary when excluding directories?
  2. What problem does -print0 solve?
  3. Why might you avoid -L (follow symlinks) in a refactor tool?
  4. How can you make processing order deterministic?

Check-your-understanding answers

  1. Without -prune, find still descends into excluded directories.
  2. It prevents word-splitting on spaces/newlines in filenames.
  3. Following symlinks can lead to edits outside the repo or cycles.
  4. Sort the file list before processing.

Real-world applications

  • Repo-wide API migrations.
  • License header updates.
  • Code formatting across a monorepo.

Where you’ll apply it

  • See §3.2 for file selection requirements.
  • See §5.10 Phase 1 for traversal and candidate collection.
  • Also used in: P04 System Inventory & Audit.

References

  • find(1) manual page
  • “The Linux Command Line” by William Shotts, Ch. 17

Key insights

Safe traversal is the first safety net; prune early and treat filenames as data, not tokens.

Summary

find is powerful but must be constrained. Pruning, null-delimited output, and deterministic ordering are essential for safe refactoring.

Homework/Exercises to practice the concept

  1. Write a find command that includes only .js files and excludes dist/.
  2. Create a file with spaces in its name and test your pipeline.
  3. Sort the file list and verify deterministic ordering across runs.

Solutions to the homework/exercises

  1. Use -path './dist' -prune -o -name '*.js' -print.
  2. Use -print0 and xargs -0 to avoid splitting.
  3. Pipe the list through sort and compare outputs.

2.2 Regex Precision and sed Addressing

Fundamentals

Refactoring is not just search-and-replace. You must match only the intended code constructs and avoid strings, comments, and unrelated identifiers. Regex precision is the core skill: you must define boundaries (word boundaries, anchors) and avoid overbroad matches. sed provides addressing, which lets you limit replacements to certain lines or ranges. For example, you can restrict changes to files of a certain type, or only within a function block. sed also supports in-place editing with -i, but GNU and BSD sed differ in syntax. Understanding these details prevents accidental changes and makes the refactor portable. Regex cannot fully parse code, so you must be explicit about the limits of text-based refactoring and constrain patterns accordingly.

Deep Dive into the concept

Regex precision begins with understanding identifier boundaries. Suppose you want to rename old_func to new_func. A naive s/old_func/new_func/g will also modify old_function or myold_func. The correct approach is to add word boundaries: \bold_func\b in tools that support them, or use explicit boundary patterns like (^|[^A-Za-z0-9_])old_func([^A-Za-z0-9_]|$) for POSIX portability. In code refactors, you often need language-specific knowledge: identifiers are different in Python, JavaScript, and C. This project encourages you to define patterns per language and document limitations (for example, you might not fully parse nested strings).

sed addressing lets you scope substitutions to specific regions. For example, you can match only lines that begin with import in Python or only lines between BEGIN and END markers. This reduces risk, especially when refactoring configuration files or documentation. For multi-line constructs, sed is limited because it is line-oriented; you may need to use a range with /start/,/end/ and apply a substitution inside. This is powerful but must be used carefully: if your end pattern appears unexpectedly, your range may be shorter or longer than intended. Testing on small samples is critical.

Portability is a major issue with sed. GNU sed supports -r for extended regex and -i for in-place edits without requiring a backup suffix. BSD sed (macOS) requires -i '' for in-place edits and uses -E for extended regex. A cross-platform tool must either detect the platform and choose the correct flags or avoid in-place editing by writing to temp files. For safety and portability, a common pattern is to write to a temp file and move it into place. This is slower but deterministic and avoids sed dialect problems.

Regex precision is also about testing. You should create a suite of example lines that should match and should not match, then run your pattern against them. This can be automated: store the examples in fixtures and run grep -n to verify. This makes refactor logic testable. For complex changes, consider using a two-phase pipeline: first grep to identify candidate lines, then a second sed or awk to perform a targeted replacement. This reduces the chance of making changes where you don’t expect them.

Another key concept is idempotency. A refactor should be safe to run multiple times without further changes. If your regex matches already-updated code, it may apply the transformation again, causing duplication or corruption. For example, replacing foo() with bar() should not change bar() on subsequent runs. You can enforce idempotency by anchoring patterns or by guarding replacements with negative conditions. In sed, you can use address patterns to skip lines that already contain the new value.

For more complex replacements, consider doing a two-pass approach: first generate a report of exact match locations and review it, then apply changes. This mirrors how large teams operate and prevents "regex surprise" when patterns are broader than expected. Even in a learning project, building this review step teaches disciplined refactoring.

How this fit on projects

Regex precision and sed addressing determine whether your refactor is safe. This concept guides both dry-run analysis and actual edits.

Definitions & key terms

  • Word boundary: A transition between word characters and non-word characters.
  • Address range: sed syntax that limits commands to certain lines.
  • In-place edit: Modifying files directly rather than via temporary files.
  • Idempotent change: A change that can be applied multiple times without further effect.
  • Dialect: Differences between GNU and BSD tool behavior.

Mental model diagram (ASCII)

line -> [regex match?] -> [addressed?] -> apply substitution -> output line

How it works (step-by-step, with invariants and failure modes)

  1. Identify candidate lines with a filter regex.
  2. Apply a substitution regex with boundaries.
  3. Limit the substitution using sed addresses.
  4. Write output to temp file and replace original.

Invariant: Only lines that match both the filter and the address are modified. Failure modes: regex too broad, sed dialect mismatch, non-idempotent replacement.

Minimal concrete example

# Replace function call only when it appears as a standalone identifier
sed -E 's/(^|[^A-Za-z0-9_])old_func\(/\1new_func(/g' file.py

Common misconceptions

  • “Regex can parse code perfectly” -> It cannot; document limitations.
  • “sed -i works everywhere” -> BSD and GNU sed differ.
  • “Find/replace is safe” -> Only with precise boundaries and tests.

Check-your-understanding questions

  1. Why can s/old/new/g be dangerous in code refactors?
  2. How do sed address ranges reduce risk?
  3. What is a safe cross-platform alternative to sed -i?
  4. What makes a refactor idempotent?

Check-your-understanding answers

  1. It can match substrings in unrelated identifiers or comments.
  2. They restrict substitutions to known regions or patterns.
  3. Write to a temporary file and move it into place.
  4. Re-running it results in no further changes.

Real-world applications

  • API migrations across monorepos.
  • Large-scale import renames.
  • Configuration refactors in infrastructure repos.

Where you’ll apply it

  • See §3.2 for safe substitution requirements.
  • See §5.11 for decision table on portability and idempotency.
  • Also used in: P01 Log Analyzer for precise regex parsing.

References

  • “Mastering Regular Expressions” by Jeffrey Friedl
  • sed(1) manual page
  • “Effective Shell” by Dave Kerr

Key insights

A refactor is a controlled transformation: precision and idempotency are non-negotiable.

Summary

Precise regex and sed addressing protect your codebase from accidental changes. Build patterns that are explicit, test them, and make the refactor idempotent.

Homework/Exercises to practice the concept

  1. Create a sample file with old_func, old_function, and myold_func. Write a regex that only replaces old_func.
  2. Use a sed address range to modify only lines between two markers.
  3. Build a test script that runs the refactor twice and checks for no changes.

Solutions to the homework/exercises

  1. Use a word boundary pattern or explicit non-word boundaries.
  2. Use /START/,/END/ addresses in sed.
  3. Run git diff after the second run; it should be empty.

2.3 Backup, Rollback, and Change Reporting

Fundamentals

Any automated refactor must be reversible. Backups and rollback mechanisms protect against mistakes and enable safe experimentation. A backup can be as simple as copying files to a timestamped directory. A rollback should restore the original files quickly and reliably. Reporting is equally important: a summary of changed files, counts of replacements, and a dry-run diff builds trust and makes code review easier. These mechanisms turn a risky script into a professional tool. Atomic writes and clear rollback steps make it possible to recover even if the process is interrupted mid-run. A clear audit trail also simplifies code review.

Deep Dive into the concept

Backups are a safety contract with the user. A refactor tool should create a snapshot of the files it will modify, ideally in a dedicated directory like .refactor_backups/2026-01-01-120000. The backup process should preserve directory structure so that files can be restored with a simple copy. This also enables selective rollback if only some files need to be reverted. Backups should happen before any modifications are made, and the tool should confirm the backup exists before applying changes.

Rollback is the inverse of backup. A reliable rollback function reads the backup directory and restores each file to its original location. This should be deterministic and idempotent: running rollback multiple times should result in the same state. It should also check for conflicts, such as files created after the refactor. For a learning project, it is sufficient to restore only files that were modified, but you should document this limitation. You can store a list of modified files in the backup directory to drive rollback.

Reporting is how you gain confidence in the refactor. A dry-run mode should show the number of matches and the list of files that would change. A useful enhancement is a diff preview: show the first few lines of a unified diff for each file or aggregate a diff summary. This can be done with diff -u on the original and transformed file. For deterministic reporting, sort file lists and use stable formatting.

Another important concept is atomicity. If the refactor is interrupted, you do not want to leave files half-modified. The safest method is to write changes to a temporary file and then atomically move it into place. This way, each file is either fully updated or untouched. Combine this with backups and you have a robust system.

Version control is a natural partner here. The tool should encourage users to commit or at least check git status before applying changes. A refactor report can include a reminder and even a summary of modified files to stage. While you should not run git commands automatically in this project, mention this in the documentation.

Finally, deterministic output is essential for tests. Your tool should output the same report for the same input. This means stable ordering, consistent timestamps (use fixed timestamps in golden tests), and predictable counts. Determinism makes it possible to build automated tests that compare output to a known good snapshot.

A more advanced safety technique is to generate a manifest file that records the checksum of each file before modification, the number of replacements applied, and the resulting checksum after modification. This enables quick verification that the replacement actually changed the expected bytes and did not introduce unintended edits. In a rollback scenario, the manifest can be used to ensure that restored files match their original checksums, which is especially valuable if the refactor spans thousands of files or runs over a long time. Even if you do not implement checksum verification, describing it clarifies the difference between "backup exists" and "backup is verified," which is a key distinction in production change management.

How this fit on projects

Backup and reporting are the safety layer that makes your refactor tool trustworthy. They transform a risky script into a tool suitable for real teams.

Definitions & key terms

  • Backup snapshot: Copy of files before modification.
  • Rollback: Restoring original files from a snapshot.
  • Dry-run: Analysis without modifying files.
  • Atomic replace: Write to temp file and move into place.
  • Deterministic report: Stable output across runs.

Mental model diagram (ASCII)

scan -> backup -> transform -> report
         ^           |
         |           v
      rollback <-----+ (if needed)

How it works (step-by-step, with invariants and failure modes)

  1. Collect candidate files.
  2. Copy each candidate to a timestamped backup directory.
  3. Apply transformation into a temp file.
  4. Replace original file atomically.
  5. Record changes and produce a report.

Invariant: Every modified file has a backup copy. Failure modes: backup directory missing, partial writes, inconsistent reports.

Minimal concrete example

# Create a backup snapshot
backup_dir=".refactor_backups/$(date +%Y%m%d-%H%M%S)"
mkdir -p "$backup_dir"
cp --parents file.py "$backup_dir"

Common misconceptions

  • “Backups are optional” -> They are essential for safe refactors.
  • “Dry-run is enough” -> You still need rollback if apply mode fails.
  • “Reports are cosmetic” -> They are critical for trust and review.

Check-your-understanding questions

  1. Why should backups be created before any modifications?
  2. What makes an atomic file replacement safer?
  3. How do you ensure deterministic reporting?
  4. Why is rollback different from undoing a git commit?

Check-your-understanding answers

  1. It guarantees original state is preserved even if the tool crashes.
  2. It avoids partially written files.
  3. Sort file lists and use consistent formatting and timestamps.
  4. Rollback restores raw files even outside version control contexts.

Real-world applications

  • Safe API migrations with reversible steps.
  • Bulk linting or formatting across large repos.
  • Security patching where rollback is mandatory.

Where you’ll apply it

  • See §3.2 for backup and dry-run requirements.
  • See §3.7 for the deterministic demo transcript.
  • Also used in: P06 Personal DevOps Toolkit as a reusable subcommand.

References

  • “The Practice of System and Network Administration” (change management)
  • “Effective Shell” by Dave Kerr (script reliability)

Key insights

Trust in a refactor tool comes from reversibility and clear reporting, not just correct regex.

Summary

Backups, rollback, and deterministic reports make automated refactors safe. These mechanisms turn a risky script into a professional tool.

Homework/Exercises to practice the concept

  1. Implement a backup directory with timestamped snapshots.
  2. Write a rollback script that restores from the latest snapshot.
  3. Generate a report that lists modified files in sorted order.

Solutions to the homework/exercises

  1. Use mkdir -p and cp --parents (or rsync -a).
  2. Restore files by copying from the snapshot back to original paths.
  3. Sort file lists with sort before printing.

3. Project Specification

3.1 What You Will Build

A CLI tool called refactor that:

  • Scans a codebase for a target pattern.
  • Reports how many matches are found and where.
  • Applies safe replacements with backups.
  • Supports dry-run, apply, and rollback modes.
  • Produces a summary report and optional diff preview.

Included:

  • Cross-platform operation (GNU/BSD sed handling).
  • A .refactorignore-style exclusion list.
  • Deterministic reporting.

Excluded:

  • Full AST parsing or language-aware refactors.
  • Binary file changes.

3.2 Functional Requirements

  1. File selection: Include/exclude directories with prune rules.
  2. Pattern matching: Regex-based search for a pattern.
  3. Dry-run mode: Report counts without modifying files.
  4. Apply mode: Modify files in place with backups.
  5. Rollback mode: Restore files from last backup.
  6. Cross-platform sed: Works on macOS and Linux.
  7. Reporting: Summary of files changed and replacement counts.
  8. Exit codes: Indicate success or failure.

3.3 Non-Functional Requirements

  • Safety: Never modify files outside the target root.
  • Reliability: Backups must always be created before changes.
  • Usability: Clear help output and understandable report.

3.4 Example Usage / Output

$ ./refactor.sh --root ./src --from "old_func\(" --to "new_func(" --dry-run
Scanning: 1,402 files
Matches found: 86
Files to change: 12
Dry run complete. No files modified.

3.5 Data Formats / Schemas / Protocols

Report format (example):

files_scanned: 1402
files_changed: 12
matches_total: 86
backup_dir: .refactor_backups/2026-01-01-120000

3.6 Edge Cases

  • Filenames with spaces or newlines.
  • Binary files that should be skipped.
  • Patterns that appear in comments or strings.
  • Large repos where sorting file lists is expensive.

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

./refactor.sh --root ./src --from "old_func\(" --to "new_func(" --dry-run
./refactor.sh --root ./src --from "old_func\(" --to "new_func(" --apply

3.7.2 Golden Path Demo (Deterministic)

$ ./refactor.sh --root ./sample --from "old_func\(" --to "new_func(" --apply
Scanning: 4 files
Matches found: 3
Files changed: 2
Backup created: .refactor_backups/2026-01-01-120000

3.7.3 Failure Demo (Deterministic)

$ ./refactor.sh --root ./missing --from "old_func\(" --to "new_func(" --apply
ERROR: root directory not found: ./missing
exit code: 2

3.7.4 If CLI: exact terminal transcript

$ ./refactor.sh --root ./sample --from "old_func\(" --to "new_func(" --dry-run
Scanning: 4 files
Matches found: 3
Files to change: 2
$ echo $?
0

Exit codes:

  • 0: Success.
  • 1: Pattern not found or no changes.
  • 2: Invalid arguments or missing root.

4. Solution Architecture

4.1 High-Level Design

File discovery -> Candidate list -> Backup -> Transform -> Report
        |                          |
        v                          v
   exclude rules                rollback snapshot

4.2 Key Components

Component Responsibility Key Decisions
Finder Collect candidate files prune rules + .refactorignore
Matcher Detect occurrences grep -n with precise regex
Transformer Apply replacement temp file + atomic move
Backup Snapshot originals timestamped directory
Reporter Summarize changes deterministic order

4.3 Data Structures (No Full Code)

files[]  -> ordered list of candidate files
changes[file] = match_count
backup_dir/ + original file paths

4.4 Algorithm Overview

Key Algorithm: Safe Refactor Workflow

  1. Build candidate file list with find and prune rules.
  2. For each file, count matches with grep.
  3. If in dry-run, report and exit.
  4. Create backup snapshot.
  5. Apply sed replacement into temp file and move.
  6. Report summary.

Complexity Analysis:

  • Time: O(n * f) where n files and f average file size.
  • Space: O(n) for file list and backups.

5. Implementation Guide

5.1 Development Environment Setup

# No special dependencies required

5.2 Project Structure

refactor/
├── refactor.sh
├── lib/
│   ├── find_files.sh
│   ├── transform.sh
│   └── report.sh
├── sample/
│   └── src/
└── tests/
    └── golden-output.txt

5.3 The Core Question You’re Answering

“How can I refactor a large codebase safely, with reversibility and confidence?”

5.4 Concepts You Must Understand First

  1. find pruning and null-delimited output
  2. Regex boundaries for identifiers
  3. Cross-platform sed differences

5.5 Questions to Guide Your Design

  1. What directories should always be excluded?
  2. What is the risk if your pattern matches inside comments?
  3. How will you guarantee rollback?

5.6 Thinking Exercise

Imagine the refactor changes 500 files and you realize the regex was wrong. How quickly can you recover? What files do you need to restore?

5.7 The Interview Questions They’ll Ask

  1. How do you avoid false positives in large-scale refactors?
  2. Why is backup required even if you use git?
  3. How do GNU and BSD sed differ in in-place editing?

5.8 Hints in Layers

Hint 1: Build file list safely

find . -path './.git' -prune -o -type f -name '*.py' -print0

Hint 2: Match candidates

grep -n "old_func" file.py

Hint 3: Replace with sed

sed -E 's/(^|[^A-Za-z0-9_])old_func\(/\1new_func(/g' file.py

Hint 4: Add backups Copy files to .refactor_backups/ before editing.

5.9 Books That Will Help

Topic Book Chapter
Shell safety “Effective Shell” Error handling chapters
Regex “Mastering Regular Expressions” Ch. 3-4
find “The Linux Command Line” Ch. 17

5.10 Implementation Phases

Phase 1: Foundation (3-4 days)

Goals: Safe file discovery and dry-run reporting.

Tasks:

  1. Implement prune rules and file list generation.
  2. Implement dry-run match counting.

Checkpoint: Dry-run report lists correct files and counts.

Phase 2: Core Functionality (5-7 days)

Goals: Apply replacements safely with backups.

Tasks:

  1. Implement backup snapshot creation.
  2. Apply replacements via temp files and atomic move.

Checkpoint: Apply mode modifies files and backup exists.

Phase 3: Polish & Rollback (3-4 days)

Goals: Rollback mode and diff reporting.

Tasks:

  1. Implement rollback from latest snapshot.
  2. Add optional diff preview for changed files.

Checkpoint: Rollback restores files exactly and diff output is stable.

5.11 Key Implementation Decisions

Decision Options Recommendation Rationale
In-place edits sed -i vs temp file temp file cross-platform safety
File ordering unsorted vs sorted sorted deterministic reports
Backups per-file .bak vs snapshot dir snapshot dir easy rollback

6. Testing Strategy

6.1 Test Categories

Category Purpose Examples
Unit Tests Regex correctness known sample lines
Integration Tests End-to-end refactor sample repo fixtures
Edge Case Tests weird filenames spaces, unicode, newlines

6.2 Critical Test Cases

  1. Pattern matches only in intended identifiers.
  2. Dry-run output matches golden report.
  3. Rollback restores original files byte-for-byte.

6.3 Test Data

# sample.py
old_func()
old_function()
"old_func()"  # inside string

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall Symptom Solution
Overbroad regex unrelated code changed add boundaries and tests
Unsafe filenames skipped or broken files use -print0 and -0
No backups irreversible changes always snapshot first

7.2 Debugging Strategies

  • Use grep -n to inspect matches before replacing.
  • Run dry-run and compare with expectations.
  • Use diff -u to inspect changes.

7.3 Performance Traps

Avoid running sed on every file if grep shows no matches; filter first.


8. Extensions & Challenges

8.1 Beginner Extensions

  • Add --include and --exclude patterns.
  • Add --count-only mode for fast scanning.

8.2 Intermediate Extensions

  • Add language-specific rules (e.g., skip strings for Python).
  • Implement .refactorignore parsing.

8.3 Advanced Extensions

  • Add a preview TUI with diff browsing.
  • Integrate with git to create a branch automatically.

9. Real-World Connections

9.1 Industry Applications

  • Large-scale API migration projects.
  • Automated dependency upgrades.
  • comby: structural code search and replace.
  • codemod tools in JS ecosystems.

9.3 Interview Relevance

  • Safe automation and tooling.
  • Handling risk in large code changes.

10. Resources

10.1 Essential Reading

  • “Effective Shell” by Dave Kerr
  • “Mastering Regular Expressions” by Jeffrey Friedl

10.2 Video Resources

  • “Safe refactoring with CLI tools” (talk)

10.3 Tools & Documentation

  • find(1) and sed(1) manual pages
  • grep(1) manual page

11. Self-Assessment Checklist

11.1 Understanding

  • I can explain why -prune is essential.
  • I can explain how my regex avoids false positives.
  • I can explain how rollback restores files.

11.2 Implementation

  • Dry-run output is correct and deterministic.
  • Backups are created before any modifications.
  • Rollback restores files exactly.

11.3 Growth

  • I can present this tool as an internal productivity tool.
  • I can discuss limitations of regex-based refactors.

12. Submission / Completion Criteria

Minimum Viable Completion:

  • Safe file traversal and dry-run report.
  • Apply mode with backups.
  • Deterministic summary output.

Full Completion:

  • Rollback mode and diff previews.
  • Cross-platform behavior confirmed.
  • Golden tests pass.

Excellence (Going Above & Beyond):

  • .refactorignore support.
  • Optional git integration.
  • Performance benchmarks on large repos.