Project 7: git-insight (Composability and Parsing)
Build a CLI that analyzes git activity using safe command execution and robust parsing.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 2 (Intermediate) |
| Time Estimate | Weekend |
| Main Programming Language | Go (Alternatives: Rust) |
| Alternative Programming Languages | Rust |
| Coolness Level | Level 3: Genuinely clever |
| Business Potential | 2: Internal tooling |
| Prerequisites | CLI basics, process execution, text parsing |
| Key Topics | safe exec, parsing, performance, caching |
1. Learning Objectives
By completing this project, you will:
- Execute external commands safely without shell injection.
- Parse structured git output deterministically.
- Design output modes for humans and scripts.
- Cache expensive git queries for performance.
- Build a reliable CLI around another tool.
2. All Theory Needed (Per-Concept Breakdown)
2.1 Safe Command Execution
Fundamentals
When a CLI invokes another tool, the biggest risk is shell injection. Passing a string to a shell (sh -c) allows user input to be interpreted as shell syntax, which can execute unintended commands. Safe command execution means invoking the command directly with an argument array. This avoids shell parsing entirely. It also makes behavior consistent across platforms.
Deep Dive into the concept
Safe execution starts with argument separation. Instead of building a command string like git log --pretty=format:"%H|%an", you should pass git, log, and --pretty=format:%H%x00%an as separate arguments. This ensures that any user-provided values (like a repo path or branch name) are treated as data, not syntax. It also avoids quoting problems. The shell has its own rules for quotes, globbing, and escape characters. If you bypass the shell, those rules do not apply, and you get deterministic behavior.
You also need to control the working directory. git commands are sensitive to the repo root. Use a --git-dir or set the working directory for the process to the repo path. This is safer than trying to change directories in a shell command. It also prevents injection via path strings.
Another aspect is error handling. External commands can fail for many reasons: repo not found, not a git repo, permission issues, or invalid args. You must capture stderr and exit codes and surface a clean error message to the user. Do not dump raw stderr unless --debug is enabled. When you do show raw stderr, redact or sanitize any paths or secrets if necessary.
Finally, performance matters. Running git log repeatedly can be expensive on large repos. Use caching where possible or combine queries to minimize calls. For example, get commit lists once and compute multiple metrics from the same output. This is safer and faster than calling git for each metric separately.
How this fit on projects
This concept defines how git-insight executes git safely and how it handles errors and performance concerns.
Definitions & key terms
- Shell injection: When user input is interpreted as shell commands.
- argv array: List of arguments passed directly to a process.
- Working directory: Directory where the command runs.
- Exit code propagation: Returning child process exit codes.
Mental model diagram (ASCII)
User input -> argv array -> exec git -> parse output
How it works (step-by-step)
- Validate user input (repo path, flags).
- Build argv array with no shell interpolation.
- Execute git with working directory set.
- Capture stdout/stderr and exit code.
- Handle errors with a clean message.
Minimal concrete example
$ git-insight churn --repo ./myrepo
# internally executes: git log --numstat --pretty=format:%H%x00%an with argv array
Common misconceptions
- “Using sh -c is fine.” -> It exposes injection and quoting bugs.
- “Exit codes can be ignored.” -> They are critical for automation.
- “git output is always stable.” -> It is stable only with porcelain formats.
Check-your-understanding questions
- Why should you avoid shell commands when executing git?
- How do you set the repo context safely?
- Why is exit code propagation important?
Check-your-understanding answers
- It prevents injection and quoting issues.
- Set the working directory or use
--git-dirand--work-tree. - Scripts depend on accurate exit codes.
Real-world applications
- Many internal tools wrap git for reporting and dashboards.
- CI systems use git safely for diff and change detection.
Where you will apply it
- See §3.2 Functional Requirements and §4.1 High-Level Design.
- Also used in: Project 5: env-vault and Project 8: api-forge.
References
- The Linux Programming Interface, process execution chapters
- Git documentation on porcelain vs plumbing
Key insights
Safe execution is about never letting user input become shell syntax.
Summary
Direct exec with argv arrays keeps your CLI safe and predictable.
Homework/Exercises to practice the concept
- Write a wrapper that runs
git statuswithout using the shell. - Add a
--repoflag that sets the working directory. - Test with a repo path containing spaces.
Solutions to the homework/exercises
- Use
exec.Command("git", "status"). - Set
cmd.Diror use--git-dir. - Spaces work when you avoid the shell.
2.2 Parsing Structured Output
Fundamentals
Parsing git output is reliable only if you use stable formats. Git distinguishes between porcelain output (designed for humans) and plumbing output (designed for scripts). For a CLI, you should use porcelain formats like --pretty or --numstat with explicit separators. This creates deterministic, machine-readable output that can be parsed safely.
Deep Dive into the concept
The main challenge in parsing is delimiter collisions. If you use | as a delimiter, commit messages or author names might include | and break your parser. The safe approach is to use NUL (\x00) as a separator. Git allows you to include \x00 in --pretty=format strings. This means you can generate output like hash\0author\0date\0 and then split on NUL. This is robust because NUL rarely appears in git metadata.
Another parsing issue is line endings. Git outputs lines separated by newlines, but some fields may contain newlines if not sanitized. For example, commit messages can be multi-line. To avoid this, you should use %s for the subject line only, not %B for the full body. You can also use --no-merges or --all flags to control the scope.
For file churn, git log --numstat outputs lines with added and removed counts per file. This output includes blank lines between commits. You must handle these separators. The recommended approach is to parse the output as a stream: read commit headers, then accumulate file stats until the next header. This is more robust than splitting everything into memory.
If you output JSON, make sure to escape strings properly and preserve unicode. JSON output should be stable and schema-defined. This is important for downstream tools like jq. It is also useful for caching and test fixtures. Deterministic ordering is key: sort authors by commit count, sort files by churn, and use stable tie-breaking rules.
How this fit on projects
This concept defines the parsing strategy for git-insight and ensures that it can handle real repositories without breaking.
Definitions & key terms
- Porcelain output: Human-friendly output with stable flags.
- Plumbing output: Low-level commands for scripts.
- Delimiter collision: When data contains your separator.
- NUL separator:
\x00, safe delimiter.
Mental model diagram (ASCII)
raw git output -> split on NUL -> records -> aggregate -> report
How it works (step-by-step)
- Run git with a stable
--pretty=formatstring. - Read stdout as bytes and split on NUL.
- Assemble records into structs.
- Aggregate counts (commits, churn).
- Render output in table or JSON.
Minimal concrete example
$ git log --pretty=format:"%H%x00%an%x00%ad" --date=short
Common misconceptions
- “Human output is good enough.” -> It changes and is hard to parse.
- “Newlines are safe delimiters.” -> Commit messages can include newlines.
- “Sorting can be arbitrary.” -> Scripts need stable ordering.
Check-your-understanding questions
- Why use NUL separators?
- Why avoid parsing full commit messages?
- How do you ensure deterministic ordering in output?
Check-your-understanding answers
- To avoid delimiter collisions in normal text.
- Full messages can include newlines and break parsing.
- Sort by primary metric and tie-break by name.
Real-world applications
- Git tooling and CI scripts parse porcelain output.
git log --prettyis widely used for automation.
Where you will apply it
- See §3.5 Data Formats and §4.4 Algorithm Overview.
- Also used in: Project 1: minigrep-plus and Project 8: api-forge.
References
- Git documentation on
--pretty=format git logmanual
Key insights
Deterministic parsing requires stable formats and safe delimiters.
Summary
Use NUL-separated porcelain output, avoid multi-line fields, and enforce stable ordering.
Homework/Exercises to practice the concept
- Generate NUL-separated output from git log.
- Write a parser that counts commits per author.
- Add deterministic sorting to the output.
Solutions to the homework/exercises
- Use
--pretty=format:"%H%x00%an". - Split on NUL and count authors.
- Sort by count desc, name asc.
2.3 Performance and Caching
Fundamentals
Git queries can be expensive on large repos. Running multiple git commands for each metric can make your CLI slow. Performance strategies include caching, combining queries, and limiting scope. Caching means storing computed results for a short period or until the repo changes. This improves responsiveness for repeated queries.
Deep Dive into the concept
To cache effectively, you need a cache key that reflects the repository state. The simplest key is the HEAD commit hash plus the command parameters. If HEAD changes, the cache is invalid. If parameters change (date range, author filter), the cache is different. The cache can be stored in a file under XDG cache directories. For example, ~/.cache/git-insight/ can store JSON files keyed by a hash of inputs.
Combining queries reduces overhead. Instead of running separate commands for commits, authors, and churn, run a single git log that provides all required fields. Then compute multiple metrics from that output. This is more efficient and also ensures consistent results because all metrics are derived from the same snapshot. Be careful with memory: if the repo is huge, streaming parse is better than loading everything into memory.
Scope limiting is also important. Users rarely need “all history” for quick insights. Provide flags like --since 30d or --max-commits 1000 and default to a reasonable range. This prevents the tool from being slow by default. It also reduces the risk of timeouts in CI.
Finally, caching must be transparent. You should provide a --no-cache flag and a cache clear subcommand. You should also expose cache hits in --debug output so users can understand performance behavior. Determinism is important: cached outputs should be identical to live outputs.
How this fit on projects
This concept defines performance strategies, cache design, and default query limits.
Definitions & key terms
- Cache key: Identifier derived from repo state and parameters.
- Cache invalidation: Determining when cached data is stale.
- Scope limiting: Restricting data to a time window or count.
Mental model diagram (ASCII)
inputs + HEAD -> cache key -> cache hit? -> use cache | run git
How it works (step-by-step)
- Compute cache key from repo HEAD and args.
- If cache entry exists, load it.
- If not, run git and compute metrics.
- Store results in cache.
- Return output.
Minimal concrete example
cache key = SHA256(HEAD + "summary" + "--since=30d")
Common misconceptions
- “Caching is premature optimization.” -> For git, it is a UX feature.
- “Always use full history.” -> It is slow and unnecessary for quick insights.
- “Cache can be global.” -> It must be scoped per repo.
Check-your-understanding questions
- Why is HEAD a good cache invalidation key?
- How does scope limiting improve performance?
- Why must cache be deterministic?
Check-your-understanding answers
- HEAD changes when repo state changes.
- It reduces data processed and improves speed.
- Scripts rely on consistent output.
Real-world applications
git statususes internal caching for speed.- CI systems cache git metadata between runs.
Where you will apply it
- See §3.2 Functional Requirements and §5.10 Implementation Phases.
- Also used in: Project 10: distro-flow and Project 8: api-forge.
References
- Git performance tuning guides
- XDG Cache Directory spec
Key insights
Caching is a UX tool: faster feedback encourages use.
Summary
Combine queries, limit scope, and cache results keyed by repo state for predictable performance.
Homework/Exercises to practice the concept
- Implement a cache that stores results per repo.
- Add a
--no-cacheflag. - Compare runtimes with and without cache.
Solutions to the homework/exercises
- Use XDG cache directory and a hashed key.
- Bypass cache when flag is set.
- Measure with
time.
3. Project Specification
3.1 What You Will Build
A CLI named git-insight that reports repository activity such as commit counts, top authors, and file churn. It executes git commands safely, parses output deterministically, and provides table and JSON output modes. It caches results based on repo state.
3.2 Functional Requirements
- Commands:
summary,churn,authors. - Safe execution: no shell usage, argv arrays only.
- Parsing: NUL-separated output parsing.
- Output modes: table and JSON.
- Caching: cache results per repo and args.
- Scope flags:
--since,--until,--max-commits.
3.3 Non-Functional Requirements
- Performance: default summary in under 1s on medium repos.
- Reliability: stable output schema.
- Usability: clear errors for invalid repos.
3.4 Example Usage / Output
$ git-insight summary --since 30d
Repo: my-app
Commits (30d): 142
Top authors:
alice 48
bob 31
3.5 Data Formats / Schemas / Protocols
{"repo":"my-app","commits":142,"top_authors":[{"name":"alice","count":48}]}
3.6 Edge Cases
- Repo path invalid -> exit code 2.
- Git not installed -> exit code 4.
- Empty repo -> output zero counts.
3.7 Real World Outcome
3.7.1 How to Run (Copy/Paste)
# Build
go build -o git-insight ./cmd/git-insight
# Run in repo
./git-insight summary --since 30d
3.7.2 Golden Path Demo (Deterministic)
$ GIT_INSIGHT_NOW=2026-01-01T00:00:00Z ./git-insight summary --since 30d --json
{"repo":"my-app","commits":142,"top_authors":[{"name":"alice","count":48}]}
$ echo $?
0
3.7.3 Failure Demo (Deterministic)
$ ./git-insight summary --repo /not/a/repo
git-insight: not a git repository
$ echo $?
2
3.7.4 Exit Codes
0: Success.2: Invalid repo or not a git repo.4: Git executable not found.
4. Solution Architecture
4.1 High-Level Design
+------------------+
| CLI Parser |
+------------------+
|
v
+------------------+ +------------------+
| Git Runner | --> | Output Parser |
+------------------+ +------------------+
| |
v v
+------------------+ +------------------+
| Aggregator | --> | Formatter |
+------------------+ +------------------+
4.2 Key Components
| Component | Responsibility | Key Decisions |
|---|---|---|
| Git runner | execute git safely | argv-only exec |
| Parser | parse output | NUL-separated formats |
| Aggregator | compute metrics | single-pass streaming |
| Formatter | table/JSON output | stable schemas |
4.3 Data Structures (No Full Code)
type CommitRecord struct {
Hash string
Author string
Date string
}
4.4 Algorithm Overview
Key Algorithm: Single-Pass Aggregation
- Execute git log with NUL separators.
- Stream parse records.
- Update counters and maps.
- Sort and render output.
Complexity Analysis:
- Time: O(N) records.
- Space: O(A) authors + O(F) files.
5. Implementation Guide
5.1 Development Environment Setup
mkdir git-insight && cd git-insight
5.2 Project Structure
cmd/git-insight/
main.go
internal/
git/
parse/
aggregate/
output/
5.3 The Core Question You’re Answering
“How do I integrate with external tools safely and predictably?”
5.4 Concepts You Must Understand First
- Safe command execution.
- Structured output parsing.
- Caching and performance limits.
5.5 Questions to Guide Your Design
- Which git formats are stable and safe to parse?
- How will you handle large repos?
- What should be cached and how?
5.6 Thinking Exercise
Design a parser for git log --pretty=format:"%H%x00%an%x00%ad".
5.7 The Interview Questions They’ll Ask
- “How do you avoid shell injection when calling git?”
- “Why is NUL a safer delimiter than
|?” - “How do you ensure output stability?”
5.8 Hints in Layers
Hint 1: Use argv arrays Never pass user input to a shell.
Hint 2: Use NUL separators Avoid delimiter collisions.
Hint 3: Cache results Use HEAD hash as cache key.
Hint 4: Provide JSON output Ensure deterministic ordering.
5.9 Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Process exec | The Linux Programming Interface | process chapters |
| Text processing | The Linux Command Line | Ch. 6 |
5.10 Implementation Phases
Phase 1: Safe Exec (1 day)
Goals: run git commands safely.
Phase 2: Parsing and Metrics (2 days)
Goals: parse output and aggregate data.
Phase 3: Caching and Output (1 day)
Goals: caching and JSON output.
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale |
|---|---|---|---|
| Delimiter | | vs NUL |
NUL | safe parsing. |
| Cache key | HEAD vs timestamp | HEAD | accurate invalidation. |
| Output | table vs JSON | both | human + script use. |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples |
|---|---|---|
| Unit Tests | parsing | NUL-separated records |
| Integration Tests | git calls | summary in sample repo |
| Edge Case Tests | empty repo | zero commits |
6.2 Critical Test Cases
- Commit with special chars does not break parser.
- Cache hit returns same output as fresh run.
- Invalid repo returns exit code 2.
6.3 Test Data
fixtures/repo/ with known history
7. Common Pitfalls and Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution |
|---|---|---|
| Using shell strings | injection risk | argv-only exec |
Parsing | delimiter |
corrupted records | use NUL separators |
| No caching | slow runs | add cache layer |
7.2 Debugging Strategies
- Print raw git output in
--debugmode. - Use fixture repos for deterministic tests.
- Validate cache keys in logs.
7.3 Performance Traps
- Running multiple git commands per metric wastes time.
8. Extensions and Challenges
8.1 Beginner Extensions
- Add
--authorfilter. - Add
--filesto list changed files.
8.2 Intermediate Extensions
- Add churn heatmap by file path.
- Add timeline output.
8.3 Advanced Extensions
- Add git blame aggregation.
- Add plugin output formats.
9. Real-World Connections
9.1 Industry Applications
- Code review analytics and team metrics dashboards.
9.2 Related Open Source Projects
gitstats,gitinspector.
9.3 Interview Relevance
- Safe command execution and parsing are common topics.
10. Resources
10.1 Essential Reading
- Git documentation on porcelain output
- The Linux Programming Interface (process execution)
10.2 Video Resources
- “Parsing Git Output” talks
10.3 Tools and Documentation
git log --prettydocs
10.4 Related Projects in This Series
11. Self-Assessment Checklist
11.1 Understanding
- I can explain why argv arrays are safer than shell strings.
- I can explain NUL-separated parsing.
- I can explain cache invalidation via HEAD.
11.2 Implementation
- Git commands execute safely.
- Output is deterministic and parseable.
- Cache works and can be cleared.
11.3 Growth
- I can add new metrics without changing the parser.
- I can demo JSON and table outputs.
- I can explain performance trade-offs.
12. Submission / Completion Criteria
Minimum Viable Completion:
summarycommand works with safe exec.- Parsing is robust and deterministic.
Full Completion:
- Caching and JSON output implemented.
Excellence (Going Above and Beyond):
- Advanced metrics and visualizations.