Project 7: git-insight (Composability and Parsing)

Build a CLI that analyzes git activity using safe command execution and robust parsing.

Quick Reference

Attribute	Value
Difficulty	Level 2 (Intermediate)
Time Estimate	Weekend
Main Programming Language	Go (Alternatives: Rust)
Alternative Programming Languages	Rust
Coolness Level	Level 3: Genuinely clever
Business Potential	2: Internal tooling
Prerequisites	CLI basics, process execution, text parsing
Key Topics	safe exec, parsing, performance, caching

1. Learning Objectives

By completing this project, you will:

Execute external commands safely without shell injection.
Parse structured git output deterministically.
Design output modes for humans and scripts.
Cache expensive git queries for performance.
Build a reliable CLI around another tool.

2. All Theory Needed (Per-Concept Breakdown)

2.1 Safe Command Execution

Fundamentals

When a CLI invokes another tool, the biggest risk is shell injection. Passing a string to a shell (sh -c) allows user input to be interpreted as shell syntax, which can execute unintended commands. Safe command execution means invoking the command directly with an argument array. This avoids shell parsing entirely. It also makes behavior consistent across platforms.

Deep Dive into the concept

Safe execution starts with argument separation. Instead of building a command string like git log --pretty=format:"%H|%an", you should pass git, log, and --pretty=format:%H%x00%an as separate arguments. This ensures that any user-provided values (like a repo path or branch name) are treated as data, not syntax. It also avoids quoting problems. The shell has its own rules for quotes, globbing, and escape characters. If you bypass the shell, those rules do not apply, and you get deterministic behavior.

You also need to control the working directory. git commands are sensitive to the repo root. Use a --git-dir or set the working directory for the process to the repo path. This is safer than trying to change directories in a shell command. It also prevents injection via path strings.

Another aspect is error handling. External commands can fail for many reasons: repo not found, not a git repo, permission issues, or invalid args. You must capture stderr and exit codes and surface a clean error message to the user. Do not dump raw stderr unless --debug is enabled. When you do show raw stderr, redact or sanitize any paths or secrets if necessary.

Finally, performance matters. Running git log repeatedly can be expensive on large repos. Use caching where possible or combine queries to minimize calls. For example, get commit lists once and compute multiple metrics from the same output. This is safer and faster than calling git for each metric separately.

How this fit on projects

This concept defines how git-insight executes git safely and how it handles errors and performance concerns.

Definitions & key terms

Shell injection: When user input is interpreted as shell commands.
argv array: List of arguments passed directly to a process.
Working directory: Directory where the command runs.
Exit code propagation: Returning child process exit codes.

Mental model diagram (ASCII)

User input -> argv array -> exec git -> parse output

How it works (step-by-step)

Validate user input (repo path, flags).
Build argv array with no shell interpolation.
Execute git with working directory set.
Capture stdout/stderr and exit code.
Handle errors with a clean message.

Minimal concrete example

$ git-insight churn --repo ./myrepo
# internally executes: git log --numstat --pretty=format:%H%x00%an with argv array

Common misconceptions

“Using sh -c is fine.” -> It exposes injection and quoting bugs.
“Exit codes can be ignored.” -> They are critical for automation.
“git output is always stable.” -> It is stable only with porcelain formats.

Check-your-understanding questions

Why should you avoid shell commands when executing git?
How do you set the repo context safely?
Why is exit code propagation important?

Check-your-understanding answers

It prevents injection and quoting issues.
Set the working directory or use --git-dir and --work-tree.
Scripts depend on accurate exit codes.

Real-world applications

Many internal tools wrap git for reporting and dashboards.
CI systems use git safely for diff and change detection.

Where you will apply it

See §3.2 Functional Requirements and §4.1 High-Level Design.
Also used in: Project 5: env-vault and Project 8: api-forge.

References

The Linux Programming Interface, process execution chapters
Git documentation on porcelain vs plumbing

Key insights

Safe execution is about never letting user input become shell syntax.

Summary

Direct exec with argv arrays keeps your CLI safe and predictable.

Homework/Exercises to practice the concept

Write a wrapper that runs git status without using the shell.
Add a --repo flag that sets the working directory.
Test with a repo path containing spaces.

Solutions to the homework/exercises

Use exec.Command("git", "status").
Set cmd.Dir or use --git-dir.
Spaces work when you avoid the shell.

2.2 Parsing Structured Output

Fundamentals

Parsing git output is reliable only if you use stable formats. Git distinguishes between porcelain output (designed for humans) and plumbing output (designed for scripts). For a CLI, you should use porcelain formats like --pretty or --numstat with explicit separators. This creates deterministic, machine-readable output that can be parsed safely.

Deep Dive into the concept

The main challenge in parsing is delimiter collisions. If you use | as a delimiter, commit messages or author names might include | and break your parser. The safe approach is to use NUL (\x00) as a separator. Git allows you to include \x00 in --pretty=format strings. This means you can generate output like hash\0author\0date\0 and then split on NUL. This is robust because NUL rarely appears in git metadata.

Another parsing issue is line endings. Git outputs lines separated by newlines, but some fields may contain newlines if not sanitized. For example, commit messages can be multi-line. To avoid this, you should use %s for the subject line only, not %B for the full body. You can also use --no-merges or --all flags to control the scope.

For file churn, git log --numstat outputs lines with added and removed counts per file. This output includes blank lines between commits. You must handle these separators. The recommended approach is to parse the output as a stream: read commit headers, then accumulate file stats until the next header. This is more robust than splitting everything into memory.

If you output JSON, make sure to escape strings properly and preserve unicode. JSON output should be stable and schema-defined. This is important for downstream tools like jq. It is also useful for caching and test fixtures. Deterministic ordering is key: sort authors by commit count, sort files by churn, and use stable tie-breaking rules.

How this fit on projects

This concept defines the parsing strategy for git-insight and ensures that it can handle real repositories without breaking.

Definitions & key terms

Porcelain output: Human-friendly output with stable flags.
Plumbing output: Low-level commands for scripts.
Delimiter collision: When data contains your separator.
NUL separator: \x00, safe delimiter.

Mental model diagram (ASCII)

raw git output -> split on NUL -> records -> aggregate -> report

How it works (step-by-step)

Run git with a stable --pretty=format string.
Read stdout as bytes and split on NUL.
Assemble records into structs.
Aggregate counts (commits, churn).
Render output in table or JSON.

Minimal concrete example

$ git log --pretty=format:"%H%x00%an%x00%ad" --date=short

Common misconceptions

“Human output is good enough.” -> It changes and is hard to parse.
“Newlines are safe delimiters.” -> Commit messages can include newlines.
“Sorting can be arbitrary.” -> Scripts need stable ordering.

Check-your-understanding questions

Why use NUL separators?
Why avoid parsing full commit messages?
How do you ensure deterministic ordering in output?

Check-your-understanding answers

To avoid delimiter collisions in normal text.
Full messages can include newlines and break parsing.
Sort by primary metric and tie-break by name.

Real-world applications

Git tooling and CI scripts parse porcelain output.
git log --pretty is widely used for automation.

Where you will apply it

See §3.5 Data Formats and §4.4 Algorithm Overview.
Also used in: Project 1: minigrep-plus and Project 8: api-forge.

References

Git documentation on --pretty=format
git log manual

Key insights

Deterministic parsing requires stable formats and safe delimiters.

Summary

Use NUL-separated porcelain output, avoid multi-line fields, and enforce stable ordering.

Homework/Exercises to practice the concept

Generate NUL-separated output from git log.
Write a parser that counts commits per author.
Add deterministic sorting to the output.

Solutions to the homework/exercises

Use --pretty=format:"%H%x00%an".
Split on NUL and count authors.
Sort by count desc, name asc.

2.3 Performance and Caching

Fundamentals

Git queries can be expensive on large repos. Running multiple git commands for each metric can make your CLI slow. Performance strategies include caching, combining queries, and limiting scope. Caching means storing computed results for a short period or until the repo changes. This improves responsiveness for repeated queries.

Deep Dive into the concept

To cache effectively, you need a cache key that reflects the repository state. The simplest key is the HEAD commit hash plus the command parameters. If HEAD changes, the cache is invalid. If parameters change (date range, author filter), the cache is different. The cache can be stored in a file under XDG cache directories. For example, ~/.cache/git-insight/ can store JSON files keyed by a hash of inputs.

Combining queries reduces overhead. Instead of running separate commands for commits, authors, and churn, run a single git log that provides all required fields. Then compute multiple metrics from that output. This is more efficient and also ensures consistent results because all metrics are derived from the same snapshot. Be careful with memory: if the repo is huge, streaming parse is better than loading everything into memory.

Scope limiting is also important. Users rarely need “all history” for quick insights. Provide flags like --since 30d or --max-commits 1000 and default to a reasonable range. This prevents the tool from being slow by default. It also reduces the risk of timeouts in CI.

Finally, caching must be transparent. You should provide a --no-cache flag and a cache clear subcommand. You should also expose cache hits in --debug output so users can understand performance behavior. Determinism is important: cached outputs should be identical to live outputs.

How this fit on projects

This concept defines performance strategies, cache design, and default query limits.

Definitions & key terms

Cache key: Identifier derived from repo state and parameters.
Cache invalidation: Determining when cached data is stale.
Scope limiting: Restricting data to a time window or count.

Mental model diagram (ASCII)

inputs + HEAD -> cache key -> cache hit? -> use cache | run git

How it works (step-by-step)

Compute cache key from repo HEAD and args.
If cache entry exists, load it.
If not, run git and compute metrics.
Store results in cache.
Return output.

Minimal concrete example

cache key = SHA256(HEAD + "summary" + "--since=30d")

Common misconceptions

“Caching is premature optimization.” -> For git, it is a UX feature.
“Always use full history.” -> It is slow and unnecessary for quick insights.
“Cache can be global.” -> It must be scoped per repo.

Check-your-understanding questions

Why is HEAD a good cache invalidation key?
How does scope limiting improve performance?
Why must cache be deterministic?

Check-your-understanding answers

HEAD changes when repo state changes.
It reduces data processed and improves speed.
Scripts rely on consistent output.

Real-world applications

git status uses internal caching for speed.
CI systems cache git metadata between runs.

Where you will apply it

See §3.2 Functional Requirements and §5.10 Implementation Phases.
Also used in: Project 10: distro-flow and Project 8: api-forge.

References

Git performance tuning guides
XDG Cache Directory spec

Key insights

Caching is a UX tool: faster feedback encourages use.

Summary

Combine queries, limit scope, and cache results keyed by repo state for predictable performance.

Homework/Exercises to practice the concept

Implement a cache that stores results per repo.
Add a --no-cache flag.
Compare runtimes with and without cache.

Solutions to the homework/exercises

Use XDG cache directory and a hashed key.
Bypass cache when flag is set.
Measure with time.

3. Project Specification

3.1 What You Will Build

A CLI named git-insight that reports repository activity such as commit counts, top authors, and file churn. It executes git commands safely, parses output deterministically, and provides table and JSON output modes. It caches results based on repo state.

3.2 Functional Requirements

Commands: summary, churn, authors.
Safe execution: no shell usage, argv arrays only.
Parsing: NUL-separated output parsing.
Output modes: table and JSON.
Caching: cache results per repo and args.
Scope flags: --since, --until, --max-commits.

3.3 Non-Functional Requirements

Performance: default summary in under 1s on medium repos.
Reliability: stable output schema.
Usability: clear errors for invalid repos.

3.4 Example Usage / Output

$ git-insight summary --since 30d
Repo: my-app
Commits (30d): 142
Top authors:
  alice  48
  bob    31

3.5 Data Formats / Schemas / Protocols

{"repo":"my-app","commits":142,"top_authors":[{"name":"alice","count":48}]}

3.6 Edge Cases

Repo path invalid -> exit code 2.
Git not installed -> exit code 4.
Empty repo -> output zero counts.

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

# Build
go build -o git-insight ./cmd/git-insight

# Run in repo
./git-insight summary --since 30d

3.7.2 Golden Path Demo (Deterministic)

$ GIT_INSIGHT_NOW=2026-01-01T00:00:00Z ./git-insight summary --since 30d --json
{"repo":"my-app","commits":142,"top_authors":[{"name":"alice","count":48}]}
$ echo $?
0

3.7.3 Failure Demo (Deterministic)

$ ./git-insight summary --repo /not/a/repo
git-insight: not a git repository
$ echo $?
2

3.7.4 Exit Codes

0: Success.
2: Invalid repo or not a git repo.
4: Git executable not found.

4. Solution Architecture

4.1 High-Level Design

+------------------+
| CLI Parser       |
+------------------+
          |
          v
+------------------+     +------------------+
| Git Runner       | --> | Output Parser    |
+------------------+     +------------------+
          |                        |
          v                        v
+------------------+     +------------------+
| Aggregator       | --> | Formatter        |
+------------------+     +------------------+

4.2 Key Components

Component	Responsibility	Key Decisions
Git runner	execute git safely	argv-only exec
Parser	parse output	NUL-separated formats
Aggregator	compute metrics	single-pass streaming
Formatter	table/JSON output	stable schemas

4.3 Data Structures (No Full Code)

type CommitRecord struct {
    Hash string
    Author string
    Date string
}

4.4 Algorithm Overview

Key Algorithm: Single-Pass Aggregation

Execute git log with NUL separators.
Stream parse records.
Update counters and maps.
Sort and render output.

Complexity Analysis:

Time: O(N) records.
Space: O(A) authors + O(F) files.

5. Implementation Guide

5.1 Development Environment Setup

mkdir git-insight && cd git-insight

5.2 Project Structure

cmd/git-insight/
  main.go
internal/
  git/
  parse/
  aggregate/
  output/

5.3 The Core Question You’re Answering

“How do I integrate with external tools safely and predictably?”

5.4 Concepts You Must Understand First

Safe command execution.
Structured output parsing.
Caching and performance limits.

5.5 Questions to Guide Your Design

Which git formats are stable and safe to parse?
How will you handle large repos?
What should be cached and how?

5.6 Thinking Exercise

Design a parser for git log --pretty=format:"%H%x00%an%x00%ad".

5.7 The Interview Questions They’ll Ask

“How do you avoid shell injection when calling git?”
“Why is NUL a safer delimiter than |?”
“How do you ensure output stability?”

5.8 Hints in Layers

Hint 1: Use argv arrays Never pass user input to a shell.

Hint 2: Use NUL separators Avoid delimiter collisions.

Hint 3: Cache results Use HEAD hash as cache key.

Hint 4: Provide JSON output Ensure deterministic ordering.

5.9 Books That Will Help

Topic	Book	Chapter
Process exec	The Linux Programming Interface	process chapters
Text processing	The Linux Command Line	Ch. 6

5.10 Implementation Phases

Phase 1: Safe Exec (1 day)

Goals: run git commands safely.

Phase 2: Parsing and Metrics (2 days)

Goals: parse output and aggregate data.

Phase 3: Caching and Output (1 day)

Goals: caching and JSON output.

5.11 Key Implementation Decisions

Decision	Options	Recommendation	Rationale
Delimiter	`\|` vs NUL	NUL	safe parsing.
Cache key	HEAD vs timestamp	HEAD	accurate invalidation.
Output	table vs JSON	both	human + script use.

6. Testing Strategy

6.1 Test Categories

Category	Purpose	Examples
Unit Tests	parsing	NUL-separated records
Integration Tests	git calls	summary in sample repo
Edge Case Tests	empty repo	zero commits

6.2 Critical Test Cases

Commit with special chars does not break parser.
Cache hit returns same output as fresh run.
Invalid repo returns exit code 2.

6.3 Test Data

fixtures/repo/ with known history

7. Common Pitfalls and Debugging

7.1 Frequent Mistakes

Pitfall	Symptom	Solution
Using shell strings	injection risk	argv-only exec
Parsing `\|` delimiter	corrupted records	use NUL separators
No caching	slow runs	add cache layer

7.2 Debugging Strategies

Print raw git output in --debug mode.
Use fixture repos for deterministic tests.
Validate cache keys in logs.

7.3 Performance Traps

Running multiple git commands per metric wastes time.

8. Extensions and Challenges

8.1 Beginner Extensions

Add --author filter.
Add --files to list changed files.

8.2 Intermediate Extensions

Add churn heatmap by file path.
Add timeline output.

8.3 Advanced Extensions

Add git blame aggregation.
Add plugin output formats.

9. Real-World Connections

9.1 Industry Applications

Code review analytics and team metrics dashboards.

gitstats, gitinspector.

9.3 Interview Relevance

Safe command execution and parsing are common topics.

10. Resources

10.1 Essential Reading

Git documentation on porcelain output
The Linux Programming Interface (process execution)

10.2 Video Resources

“Parsing Git Output” talks

10.3 Tools and Documentation

git log --pretty docs

11. Self-Assessment Checklist

11.1 Understanding

I can explain why argv arrays are safer than shell strings.
I can explain NUL-separated parsing.
I can explain cache invalidation via HEAD.

11.2 Implementation

Git commands execute safely.
Output is deterministic and parseable.
Cache works and can be cleared.

11.3 Growth

I can add new metrics without changing the parser.
I can demo JSON and table outputs.
I can explain performance trade-offs.

12. Submission / Completion Criteria

Minimum Viable Completion:

summary command works with safe exec.
Parsing is robust and deterministic.

Full Completion:

Caching and JSON output implemented.

Excellence (Going Above and Beyond):

Advanced metrics and visualizations.