Project 4: The Multi-Line Address Parser
Build a
sedtool that parses multi-line blocks using the hold space to extract structured sections.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 4: Advanced |
| Time Estimate | 15-25 hours |
| Main Programming Language | sed (script file) |
| Alternative Programming Languages | awk, Python, Perl |
| Coolness Level | Level 5: Wizardry |
| Business Potential | 2: Internal Automation |
| Prerequisites | Prior sed projects, strong regex, control flow basics |
| Key Topics | hold space, N, H, G, ranges, branching |
1. Learning Objectives
By completing this project, you will:
- Use the hold space to accumulate and process multi-line blocks.
- Apply address ranges to select sections of a file.
- Build a block parser that extracts or rewrites multi-line records.
- Debug and reason about
sedstate across multiple cycles.
2. All Theory Needed (Per-Concept Breakdown)
2.1 Pattern Space, Hold Space, and Multi-Line Buffers
Fundamentals
sed is line-oriented by default, but the hold space allows you to carry state across lines. The pattern space holds the current line; the hold space is a persistent buffer. Commands like h, H, g, G, and x move data between them. The N command appends the next line to the pattern space, creating a multi-line buffer. This is the foundation for parsing blocks of text that span multiple lines, such as paragraphs, records, or sections.
The key mental shift is that multi-line buffers turn sed into a block processor. You must track what is in the pattern space at each step, especially when embedded newlines are present.
When you rely on ranges, you must remember that the range is inclusive. If you need exclusive behavior, you should explicitly delete or skip the boundary lines after they are matched.
Deep Dive into the Concept
Multi-line editing is the point where sed stops behaving like a simple line filter and starts behaving like a small virtual machine. The pattern space is still the active buffer, but it can now contain embedded newlines if you use N. Once you have multi-line pattern spaces, anchors like ^ and $ apply to the beginning and end of the whole pattern space, not each line. This changes how you write regex: you may need to match \n explicitly or use patterns like \n[^\n]* to target specific lines.
The hold space provides persistent state. h overwrites the hold space with the pattern space, H appends to it with a newline, g replaces the pattern space with the hold space, G appends hold space to pattern space, and x swaps them. These commands let you accumulate blocks over multiple cycles. For example, to capture a paragraph separated by blank lines, you can append each nonblank line to the hold space with H, and when you hit a blank line, you swap the hold space into the pattern space and process the full block.
One subtlety is that H always inserts a leading newline before appending. This can produce an empty first line if the hold space is empty. Many scripts handle this by special-casing the first line or trimming the leading newline with s/^\n//. Another subtlety is that the hold space persists across cycles, so if you do not explicitly clear it (h with an empty pattern space or x; s/.*//; x), you may accidentally mix blocks. The safest pattern is: clear hold space at block start, append lines, then consume and reset.
Multi-line processing has memory implications. Because the hold space can grow, scripts that accumulate large blocks can hit implementation limits. POSIX requires at least 8192 bytes for pattern and hold space; GNU sed allows more, but you should still design scripts that do not grow without bound. This project uses bounded blocks (paragraphs or sections) to keep memory predictable.
Another important detail is how d interacts with multi-line buffers. If you call d while the pattern space holds multiple lines, you discard the entire buffer and move to the next cycle, which can skip lines you intended to keep. This is why multi-line scripts often avoid d except at clearly defined boundaries. When you do need to discard, consider whether you should flush or reset the hold space first to avoid leaking state into the next block.
Multi-line scripts also benefit from explicit end-of-file handling. If your file ends without a blank line, your accumulator may never flush. A common pattern is to add a rule for the $ address that triggers the same output logic as a blank line. This keeps behavior deterministic and avoids losing the last record.
A related multi-line command is D, which deletes up to the first newline in the pattern space and restarts the cycle without reading a new line. This can be useful when you want to slide through a multi-line buffer one line at a time. It is advanced, but knowing it exists helps you reason about multi-line control flow.
How This Fits on Projects
This project is all about hold space and multi-line buffers. You will accumulate blocks delimited by blank lines and then extract fields from each block. This skill is foundational for Project 5’s file reversal pattern.
Definitions & Key Terms
- Pattern space: Current working buffer.
- Hold space: Persistent buffer across cycles.
Ncommand: Append next line to pattern space.Hcommand: Append pattern space to hold space with newline.Gcommand: Append hold space to pattern space.xcommand: Swap hold and pattern spaces.
Mental Model Diagram (ASCII)
Line -> pattern space
| h/H
v
hold space (persists)
^ g/G/x
How It Works (Step-by-Step)
- Read a line into the pattern space.
- If line is part of a block, append to hold space (
H). - On block boundary, swap hold space into pattern space (
gorx). - Process the full block with regex and substitution.
- Clear hold space before starting the next block.
Minimal Concrete Example
# Join two lines and replace newline with a space
sed 'N; s/\n/ /' file.txt
Common Misconceptions
- Misconception: Hold space resets each line.
- Correction: Hold space persists until explicitly changed.
- Misconception:
Nacts liken.- Correction:
Nappends;nreplaces pattern space with next line.
- Correction:
Check-Your-Understanding Questions
- What is the difference between
Handh? - Why can
Hcreate a leading blank line? - What does
xdo?
Check-Your-Understanding Answers
Happends to hold space with a newline,hoverwrites it.- It inserts a newline before appending, even if hold space is empty.
- It swaps the contents of pattern space and hold space.
Real-World Applications
- Parsing paragraph blocks in text files.
- Extracting multi-line records from reports.
- Building advanced text transformations in CI pipelines.
Where You’ll Apply It
- In this project: See §3.1 What You Will Build and §5.10 Implementation Phases.
- Also used in: P05-reversing-a-file-line-by-line.md.
References
- GNU sed manual – hold space commands
- “sed & awk” – advanced
sedtechniques
Key Insight
Hold space lets you turn line-oriented sed into a block-oriented parser.
Summary
Multi-line buffers are the gateway to advanced sed scripting.
Homework/Exercises to Practice the Concept
- Append each line to the hold space and print the hold space on blank lines.
- Join each pair of lines into a single line.
- Swap pattern and hold space and observe the effect.
Solutions to the Homework/Exercises
sed -n '/^$/ {x; s/^\n//; p; x; s/.*//; x; } {H;}' file.txtsed 'N; s/\n/ /' file.txtsed -n '1{h;d}; x; p' file.txt
2.2 Range Addressing, Branching, and Stateful Parsing
Fundamentals
Range addresses like /START/,/END/ let you select a block of lines. These ranges are stateful: once the start matches, every line is selected until the end matches. Branching with labels (:label, b label) lets you control flow and skip commands. Together, ranges and branching let you build a parser that recognizes blocks and treats them as units, even though sed processes line by line.
Range addressing is stateful, which means your script has implicit memory even without variables. Understanding when that state resets is critical for correctness.
When you rely on ranges, you must remember that the range is inclusive. If you need exclusive behavior, you should explicitly delete or skip the boundary lines after they are matched.
Deep Dive into the Concept
Stateful parsing in sed means your script remembers that it is “inside” a block. Range addresses provide this state implicitly: when /START/ matches, the range becomes active; when /END/ matches, it deactivates. This is perfect for extracting sections like:
BEGIN
line1
line2
END
You can use /^BEGIN$/,/^END$/ to select the block. Inside that range, you can accumulate lines into the hold space or perform targeted substitutions. The important detail is that both the start and end lines are included in the range. If you need to exclude them, you must explicitly delete or skip them after the selection.
Branching adds explicit control. A label :skip and a branch b skip let you jump over commands, which is useful when you want to treat different block types differently. For example, if you detect a header line, you might branch to a label that initializes the hold space; when you reach a footer line, you might branch to output logic. This is a form of control flow inside a single sed script.
Combining ranges with hold space yields powerful patterns: detect START, clear hold space, append lines until END, then output the accumulated block. One tricky edge case is nested blocks. Standard sed ranges do not support nesting; the first END closes the range. If your data can contain nested blocks, you need a more complex state machine, which is beyond this project. Documenting this limitation is part of designing a reliable tool.
Finally, remember that sed does not have explicit variables. The hold space is your only state store, and branching is your only control flow. That means your script should be small and predictable. When debugging, insert temporary p commands or use -n and explicit printing to understand which branch is taken.
Branching can also be tied to substitution success using the t and T commands (branch on successful or failed substitution). This is a powerful way to build state machines in sed. For instance, you can attempt to match a header line; if the substitution succeeds, branch to a label that initializes accumulation. If it fails, fall through to normal line handling. This pattern keeps your logic declarative while still allowing conditional behavior.
You can simulate simple state by storing a marker in the hold space. For example, set the hold space to IN_BLOCK when you see a header, and clear it on footer. Then use x and pattern matching to decide whether to append or output. This approach is fragile but can be sufficient for small, well-defined formats.
Another edge case is overlapping ranges. If your data has a START inside an active range before an END, the range stays active; it does not nest. This can lead to surprising behavior, so it is often safer to enforce non-overlapping blocks by validating input format or by using unique markers.
You can also combine ranges with line-number addresses to enforce bounds, such as only parsing blocks in the first N lines. This is useful for large files where you only want the most recent records.
How This Fits on Projects
Range addressing defines the blocks you will parse in this project. Branching provides the control flow to accumulate and then output each block. These are the same tools used in Project 5, but with a different goal.
Definitions & Key Terms
- Range address: A start and end selector that defines a block.
- Label: A named point in a
sedscript (:label). - Branch: A jump to a label (
b label). - State: The implicit “inside block” condition maintained by ranges.
Mental Model Diagram (ASCII)
START line -> range active -> accumulate -> END line -> output block
How It Works (Step-by-Step)
- Detect start of block with a regex address.
- Clear hold space and begin accumulation.
- Append each line to hold space while range is active.
- On end line, swap hold space to pattern space and output.
- Reset hold space and continue.
Minimal Concrete Example
# Print only lines between BEGIN and END
sed -n '/^BEGIN$/, /^END$/p' file.txt
Common Misconceptions
- Misconception: Ranges are stateless.
- Correction: Ranges stay active after the start matches.
- Misconception:
sedsupports nested ranges.- Correction: Nested ranges are not supported by default.
Check-Your-Understanding Questions
- Are start and end lines included in ranges?
- What does
b labeldo? - Why is nested block parsing hard in
sed?
Check-Your-Understanding Answers
- Yes, both are included.
- It jumps to the label, skipping intermediate commands.
sedhas no stack; ranges close on the first end match.
Real-World Applications
- Extracting sections from config files.
- Parsing reports with headers and footers.
- Processing multi-line records in audit logs.
Where You’ll Apply It
- In this project: See §3.2 Functional Requirements and §6 Testing Strategy.
- Also used in: P02-log-file-cleaner.md.
References
- GNU sed manual – addresses and branching
- “sed & awk” – range patterns
Key Insight
Ranges provide implicit state; branching lets you exploit it.
Summary
Address ranges and branching turn line-by-line processing into block parsing.
Homework/Exercises to Practice the Concept
- Print only the lines between
STARTandENDmarkers. - Exclude the
STARTandENDlines from output. - Use a branch to skip lines that start with
#.
Solutions to the Homework/Exercises
sed -n '/^START$/,/^END$/p' file.txtsed -n '/^START$/,/^END$/ {/^START$/d; /^END$/d; p}' file.txtsed '/^#/b; p' file.txt
3. Project Specification
3.1 What You Will Build
You will build a sed script named block-parse.sed that extracts multi-line records from a file separated by blank lines. Each record is turned into a single line with fields separated by tabs. Example: an address block with Name, Email, and Phone lines becomes a TSV row. The script operates deterministically and ignores incomplete records.
3.2 Functional Requirements
- Detect block boundaries: blank lines separate records.
- Accumulate block lines: use hold space to build the record.
- Extract fields: map lines like
Name: ...to TSV columns. - Output TSV: emit a header and one row per complete record.
- Exit codes: handle missing input or malformed blocks.
3.3 Non-Functional Requirements
- Performance: Process 50k lines in under 2 seconds.
- Reliability: Do not mix data across blocks.
- Usability: Clear documentation of expected input format.
3.4 Example Usage / Output
$ sed -f block-parse.sed input.txt > output.tsv
3.5 Data Formats / Schemas / Protocols
Input:
Name: Alice Example
Email: alice@example.com
Phone: 555-0101
Name: Bob Example
Email: bob@example.com
Phone: 555-0102
Output:
name email phone
Alice Example alice@example.com 555-0101
Bob Example bob@example.com 555-0102
3.6 Edge Cases
- Missing
EmailorPhonelines in a block. - Extra fields in a block.
- Consecutive blank lines.
- File that ends without a trailing blank line.
3.7 Real World Outcome
You will have a deterministic parser that turns multi-line records into a normalized TSV.
3.7.1 How to Run (Copy/Paste)
cat > input.txt <<'EOF'
Name: Alice Example
Email: alice@example.com
Phone: 555-0101
Name: Bob Example
Email: bob@example.com
Phone: 555-0102
EOF
sed -f block-parse.sed input.txt > output.tsv
3.7.2 Golden Path Demo (Deterministic)
Output includes exactly two rows with the expected fields.
3.7.3 CLI Transcript (Success)
$ sed -f block-parse.sed input.txt > output.tsv
$ echo $?
0
$ cat output.tsv
name email phone
Alice Example alice@example.com 555-0101
Bob Example bob@example.com 555-0102
3.7.4 CLI Transcript (Failure: Malformed Block)
$ printf 'Name: Solo\n\n' > bad.txt
$ sed -f block-parse.sed bad.txt
Error: incomplete record at line 1
$ echo $?
2
Exit codes:
0success1usage error2malformed record
4. Solution Architecture
4.1 High-Level Design
input.txt -> sed (accumulate block) -> transform fields -> TSV output
4.2 Key Components
| Component | Responsibility | Key Decisions |
|---|---|---|
| Range detector | Identify blank-line blocks | Use /^$/ as boundary |
| Hold space builder | Accumulate block text | H and x pattern |
| Field extractor | Convert Name: lines to TSV |
Regex substitutions |
| Output writer | Emit header and rows | Print once at start |
4.3 Data Structures (No Full Code)
block = "Name: ...\nEmail: ...\nPhone: ..."
4.4 Algorithm Overview
Key Algorithm: Block Accumulation
- Clear hold space at start of block.
- Append each line to hold space until blank line.
- On blank line, swap hold space into pattern space.
- Extract fields and output TSV row.
Complexity Analysis:
- Time: O(n)
- Space: O(block size)
5. Implementation Guide
5.1 Development Environment Setup
printf 'Name: Alice\nEmail: alice@example.com\nPhone: 555-0101\n\n' > input.txt
5.2 Project Structure
block-parse/
├── block-parse.sed
├── bin/
│ └── block-parse
├── tests/
│ └── test-block-parse.sh
└── README.md
5.3 The Core Question You’re Answering
“How can I parse multi-line records with a line-oriented tool?”
5.4 Concepts You Must Understand First
- Hold space and multi-line pattern space.
- Range addresses and branching.
5.5 Questions to Guide Your Design
- How will you handle a file that ends without a blank line?
- What makes a record “complete”?
- How will you reset the hold space between blocks?
5.6 Thinking Exercise
Trace the hold space after each line of this input:
Name: A
Email: a@x.com
Phone: 555-0000
5.7 The Interview Questions They’ll Ask
- “How does the hold space differ from the pattern space?”
- “Why does
Hsometimes add a leading newline?” - “How do range addresses maintain state?”
5.8 Hints in Layers
Hint 1: Use H to accumulate lines until a blank line.
Hint 2: On blank line, swap hold space to pattern space and transform.
Hint 3: Clear the hold space after output.
5.9 Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Advanced sed | “sed & awk” | Ch. 6 |
| Stream editing patterns | “Classic Shell Scripting” | Ch. 8 |
| Parsing philosophy | “The UNIX Programming Environment” | Ch. 7 |
5.10 Implementation Phases
Phase 1: Foundation (4-6 hours)
Goals:
- Detect block boundaries.
- Accumulate lines in hold space.
Tasks:
- Write a script that prints each block as a single chunk.
- Handle missing trailing blank line.
Checkpoint: Each block prints once, intact.
Phase 2: Core Functionality (5-8 hours)
Goals:
- Extract fields from the accumulated block.
- Emit TSV rows.
Tasks:
- Replace
Name:with field value. - Reorder fields into TSV order.
Checkpoint: Output matches expected TSV.
Phase 3: Polish & Edge Cases (2-4 hours)
Goals:
- Skip incomplete records.
- Add header output.
Tasks:
- Detect missing fields and return errors.
- Print header once at start.
Checkpoint: Error handling works and header appears once.
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale |
|---|---|---|---|
| Block delimiter | Blank lines vs markers | Blank lines | Simple and common |
| Field extraction | Single regex vs multiple | Multiple regexes | Easier to debug |
| Error handling | Ignore vs error | Error | Deterministic behavior |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples |
|---|---|---|
| Unit Tests | Verify field regexes | Name/Email/Phone extraction |
| Integration Tests | End-to-end parsing | Two blocks to TSV |
| Edge Case Tests | Missing fields | Incomplete record detection |
6.2 Critical Test Cases
- Block with all fields produces a correct row.
- Missing email triggers exit code 2.
- File ending without blank line still processes last block.
6.3 Test Data
Name: A
Email: a@x.com
Phone: 555-0000
Name: B
Email: b@x.com
Phone: 555-0001
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution |
|---|---|---|
| Not clearing hold space | Blocks bleed together | Clear hold at block start |
Misusing N |
Skipped lines | Use H for accumulation |
| Missing last block | Output missing final record | Add end-of-file handling |
7.2 Debugging Strategies
- Print intermediate states: Temporarily
pthe hold space. - Use small fixtures: Start with two blocks.
- Add markers: Insert visible separators during debugging.
7.3 Performance Traps
- Avoid building unbounded hold space; keep blocks small and well-delimited.
8. Extensions & Challenges
8.1 Beginner Extensions
- Support
Address:field extraction. - Allow custom delimiter lines.
8.2 Intermediate Extensions
- Add JSON output.
- Support optional fields with empty columns.
8.3 Advanced Extensions
- Parse nested blocks with custom markers.
- Build a DSL for block definitions.
9. Real-World Connections
9.1 Industry Applications
- Compliance reports: extract multi-line audit records.
- Ops runbooks: parse incident summaries.
9.2 Related Open Source Projects
- logstash-filter-mutate: similar field extraction concept.
- csvkit: downstream tool for structured output.
9.3 Interview Relevance
- Stateful parsing: shows you can manage parsing state in streaming tools.
- Text processing depth: demonstrates advanced
sedknowledge.
10. Resources
10.1 Essential Reading
- “sed & awk” – hold space and advanced patterns
- “Classic Shell Scripting” – block processing
10.2 Video Resources
- Multi-line
sedtutorial (video)
10.3 Tools & Documentation
- GNU sed manual – hold space and branching
- POSIX sed spec – addressing rules
10.4 Related Projects in This Series
- Project 3: Markdown to HTML – ordering and script structure.
- Project 5: Reversing a File – hold space mastery.
11. Self-Assessment Checklist
11.1 Understanding
- I can explain how
H,G, andxinteract. - I can explain how range addresses maintain state.
- I can predict hold space contents at a given line.
11.2 Implementation
- Records are extracted correctly.
- Incomplete records are rejected with errors.
- Output is deterministic.
11.3 Growth
- I can extend the parser with new fields.
- I can explain the hold space pattern to someone else.
12. Submission / Completion Criteria
Minimum Viable Completion:
- A working
block-parse.sedscript that extracts records into TSV. - Deterministic output for sample input.
- Proper handling of missing fields.
Full Completion:
- All minimum criteria plus:
- Tests for malformed blocks.
- A wrapper CLI with clear usage output.
Excellence (Going Above & Beyond):
- Support custom schemas defined in a config file.
- Add a
--strictmode that rejects any extra fields.