Project 4: The Multi-Line Address Parser

Build a sed tool that parses multi-line blocks using the hold space to extract structured sections.

Quick Reference

Attribute Value
Difficulty Level 4: Advanced
Time Estimate 15-25 hours
Main Programming Language sed (script file)
Alternative Programming Languages awk, Python, Perl
Coolness Level Level 5: Wizardry
Business Potential 2: Internal Automation
Prerequisites Prior sed projects, strong regex, control flow basics
Key Topics hold space, N, H, G, ranges, branching

1. Learning Objectives

By completing this project, you will:

  1. Use the hold space to accumulate and process multi-line blocks.
  2. Apply address ranges to select sections of a file.
  3. Build a block parser that extracts or rewrites multi-line records.
  4. Debug and reason about sed state across multiple cycles.

2. All Theory Needed (Per-Concept Breakdown)

2.1 Pattern Space, Hold Space, and Multi-Line Buffers

Fundamentals

sed is line-oriented by default, but the hold space allows you to carry state across lines. The pattern space holds the current line; the hold space is a persistent buffer. Commands like h, H, g, G, and x move data between them. The N command appends the next line to the pattern space, creating a multi-line buffer. This is the foundation for parsing blocks of text that span multiple lines, such as paragraphs, records, or sections.

The key mental shift is that multi-line buffers turn sed into a block processor. You must track what is in the pattern space at each step, especially when embedded newlines are present.

When you rely on ranges, you must remember that the range is inclusive. If you need exclusive behavior, you should explicitly delete or skip the boundary lines after they are matched.

Deep Dive into the Concept

Multi-line editing is the point where sed stops behaving like a simple line filter and starts behaving like a small virtual machine. The pattern space is still the active buffer, but it can now contain embedded newlines if you use N. Once you have multi-line pattern spaces, anchors like ^ and $ apply to the beginning and end of the whole pattern space, not each line. This changes how you write regex: you may need to match \n explicitly or use patterns like \n[^\n]* to target specific lines.

The hold space provides persistent state. h overwrites the hold space with the pattern space, H appends to it with a newline, g replaces the pattern space with the hold space, G appends hold space to pattern space, and x swaps them. These commands let you accumulate blocks over multiple cycles. For example, to capture a paragraph separated by blank lines, you can append each nonblank line to the hold space with H, and when you hit a blank line, you swap the hold space into the pattern space and process the full block.

One subtlety is that H always inserts a leading newline before appending. This can produce an empty first line if the hold space is empty. Many scripts handle this by special-casing the first line or trimming the leading newline with s/^\n//. Another subtlety is that the hold space persists across cycles, so if you do not explicitly clear it (h with an empty pattern space or x; s/.*//; x), you may accidentally mix blocks. The safest pattern is: clear hold space at block start, append lines, then consume and reset.

Multi-line processing has memory implications. Because the hold space can grow, scripts that accumulate large blocks can hit implementation limits. POSIX requires at least 8192 bytes for pattern and hold space; GNU sed allows more, but you should still design scripts that do not grow without bound. This project uses bounded blocks (paragraphs or sections) to keep memory predictable.

Another important detail is how d interacts with multi-line buffers. If you call d while the pattern space holds multiple lines, you discard the entire buffer and move to the next cycle, which can skip lines you intended to keep. This is why multi-line scripts often avoid d except at clearly defined boundaries. When you do need to discard, consider whether you should flush or reset the hold space first to avoid leaking state into the next block.

Multi-line scripts also benefit from explicit end-of-file handling. If your file ends without a blank line, your accumulator may never flush. A common pattern is to add a rule for the $ address that triggers the same output logic as a blank line. This keeps behavior deterministic and avoids losing the last record.

A related multi-line command is D, which deletes up to the first newline in the pattern space and restarts the cycle without reading a new line. This can be useful when you want to slide through a multi-line buffer one line at a time. It is advanced, but knowing it exists helps you reason about multi-line control flow.

How This Fits on Projects

This project is all about hold space and multi-line buffers. You will accumulate blocks delimited by blank lines and then extract fields from each block. This skill is foundational for Project 5’s file reversal pattern.

Definitions & Key Terms

  • Pattern space: Current working buffer.
  • Hold space: Persistent buffer across cycles.
  • N command: Append next line to pattern space.
  • H command: Append pattern space to hold space with newline.
  • G command: Append hold space to pattern space.
  • x command: Swap hold and pattern spaces.

Mental Model Diagram (ASCII)

Line -> pattern space
        | h/H
        v
      hold space (persists)
        ^ g/G/x

How It Works (Step-by-Step)

  1. Read a line into the pattern space.
  2. If line is part of a block, append to hold space (H).
  3. On block boundary, swap hold space into pattern space (g or x).
  4. Process the full block with regex and substitution.
  5. Clear hold space before starting the next block.

Minimal Concrete Example

# Join two lines and replace newline with a space
sed 'N; s/\n/ /' file.txt

Common Misconceptions

  • Misconception: Hold space resets each line.
    • Correction: Hold space persists until explicitly changed.
  • Misconception: N acts like n.
    • Correction: N appends; n replaces pattern space with next line.

Check-Your-Understanding Questions

  1. What is the difference between H and h?
  2. Why can H create a leading blank line?
  3. What does x do?

Check-Your-Understanding Answers

  1. H appends to hold space with a newline, h overwrites it.
  2. It inserts a newline before appending, even if hold space is empty.
  3. It swaps the contents of pattern space and hold space.

Real-World Applications

  • Parsing paragraph blocks in text files.
  • Extracting multi-line records from reports.
  • Building advanced text transformations in CI pipelines.

Where You’ll Apply It

References

  • GNU sed manual – hold space commands
  • “sed & awk” – advanced sed techniques

Key Insight

Hold space lets you turn line-oriented sed into a block-oriented parser.

Summary

Multi-line buffers are the gateway to advanced sed scripting.

Homework/Exercises to Practice the Concept

  1. Append each line to the hold space and print the hold space on blank lines.
  2. Join each pair of lines into a single line.
  3. Swap pattern and hold space and observe the effect.

Solutions to the Homework/Exercises

  1. sed -n '/^$/ {x; s/^\n//; p; x; s/.*//; x; } {H;}' file.txt
  2. sed 'N; s/\n/ /' file.txt
  3. sed -n '1{h;d}; x; p' file.txt

2.2 Range Addressing, Branching, and Stateful Parsing

Fundamentals

Range addresses like /START/,/END/ let you select a block of lines. These ranges are stateful: once the start matches, every line is selected until the end matches. Branching with labels (:label, b label) lets you control flow and skip commands. Together, ranges and branching let you build a parser that recognizes blocks and treats them as units, even though sed processes line by line.

Range addressing is stateful, which means your script has implicit memory even without variables. Understanding when that state resets is critical for correctness.

When you rely on ranges, you must remember that the range is inclusive. If you need exclusive behavior, you should explicitly delete or skip the boundary lines after they are matched.

Deep Dive into the Concept

Stateful parsing in sed means your script remembers that it is “inside” a block. Range addresses provide this state implicitly: when /START/ matches, the range becomes active; when /END/ matches, it deactivates. This is perfect for extracting sections like:

BEGIN
line1
line2
END

You can use /^BEGIN$/,/^END$/ to select the block. Inside that range, you can accumulate lines into the hold space or perform targeted substitutions. The important detail is that both the start and end lines are included in the range. If you need to exclude them, you must explicitly delete or skip them after the selection.

Branching adds explicit control. A label :skip and a branch b skip let you jump over commands, which is useful when you want to treat different block types differently. For example, if you detect a header line, you might branch to a label that initializes the hold space; when you reach a footer line, you might branch to output logic. This is a form of control flow inside a single sed script.

Combining ranges with hold space yields powerful patterns: detect START, clear hold space, append lines until END, then output the accumulated block. One tricky edge case is nested blocks. Standard sed ranges do not support nesting; the first END closes the range. If your data can contain nested blocks, you need a more complex state machine, which is beyond this project. Documenting this limitation is part of designing a reliable tool.

Finally, remember that sed does not have explicit variables. The hold space is your only state store, and branching is your only control flow. That means your script should be small and predictable. When debugging, insert temporary p commands or use -n and explicit printing to understand which branch is taken.

Branching can also be tied to substitution success using the t and T commands (branch on successful or failed substitution). This is a powerful way to build state machines in sed. For instance, you can attempt to match a header line; if the substitution succeeds, branch to a label that initializes accumulation. If it fails, fall through to normal line handling. This pattern keeps your logic declarative while still allowing conditional behavior.

You can simulate simple state by storing a marker in the hold space. For example, set the hold space to IN_BLOCK when you see a header, and clear it on footer. Then use x and pattern matching to decide whether to append or output. This approach is fragile but can be sufficient for small, well-defined formats.

Another edge case is overlapping ranges. If your data has a START inside an active range before an END, the range stays active; it does not nest. This can lead to surprising behavior, so it is often safer to enforce non-overlapping blocks by validating input format or by using unique markers.

You can also combine ranges with line-number addresses to enforce bounds, such as only parsing blocks in the first N lines. This is useful for large files where you only want the most recent records.

How This Fits on Projects

Range addressing defines the blocks you will parse in this project. Branching provides the control flow to accumulate and then output each block. These are the same tools used in Project 5, but with a different goal.

Definitions & Key Terms

  • Range address: A start and end selector that defines a block.
  • Label: A named point in a sed script (:label).
  • Branch: A jump to a label (b label).
  • State: The implicit “inside block” condition maintained by ranges.

Mental Model Diagram (ASCII)

START line -> range active -> accumulate -> END line -> output block

How It Works (Step-by-Step)

  1. Detect start of block with a regex address.
  2. Clear hold space and begin accumulation.
  3. Append each line to hold space while range is active.
  4. On end line, swap hold space to pattern space and output.
  5. Reset hold space and continue.

Minimal Concrete Example

# Print only lines between BEGIN and END
sed -n '/^BEGIN$/, /^END$/p' file.txt

Common Misconceptions

  • Misconception: Ranges are stateless.
    • Correction: Ranges stay active after the start matches.
  • Misconception: sed supports nested ranges.
    • Correction: Nested ranges are not supported by default.

Check-Your-Understanding Questions

  1. Are start and end lines included in ranges?
  2. What does b label do?
  3. Why is nested block parsing hard in sed?

Check-Your-Understanding Answers

  1. Yes, both are included.
  2. It jumps to the label, skipping intermediate commands.
  3. sed has no stack; ranges close on the first end match.

Real-World Applications

  • Extracting sections from config files.
  • Parsing reports with headers and footers.
  • Processing multi-line records in audit logs.

Where You’ll Apply It

  • In this project: See §3.2 Functional Requirements and §6 Testing Strategy.
  • Also used in: P02-log-file-cleaner.md.

References

  • GNU sed manual – addresses and branching
  • “sed & awk” – range patterns

Key Insight

Ranges provide implicit state; branching lets you exploit it.

Summary

Address ranges and branching turn line-by-line processing into block parsing.

Homework/Exercises to Practice the Concept

  1. Print only the lines between START and END markers.
  2. Exclude the START and END lines from output.
  3. Use a branch to skip lines that start with #.

Solutions to the Homework/Exercises

  1. sed -n '/^START$/,/^END$/p' file.txt
  2. sed -n '/^START$/,/^END$/ {/^START$/d; /^END$/d; p}' file.txt
  3. sed '/^#/b; p' file.txt

3. Project Specification

3.1 What You Will Build

You will build a sed script named block-parse.sed that extracts multi-line records from a file separated by blank lines. Each record is turned into a single line with fields separated by tabs. Example: an address block with Name, Email, and Phone lines becomes a TSV row. The script operates deterministically and ignores incomplete records.

3.2 Functional Requirements

  1. Detect block boundaries: blank lines separate records.
  2. Accumulate block lines: use hold space to build the record.
  3. Extract fields: map lines like Name: ... to TSV columns.
  4. Output TSV: emit a header and one row per complete record.
  5. Exit codes: handle missing input or malformed blocks.

3.3 Non-Functional Requirements

  • Performance: Process 50k lines in under 2 seconds.
  • Reliability: Do not mix data across blocks.
  • Usability: Clear documentation of expected input format.

3.4 Example Usage / Output

$ sed -f block-parse.sed input.txt > output.tsv

3.5 Data Formats / Schemas / Protocols

Input:

Name: Alice Example
Email: alice@example.com
Phone: 555-0101

Name: Bob Example
Email: bob@example.com
Phone: 555-0102

Output:

name	email	phone
Alice Example	alice@example.com	555-0101
Bob Example	bob@example.com	555-0102

3.6 Edge Cases

  • Missing Email or Phone lines in a block.
  • Extra fields in a block.
  • Consecutive blank lines.
  • File that ends without a trailing blank line.

3.7 Real World Outcome

You will have a deterministic parser that turns multi-line records into a normalized TSV.

3.7.1 How to Run (Copy/Paste)

cat > input.txt <<'EOF'
Name: Alice Example
Email: alice@example.com
Phone: 555-0101

Name: Bob Example
Email: bob@example.com
Phone: 555-0102
EOF

sed -f block-parse.sed input.txt > output.tsv

3.7.2 Golden Path Demo (Deterministic)

Output includes exactly two rows with the expected fields.

3.7.3 CLI Transcript (Success)

$ sed -f block-parse.sed input.txt > output.tsv
$ echo $?
0
$ cat output.tsv
name	email	phone
Alice Example	alice@example.com	555-0101
Bob Example	bob@example.com	555-0102

3.7.4 CLI Transcript (Failure: Malformed Block)

$ printf 'Name: Solo\n\n' > bad.txt
$ sed -f block-parse.sed bad.txt
Error: incomplete record at line 1
$ echo $?
2

Exit codes:

  • 0 success
  • 1 usage error
  • 2 malformed record

4. Solution Architecture

4.1 High-Level Design

input.txt -> sed (accumulate block) -> transform fields -> TSV output

4.2 Key Components

Component Responsibility Key Decisions
Range detector Identify blank-line blocks Use /^$/ as boundary
Hold space builder Accumulate block text H and x pattern
Field extractor Convert Name: lines to TSV Regex substitutions
Output writer Emit header and rows Print once at start

4.3 Data Structures (No Full Code)

block = "Name: ...\nEmail: ...\nPhone: ..."

4.4 Algorithm Overview

Key Algorithm: Block Accumulation

  1. Clear hold space at start of block.
  2. Append each line to hold space until blank line.
  3. On blank line, swap hold space into pattern space.
  4. Extract fields and output TSV row.

Complexity Analysis:

  • Time: O(n)
  • Space: O(block size)

5. Implementation Guide

5.1 Development Environment Setup

printf 'Name: Alice\nEmail: alice@example.com\nPhone: 555-0101\n\n' > input.txt

5.2 Project Structure

block-parse/
├── block-parse.sed
├── bin/
│   └── block-parse
├── tests/
│   └── test-block-parse.sh
└── README.md

5.3 The Core Question You’re Answering

“How can I parse multi-line records with a line-oriented tool?”

5.4 Concepts You Must Understand First

  1. Hold space and multi-line pattern space.
  2. Range addresses and branching.

5.5 Questions to Guide Your Design

  1. How will you handle a file that ends without a blank line?
  2. What makes a record “complete”?
  3. How will you reset the hold space between blocks?

5.6 Thinking Exercise

Trace the hold space after each line of this input:

Name: A
Email: a@x.com
Phone: 555-0000

5.7 The Interview Questions They’ll Ask

  1. “How does the hold space differ from the pattern space?”
  2. “Why does H sometimes add a leading newline?”
  3. “How do range addresses maintain state?”

5.8 Hints in Layers

Hint 1: Use H to accumulate lines until a blank line.

Hint 2: On blank line, swap hold space to pattern space and transform.

Hint 3: Clear the hold space after output.

5.9 Books That Will Help

Topic Book Chapter
Advanced sed “sed & awk” Ch. 6
Stream editing patterns “Classic Shell Scripting” Ch. 8
Parsing philosophy “The UNIX Programming Environment” Ch. 7

5.10 Implementation Phases

Phase 1: Foundation (4-6 hours)

Goals:

  • Detect block boundaries.
  • Accumulate lines in hold space.

Tasks:

  1. Write a script that prints each block as a single chunk.
  2. Handle missing trailing blank line.

Checkpoint: Each block prints once, intact.

Phase 2: Core Functionality (5-8 hours)

Goals:

  • Extract fields from the accumulated block.
  • Emit TSV rows.

Tasks:

  1. Replace Name: with field value.
  2. Reorder fields into TSV order.

Checkpoint: Output matches expected TSV.

Phase 3: Polish & Edge Cases (2-4 hours)

Goals:

  • Skip incomplete records.
  • Add header output.

Tasks:

  1. Detect missing fields and return errors.
  2. Print header once at start.

Checkpoint: Error handling works and header appears once.

5.11 Key Implementation Decisions

Decision Options Recommendation Rationale
Block delimiter Blank lines vs markers Blank lines Simple and common
Field extraction Single regex vs multiple Multiple regexes Easier to debug
Error handling Ignore vs error Error Deterministic behavior

6. Testing Strategy

6.1 Test Categories

Category Purpose Examples
Unit Tests Verify field regexes Name/Email/Phone extraction
Integration Tests End-to-end parsing Two blocks to TSV
Edge Case Tests Missing fields Incomplete record detection

6.2 Critical Test Cases

  1. Block with all fields produces a correct row.
  2. Missing email triggers exit code 2.
  3. File ending without blank line still processes last block.

6.3 Test Data

Name: A
Email: a@x.com
Phone: 555-0000

Name: B
Email: b@x.com
Phone: 555-0001

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall Symptom Solution
Not clearing hold space Blocks bleed together Clear hold at block start
Misusing N Skipped lines Use H for accumulation
Missing last block Output missing final record Add end-of-file handling

7.2 Debugging Strategies

  • Print intermediate states: Temporarily p the hold space.
  • Use small fixtures: Start with two blocks.
  • Add markers: Insert visible separators during debugging.

7.3 Performance Traps

  • Avoid building unbounded hold space; keep blocks small and well-delimited.

8. Extensions & Challenges

8.1 Beginner Extensions

  • Support Address: field extraction.
  • Allow custom delimiter lines.

8.2 Intermediate Extensions

  • Add JSON output.
  • Support optional fields with empty columns.

8.3 Advanced Extensions

  • Parse nested blocks with custom markers.
  • Build a DSL for block definitions.

9. Real-World Connections

9.1 Industry Applications

  • Compliance reports: extract multi-line audit records.
  • Ops runbooks: parse incident summaries.
  • logstash-filter-mutate: similar field extraction concept.
  • csvkit: downstream tool for structured output.

9.3 Interview Relevance

  • Stateful parsing: shows you can manage parsing state in streaming tools.
  • Text processing depth: demonstrates advanced sed knowledge.

10. Resources

10.1 Essential Reading

  • “sed & awk” – hold space and advanced patterns
  • “Classic Shell Scripting” – block processing

10.2 Video Resources

  • Multi-line sed tutorial (video)

10.3 Tools & Documentation

  • GNU sed manual – hold space and branching
  • POSIX sed spec – addressing rules
  • Project 3: Markdown to HTML – ordering and script structure.
  • Project 5: Reversing a File – hold space mastery.

11. Self-Assessment Checklist

11.1 Understanding

  • I can explain how H, G, and x interact.
  • I can explain how range addresses maintain state.
  • I can predict hold space contents at a given line.

11.2 Implementation

  • Records are extracted correctly.
  • Incomplete records are rejected with errors.
  • Output is deterministic.

11.3 Growth

  • I can extend the parser with new fields.
  • I can explain the hold space pattern to someone else.

12. Submission / Completion Criteria

Minimum Viable Completion:

  • A working block-parse.sed script that extracts records into TSV.
  • Deterministic output for sample input.
  • Proper handling of missing fields.

Full Completion:

  • All minimum criteria plus:
  • Tests for malformed blocks.
  • A wrapper CLI with clear usage output.

Excellence (Going Above & Beyond):

  • Support custom schemas defined in a config file.
  • Add a --strict mode that rejects any extra fields.