Project 3: Basic Markdown to HTML Converter

Build a multi-command sed script that converts a constrained Markdown subset into clean, deterministic HTML.

Quick Reference

Attribute	Value
Difficulty	Level 2: Intermediate
Time Estimate	12-20 hours
Main Programming Language	sed (script file)
Alternative Programming Languages	awk, Python, Perl
Coolness Level	Level 4: Fun and Impressive
Business Potential	2: Internal Automation
Prerequisites	Regex mastery, `sed` command order, basic HTML
Key Topics	script files, command ordering, regex transforms

1. Learning Objectives

By completing this project, you will:

Write multi-command sed scripts with predictable ordering.
Convert Markdown headings, lists, and emphasis into HTML tags.
Build a deterministic transformation pipeline with test fixtures.
Recognize the limitations of line-based parsing and design around them.

2. All Theory Needed (Per-Concept Breakdown)

2.1 Sed Script Files and Command Ordering

Fundamentals

sed scripts let you chain multiple editing commands in a specific order. You can pass them as -e one-liners, but for real transformations a script file (sed -f script.sed) is more readable and maintainable. The order matters because each command sees the output of the previous. When converting Markdown to HTML, you must apply heading rules before paragraph wrapping, strip list markers before surrounding the list in <ul>, and escape HTML-sensitive characters before injecting tags. A script file gives you a single, repeatable transformation pipeline.

A practical discipline is to group commands by layer: block-level transforms first, inline transforms second, and cleanup last. This mirrors how Markdown parsers work and makes your script easier to reason about.

Deep Dive into the Concept

Markdown conversion is a classic example of why command ordering is critical. If you replace **bold** with <strong>bold</strong> after you have already wrapped the line in <p>, you might end up with nested tags that are still valid, but if you instead escape < and > too late, you can corrupt your generated HTML. The sed execution model processes commands top to bottom for every line. That means you must design the transformation as a series of safe, composable steps. A typical approach is:

Normalize line endings and strip trailing whitespace.
Detect and convert block-level elements like headings and list markers.
Convert inline elements like emphasis and code.
Escape remaining HTML-special characters that should not be interpreted as tags.

The tricky part is that some steps interact. If you escape < and > before you insert <h1> tags, you will accidentally escape your own generated tags. That is why you must escape only the parts of the input that remain as text, not the tags you insert. In a pure sed script, the typical strategy is to escape first, then unescape the tags you generate. Another strategy is to place tags using s commands that insert raw < and > after escaping. The key is that ordering controls which characters are treated as data versus markup.

Script files also allow labels and branches. While not strictly necessary for a basic converter, labels can help you build more readable logic, such as only wrapping lines in <p> when they are not already converted to a block element. You can implement this by setting a flag in the hold space or by using branches that skip over paragraph conversion when a heading pattern matches. This is where script files shine: you can define a clear top-down flow rather than a tangled one-liner.

Finally, script files make testing easier. You can keep a sample Markdown input and run sed -f md2html.sed sample.md, then compare output to an expected HTML file. This aligns with deterministic transformations and makes debugging simpler.

For a converter, ordering also affects list handling. If you wrap list items in <li> first, you still need to add <ul> before and </ul> after the list. In sed, you can do this by detecting the first list line and inserting a <ul>, then detecting the transition out of a list and inserting </ul>. This often requires labels and branching, or a state flag stored in the hold space. Even if you keep the list logic simple, understanding this block-level ordering helps you avoid malformed HTML that mixes <p> and <ul> incorrectly.

Another ordering concern is paragraph handling. If you wrap every non-blank line in <p>, you must avoid wrapping headings and list items. A common sed strategy is to tag lines that are already block elements (for example, prefix them with a marker), then only wrap untagged lines, and finally remove the markers. This keeps paragraph logic separate from heading and list logic while preserving readability.

How This Fits on Projects

In this project, the entire converter is a script file with multiple transformation steps. You will rely on precise command ordering to avoid malformed HTML. In Project 4, you will use more advanced control flow (branches and hold space), which builds directly on this skill.

Definitions & Key Terms

Script file: A file of sed commands executed with -f.
Command order: The sequence in which transformations are applied.
Block-level element: HTML tags like <h1>, <ul>, <p>.
Inline element: HTML tags like <strong>, <em>, <code>.

Mental Model Diagram (ASCII)

Input line -> [cmd1] -> [cmd2] -> [cmd3] -> output line

How It Works (Step-by-Step)

sed reads a line into the pattern space.
It applies command 1 (e.g., normalize whitespace).
It applies command 2 (e.g., heading conversion).
It applies command 3 (e.g., inline emphasis).
The transformed line prints (or is stored).

Minimal Concrete Example

# Convert Markdown headings to HTML
sed -E 's/^# (.*)$/<h1>\1<\/h1>/' input.md

Common Misconceptions

Misconception: Command order does not matter.
- Correction: Later commands see the output of earlier ones, so order is essential.
Misconception: One-liners are always better.
- Correction: Complex transformations are safer in script files.

Check-Your-Understanding Questions

Why should heading conversion run before paragraph wrapping?
What is the risk of escaping < after you insert tags?
How does -f improve maintainability?

Check-Your-Understanding Answers

Otherwise the heading line might already be wrapped in <p>, causing nested tags.
You will escape your own tags and break the HTML.
It makes the transformation readable and versioned as a script.

Real-World Applications

Building static site generators for small docs.
Converting README files into HTML for dashboards.
Creating lightweight documentation pipelines.

Where You’ll Apply It

In this project: See §3.2 Functional Requirements and §5.10 Implementation Phases.
Also used in: P04-multi-line-address-parser.md.

References

GNU sed manual – script files and -f
“sed & awk” – multi-command scripts

Key Insight

Order is the architecture of a sed script; the same commands in the wrong order produce broken output.

Summary

Script files and command ordering let you design transformations as a clear pipeline instead of a fragile one-liner.

Homework/Exercises to Practice the Concept

Write a script that turns # Title into <h1>Title</h1>.
Add a second command to wrap any line not starting with # in <p>.
Create a script file and run it with sed -f.

Solutions to the Homework/Exercises

sed -E 's/^# (.*)$/<h1>\1<\/h1>/' file.md
sed -E '/^#/! s/^(.*)$/<p>\1<\/p>/' file.md
sed -f md2html.sed file.md

2.2 Regex-Based Markdown Transforms and Escaping

Fundamentals

Markdown is mostly syntax sugar over plain text. It uses markers like #, *, and backticks to indicate formatting. A sed converter relies on regex patterns to detect those markers and replace them with HTML tags. Inline elements like **bold** and *italic* can be transformed with capture groups. Block elements like lists require recognizing line prefixes (- or * ). Escaping is equally important: if the input contains < or &, it must be converted to < and & so it does not break HTML output.

Escaping is not optional because HTML is not just formatting; it is a language with its own syntax. If you do not escape special characters, your output can become invalid or even unsafe.

Deep Dive into the Concept

Regex-based Markdown conversion is not a full parser; it is a set of disciplined transformations. The key is to define a constrained subset of Markdown that is line-oriented and therefore compatible with sed. For example, you can support:

Headings: #, ##, ### at line start
Unordered lists: lines starting with - or *
Bold: **text**
Italic: *text* (with careful avoidance of list markers)
Inline code: `code`

Each of these can be implemented with sed substitutions and capture groups. The challenge is avoiding conflicts. For instance, * is used both for italics and for list markers. If you apply italic conversion before list handling, you may accidentally wrap list markers in <em>. The solution is to anchor list patterns to the start of the line, transform lists first, and only then run inline formatting on the remaining text. That means the order of regex transforms must mirror the precedence rules of Markdown.

Escaping is another critical concern. HTML treats <, >, and & as special. If your Markdown contains 2 < 3, you must convert it to 2 < 3 before output. However, you must avoid escaping the tags you insert. One practical strategy is to escape first, then selectively unescape the HTML tags you introduce by replacing <h1> back to <h1>. Another strategy is to escape only text segments captured before you inject tags. Both approaches are workable; the choice depends on how complex your script is.

Because sed is line-based, multi-line constructs like fenced code blocks or nested lists are difficult. The project scope should explicitly exclude these or treat them with simplified rules. This is an important engineering decision: a tool is only correct within its defined boundaries. By documenting the supported Markdown subset, you make the converter reliable rather than “almost correct”. This is a key lesson for any text transformation system.

Inline formatting has edge cases: nested emphasis, multiple bold segments on the same line, or asterisks inside code spans. A sed-based converter cannot handle every combination, but you can still make it predictable by declaring rules. For example, run code-span conversion first, then avoid applying italic or bold patterns inside <code> tags. You can do this by temporarily protecting code spans with placeholders, converting emphasis, and then restoring the code spans. This approach keeps your output stable without pretending to fully parse Markdown.

Escaping also interacts with code spans. If you convert code spans into <code> early, you should avoid applying other inline transforms inside those tags. A placeholder approach works well: replace code spans with a sentinel token, apply other inline conversions, then restore the code spans. This preserves code literals and keeps emphasis rules from corrupting them.

If you later add support for links ([text](url)), the same ordering lesson applies: convert links before emphasis, otherwise the * or _ inside URLs can trigger italic rules. This shows why a clear precedence list is essential, even in a constrained Markdown subset.

A final guard is to run a cleanup pass that removes empty <p></p> elements created by consecutive blank lines. This is a good example of a post-processing rule that depends on all earlier transformations being complete.

How This Fits on Projects

This concept drives the actual Markdown-to-HTML conversion. You will build regex rules for headings, lists, bold, italic, and code. In Project 5, you will see another example of using regex with hold space for whole-file transformations.

Definitions & Key Terms

Inline formatting: Formatting that appears inside a line (bold, italic, code).
Block formatting: Formatting that applies to whole lines (headings, lists).
Escaping: Replacing special characters with safe HTML entities.
Subset: A limited set of rules that the converter guarantees to support.

Mental Model Diagram (ASCII)

Markdown line -> match syntax -> replace with HTML -> escaped text

How It Works (Step-by-Step)

Detect block-level markers (#, - ) and replace them with tags.
Convert inline markers (**, *, `) using capture groups.
Escape any remaining <, >, & to HTML entities.
Ensure output is deterministic and idempotent.

Minimal Concrete Example

# Convert bold markdown to HTML
sed -E 's/\*\*([^*]+)\*\*/<strong>\1<\/strong>/g' input.md

Common Misconceptions

Misconception: Regex can fully parse Markdown.
- Correction: Regex is adequate only for a defined subset.
Misconception: Escaping can be ignored if input is trusted.
- Correction: Unescaped HTML can break output or cause security issues.

Check-Your-Understanding Questions

Why should list conversion happen before italic conversion?
What is the risk of not escaping & in HTML output?
Which Markdown constructs are hard to support with line-based tools?

Check-Your-Understanding Answers

List markers use *, which could be mistaken for italics.
& can start an entity and corrupt the HTML if not escaped.
Multi-line constructs like fenced code blocks and nested lists.

Real-World Applications

Converting internal docs to HTML for dashboards.
Creating fast previews of Markdown files.
Building minimal static site generators.

Where You’ll Apply It

In this project: See §3.1 What You Will Build and §3.5 Data Formats.
Also used in: P02-log-file-cleaner.md.

References

CommonMark spec (for understanding limitations)
“sed & awk” – substitution patterns

Key Insight

A reliable Markdown converter is one that clearly defines its supported subset and applies regex rules in a safe order.

Summary

Regex transformations are powerful but only within a scope you can guarantee.

Homework/Exercises to Practice the Concept

Convert ## Subtitle to <h2>Subtitle</h2>.
Convert *item* to italic without touching list markers.
Escape < and & in a text file.

Solutions to the Homework/Exercises

sed -E 's/^## (.*)$/<h2>\1<\/h2>/' file.md
sed -E '/^\*/! s/\*([^*]+)\*/<em>\1<\/em>/' file.md
sed -E 's/&/\&/g; s/</\</g' file.md

3. Project Specification

3.1 What You Will Build

You will build a sed script named md2html.sed and a wrapper CLI md2html that converts a constrained Markdown subset into HTML. The tool supports headings (#, ##, ###), unordered lists (- ), inline bold (**), italic (*), and inline code (`). It will not support nested lists, fenced code blocks, or tables. Output is deterministic and safe for static pages.

3.2 Functional Requirements

Convert headings: #, ##, ### into <h1>, <h2>, <h3>.
Convert unordered lists: - item into <ul><li>item</li></ul> block.
Convert inline emphasis: **bold** and *italic*.
Escape HTML: Replace <, >, & in text segments.
Deterministic output: Same input produces same output every run.

3.3 Non-Functional Requirements

Performance: Handle 1 MB Markdown file in under 1 second.
Reliability: No malformed HTML for supported subset.
Usability: Clear error messages for unsupported constructs.

3.4 Example Usage / Output

$ ./md2html README.md > README.html

3.5 Data Formats / Schemas / Protocols

Input (subset):

# Title

- item one
- item two

This is **bold** and *italic* and `code`.

Output:

<h1>Title</h1>
<ul>
<li>item one</li>
<li>item two</li>
</ul>
<p>This is <strong>bold</strong> and <em>italic</em> and <code>code</code>.</p>

3.6 Edge Cases

Lines with * that are not italics.
Mixed bold and italic in one line.
Empty lines between paragraphs.
Lines with < or & that must be escaped.

3.7 Real World Outcome

You will have a deterministic Markdown converter suitable for small docs and internal tooling.

3.7.1 How to Run (Copy/Paste)

cat > sample.md <<'EOF'
# Title

- item one
- item two

This is **bold** and *italic* and `code`.
EOF

sed -f md2html.sed sample.md > sample.html

3.7.2 Golden Path Demo (Deterministic)

Expected output is the same for every run and matches the sample HTML exactly.

3.7.3 CLI Transcript (Success)

$ ./md2html sample.md > sample.html
$ echo $?
0
$ cat sample.html
<h1>Title</h1>
<ul>
<li>item one</li>
<li>item two</li>
</ul>
<p>This is <strong>bold</strong> and <em>italic</em> and <code>code</code>.</p>

3.7.4 CLI Transcript (Failure: Unsupported Construct)

$ printf '```\ncode\n```\n' > code.md
$ ./md2html code.md
Error: fenced code blocks are not supported
$ echo $?
2

Exit codes:

0 success
1 usage error
2 unsupported Markdown construct

4. Solution Architecture

4.1 High-Level Design

Markdown -> sed script -> HTML

4.2 Key Components

Component	Responsibility	Key Decisions
Script file	Convert syntax	Use ordered transformations
Wrapper CLI	Validate input and report errors	Detect unsupported constructs
Test fixtures	Verify deterministic output	Compare against golden HTML

4.3 Data Structures (No Full Code)

line = {raw_text}

4.4 Algorithm Overview

Key Algorithm: Markdown Line Transformation

Normalize whitespace and escape HTML characters.
Convert headings and lists.
Convert inline emphasis and code.
Wrap leftover lines in paragraphs.

Complexity Analysis:

Time: O(n) lines
Space: O(1) streaming

5. Implementation Guide

5.1 Development Environment Setup

printf '# Title\n' > sample.md

5.2 Project Structure

md2html/
├── md2html.sed
├── bin/
│   └── md2html
├── tests/
│   └── test-md2html.sh
└── README.md

5.3 The Core Question You’re Answering

“How can I build a reliable Markdown converter with only a line-based stream editor?”

5.4 Concepts You Must Understand First

Script files and command ordering.
Regex transforms and HTML escaping.

5.5 Questions to Guide Your Design

Which Markdown features will you support and which will you exclude?
How will you prevent list lines from being double-wrapped?
How will you escape HTML without escaping your generated tags?

5.6 Thinking Exercise

Manually convert this line into HTML:

This is **bold** and *italic*.

Write the exact output and verify that your regex matches it.

5.7 The Interview Questions They’ll Ask

“Why is Markdown conversion with sed limited?”
“How do you avoid converting list markers into italics?”
“What is the role of command ordering?”

5.8 Hints in Layers

Hint 1: Convert block elements before inline elements.

Hint 2: Use anchors to target headings and list items.

Hint 3: Add a final pass to wrap remaining lines in <p>.

5.9 Books That Will Help

Topic	Book	Chapter
Text processing	“The Unix Programming Environment”	Ch. 5-7
Regex mastery	“Mastering Regular Expressions”	Ch. 1-4
sed scripting	“sed & awk”	Ch. 4-6

5.10 Implementation Phases

Phase 1: Foundation (4-6 hours)

Goals:

Convert headings and simple paragraphs.
Establish script structure.

Tasks:

Write heading rules for #, ##, ###.
Wrap non-heading lines in <p>.

Checkpoint: Simple file renders correctly.

Phase 2: Core Functionality (4-8 hours)

Goals:

Add lists and inline emphasis.

Tasks:

Detect list items and wrap in <li>.
Convert **bold** and *italic*.

Checkpoint: Sample file matches expected HTML.

Phase 3: Polish & Edge Cases (2-4 hours)

Goals:

Escape HTML and handle unsupported constructs.

Tasks:

Escape <, >, & in text.
Detect fenced code blocks and return error.

Checkpoint: Tests cover supported and unsupported cases.

5.11 Key Implementation Decisions

Decision	Options	Recommendation	Rationale
Escaping strategy	Escape first vs last	Escape first, then insert tags	Avoid raw HTML injection
List handling	Wrap each line vs block	Block-level `<ul>`	Produces valid HTML
Unsupported syntax	Ignore vs error	Error with exit code	Deterministic, explicit behavior

6. Testing Strategy

6.1 Test Categories

Category	Purpose	Examples
Unit Tests	Verify regex rules	Headings, bold, italic
Integration Tests	Full document conversion	Sample markdown to HTML
Edge Case Tests	Unsupported features	Fenced code blocks

6.2 Critical Test Cases

Heading conversion produces correct tags.
Lists are wrapped in <ul> without duplicates.
HTML escaping occurs for raw < and &.

6.3 Test Data

# Title

- item

This is **bold** and *italic*.

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall	Symptom	Solution
Wrong command order	Broken HTML nesting	Reorder script commands
Missing escaping	Raw `<` in output	Add escape step
Inline regex too greedy	Over-matching text	Use non-greedy patterns or anchors

7.2 Debugging Strategies

Test line-by-line: Use sed -n '1p' to isolate output.
Diff output: Compare with expected HTML.
Reduce scope: Debug one rule at a time.

7.3 Performance Traps

Running multiple scripts separately slows down large files; keep transformations in a single script file.

8. Extensions & Challenges

8.1 Beginner Extensions

Add support for <blockquote> using > prefix.
Add support for horizontal rules (---).

8.2 Intermediate Extensions

Support inline links [text](url).
Support images ![alt](url).

8.3 Advanced Extensions

Add support for code blocks using hold space.
Build a minimal site generator that wraps HTML in a template.

9. Real-World Connections

9.1 Industry Applications

Documentation pipelines: fast Markdown to HTML for internal docs.
Static sites: simple converters for small docs.

Pandoc: full-featured converter (this project is a minimal analog).
CommonMark: specification guiding robust parsers.

9.3 Interview Relevance

Parsing trade-offs: demonstrates you can define scope for parsing tasks.
Text processing: shows mastery of regex transformations.

10. Resources

10.1 Essential Reading

“sed & awk” – command ordering and script files
“Mastering Regular Expressions” – safe pattern design

10.2 Video Resources

Markdown parsing overview (video)

10.3 Tools & Documentation

GNU sed manual – script execution
CommonMark spec (reference)

Project 2: Log File Cleaner – uses capture groups and filtering.
Project 4: Multi-Line Address Parser – introduces hold space and branching.

11. Self-Assessment Checklist

11.1 Understanding

I can explain why command order matters in sed.
I can describe the supported Markdown subset.
I can explain how escaping prevents malformed HTML.

11.2 Implementation

The converter produces valid HTML for the supported subset.
Unsupported constructs produce clear errors.
Output is deterministic across runs.

11.3 Growth

I can extend the converter with one new Markdown feature.
I can explain the trade-offs of using sed vs a parser.

12. Submission / Completion Criteria

Minimum Viable Completion:

A working md2html.sed script that converts headings, lists, bold, italic, and code.
Deterministic HTML output for sample inputs.
Clear error handling for unsupported features.

Full Completion:

All minimum criteria plus:
Tests for all supported elements.
A wrapper CLI with helpful usage output.

Excellence (Going Above & Beyond):

Support links and images with safe escaping.
Provide a mini static-site wrapper around generated HTML.