Project 3: Basic Markdown to HTML Converter
Build a multi-command
sedscript that converts a constrained Markdown subset into clean, deterministic HTML.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 2: Intermediate |
| Time Estimate | 12-20 hours |
| Main Programming Language | sed (script file) |
| Alternative Programming Languages | awk, Python, Perl |
| Coolness Level | Level 4: Fun and Impressive |
| Business Potential | 2: Internal Automation |
| Prerequisites | Regex mastery, sed command order, basic HTML |
| Key Topics | script files, command ordering, regex transforms |
1. Learning Objectives
By completing this project, you will:
- Write multi-command
sedscripts with predictable ordering. - Convert Markdown headings, lists, and emphasis into HTML tags.
- Build a deterministic transformation pipeline with test fixtures.
- Recognize the limitations of line-based parsing and design around them.
2. All Theory Needed (Per-Concept Breakdown)
2.1 Sed Script Files and Command Ordering
Fundamentals
sed scripts let you chain multiple editing commands in a specific order. You can pass them as -e one-liners, but for real transformations a script file (sed -f script.sed) is more readable and maintainable. The order matters because each command sees the output of the previous. When converting Markdown to HTML, you must apply heading rules before paragraph wrapping, strip list markers before surrounding the list in <ul>, and escape HTML-sensitive characters before injecting tags. A script file gives you a single, repeatable transformation pipeline.
A practical discipline is to group commands by layer: block-level transforms first, inline transforms second, and cleanup last. This mirrors how Markdown parsers work and makes your script easier to reason about.
Deep Dive into the Concept
Markdown conversion is a classic example of why command ordering is critical. If you replace **bold** with <strong>bold</strong> after you have already wrapped the line in <p>, you might end up with nested tags that are still valid, but if you instead escape < and > too late, you can corrupt your generated HTML. The sed execution model processes commands top to bottom for every line. That means you must design the transformation as a series of safe, composable steps. A typical approach is:
- Normalize line endings and strip trailing whitespace.
- Detect and convert block-level elements like headings and list markers.
- Convert inline elements like emphasis and code.
- Escape remaining HTML-special characters that should not be interpreted as tags.
The tricky part is that some steps interact. If you escape < and > before you insert <h1> tags, you will accidentally escape your own generated tags. That is why you must escape only the parts of the input that remain as text, not the tags you insert. In a pure sed script, the typical strategy is to escape first, then unescape the tags you generate. Another strategy is to place tags using s commands that insert raw < and > after escaping. The key is that ordering controls which characters are treated as data versus markup.
Script files also allow labels and branches. While not strictly necessary for a basic converter, labels can help you build more readable logic, such as only wrapping lines in <p> when they are not already converted to a block element. You can implement this by setting a flag in the hold space or by using branches that skip over paragraph conversion when a heading pattern matches. This is where script files shine: you can define a clear top-down flow rather than a tangled one-liner.
Finally, script files make testing easier. You can keep a sample Markdown input and run sed -f md2html.sed sample.md, then compare output to an expected HTML file. This aligns with deterministic transformations and makes debugging simpler.
For a converter, ordering also affects list handling. If you wrap list items in <li> first, you still need to add <ul> before and </ul> after the list. In sed, you can do this by detecting the first list line and inserting a <ul>, then detecting the transition out of a list and inserting </ul>. This often requires labels and branching, or a state flag stored in the hold space. Even if you keep the list logic simple, understanding this block-level ordering helps you avoid malformed HTML that mixes <p> and <ul> incorrectly.
Another ordering concern is paragraph handling. If you wrap every non-blank line in <p>, you must avoid wrapping headings and list items. A common sed strategy is to tag lines that are already block elements (for example, prefix them with a marker), then only wrap untagged lines, and finally remove the markers. This keeps paragraph logic separate from heading and list logic while preserving readability.
How This Fits on Projects
In this project, the entire converter is a script file with multiple transformation steps. You will rely on precise command ordering to avoid malformed HTML. In Project 4, you will use more advanced control flow (branches and hold space), which builds directly on this skill.
Definitions & Key Terms
- Script file: A file of
sedcommands executed with-f. - Command order: The sequence in which transformations are applied.
- Block-level element: HTML tags like
<h1>,<ul>,<p>. - Inline element: HTML tags like
<strong>,<em>,<code>.
Mental Model Diagram (ASCII)
Input line -> [cmd1] -> [cmd2] -> [cmd3] -> output line
How It Works (Step-by-Step)
sedreads a line into the pattern space.- It applies command 1 (e.g., normalize whitespace).
- It applies command 2 (e.g., heading conversion).
- It applies command 3 (e.g., inline emphasis).
- The transformed line prints (or is stored).
Minimal Concrete Example
# Convert Markdown headings to HTML
sed -E 's/^# (.*)$/<h1>\1<\/h1>/' input.md
Common Misconceptions
- Misconception: Command order does not matter.
- Correction: Later commands see the output of earlier ones, so order is essential.
- Misconception: One-liners are always better.
- Correction: Complex transformations are safer in script files.
Check-Your-Understanding Questions
- Why should heading conversion run before paragraph wrapping?
- What is the risk of escaping
<after you insert tags? - How does
-fimprove maintainability?
Check-Your-Understanding Answers
- Otherwise the heading line might already be wrapped in
<p>, causing nested tags. - You will escape your own tags and break the HTML.
- It makes the transformation readable and versioned as a script.
Real-World Applications
- Building static site generators for small docs.
- Converting README files into HTML for dashboards.
- Creating lightweight documentation pipelines.
Where You’ll Apply It
- In this project: See §3.2 Functional Requirements and §5.10 Implementation Phases.
- Also used in: P04-multi-line-address-parser.md.
References
- GNU sed manual – script files and
-f - “sed & awk” – multi-command scripts
Key Insight
Order is the architecture of a sed script; the same commands in the wrong order produce broken output.
Summary
Script files and command ordering let you design transformations as a clear pipeline instead of a fragile one-liner.
Homework/Exercises to Practice the Concept
- Write a script that turns
# Titleinto<h1>Title</h1>. - Add a second command to wrap any line not starting with
#in<p>. - Create a script file and run it with
sed -f.
Solutions to the Homework/Exercises
sed -E 's/^# (.*)$/<h1>\1<\/h1>/' file.mdsed -E '/^#/! s/^(.*)$/<p>\1<\/p>/' file.mdsed -f md2html.sed file.md
2.2 Regex-Based Markdown Transforms and Escaping
Fundamentals
Markdown is mostly syntax sugar over plain text. It uses markers like #, *, and backticks to indicate formatting. A sed converter relies on regex patterns to detect those markers and replace them with HTML tags. Inline elements like **bold** and *italic* can be transformed with capture groups. Block elements like lists require recognizing line prefixes (- or * ). Escaping is equally important: if the input contains < or &, it must be converted to < and & so it does not break HTML output.
Escaping is not optional because HTML is not just formatting; it is a language with its own syntax. If you do not escape special characters, your output can become invalid or even unsafe.
Deep Dive into the Concept
Regex-based Markdown conversion is not a full parser; it is a set of disciplined transformations. The key is to define a constrained subset of Markdown that is line-oriented and therefore compatible with sed. For example, you can support:
- Headings:
#,##,###at line start - Unordered lists: lines starting with
-or* - Bold:
**text** - Italic:
*text*(with careful avoidance of list markers) - Inline code:
`code`
Each of these can be implemented with sed substitutions and capture groups. The challenge is avoiding conflicts. For instance, * is used both for italics and for list markers. If you apply italic conversion before list handling, you may accidentally wrap list markers in <em>. The solution is to anchor list patterns to the start of the line, transform lists first, and only then run inline formatting on the remaining text. That means the order of regex transforms must mirror the precedence rules of Markdown.
Escaping is another critical concern. HTML treats <, >, and & as special. If your Markdown contains 2 < 3, you must convert it to 2 < 3 before output. However, you must avoid escaping the tags you insert. One practical strategy is to escape first, then selectively unescape the HTML tags you introduce by replacing <h1> back to <h1>. Another strategy is to escape only text segments captured before you inject tags. Both approaches are workable; the choice depends on how complex your script is.
Because sed is line-based, multi-line constructs like fenced code blocks or nested lists are difficult. The project scope should explicitly exclude these or treat them with simplified rules. This is an important engineering decision: a tool is only correct within its defined boundaries. By documenting the supported Markdown subset, you make the converter reliable rather than “almost correct”. This is a key lesson for any text transformation system.
Inline formatting has edge cases: nested emphasis, multiple bold segments on the same line, or asterisks inside code spans. A sed-based converter cannot handle every combination, but you can still make it predictable by declaring rules. For example, run code-span conversion first, then avoid applying italic or bold patterns inside <code> tags. You can do this by temporarily protecting code spans with placeholders, converting emphasis, and then restoring the code spans. This approach keeps your output stable without pretending to fully parse Markdown.
Escaping also interacts with code spans. If you convert code spans into <code> early, you should avoid applying other inline transforms inside those tags. A placeholder approach works well: replace code spans with a sentinel token, apply other inline conversions, then restore the code spans. This preserves code literals and keeps emphasis rules from corrupting them.
If you later add support for links ([text](url)), the same ordering lesson applies: convert links before emphasis, otherwise the * or _ inside URLs can trigger italic rules. This shows why a clear precedence list is essential, even in a constrained Markdown subset.
A final guard is to run a cleanup pass that removes empty <p></p> elements created by consecutive blank lines. This is a good example of a post-processing rule that depends on all earlier transformations being complete.
How This Fits on Projects
This concept drives the actual Markdown-to-HTML conversion. You will build regex rules for headings, lists, bold, italic, and code. In Project 5, you will see another example of using regex with hold space for whole-file transformations.
Definitions & Key Terms
- Inline formatting: Formatting that appears inside a line (bold, italic, code).
- Block formatting: Formatting that applies to whole lines (headings, lists).
- Escaping: Replacing special characters with safe HTML entities.
- Subset: A limited set of rules that the converter guarantees to support.
Mental Model Diagram (ASCII)
Markdown line -> match syntax -> replace with HTML -> escaped text
How It Works (Step-by-Step)
- Detect block-level markers (
#,-) and replace them with tags. - Convert inline markers (
**,*,`) using capture groups. - Escape any remaining
<,>,&to HTML entities. - Ensure output is deterministic and idempotent.
Minimal Concrete Example
# Convert bold markdown to HTML
sed -E 's/\*\*([^*]+)\*\*/<strong>\1<\/strong>/g' input.md
Common Misconceptions
- Misconception: Regex can fully parse Markdown.
- Correction: Regex is adequate only for a defined subset.
- Misconception: Escaping can be ignored if input is trusted.
- Correction: Unescaped HTML can break output or cause security issues.
Check-Your-Understanding Questions
- Why should list conversion happen before italic conversion?
- What is the risk of not escaping
&in HTML output? - Which Markdown constructs are hard to support with line-based tools?
Check-Your-Understanding Answers
- List markers use
*, which could be mistaken for italics. &can start an entity and corrupt the HTML if not escaped.- Multi-line constructs like fenced code blocks and nested lists.
Real-World Applications
- Converting internal docs to HTML for dashboards.
- Creating fast previews of Markdown files.
- Building minimal static site generators.
Where You’ll Apply It
- In this project: See §3.1 What You Will Build and §3.5 Data Formats.
- Also used in: P02-log-file-cleaner.md.
References
- CommonMark spec (for understanding limitations)
- “sed & awk” – substitution patterns
Key Insight
A reliable Markdown converter is one that clearly defines its supported subset and applies regex rules in a safe order.
Summary
Regex transformations are powerful but only within a scope you can guarantee.
Homework/Exercises to Practice the Concept
- Convert
## Subtitleto<h2>Subtitle</h2>. - Convert
*item*to italic without touching list markers. - Escape
<and&in a text file.
Solutions to the Homework/Exercises
sed -E 's/^## (.*)$/<h2>\1<\/h2>/' file.mdsed -E '/^\*/! s/\*([^*]+)\*/<em>\1<\/em>/' file.mdsed -E 's/&/\&/g; s/</\</g' file.md
3. Project Specification
3.1 What You Will Build
You will build a sed script named md2html.sed and a wrapper CLI md2html that converts a constrained Markdown subset into HTML. The tool supports headings (#, ##, ###), unordered lists (- ), inline bold (**), italic (*), and inline code (`). It will not support nested lists, fenced code blocks, or tables. Output is deterministic and safe for static pages.
3.2 Functional Requirements
- Convert headings:
#,##,###into<h1>,<h2>,<h3>. - Convert unordered lists:
- iteminto<ul><li>item</li></ul>block. - Convert inline emphasis:
**bold**and*italic*. - Escape HTML: Replace
<,>,&in text segments. - Deterministic output: Same input produces same output every run.
3.3 Non-Functional Requirements
- Performance: Handle 1 MB Markdown file in under 1 second.
- Reliability: No malformed HTML for supported subset.
- Usability: Clear error messages for unsupported constructs.
3.4 Example Usage / Output
$ ./md2html README.md > README.html
3.5 Data Formats / Schemas / Protocols
Input (subset):
# Title
- item one
- item two
This is **bold** and *italic* and `code`.
Output:
<h1>Title</h1>
<ul>
<li>item one</li>
<li>item two</li>
</ul>
<p>This is <strong>bold</strong> and <em>italic</em> and <code>code</code>.</p>
3.6 Edge Cases
- Lines with
*that are not italics. - Mixed bold and italic in one line.
- Empty lines between paragraphs.
- Lines with
<or&that must be escaped.
3.7 Real World Outcome
You will have a deterministic Markdown converter suitable for small docs and internal tooling.
3.7.1 How to Run (Copy/Paste)
cat > sample.md <<'EOF'
# Title
- item one
- item two
This is **bold** and *italic* and `code`.
EOF
sed -f md2html.sed sample.md > sample.html
3.7.2 Golden Path Demo (Deterministic)
Expected output is the same for every run and matches the sample HTML exactly.
3.7.3 CLI Transcript (Success)
$ ./md2html sample.md > sample.html
$ echo $?
0
$ cat sample.html
<h1>Title</h1>
<ul>
<li>item one</li>
<li>item two</li>
</ul>
<p>This is <strong>bold</strong> and <em>italic</em> and <code>code</code>.</p>
3.7.4 CLI Transcript (Failure: Unsupported Construct)
$ printf '```\ncode\n```\n' > code.md
$ ./md2html code.md
Error: fenced code blocks are not supported
$ echo $?
2
Exit codes:
0success1usage error2unsupported Markdown construct
4. Solution Architecture
4.1 High-Level Design
Markdown -> sed script -> HTML
4.2 Key Components
| Component | Responsibility | Key Decisions |
|---|---|---|
| Script file | Convert syntax | Use ordered transformations |
| Wrapper CLI | Validate input and report errors | Detect unsupported constructs |
| Test fixtures | Verify deterministic output | Compare against golden HTML |
4.3 Data Structures (No Full Code)
line = {raw_text}
4.4 Algorithm Overview
Key Algorithm: Markdown Line Transformation
- Normalize whitespace and escape HTML characters.
- Convert headings and lists.
- Convert inline emphasis and code.
- Wrap leftover lines in paragraphs.
Complexity Analysis:
- Time: O(n) lines
- Space: O(1) streaming
5. Implementation Guide
5.1 Development Environment Setup
printf '# Title\n' > sample.md
5.2 Project Structure
md2html/
├── md2html.sed
├── bin/
│ └── md2html
├── tests/
│ └── test-md2html.sh
└── README.md
5.3 The Core Question You’re Answering
“How can I build a reliable Markdown converter with only a line-based stream editor?”
5.4 Concepts You Must Understand First
- Script files and command ordering.
- Regex transforms and HTML escaping.
5.5 Questions to Guide Your Design
- Which Markdown features will you support and which will you exclude?
- How will you prevent list lines from being double-wrapped?
- How will you escape HTML without escaping your generated tags?
5.6 Thinking Exercise
Manually convert this line into HTML:
This is **bold** and *italic*.
Write the exact output and verify that your regex matches it.
5.7 The Interview Questions They’ll Ask
- “Why is Markdown conversion with sed limited?”
- “How do you avoid converting list markers into italics?”
- “What is the role of command ordering?”
5.8 Hints in Layers
Hint 1: Convert block elements before inline elements.
Hint 2: Use anchors to target headings and list items.
Hint 3: Add a final pass to wrap remaining lines in <p>.
5.9 Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Text processing | “The Unix Programming Environment” | Ch. 5-7 |
| Regex mastery | “Mastering Regular Expressions” | Ch. 1-4 |
| sed scripting | “sed & awk” | Ch. 4-6 |
5.10 Implementation Phases
Phase 1: Foundation (4-6 hours)
Goals:
- Convert headings and simple paragraphs.
- Establish script structure.
Tasks:
- Write heading rules for
#,##,###. - Wrap non-heading lines in
<p>.
Checkpoint: Simple file renders correctly.
Phase 2: Core Functionality (4-8 hours)
Goals:
- Add lists and inline emphasis.
Tasks:
- Detect list items and wrap in
<li>. - Convert
**bold**and*italic*.
Checkpoint: Sample file matches expected HTML.
Phase 3: Polish & Edge Cases (2-4 hours)
Goals:
- Escape HTML and handle unsupported constructs.
Tasks:
- Escape
<,>,&in text. - Detect fenced code blocks and return error.
Checkpoint: Tests cover supported and unsupported cases.
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale |
|---|---|---|---|
| Escaping strategy | Escape first vs last | Escape first, then insert tags | Avoid raw HTML injection |
| List handling | Wrap each line vs block | Block-level <ul> |
Produces valid HTML |
| Unsupported syntax | Ignore vs error | Error with exit code | Deterministic, explicit behavior |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples |
|---|---|---|
| Unit Tests | Verify regex rules | Headings, bold, italic |
| Integration Tests | Full document conversion | Sample markdown to HTML |
| Edge Case Tests | Unsupported features | Fenced code blocks |
6.2 Critical Test Cases
- Heading conversion produces correct tags.
- Lists are wrapped in
<ul>without duplicates. - HTML escaping occurs for raw
<and&.
6.3 Test Data
# Title
- item
This is **bold** and *italic*.
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution |
|---|---|---|
| Wrong command order | Broken HTML nesting | Reorder script commands |
| Missing escaping | Raw < in output |
Add escape step |
| Inline regex too greedy | Over-matching text | Use non-greedy patterns or anchors |
7.2 Debugging Strategies
- Test line-by-line: Use
sed -n '1p'to isolate output. - Diff output: Compare with expected HTML.
- Reduce scope: Debug one rule at a time.
7.3 Performance Traps
- Running multiple scripts separately slows down large files; keep transformations in a single script file.
8. Extensions & Challenges
8.1 Beginner Extensions
- Add support for
<blockquote>using>prefix. - Add support for horizontal rules (
---).
8.2 Intermediate Extensions
- Support inline links
[text](url). - Support images
.
8.3 Advanced Extensions
- Add support for code blocks using hold space.
- Build a minimal site generator that wraps HTML in a template.
9. Real-World Connections
9.1 Industry Applications
- Documentation pipelines: fast Markdown to HTML for internal docs.
- Static sites: simple converters for small docs.
9.2 Related Open Source Projects
- Pandoc: full-featured converter (this project is a minimal analog).
- CommonMark: specification guiding robust parsers.
9.3 Interview Relevance
- Parsing trade-offs: demonstrates you can define scope for parsing tasks.
- Text processing: shows mastery of regex transformations.
10. Resources
10.1 Essential Reading
- “sed & awk” – command ordering and script files
- “Mastering Regular Expressions” – safe pattern design
10.2 Video Resources
- Markdown parsing overview (video)
10.3 Tools & Documentation
- GNU sed manual – script execution
- CommonMark spec (reference)
10.4 Related Projects in This Series
- Project 2: Log File Cleaner – uses capture groups and filtering.
- Project 4: Multi-Line Address Parser – introduces hold space and branching.
11. Self-Assessment Checklist
11.1 Understanding
- I can explain why command order matters in
sed. - I can describe the supported Markdown subset.
- I can explain how escaping prevents malformed HTML.
11.2 Implementation
- The converter produces valid HTML for the supported subset.
- Unsupported constructs produce clear errors.
- Output is deterministic across runs.
11.3 Growth
- I can extend the converter with one new Markdown feature.
- I can explain the trade-offs of using
sedvs a parser.
12. Submission / Completion Criteria
Minimum Viable Completion:
- A working
md2html.sedscript that converts headings, lists, bold, italic, and code. - Deterministic HTML output for sample inputs.
- Clear error handling for unsupported features.
Full Completion:
- All minimum criteria plus:
- Tests for all supported elements.
- A wrapper CLI with helpful usage output.
Excellence (Going Above & Beyond):
- Support links and images with safe escaping.
- Provide a mini static-site wrapper around generated HTML.