Project 7: Parser Combinator Library

Project 7: Parser Combinator Library

Building the Crown Jewel of Functional Programming

  • Main Programming Language: Haskell
  • Alternative Languages: Scala, OCaml, F#
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Parsing / Combinators
  • Estimated Time: 2-3 weeks
  • Prerequisites: Functor, Applicative, Monad, Alternative type classes; Basic recursion; Understanding of grammars

Learning Objectives

After completing this project, you will be able to:

  1. Design parser types from first principles - Derive the Parser a = String -> Maybe (a, String) type and understand why each component is necessary
  2. Implement Functor, Applicative, and Monad instances for parsers - Write these instances by hand and explain what each operation means for parsing
  3. Use the Alternative type class for choice - Implement <|> for backtracking and understand when parsers should backtrack
  4. Build complex parsers from simple primitives - Combine satisfy, char, string into parsers for JSON, arithmetic, or programming languages
  5. Handle parse errors gracefully - Implement meaningful error messages with position tracking
  6. Understand the relationship between grammars and combinators - Translate BNF/EBNF directly into combinator code
  7. Apply monadic parsing to real-world formats - Parse JSON, CSV, configuration files, or simple programming languages

Conceptual Foundation

What Is Parsing?

Parsing is the process of transforming unstructured text into structured data. When you read the string "[1, 2, 3]" and produce a Haskell list [1, 2, 3], you are parsing. When a compiler reads "x + y * z" and produces an abstract syntax tree respecting operator precedence, it is parsing.

The fundamental challenge of parsing is ambiguity and structure recovery. Text is linear—a sequence of characters. But the data it represents often has hierarchical structure: nested lists, nested function calls, nested expressions. The parser’s job is to recover this hidden structure.

Traditional Approaches vs. Combinators

Historically, parsing was done with specialized tools:

  1. Hand-written recursive descent parsers - Tedious, error-prone, but understandable
  2. Parser generators (yacc, bison, ANTLR) - Powerful but require learning a separate grammar language
  3. Regular expressions - Limited to regular languages (no nesting)

Parser combinators offer a different approach: parsers as first-class values. Instead of describing your grammar in a separate DSL, you write it directly in your host language. The grammar is the code.

-- Traditional grammar definition (BNF):
-- expr   ::= term (('+' | '-') term)*
-- term   ::= factor (('*' | '/') factor)*
-- factor ::= number | '(' expr ')'

-- Parser combinator version (this is actual Haskell code):
expr :: Parser Expr
expr = chainl1 term addOp

term :: Parser Expr
term = chainl1 factor mulOp

factor :: Parser Expr
factor = number <|> parens expr

The combinator version is not a separate specification—it’s executable code that directly performs the parsing.

The Parser Type: Building Intuition

What should the type of a parser be? Let’s derive it step by step.

Attempt 1: Parser as a function

A parser takes a string and produces something:

type Parser a = String -> a

But this doesn’t handle failure. What if the input isn’t valid?

Attempt 2: Adding failure with Maybe

type Parser a = String -> Maybe a

Better! Now Nothing represents parse failure. But we have a problem: after parsing part of the input, we need to know what’s left to continue parsing.

Attempt 3: Returning remaining input

type Parser a = String -> Maybe (a, String)

Now this is useful! A parser takes a string, potentially parses some prefix, and returns both the parsed value and the unconsumed remainder.

For example, parsing an integer from "123abc":

  • Input: "123abc"
  • Output: Just (123, "abc")

This is the classic parser combinator type. Let’s make it a proper newtype:

newtype Parser a = Parser { runParser :: String -> Maybe (a, String) }

Understanding the Parser Type Deeply

Let’s examine what each part of String -> Maybe (a, String) contributes:

                    String -> Maybe (a, String)
                    ------    -----  ----------
                      |         |        |
                      |         |        +-- (a, String): Result type
                      |         |            a: The parsed value
                      |         |            String: Remaining input
                      |         |
                      |         +-- Maybe: Handles failure
                      |             Just: Successful parse
                      |             Nothing: Parse failure
                      |
                      +-- String: Input to consume

Why a tuple? Because parsers are incremental. After one parser succeeds, the next parser continues from where it left off:

Input:    "123abc"
          ^^^
Parser 1: integer
Output 1: Just (123, "abc")
                     ^^^^
                     Passed to Parser 2

Input:    "abc"
          ^^^
Parser 2: letters
Output 2: Just ("abc", "")

This is the essence of sequential composition in parsing.

The Primitive Parsers

Every parser combinator library is built from a few fundamental primitives:

1. satisfy - The Foundation

satisfy :: (Char -> Bool) -> Parser Char
satisfy predicate = Parser $ \input ->
    case input of
        (c:cs) | predicate c -> Just (c, cs)
        _                    -> Nothing

satisfy consumes one character if it satisfies a predicate. All other character-based parsers are built on top of it:

char :: Char -> Parser Char
char c = satisfy (== c)

digit :: Parser Char
digit = satisfy isDigit

letter :: Parser Char
letter = satisfy isAlpha

anyChar :: Parser Char
anyChar = satisfy (const True)

2. pure (or return) - Success Without Consuming

instance Applicative Parser where
    pure x = Parser $ \input -> Just (x, input)
    -- ...

pure always succeeds with a given value, consuming no input. This is essential for constructing results:

-- Parse "true" and return the boolean True
trueParser :: Parser Bool
trueParser = string "true" *> pure True

3. empty (or mzero) - Guaranteed Failure

instance Alternative Parser where
    empty = Parser $ \_ -> Nothing
    -- ...

empty always fails. Useful for conditional failure:

-- Parse a digit, but fail if it's zero
nonZeroDigit :: Parser Char
nonZeroDigit = satisfy isDigit >>= \c ->
    if c == '0' then empty else pure c

Functor: Transforming Parse Results

The Functor instance lets you transform what a parser returns:

instance Functor Parser where
    fmap f (Parser p) = Parser $ \input ->
        case p input of
            Nothing        -> Nothing
            Just (a, rest) -> Just (f a, rest)

Visualized:

                 Parser a
                    |
                    | fmap f
                    v
                 Parser b

Input: "123abc"
       ^^^
Parser (digit): Just ('1', "23abc")
                      ^^^
fmap digitToInt:      Just (1, "23abc")
                            ^
                            Transformed!

Common uses:

-- Parse a digit character, convert to Int
digitAsInt :: Parser Int
digitAsInt = fmap digitToInt digit

-- Parse quoted string, drop the quotes
quotedString :: Parser String
quotedString = fmap (init . tail) (char '"' *> many (satisfy (/= '"')) <* char '"')
-- Or better: char '"' *> many (satisfy (/= '"')) <* char '"'

Applicative: Sequencing Independent Parsers

The Applicative instance sequences parsers and combines their results:

instance Applicative Parser where
    pure x = Parser $ \input -> Just (x, input)

    (Parser pf) <*> (Parser pa) = Parser $ \input ->
        case pf input of
            Nothing        -> Nothing
            Just (f, rest) -> case pa rest of
                Nothing         -> Nothing
                Just (a, rest') -> Just (f a, rest')

The key insight: <*> runs the first parser, then runs the second on the remaining input, and combines the results:

Parser (a -> b)  <*>  Parser a  =  Parser b

Input: "ABCdef"
       ^^^
Parser 1: Parse "ABC", return (\x -> "ABC" ++ x)
Output 1: Just ((\x -> "ABC" ++ x), "def")
                                    ^^^
                                    Remaining input

Input for Parser 2: "def"
                    ^^^
Parser 2: Parse "def", return "def"
Output 2: Just ("def", "")

Combined: (\x -> "ABC" ++ x) applied to "def"
Result: Just ("ABCdef", "")

Applicative operators for convenience:

(*>) :: Parser a -> Parser b -> Parser b  -- Run both, keep right
(<*) :: Parser a -> Parser b -> Parser a  -- Run both, keep left
(<$>) :: (a -> b) -> Parser a -> Parser b -- Same as fmap

Parsing a point (3,4):

data Point = Point Int Int

point :: Parser Point
point = Point
    <$> (char '(' *> integer)
    <*> (char ',' *> integer <* char ')')

-- Or with ApplicativeDo:
point = do
    char '('
    x <- integer
    char ','
    y <- integer
    char ')'
    pure (Point x y)

Monad: Context-Sensitive Parsing

Sometimes you need the result of one parser to decide what to parse next. This requires Monad:

instance Monad Parser where
    return = pure

    (Parser pa) >>= f = Parser $ \input ->
        case pa input of
            Nothing        -> Nothing
            Just (a, rest) -> runParser (f a) rest

The crucial difference from Applicative: with >>=, the second parser is computed from the first result.

Example: Parsing length-prefixed data

Format: "3:abc" means "3 characters follow: abc"
        "5:hello" means "5 characters follow: hello"
lengthPrefixed :: Parser String
lengthPrefixed = do
    n <- integer       -- Parse the length
    char ':'
    count n anyChar    -- Parse exactly n characters
                       -- 'count n' depends on the VALUE of n!

This is impossible with just Applicative because count n needs the value n that was parsed.

When to use Monad vs Applicative:

  • Use Applicative when the structure of your parser is fixed (most grammars)
  • Use Monad when the structure depends on parsed values (length prefixes, version-dependent formats)

Applicative is faster because parsers can be analyzed statically. Prefer it when possible.

Alternative: Choice and Backtracking

The Alternative instance provides choice between parsers:

instance Alternative Parser where
    empty = Parser $ \_ -> Nothing

    (Parser p1) <|> (Parser p2) = Parser $ \input ->
        case p1 input of
            Nothing -> p2 input  -- p1 failed, try p2 on ORIGINAL input
            result  -> result    -- p1 succeeded, use its result

Critical concept: Backtracking

When p1 fails, p2 gets the original input, not what p1 consumed. This is called backtracking.

Input: "false"
Parser 1: string "true"
  Attempt: 'f' != 't', fails immediately
  Consumed: nothing

Parser 2: string "false"  -- Gets ORIGINAL input
  Attempt: matches!
  Result: Just ("false", "")

Derived combinators from Alternative:

many :: Parser a -> Parser [a]   -- Zero or more
some :: Parser a -> Parser [a]   -- One or more
optional :: Parser a -> Parser (Maybe a)

-- Implementations:
many p = some p <|> pure []
some p = (:) <$> p <*> many p

The Problem with Naive Backtracking

Backtracking can be expensive. Consider:

keyword :: Parser String
keyword = string "function" <|> string "for" <|> string "func"

For input "function":

  1. Try "function" - matches!

For input "forward":

  1. Try "function" - ‘f’, ‘o’, ‘r’ match, then ‘w’ != ‘c’, fail after consuming 3 chars
  2. Backtrack to start, try "for" - matches!

For input "foreach":

  1. Try "function" - ‘f’, ‘o’, ‘r’ match, fail
  2. Backtrack, try "for" - matches!
  3. But wait… we wanted "foreach", and "for" consumed only 3 chars

This naive approach has problems:

  • Performance: Re-scanning the same prefix multiple times
  • Ambiguity: "for" matches as prefix of "foreach"

Advanced Parsers: The try Combinator

Real parser libraries like Parsec use predictive parsing: by default, once a parser consumes input, it “commits” and won’t backtrack.

-- In Parsec/Megaparsec:
keyword = string "function" <|> string "for"

-- On input "forward":
-- 1. Try "function", consume "f", commit
-- 2. "o" matches, "r" matches, "w" fails
-- 3. Parser FAILS (no backtrack because we consumed input)

The try combinator enables backtracking:

keyword = try (string "function") <|> string "for"

-- Now on input "forward":
-- 1. try (string "function") consumes "for", fails on 'w'
-- 2. try causes BACKTRACK to original position
-- 3. string "for" matches

For this project, implement the simpler always-backtracking version first, then optionally add try for the commit/backtrack distinction.

Error Messages and Position Tracking

A parser that just returns Nothing is frustrating. Real parsers need error messages:

data ParseError = ParseError
    { errorPos     :: SourcePos
    , errorMessage :: String
    , expected     :: [String]
    }

data SourcePos = SourcePos
    { sourceLine   :: Int
    , sourceColumn :: Int
    }

Enhanced parser type:

newtype Parser a = Parser
    { runParser :: String -> SourcePos -> Either ParseError (a, String, SourcePos)
    }

Now parsers track position and accumulate meaningful error messages:

Parse error at line 3, column 15:
    unexpected '}'
    expecting expression

From Grammar to Combinators: The Translation

Any context-free grammar can be translated to combinators mechanically:

BNF Grammar:

<json>   ::= <object> | <array> | <string> | <number> | <bool> | "null"
<object> ::= '{' <members>? '}'
<members> ::= <pair> (',' <pair>)*
<pair>   ::= <string> ':' <json>
<array>  ::= '[' <elements>? ']'
<elements> ::= <json> (',' <json>)*

Direct Translation:

json :: Parser JSON
json = object <|> array <|> string <|> number <|> bool <|> null_

object :: Parser JSON
object = JObject <$> (char '{' *> optional members <* char '}')

members :: Parser [(String, JSON)]
members = sepBy1 pair (char ',')

pair :: Parser (String, JSON)
pair = (,) <$> stringLiteral <*> (char ':' *> json)

array :: Parser JSON
array = JArray <$> (char '[' *> optional elements <* char ']')

elements :: Parser [JSON]
elements = sepBy1 json (char ',')

The pattern:

  • | in BNF becomes <|> in combinators
  • Sequencing in BNF becomes <*> or *>
  • * (zero or more) becomes many
  • + (one or more) becomes some
  • ? (optional) becomes optional

Why Parser Combinators Embody Functional Programming

Parser combinators are the crown jewel of FP because they demonstrate every core principle:

  1. Functions as first-class values: Parsers ARE functions
  2. Composition: Complex parsers from simple ones via <*>, <|>, >>=
  3. Higher-order functions: many, sepBy, chainl1 are all HOFs
  4. Algebraic structures: Functor, Applicative, Monad, Alternative
  5. Referential transparency: Same input always produces same output
  6. Declarative style: Grammar IS the code

Compare to imperative parsing:

// Imperative
int parseInteger(String input, int[] position) {
    int start = position[0];
    while (position[0] < input.length() &&
           Character.isDigit(input.charAt(position[0]))) {
        position[0]++;
    }
    return Integer.parseInt(input.substring(start, position[0]));
}
-- Functional
integer :: Parser Int
integer = read <$> some digit

The functional version is shorter, clearer, and composable.

The Applicative Parsing Intuition

Here’s a mental model that helps: Applicative parsing is like filling in a template.

data Person = Person String Int String  -- Name, Age, City

person :: Parser Person
person = Person
    <$> name       -- Fill in slot 1
    <*> age        -- Fill in slot 2
    <*> city       -- Fill in slot 3

Reading left to right:

  1. Start with constructor Person :: String -> Int -> String -> Person
  2. Apply <$> to get Parser (Int -> String -> Person)
  3. Apply <*> to get Parser (String -> Person)
  4. Apply <*> to get Parser Person

Each <*> “uses up” one argument to the constructor.

Common Combinator Patterns

Separated lists:

sepBy :: Parser a -> Parser sep -> Parser [a]
sepBy p sep = sepBy1 p sep <|> pure []

sepBy1 :: Parser a -> Parser sep -> Parser [a]
sepBy1 p sep = (:) <$> p <*> many (sep *> p)

Chained expressions (left-associative):

chainl1 :: Parser a -> Parser (a -> a -> a) -> Parser a
chainl1 p op = do
    x <- p
    rest x
  where
    rest x = (do
        f <- op
        y <- p
        rest (f x y)) <|> pure x

For expression 1 + 2 + 3:

chainl1 number (char '+' *> pure (+))
  1. Parse 1, x = 1
  2. Parse +, parse 2, rest (1 + 2) = rest 3
  3. Parse +, parse 3, rest (3 + 3) = rest 6
  4. No more +, return 6

Between delimiters:

between :: Parser open -> Parser close -> Parser a -> Parser a
between open close p = open *> p <* close

parens :: Parser a -> Parser a
parens = between (char '(') (char ')')

brackets :: Parser a -> Parser a
brackets = between (char '[') (char ']')

Project Specification

Core Requirements

Build a parser combinator library that can parse non-trivial structured text. Your library must include:

1. The Parser Type

  • Newtype wrapper around the parsing function
  • Support for tracking remaining input
  • (Optional) Position tracking for error messages

2. Basic Parsers

  • satisfy :: (Char -> Bool) -> Parser Char
  • char :: Char -> Parser Char
  • string :: String -> Parser String
  • anyChar :: Parser Char
  • eof :: Parser () (succeeds only at end of input)

3. Type Class Instances

  • Functor Parser
  • Applicative Parser
  • Monad Parser
  • Alternative Parser

4. Combinators

  • many :: Parser a -> Parser [a] (zero or more)
  • some :: Parser a -> Parser [a] (one or more)
  • optional :: Parser a -> Parser (Maybe a)
  • sepBy :: Parser a -> Parser sep -> Parser [a]
  • sepBy1 :: Parser a -> Parser sep -> Parser [a]
  • between :: Parser open -> Parser close -> Parser a -> Parser a
  • chainl1 :: Parser a -> Parser (a -> a -> a) -> Parser a
  • choice :: [Parser a] -> Parser a

5. Lexical Helpers

  • spaces :: Parser String (whitespace)
  • lexeme :: Parser a -> Parser a (consume trailing whitespace)
  • symbol :: String -> Parser String (string + trailing whitespace)
  • integer :: Parser Int

6. A Non-Trivial Parser

Choose ONE of:

  • JSON parser - Parse JSON into a Haskell data type
  • Arithmetic expression parser - Respect operator precedence
  • Simple programming language - Variables, let bindings, functions

Stretch Goals

  • Add position tracking and meaningful error messages
  • Implement try for explicit backtracking control
  • Add <?> :: Parser a -> String -> Parser a for labeling parsers in errors
  • Implement lookAhead :: Parser a -> Parser a (peek without consuming)
  • Implement notFollowedBy :: Parser a -> Parser () (negative lookahead)

Solution Architecture

Module Structure

src/
  Parser/
    Core.hs         -- Parser type, Functor/Applicative/Monad/Alternative
    Primitives.hs   -- satisfy, char, string, anyChar
    Combinators.hs  -- many, some, sepBy, chainl1, etc.
    Lexer.hs        -- spaces, lexeme, symbol, integer
    Error.hs        -- (Optional) Error types and messages

  Examples/
    JSON.hs         -- JSON parser using your library
    Expr.hs         -- Arithmetic expression parser

test/
    CoreSpec.hs     -- Property tests for laws
    JSONSpec.hs     -- JSON parsing tests

The Parser Type (Two Options)

Simple Version:

newtype Parser a = Parser
    { runParser :: String -> Maybe (a, String)
    }

With Errors:

newtype Parser a = Parser
    { runParser :: String -> Either ParseError (a, String)
    }

data ParseError = ParseError
    { errorPosition :: Int
    , errorMessage  :: String
    }

Implementation Strategy

  1. Start with the simple type - Get Functor/Applicative/Monad/Alternative working
  2. Build primitives - satisfy, char, string
  3. Derive combinators - Most are built on the instances
  4. Add lexing - Whitespace handling
  5. Build your target parser - JSON or expressions
  6. Optionally enhance errors - Add position tracking

Implementation Guide

Phase 1: Core Parser Type and Instances

Goal: Implement the Parser newtype and its Functor/Applicative/Monad/Alternative instances.

Step 1: Define the type

module Parser.Core where

newtype Parser a = Parser
    { runParser :: String -> Maybe (a, String)
    }

Step 2: Implement Functor

Think about what fmap means: transform the result without changing parsing behavior.

instance Functor Parser where
    fmap f (Parser p) = Parser $ \input ->
        -- Pattern match on result, apply f to the parsed value
        -- Hint: use fmap on the outer Maybe, then on the tuple

Step 3: Implement Applicative

pure succeeds without consuming. <*> sequences parsers.

instance Applicative Parser where
    pure x = Parser $ \input -> -- Return x, don't consume input

    (Parser pf) <*> (Parser pa) = Parser $ \input ->
        -- Run pf, get function f and remaining input
        -- Run pa on remaining input, get value a
        -- Return f a with final remaining input

Step 4: Implement Monad

>>= lets the second parser depend on the first result.

instance Monad Parser where
    return = pure

    (Parser pa) >>= f = Parser $ \input ->
        -- Run pa, get value a and remaining input
        -- Apply f to a to get new parser
        -- Run new parser on remaining input

Step 5: Implement Alternative

empty always fails. <|> tries second on failure of first.

instance Alternative Parser where
    empty = Parser $ \_ -> Nothing

    (Parser p1) <|> (Parser p2) = Parser $ \input ->
        -- Try p1 on input
        -- If Nothing, try p2 on SAME input
        -- If Just, return that result

Verification: Write tests for the Functor/Applicative/Monad laws.

Phase 2: Primitive Parsers

Goal: Implement the building blocks all other parsers use.

module Parser.Primitives where

import Parser.Core

-- The fundamental primitive
satisfy :: (Char -> Bool) -> Parser Char
satisfy predicate = Parser $ \input ->
    case input of
        []     -> Nothing
        (c:cs) -> if predicate c then Just (c, cs) else Nothing

-- Derived from satisfy
char :: Char -> Parser Char
char c = satisfy (== c)

digit :: Parser Char
digit = satisfy isDigit

letter :: Parser Char
letter = satisfy isAlpha

anyChar :: Parser Char
anyChar = satisfy (const True)

-- End of input
eof :: Parser ()
eof = Parser $ \input ->
    case input of
        [] -> Just ((), "")
        _  -> Nothing

-- Parse a specific string
string :: String -> Parser String
string []     = pure []
string (c:cs) = (:) <$> char c <*> string cs
-- Or: string s = traverse char s

Phase 3: Combinators

Goal: Build the higher-level parsing tools.

module Parser.Combinators where

import Control.Applicative (Alternative(..))
import Parser.Core
import Parser.Primitives

-- many and some come from Alternative, but let's implement them:
manyP :: Parser a -> Parser [a]
manyP p = someP p <|> pure []

someP :: Parser a -> Parser [a]
someP p = (:) <$> p <*> manyP p

-- Separated by
sepBy :: Parser a -> Parser sep -> Parser [a]
sepBy p sep = sepBy1 p sep <|> pure []

sepBy1 :: Parser a -> Parser sep -> Parser [a]
sepBy1 p sep = (:) <$> p <*> many (sep *> p)

-- Between delimiters
between :: Parser open -> Parser close -> Parser a -> Parser a
between open close p = open *> p <* close

-- Choice from list
choice :: [Parser a] -> Parser a
choice = foldr (<|>) empty

-- Optional with default
option :: a -> Parser a -> Parser a
option x p = p <|> pure x

-- Left-associative chaining (for operators)
chainl1 :: Parser a -> Parser (a -> a -> a) -> Parser a
chainl1 p op = p >>= rest
  where
    rest x = (do
        f <- op
        y <- p
        rest (f x y)) <|> pure x

-- Right-associative chaining
chainr1 :: Parser a -> Parser (a -> a -> a) -> Parser a
chainr1 p op = do
    x <- p
    (do
        f <- op
        y <- chainr1 p op
        pure (f x y)) <|> pure x

Phase 4: Lexer Utilities

Goal: Handle whitespace and token-level parsing.

module Parser.Lexer where

import Parser.Core
import Parser.Primitives
import Parser.Combinators
import Data.Char (isSpace, isDigit)

-- Whitespace
spaces :: Parser String
spaces = many (satisfy isSpace)

-- Parser that consumes trailing whitespace
lexeme :: Parser a -> Parser a
lexeme p = p <* spaces

-- Parse a symbol (string + trailing whitespace)
symbol :: String -> Parser String
symbol s = lexeme (string s)

-- Parse an integer
integer :: Parser Int
integer = read <$> some digit

-- Signed integer
signedInteger :: Parser Int
signedInteger = do
    sign <- option '+' (char '-' <|> char '+')
    n <- integer
    pure $ if sign == '-' then -n else n

Phase 5: Build Your Target Parser

Option A: JSON Parser

module Examples.JSON where

data JSON
    = JNull
    | JBool Bool
    | JNumber Double
    | JString String
    | JArray [JSON]
    | JObject [(String, JSON)]
    deriving (Show, Eq)

json :: Parser JSON
json = lexeme $ choice
    [ JNull   <$  symbol "null"
    , JBool   <$> bool
    , JNumber <$> number
    , JString <$> stringLiteral
    , JArray  <$> array
    , JObject <$> object
    ]

bool :: Parser Bool
bool = (True <$ symbol "true") <|> (False <$ symbol "false")

number :: Parser Double
-- Handle integers, decimals, exponents

stringLiteral :: Parser String
-- Handle quotes, escape sequences

array :: Parser [JSON]
array = between (symbol "[") (symbol "]") (sepBy json (symbol ","))

object :: Parser [(String, JSON)]
object = between (symbol "{") (symbol "}") (sepBy pair (symbol ","))

pair :: Parser (String, JSON)
pair = (,) <$> (stringLiteral <* symbol ":") <*> json

Option B: Arithmetic Expressions

module Examples.Expr where

data Expr
    = Lit Int
    | Add Expr Expr
    | Sub Expr Expr
    | Mul Expr Expr
    | Div Expr Expr
    | Neg Expr
    deriving (Show, Eq)

expr :: Parser Expr
expr = term `chainl1` addOp

term :: Parser Expr
term = factor `chainl1` mulOp

factor :: Parser Expr
factor = choice
    [ Lit <$> lexeme integer
    , between (symbol "(") (symbol ")") expr
    , Neg <$> (symbol "-" *> factor)
    ]

addOp :: Parser (Expr -> Expr -> Expr)
addOp = (Add <$ symbol "+") <|> (Sub <$ symbol "-")

mulOp :: Parser (Expr -> Expr -> Expr)
mulOp = (Mul <$ symbol "*") <|> (Div <$ symbol "/")

eval :: Expr -> Int
eval (Lit n)     = n
eval (Add e1 e2) = eval e1 + eval e2
eval (Sub e1 e2) = eval e1 - eval e2
eval (Mul e1 e2) = eval e1 * eval e2
eval (Div e1 e2) = eval e1 `div` eval e2
eval (Neg e)     = - eval e

Phase 6: Error Enhancement (Optional)

Goal: Add position tracking and meaningful errors.

data ParseError = ParseError
    { errorPos      :: SourcePos
    , errorUnexpected :: Maybe Char
    , errorExpected   :: [String]
    }

data SourcePos = SourcePos
    { line   :: Int
    , column :: Int
    } deriving (Show, Eq)

newtype Parser a = Parser
    { runParser :: String -> SourcePos -> Either ParseError (a, String, SourcePos)
    }

-- Update position after consuming character
updatePos :: SourcePos -> Char -> SourcePos
updatePos (SourcePos l c) '\n' = SourcePos (l + 1) 1
updatePos (SourcePos l c) _    = SourcePos l (c + 1)

-- Label a parser for error messages
(<?>) :: Parser a -> String -> Parser a
p <?> label = -- On failure, include label in "expected" list

Testing Strategy

Unit Tests

-- Primitive tests
testSatisfy :: Spec
testSatisfy = describe "satisfy" $ do
    it "succeeds when predicate is true" $
        runParser (satisfy (== 'a')) "abc" `shouldBe` Just ('a', "bc")

    it "fails when predicate is false" $
        runParser (satisfy (== 'a')) "xyz" `shouldBe` Nothing

    it "fails on empty input" $
        runParser (satisfy (const True)) "" `shouldBe` Nothing

testString :: Spec
testString = describe "string" $ do
    it "parses exact match" $
        runParser (string "hello") "hello world" `shouldBe` Just ("hello", " world")

    it "fails on partial match" $
        runParser (string "hello") "help" `shouldBe` Nothing

Property-Based Tests (with QuickCheck)

-- Functor laws
prop_functorIdentity :: String -> Bool
prop_functorIdentity input =
    runParser (fmap id anyChar) input == runParser anyChar input

prop_functorComposition :: Fun Char Char -> Fun Char Char -> String -> Bool
prop_functorComposition (Fun _ f) (Fun _ g) input =
    runParser (fmap (f . g) anyChar) input ==
    runParser (fmap f . fmap g $ anyChar) input

-- Applicative laws
prop_applicativeIdentity :: String -> Bool
prop_applicativeIdentity input =
    runParser (pure id <*> anyChar) input == runParser anyChar input

-- Monad laws
prop_monadLeftIdentity :: Char -> Fun Char Char -> String -> Bool
prop_monadLeftIdentity x (Fun _ f) input =
    runParser (return x >>= \c -> return (f c)) input ==
    runParser (return (f x)) input

-- Alternative laws
prop_alternativeAssociativity :: String -> Property
prop_alternativeAssociativity input =
    runParser ((p1 <|> p2) <|> p3) input ===
    runParser (p1 <|> (p2 <|> p3)) input
  where
    p1 = char 'a'
    p2 = char 'b'
    p3 = char 'c'

Integration Tests (JSON Example)

testJSONParser :: Spec
testJSONParser = describe "JSON parser" $ do
    it "parses null" $
        runParser json "null" `shouldBe` Just (JNull, "")

    it "parses booleans" $ do
        runParser json "true" `shouldBe` Just (JBool True, "")
        runParser json "false" `shouldBe` Just (JBool False, "")

    it "parses integers" $
        runParser json "42" `shouldBe` Just (JNumber 42.0, "")

    it "parses strings" $
        runParser json "\"hello\"" `shouldBe` Just (JString "hello", "")

    it "parses arrays" $
        runParser json "[1, 2, 3]" `shouldBe`
            Just (JArray [JNumber 1, JNumber 2, JNumber 3], "")

    it "parses nested structures" $
        runParser json "{\"a\": [1, 2], \"b\": null}" `shouldBe`
            Just (JObject [("a", JArray [JNumber 1, JNumber 2]), ("b", JNull)], "")

Common Pitfalls

Pitfall 1: Left Recursion

Problem: Parser hangs infinitely.

-- WRONG: Left recursion causes infinite loop
expr :: Parser Expr
expr = (Add <$> expr <*> (char '+' *> term)) <|> term
--              ^^^^
--              Calls itself without consuming input!

Solution: Use chainl1 or rewrite to be right-recursive/iterative.

-- CORRECT: Left-factor or use chainl1
expr = term `chainl1` ((+) <$ char '+')

Pitfall 2: Forgetting Backtracking

Problem: Parser fails when it should try alternatives.

keyword = string "function" <|> string "func"

-- On input "func", this fails!
-- "function" consumes "func", then fails on 't' vs 'i'
-- No backtrack occurs

In the simple always-backtracking model, this works fine. But if you implement commit semantics like Parsec, you need try:

keyword = try (string "function") <|> string "func"

Pitfall 3: Greedy many Causing Failures

Problem: many consumes too much, causing the next parser to fail.

-- WRONG: many anyChar consumes EVERYTHING, including the quote
quotedString = char '"' *> many anyChar <* char '"'

runParser quotedString "\"hello\"" -- Nothing!

Solution: Be more specific about what you’re matching.

-- CORRECT: Stop at quote
quotedString = char '"' *> many (satisfy (/= '"')) <* char '"'

Pitfall 4: Mixing Up *> and <*

Problem: Discarding the wrong result.

-- These are different!
char '(' *> expr <* char ')'  -- Returns expr result
char '(' *> expr *> char ')'  -- Returns ')' (the last parser's result)

Memory aid:

  • *> points right: keep the RIGHT result
  • <* points left: keep the LEFT result

Pitfall 5: Not Handling Whitespace

Problem: Parser works for "123" but fails for " 123" or "123 ".

Solution: Decide on a whitespace strategy:

-- Option 1: Consume leading whitespace once, trailing whitespace after each token
json = spaces *> jsonValue

-- Option 2: Every parser consumes its trailing whitespace
integer = read <$> some digit <* spaces

Extensions and Challenges

Extension 1: Implement try for Explicit Backtracking

Add commit/backtrack semantics. By default, consuming input commits to that branch.

try :: Parser a -> Parser a
-- If p fails after consuming input, backtrack to original position

Extension 2: Parse a Complete Programming Language

Extend your expression parser to a mini-language with:

  • Variable bindings: let x = 5 in x + 1
  • Conditionals: if x > 0 then x else -x
  • Functions: \x -> x + 1
  • Function application: f 5

Extension 3: Add Megaparsec-style Error Handling

Implement:

  • Error message accumulation
  • “expected” labels
  • Position tracking with file name, line, column
  • Pretty-printed error messages

Extension 4: Implement Indentation-Sensitive Parsing

Parse Python-style or Haskell-style indentation:

-- Track current indentation level
type Indent = Int

newtype Parser a = Parser
    { runParser :: String -> Indent -> Maybe (a, String, Indent)
    }

block :: Parser a -> Parser [a]
-- Parse multiple items at same or greater indentation

Extension 5: Make It a Real Library

  • Add documentation with Haddock
  • Publish to Hackage
  • Compare performance with Parsec/Megaparsec
  • Add benchmarks with criterion

Real-World Connections

Industry Use Cases

  1. Compilers and Interpreters: GHC uses parser combinators for Haskell parsing
  2. Configuration Files: Parse YAML, TOML, INI files
  3. Domain-Specific Languages: Internal DSLs for business rules, queries
  4. Data Formats: JSON, XML, CSV, log files
  5. Protocol Parsing: HTTP headers, email (RFC 5322), URLs
Library Language Notes
Parsec Haskell The original, by Daan Leijen
Megaparsec Haskell Modern, better errors, more features
attoparsec Haskell Faster, less features, for protocols
FParsec F# Port of Parsec
nom Rust Zero-copy, binary parsing focus
pyparsing Python Python-style API
Scala Parser Combinators Scala Standard library

The Parsec Paper

Read “Parsec: Direct Style Monadic Parser Combinators For The Real World” by Daan Leijen. It explains:

  • Why commit/backtrack semantics improve error messages
  • How to handle left recursion
  • Performance optimizations

Interview Questions

After completing this project, you should be able to answer:

  1. “Explain the type String -> Maybe (a, String) for a parser. Why each component?”
    • String: Input to consume
    • Maybe: Handles success/failure
    • a: The parsed result
    • String: Remaining unconsumed input
  2. “Why do parser combinators use the Monad type class?”
    • Monad allows the next parser to depend on the previous result
    • Essential for context-sensitive grammars (length-prefixed data, version-dependent formats)
    • >>= threads state (remaining input) through the computation
  3. “What is the difference between Applicative and Monad parsing?”
    • Applicative: Structure is static, parsers are independent
    • Monad: Structure can depend on parsed values
    • Applicative is more analyzable (can determine what input looks like before running)
  4. “How does <|> implement backtracking?”
    • When left parser fails, right parser receives original input
    • In commit semantics, try is needed to enable backtracking after consumption
  5. “What is left recursion and why is it a problem for recursive descent parsers?”
    • Left recursion: A → A Îą | β
    • Parser calls itself without consuming input
    • Causes infinite loop
    • Solution: Left-factor or use iterative combinators like chainl1
  6. “How would you parse operator precedence (e.g., * before +)?”
    • Stratify the grammar by precedence level
    • Higher precedence operators in inner productions
    • Use chainl1 or chainr1 for each level
  7. “What is the relationship between context-free grammars and parser combinators?”
    • Direct correspondence: each production becomes a parser
    • | becomes <|>, sequencing becomes <*> or >>=
    • Combinators can express context-sensitive features too

Self-Assessment Checklist

Check off each item when you’ve mastered it:

Fundamentals

  • I can explain why parsers return (a, String) tuples
  • I understand the difference between Maybe and Either for error handling
  • I can trace through what happens when chaining parsers with >>=

Type Classes

  • I can implement Functor for Parser from scratch
  • I can implement Applicative for Parser from scratch
  • I can implement Monad for Parser from scratch
  • I can implement Alternative for Parser from scratch
  • I can explain when to use <*> vs >>=

Combinators

  • I can implement many and some in terms of Alternative
  • I can implement sepBy and sepBy1
  • I can implement chainl1 for left-associative operators
  • I understand why chainl1 avoids left recursion

Practical Parsing

  • I can translate a BNF grammar to combinator code
  • I can parse JSON or arithmetic expressions
  • I can handle operator precedence correctly
  • I understand whitespace handling strategies

Advanced Topics

  • I can explain commit vs backtrack semantics
  • I understand when try is needed
  • I can add position tracking for error messages
  • I can debug common parser issues (infinite loops, unexpected failures)

Resources

Books

Book Chapter What You’ll Learn
“Programming in Haskell” (Hutton) Chapter 13 The definitive introduction to parsing
“Haskell Programming from First Principles” Chapters 24-25 Parser combinators step by step
“Real World Haskell” Chapter 16 Parsec in practice
“Haskell in Depth” Chapter 8 Production parsing techniques

Papers

  1. “Monadic Parsing in Haskell” by Hutton & Meijer (1998)
    • The foundational paper, extremely readable
    • Derives the parser type step by step
  2. “Parsec: Direct Style Monadic Parser Combinators For The Real World” by Daan Leijen
    • Explains Parsec’s design decisions
    • Error handling and performance
  3. “Monadic Parser Combinators” by Graham Hutton & Erik Meijer
    • Technical report with more detail than the paper

Online Resources

Code References


Conclusion

Parser combinators represent functional programming at its best: small, composable pieces that combine into powerful tools. By building this library, you’ve learned:

  1. How to design types that capture computation - The parser type encapsulates sequential, fallible computation
  2. Why Functor/Applicative/Monad matter - Real use cases, not abstract nonsense
  3. The power of composition - Complex parsers from simple primitives
  4. Practical FP techniques - Error handling, state threading, algebraic design

You now have the skills to parse any structured text format you encounter, and the conceptual tools to build your own DSLs and compilers.

Next Steps:

  • Project 8: Lazy Evaluation Demonstrator - See how Haskell’s evaluation strategy enables infinite data structures
  • Project 9: Monad Transformers - Stack multiple effects (parsing with logging, or parsing with state)