Xmas List Parser
This post is part of the F# Advent Calendar 2019. Many thanks to Sergey Tihon for organizing these.
Last year I wrote an app for Santa to keep track of his list of presents to buy for the nice children of the world.
Sadly, the development team didn’t do proper research into Santa’s requirements; they couldn’t be bothered with a trek to the North Pole and just sat at home watching “The Santa Clause” and then reckoned they knew it all. Luckily no harm came to Christmas 2018.
Good news is, Santa’s been in touch and the additional requirements for this year are:
- I don’t want to retype all the bloomin’ letters.
- I’d like to send presents to naughty children.
The Problem
This year I’m going to walk through how you can solve Santa’s problem using something I’ve recently began playing with - FParsec.
FParsec is parser combinator library for F#.
I’d describe it as: a library that lets you write a parser by combining functions.
This is only my second go at using it, my first was to solve Mike Hadlow’s “Journeys” coding challenge. So this might not be the most idiomatic way to write a parser.
We’ll assume that Santa has bought some off the shelf OCR software and has scanned in some Christmas lists into a text file.
Example
Alice: Nice
- Bike
- Socks * 2
Bobby: Naughty
- Coal
Claire:Nice
-Hat
- Gloves * 2
- Book
Dave : Naughty
- Nothing
As you can see the OCR software hasn’t done too well with the whitespace. We need a parser that is able to parse this into some nice F# records and handle the lack of perfect structure.
Domain
When writing solutions in F# I like to model the domain first:
module Domain =
type Behaviour = Naughty | Nice
type Gift = {
Gift: string
Quantity: int
}
type Child = {
Name: string
Behaviour: Behaviour
Gifts: Gift list
}
First the Behaviour
is modelled as a discriminated union: either Naughty
or Nice
.
A record for the Gift
holds the name of a gift and the quantity.
The Child
record models the name of the child, their behaviour and a list of gifts they are getting.
The overall output of a successfully parsing the text will be a list of Child
records.
Parsing
Initially I thought it would be a clever idea to parse the text directly into the domain model. That didn’t work out so, instead I defined my own AST to parse into, then later map that into the domain model.
type Line =
| Child of string * Domain.Behaviour
| QuantifiedGift of string * int
| SingleGift of string
A Child
line represents a child and their Behaviour
this year. A QuantifiedGift
represents a gift that was specified
with a quantity (e.g. “Bike * 2”) and a SingleGift
represents a gift without a quantity.
Modelling this way avoids putting domain logic into your parser - for example, what is the quantity of a single gift? It might seem trivial, but the less the parser knows about your domain the easier it is to create.
Before we get into the actual parsing of the lines, there’s a helper I added called wsAround
:
open FParsec
let wsAround c =
spaces >>. skipChar c >>. spaces
This is a function that creates a parser based on a single character c
and allows the character c
to be
surrounded by whitespace (spaces
function). The skipChar
function says that I don’t care about
parsing the value of c
, just that c
has to be there. I’ll go into the >>.
later on, but it is one of
FParsec’s custom operators for combining parsers.
So wsAround ':'
lets me parse :
with potential whitespace either side of it.
It can be used as part of parsing any of the following:
a : b
a:b
a: b
And as the examples above show, there are a few places where we don’t care about whitespace either side of a separator:
- Either side of the
:
separating the name and behaviour. - Before/after the
-
that precedes either types of gift. - Either side of the
*
for quantified gifts.
Parsing Children
A child line is defined as “a name and behaviour separated by a :
”.
For example: Dave : Nice
And as stated above, there can be any amount (or none) of whitespace around the :
.
The pName
function defines how to parse a name:
let pName =
let endOfName = wsAround ':'
many1CharsTill anyChar endOfName |>> string
many1CharsTill
is a parser that runs two other parsers. The first argument is the parser it will look
for “many chars” from, the second argument is the parser that tells it when to stop.
Here it parses any character using anyChar
until it reaches the endOfName
parser, which is a function that looks for
:
with whitespace around it.
The result of the parser is then converted into a string
using the |>>
operator.
The pBehaviour
function parses naughty or nice into the discriminated union:
let pBehaviour =
(pstringCI "nice" >>% Domain.Nice)
<|>
(pstringCI "naughty" >>% Domain.Naughty)
This defines 2 parsers, one for each case, and uses the <|>
operator to choose between them.
pstringCI "nice"
is looking to parse the string nice
case-insensitive and then the >>%
operator discards the
parsed string and just returns Domain.Nice
.
These 2 functions are combined to create the pChild
function that can parse the full line of text into a Child
line.
let pChild =
let pName = //...
let pBehaviour = //...
pName .>>. pBehaviour |>> Child
pName
and pBehaviour
are combined with the .>>.
operator to create a tuple of each parsers result, then the result
or that is passed to the Child
line constructor by the |>>
operator.
Parsing Gifts
Both gifts make use of the startOfGiftName
parser function:
let startOfGiftName = wsAround '-'
A single gift is parsed with:
let pSingleGift =
let allTillEOL = manyChars (noneOf "\n")
startOfGiftName >>. allTillEOL |>> SingleGift
The allTillEOL
function was taken from this StackOverflow answer and parses everything up to the end of a line.
This is combined with startOfGiftName
using the >>.
operator, which is similar to the .>>.
operator, but in this case
I only want the result from the right-hand side parser - in this case the allTillEOL
, this is then passed into the SingleGift
union case constructor.
A quantified gift is parsed with:
let pQuantifiedGift =
let endOfQty = wsAround '*'
let pGiftName =
startOfGiftName >>. manyCharsTill anyChar endOfQty
pGiftName .>>. pint32 |>> QuantifiedGift
This uses endOfQty
and pGiftName
combined in a similar way to the pName
in pChild
. Parsing all characters up until the
*
and only keeping the name part.
pGiftName
is combined with pint32
with the .>>.
function to get the result of both parsers in a tuple and is fed into the
QuantifiedGift
union case.
Putting it all together
The top level parser is pLine
which parses each line of the text into one of the cases from the Line
discriminated union.
let pLine =
attempt pQuantifiedGift
<|>
attempt pSingleGift
<|>
pChild
This uses the <|>
that was used for the Behaviour
, but it also requires the attempt
function before the first two parsers.
This is because these parsers consume some of the input stream as they execute. Without the attempt
it would start on
a quantified gift, then realise it is actually a single gift and have no way to go into the next choice.
Using attempt
allows the parser to “rewind” when it has a problem - like a quantified gift missing a *
.
If you want to see how this works, you need to decorate your parser functions with the <!>
operator that is defined here.
This shows the steps the parser takes and allows you to see that it has “gone the wrong way”.
Finally a helper method called parseInput
is used to parse the entire file:
let parseInput input =
run (sepBy pLine newline) input
This calls the run
function passing in a sepBy
parser for each pLine
separated by a newline
. This way each line is processed on it’s own.
That is the end of the parser module.
Mapping to the Domain
The current output of parseInput
is a ParserResult<Line list, unit>
. Assuming success there is now a list of Line
union cases
that need to be mapped into a list of Child
from the domain.
These have separate structures:
- A
Child
record is hierarchical - it contains a list ofGift
s. - The list of
Line
s has structure defined by the order of elements,Gift
s follow theChild
they relate to.
Initially I thought about using a fold
to go through each line, if the line was a child, add a child to the head of
the results, if the line was a gift add it to the head of the list of gifts of the first child in the list, this was the code:
let folder (state: Child list) (line : Line) : Child list =
let addGift nm qty =
let head::tail = state
let newHead = { head with Gifts = {Gift = nm; Quantity = qty; } :: head.Gifts; }
newHead :: tail
match line with
| Child (name, behaviour) -> { Name = name; Behaviour = behaviour; Gifts = []; } :: state
| SingleGift name -> addGift name 1
| QuantifiedGift (name, quantity) -> addGift name quantity
This worked, but because F# lists are implemented as singly linked lists you add to the head of the list instead of the tail. This
had the annoying feature that the Child
items were revered in the list - not so bad, but then the list of gifts in each child was backwards too.
I could have sorted both lists, but it would require recreating the results as the lists are immutable and I wanted to keep to idiomatic F# as
much as I could.
A foldBack
on the other hand works backwards “up” the list, which meant I could get the results in the order I wanted, but there
was a complication. When going forward, the first line was always a child, so I always had a child to add gifts to. Going backwards
there is just gifts until you get to a child, so you have to maintain a list of gifts, until you reach a child line, then you can
create a child assign the gifts, then clear the list.
This is how I implemented it:
module Translation =
open Domain
open Parser
let foldLine line state = //Line -> Child list * Gift list -> Child list * Gift list
let cList, gList = state
let addChild name behaviour =
{ Name = name; Behaviour = behaviour; Gifts = gList; } :: cList
let addGift name quantity =
{ Gift = name; Quantity = quantity; } :: gList
match line with
| Child (name, behaviour) -> addChild name behaviour, []
| SingleGift name -> cList, addGift name 1
| QuantifiedGift (name, quantity) -> cList, addGift name quantity
The state
is a tuple of lists, the first for the Child list
(the result we want) and the second for keeping track of the gifts
that are not yet assigned to children.
First this function deconstructs state
into the child and gift lists - cList
and gList
respectively.
Next I’ve declared some helper functions for adding to either the Child
or Gift
list:
addChild
creates a newChild
with theGifts
set to the accumulated list of Gifts (gList
) and prepends it ontocList
.addGift
creates a newGift
and prepends it ontogList
.
Then the correct function is called based on the type of Line.
- Children return a new
Child list
with a EmptyGift list
. - The gifts return the existing
Child list
, with the current item added to theGift list
.
The overall result is a tuple of all the Child
records correctly populated, and an empty list of Gift
records, as the last item will be the
first row and that will be a Child
.
let mapLinesToDomain lines = //ParserResult<Line list, unit> -> Child list
let initState = [],[]
let mapped =
match lines with
| Success (lines, _, _) -> Seq.foldBack foldLine lines initState
| Failure (err, _, _) -> failwith err
fst mapped
Finally, the output of parseInput
can be piped into mapLinesToDomain
to get the Child list
we need:
let childList =
Parser.parseInput input //Input is just a string from File.ReadAllText
|> Translation.mapLinesToDomain
Summing up
I really like how simple parsers can be once written, but it takes some time to get used to how they work and how you need to separate the parsing and domain logic.
My main pain points were:
- Trying to get the domain model in the parser - adding Gifts to Children, setting default quantity to 1, etc resulted in a lot of extra code. Once I stopped this and just focussed on mapping to the AST it was much simpler. Another benefit was not having to map things into Records, just using tuples and discriminated unions allowed a much cleaner implementation.
- Not knowing about using
attempt
, I just assumed<|>
worked like pattern matching, turns out, it doesn’t.
I made heavy use of the F# REPL and found it helped massively as I worked my way through writing each parser and then combining
them together. For example, I first wrote the Behaviour parser and tested it worked correctly on just “Naughty” and “Nice”.
Then I wrote a parser for the Child’s name and :
and tested it on “Dave : Nice”, but only getting the name.
Then I could write a function to combine the two together and check that the results were correct again. The whole development
process was done this way, just add a bit more code, bit more example, test in the REPL and repeat.
The whole code for this is on GitHub - it is only 115 lines long, including code to print the list of Children back out so I could see the results.