Parsing regular expressions in Haskell (part 1)
If you’ve read my previous posts, you know I’m learning Haskell. But I want to avoid any boring way to learn it – learning anything that is boring is a waste of time. My approach is to read some online resources, do exercises and write real world programs. So far I’ve been mostly reading the Haskell Wikibook, which has some good exercises. I’ve also found some interesting pieces in Real World Haskell, although I haven’t started reading that book yet. Both resources seem to be quite good.
In my previous assignment I’ve written a grep tool in Haskell. I’ve also done some exercises with lists. And I think these steps resulted in a breakthrough for me – I can now write more complex code, which sometimes works out of the box and sometimes I only have to do a few fixes, but I don’t seem to be stuck for good and I don’t have to search for hints to the solution online anymore. The key was understanding that:
- Recursion is an essential programming paradigm in Haskell.
- In conjunction with lists of course, which are the basic tool for writing algorithms.
- Monads are an important tool for achieving more beyond functional programming, and working with monads is different from working with functions.
You can notice that my first programs used more traditional programming, as if they were written in an imperative programming language. But the grep program started looking like a real Haskell program – no loops and even no classic conditionals (well, function overloading is a kind of conditional).
Simple regular expressions
I’ve always wanted to write a parser for regular expressions. Well, this is my opportunity. The nice thing about Haskell is that programs written in this language are quite concise. Once you understand the basics (which is not that easy), Haskell seems much less scary.
I started off with a simplified regular expression. Here is the definition:
- ^ matches the beginning of the string.
- $ matches the end of the string.
- . matches any character
- * indicates that the previous token can be repeated zero or more times (for example a* means that a occurs zero or more times).
- + indicates that the previous token can be repeated one or more times (for example a+ means that a occurs one or more times).
- Any other character in the pattern is treated literally.
My program requires 2 or more arguments on the command line. The first argument is the regular expression (definition above). Further arguments are tested against this regular expression, one by one, and for each of them we print either MATCH: string or just string if the regular expression did not match the string.
$ runhaskell regex.hs "^ab*c+" "abc" "ac" "bac" "abbcc" MATCH: abc MATCH: ac bac MATCH: abbcc
Without further ado, here is the source code:
import System.Environment -- for getArgs main = getArgs >>= mainWithArgs usage = putStrLn $ "Usage: regex REGEX [string...]" mainWithArgs  = usage mainWithArgs [_] = usage mainWithArgs (regex:strings) = mapM_ (runRegexAndPrint regex) strings runRegexAndPrint regex string = if matchRegex regex string then putStrLn $ "MATCH: " ++ string else putStrLn $ " " ++ string matchRegex  _ = True matchRegex ['^'] _ = True matchRegex ['$'] _ = True matchRegex ('^':rcs) scs = tryMatch rcs scs matchRegex rcs scs = tryMatch ('.':'*':rcs) scs tryMatch ['$']  = True tryMatch  _ = True tryMatch _  = False tryMatch (rc:'*':rcs) (sc:scs) = if matchSymbol rc sc then tryMatch (rc:'*':rcs) scs || tryMatch rcs scs || tryMatch rcs (sc:scs) else tryMatch rcs (sc:scs) tryMatch (rc:'+':rcs) (sc:scs) = if matchSymbol rc sc then tryMatch (rc:'*':rcs) scs || tryMatch rcs scs || tryMatch rcs (sc:scs) else False tryMatch (rc:rcs) (sc:scs) = matchSymbol rc sc && tryMatch rcs scs matchSymbol '.' _ = True matchSymbol rc sc = rc == sc
This program again heavily relies on function overloading and list operations (string is a list of characters).
One interesting thing is that + and * handling look very similar. Eventually it would be nice to collapse them into one function, esp. if we were to handle repeat ranges.
The + and * tokens are greedy. Any ideas how to make them non-greedy?
I am not planning to stop here. I’d like to implement things like groups, character ranges, repeat count/ranges, greedy/non-greedy selection, etc. I think a simple approach like above won’t suffice though, these features will require building a regular expression representation (known as compiling the regular expression) and then using that representation on the strings.
Another thing I would expect from a good regular expressions library would be to return found ranges, possibly multiple, so that we could e.g. subscript or highlight them when printing the strings.
One last point – unfortunately WordPress cannot highlight Haskell code, which is a pity. Also pasting pre-formatted source code is tedious.