Archive
Parsing regular expressions in Haskell (part 1)
If you’ve read my previous posts, you know I’m learning Haskell. But I want to avoid any boring way to learn it – learning anything that is boring is a waste of time. My approach is to read some online resources, do exercises and write real world programs. So far I’ve been mostly reading the Haskell Wikibook, which has some good exercises. I’ve also found some interesting pieces in Real World Haskell, although I haven’t started reading that book yet. Both resources seem to be quite good.
In my previous assignment I’ve written a grep tool in Haskell. I’ve also done some exercises with lists. And I think these steps resulted in a breakthrough for me – I can now write more complex code, which sometimes works out of the box and sometimes I only have to do a few fixes, but I don’t seem to be stuck for good and I don’t have to search for hints to the solution online anymore. The key was understanding that:
- Recursion is an essential programming paradigm in Haskell.
- In conjunction with lists of course, which are the basic tool for writing algorithms.
- Monads are an important tool for achieving more beyond functional programming, and working with monads is different from working with functions.
You can notice that my first programs used more traditional programming, as if they were written in an imperative programming language. But the grep program started looking like a real Haskell program – no loops and even no classic conditionals (well, function overloading is a kind of conditional).
Simple regular expressions
I’ve always wanted to write a parser for regular expressions. Well, this is my opportunity. The nice thing about Haskell is that programs written in this language are quite concise. Once you understand the basics (which is not that easy), Haskell seems much less scary.
I started off with a simplified regular expression. Here is the definition:
- ^ matches the beginning of the string.
- $ matches the end of the string.
- . matches any character
- * indicates that the previous token can be repeated zero or more times (for example a* means that a occurs zero or more times).
- + indicates that the previous token can be repeated one or more times (for example a+ means that a occurs one or more times).
- Any other character in the pattern is treated literally.
My program requires 2 or more arguments on the command line. The first argument is the regular expression (definition above). Further arguments are tested against this regular expression, one by one, and for each of them we print either MATCH: string or just string if the regular expression did not match the string.
For example:
$ runhaskell regex.hs "^ab*c+" "abc" "ac" "bac" "abbcc" MATCH: abc MATCH: ac bac MATCH: abbcc
Without further ado, here is the source code:
import System.Environment -- for getArgs main = getArgs >>= mainWithArgs usage = putStrLn $ "Usage: regex REGEX [string...]" mainWithArgs [] = usage mainWithArgs [_] = usage mainWithArgs (regex:strings) = mapM_ (runRegexAndPrint regex) strings runRegexAndPrint regex string = if matchRegex regex string then putStrLn $ "MATCH: " ++ string else putStrLn $ " " ++ string matchRegex [] _ = True matchRegex ['^'] _ = True matchRegex ['$'] _ = True matchRegex ('^':rcs) scs = tryMatch rcs scs matchRegex rcs scs = tryMatch ('.':'*':rcs) scs tryMatch ['$'] [] = True tryMatch [] _ = True tryMatch _ [] = False tryMatch (rc:'*':rcs) (sc:scs) = if matchSymbol rc sc then tryMatch (rc:'*':rcs) scs || tryMatch rcs scs || tryMatch rcs (sc:scs) else tryMatch rcs (sc:scs) tryMatch (rc:'+':rcs) (sc:scs) = if matchSymbol rc sc then tryMatch (rc:'*':rcs) scs || tryMatch rcs scs || tryMatch rcs (sc:scs) else False tryMatch (rc:rcs) (sc:scs) = matchSymbol rc sc && tryMatch rcs scs matchSymbol '.' _ = True matchSymbol rc sc = rc == sc
This program again heavily relies on function overloading and list operations (string is a list of characters).
One interesting thing is that + and * handling look very similar. Eventually it would be nice to collapse them into one function, esp. if we were to handle repeat ranges.
The + and * tokens are greedy. Any ideas how to make them non-greedy?
What’s next?
I am not planning to stop here. I’d like to implement things like groups, character ranges, repeat count/ranges, greedy/non-greedy selection, etc. I think a simple approach like above won’t suffice though, these features will require building a regular expression representation (known as compiling the regular expression) and then using that representation on the strings.
Another thing I would expect from a good regular expressions library would be to return found ranges, possibly multiple, so that we could e.g. subscript or highlight them when printing the strings.
One last point – unfortunately WordPress cannot highlight Haskell code, which is a pity. Also pasting pre-formatted source code is tedious.
grep in Haskell
In-between my “assignments” I’ve been reading more about Haskell and doing various exercises. Here are some notable things worth mentioning about this peculiar programming language:
- Types are used sparsely and needed rarely. Haskell mostly looks like a dynamically typed language. But it is not. Haskell is strongly, statically typed. The compiler figures out the types automagically. You could compare this to C++ templates, where you need or want to specify concrete types rarely. From this perspective, writing in Haskell is like hardcore template meta-programming.
- Lists are a basic structured type in Haskell. Just like std::vector is the most important container type in C++. However lists in Haskell are more pervasive.
- Recursion is probably as much pervasive in Haskell as lists. In fact, all list processing is done by the means of recursion! There is an operator for indexing lists, but underneath it seems to be retrieving elements using recursion.
- Haskell is a lazy language. The code is not resolved until it is actually needed. Therefore code which looks like processing huge lists, e.g. reading huge files into memory, may not be using too much memory, since it is executed lazily. Also, infinite lists are possible in Haskell thanks to this.
- The compiler seems to be doing a reasonable job at optimizing the code. This is hearsay on my end, I haven’t done any comparisons myself.
grep
My next assignment was to write a simple grep-like program. So, input coming from files or stdin is filtered line-by-line with a regular expression.
A lot of reading and searching went in this direction. I have seen lots of similar code. In fact there are many, many ways to implement this in Haskell.
Here is what I came up with:
import System.Environment -- for getArgs import Text.Regex.Posix -- for regular expressions main = getArgs >>= mainWithArgs mainWithArgs [] = usage mainWithArgs ["-h"] = usage mainWithArgs [regex] = processStdin $ grepContents regex mainWithArgs (regex:filenames) = mapM_ (processFile $ grepContents regex) filenames usage = putStrLn $ "Usage: grep regex [file...]" processFile func filename = readFile filename >>= processContents func processStdin func = getContents >>= processContents func processContents func contents = putStr $ func contents grepContents regex = unlines . filter (grepLine regex) . lines grepLine regex line = line =~ regex
There are several things here worth noting.
The code is quite Haskellesque. There are no loops. There are no conditional expressions. Except for a minimal use of Monads, it’s pure functional programming.
Lines 1 and 2 import modules for a few things which we need in this program.
Line 4 contains the actual main computation. The main computation retrieves the list of program arguments and passes them to mainWithArgs computation. The >>= operator is used to chain computations, the output of the preceding computation is fed as an argument to the computation coming after it.
One thing crucial for understanding – and at the same time quite difficult to grasp for new Haskel programmers – is the difference between computations and functions. Functions are pure, which means that when invoked with specific arguments, they always return the same value. Such idealistic approach to programming is not sustainable in real world programs. To solve this, Haskell has monads. I don’t understand the difference between a monad and a computation nor I know whether there is any difference between them yet. But monadic computations do not always return the same result, even when given the same input.
Because Haskell has this distinction between functions and computations, different types and operators are used to denote operations when monads are involved. This is also why mainWithArgs cannot be invoked like this:
main = mainWithArgs ( getArgs ) -- Invalid! These are not functions!
Back to the grep program. The mainWithArgs computation is implemented four times. But each implementation takes different arguments. This is very similar to overloading in C++. But notice that the selection of the overloaded computation will occur in run time! It depends on which arguments are used on the command line. Lines 6, 7: If the user invokes the program with no arguments, or with a single -h argument, the usage computation will be invoked. Line 8: If a single argument is passed, it is treated as a regular expression, and the input is taken from stdin. Line 9: Last but not least, if there are more arguments, the first one is treated as a regular expression and the rest are file names of files to be grepped.
mapM_ in line 9 is similar to map, except it works with computations instead of functions. We could not use map here, because we’re dealing with the IO monad.
In lines 8 and 9 we’re passing a function grepContents regex to be executed on contents. But on line 19 you can notice, that it’s really a concatenation of functions. There are no real arguments passed in it. In Haskell you can chain and pass functions around.
You can also pass incomplete list of arguments to a function and thus create a new function. In line 19 we are doing exactly that with the function grepLine. The grepLine function has two arguments, but we’re passing only one. The result is another function, which can be invoked on a String.
Last but not least, in line 20 we’re using the =~ operator which applies a regular expression to a string. The result may vary, but somehow Haskell figures out that we want the Bool result here.
I originally had type signatures for each function, but I removed them to make the code more concise. Normally type signatures provide additional check to validate the code.
One more thing to notice: the program is written with the most general computations on the top and the details on the bottom. This is not typically seen in statically typed languages, but it helps to read and understand the program.
Let’s be clear. It is mind-boggling. I am still trying to wrap my head around many aspects. Also, there are many new types of expressions in Haskell and many expressions from traditional languages are missing.
My adventure continues…
Basic input and output
Day 2
It’s been an uphill battle. I’m actually surprised I got where I am!
My second program’s purpose is to retrieve two numbers from the command line, assume they are the shorter edges of a right triangle and compute the longest edge.
The “Hello, World!” program evolved only slightly, but it took a long time to arrive at this:
main = do putStrLn "Please enter edge A:" sa <- getLine let a = read sa putStrLn "Please enter edge B:" sb <- getLine let b = read sb let c = sqrt $ a^2 + b^2 putStrLn "" putStrLn $ "A = " ++ show(a) ++ ", B = " ++ show(b) ++ ", C = " ++ show(c)
The first thing to notice here is that in line 3 we are creating a local variable sa and assigning the output of getLine to it. The return value of getLine is an IO monad, this is why we can’t just assign it, we have to use the <- operator. I don’t understand monads yet, so let’s assume they are something different from regular functions. The back story of this is that probably the whole expression has some special meaning.
Another fact about it is that the do expression, which is the entire main function, is also a monad (whatever that means).
The most difficult thing to get to work was to actually treat the value received from getLine as a number, because normally it returns a string. The thing did not want to compile no matter what! After a lot of searching I learned that local variables can be assigned/created with let, which in this case is a special, simplified version of the let statement, which works inside the do statements. The other thing is that read can be used like a function, which wasn’t apparent at first.
In other words, read turns a string into another value (like a number), and show turns any value into a string.
Helper function
The next thing I noticed was that there is some boiler plate code here. In particular we are repeating the following pattern twice: printing a prompt, reading a line from stdin and converting the read string into a usable value.
Again, an uphill battle to extract this into a function. And again it didn’t want to compile. The errors were suggesting something about the types, but heck, I wished I understood what it wanted from me! So I left it and returned the next day to arrive at this:
getData :: String -> IO Double getData prompt = do putStrLn prompt val <- getLine return (read val) main = do a <- getData "Please enter edge A:" b <- getData "Please enter edge B:" let c = sqrt $ a^2 + b^2 putStrLn "" putStrLn $ "A = " ++ show(a) ++ ", B = " ++ show(b) ++ ", C = " ++ show(c)
The first line describes the types of the function defined in line two. It’s optional and can be deleted without consequences, but in this case it enforces that there is one argument of type String and the return type is Double. There is a monad involved, too, but I don’t understand yet how it’s linked with the main function.
The dollar sign in lines 10 and 12 changes the order of evaluation from its position to the end of the line, making the rest of the line form a single argument to the function. I could have used parentheses instead.
The ++ operator in line 12 concatenates strings.
Impressions
I began to wonder if it will continue to be an uphill battle and when it will become easier. I also started feeling a little bit like fighting PERL for a moment, I hope that feeling won’t persist…
Haskell is difficult. I am not sure yet whether it impacts its usability. We will see.
Hello, World!
For some time now I’ve been planning to learn Haskell. Haskell is a modern functional programming language. The fact is that many other modern languages are constantly acquiring new features from Haskell, such as lambdas, coroutines, list comprehensions, etc. If you look at JS 1.8, Python 3 or C++11, they have more and more features which come from Haskell.
Learning new programming languages is a good way for a programmer to improve his skills. Haskell in particular requires a specific approach to programming, different from the languages most programmers use every day.
One thing, which is encouraging to me, is that you can compile Haskell programs into real executables and run them on bare metal, without any VMs, interpreters and other junk.
There is an obstacle, however. Haskell is scary, complex and difficult to learn. Even though its syntax is quite terse and there is very little boilerplate code in programs written in Haskell, it still has a steep learning curve. If you don’t know Haskell, a program written in this language may seem incomprehensible. This reminds me a lot of PERL, which I have a personal aversion to.
Python is trivial, you can learn it overnight by just reading the online tutorial and you can master it within a week. But Haskell? Forget about it. After one evening of reading my head swelled and at the end of an interactive tutorial I forgot half of the things I’ve learnt!
I’ve decided to take a different approach: to learn the language by combining reading a tutorial and simultaneously set some goals and achieve them by writing more and more complex programs which will do concrete things. I generally hate raw material and hands on experience is the best way for me to learn.
The Haskell Wikibook, which a friend recommended to me, is probably a good place to start. I also have upgraded the installation of Haskell Platform (I’ve had some older version installed for some time).
So for a start, after some basic reading to refresh what I’ve previously learnt and after looking here and there, I’ve managed to write a simple “Hello, World!” program:
main = print "Hello, World!"
I saved this in a file called test.hs. From there there are two options, either compile it like this: ghc -o test test.hs or to run it directly like this: runhaskell test.hs. However the output is not satisfactory:
$ runhaskell test.hs "Hello, World!"
In particular I don’t want the double quotes. A quick Google search brought me down to this:
main = putStrLn "Hello, World!"
This achieves exactly what I wanted. Unfortunately putStrLn is more difficult to remember than print. After some practice I will probably remember it though.
I’ve also managed to print two strings, although it wasn’t trivial and involved using a do expression:
main = do putStrLn "Hello, World!" putStrLn "Hello, programmer!"
As you would expect, this is what I get:
$ runhaskell test.hs Hello, World! Hello, programmer!
So here is some recap:
- The program has one function, main. This is the function which is executed to run the program. This is similar to C, Python and many other languages.
- It’s trivial to write the above function. It has no arguments and the return value is implicit (no return value in fact). The function declaration couldn’t be more trivial: functionName = functionBody
- putStrLn (or print) is the function I use to print a string, i.e. to write it to standard output.
- I don’t have to use parentheses when invoking putStrLn. In fact everything until the end of line is treated as arguments for that function. If I wanted to do something more complex, I would have to use parentheses to tell the compiler where the arguments end. Another way is to use the $ operator, which I haven’t fully figured out yet.
- The do expression lets me do several things, otherwise I would be limited to a single expression.
- Similarly to Python, there are no semicolons needed. Haskell uses indentation to figure out what belongs where, just like Python. I could use semicolons if I wanted to put both putStrLn functions on one line though, like in Python.
One thing I’ve found out is that WordPress doesn’t color Haskell syntax, so my Haskell programs will be uncolored, sorry!
If you have ideas what I should do next, please let me know. I’m planning to do some string and/or list manipulation, together with maybe input and output formatting.