grep in Haskell
In-between my “assignments” I’ve been reading more about Haskell and doing various exercises. Here are some notable things worth mentioning about this peculiar programming language:
- Types are used sparsely and needed rarely. Haskell mostly looks like a dynamically typed language. But it is not. Haskell is strongly, statically typed. The compiler figures out the types automagically. You could compare this to C++ templates, where you need or want to specify concrete types rarely. From this perspective, writing in Haskell is like hardcore template meta-programming.
- Lists are a basic structured type in Haskell. Just like std::vector is the most important container type in C++. However lists in Haskell are more pervasive.
- Recursion is probably as much pervasive in Haskell as lists. In fact, all list processing is done by the means of recursion! There is an operator for indexing lists, but underneath it seems to be retrieving elements using recursion.
- Haskell is a lazy language. The code is not resolved until it is actually needed. Therefore code which looks like processing huge lists, e.g. reading huge files into memory, may not be using too much memory, since it is executed lazily. Also, infinite lists are possible in Haskell thanks to this.
- The compiler seems to be doing a reasonable job at optimizing the code. This is hearsay on my end, I haven’t done any comparisons myself.
My next assignment was to write a simple grep-like program. So, input coming from files or stdin is filtered line-by-line with a regular expression.
A lot of reading and searching went in this direction. I have seen lots of similar code. In fact there are many, many ways to implement this in Haskell.
Here is what I came up with:
import System.Environment -- for getArgs import Text.Regex.Posix -- for regular expressions main = getArgs >>= mainWithArgs mainWithArgs  = usage mainWithArgs ["-h"] = usage mainWithArgs [regex] = processStdin $ grepContents regex mainWithArgs (regex:filenames) = mapM_ (processFile $ grepContents regex) filenames usage = putStrLn $ "Usage: grep regex [file...]" processFile func filename = readFile filename >>= processContents func processStdin func = getContents >>= processContents func processContents func contents = putStr $ func contents grepContents regex = unlines . filter (grepLine regex) . lines grepLine regex line = line =~ regex
There are several things here worth noting.
The code is quite Haskellesque. There are no loops. There are no conditional expressions. Except for a minimal use of Monads, it’s pure functional programming.
Lines 1 and 2 import modules for a few things which we need in this program.
Line 4 contains the actual main computation. The main computation retrieves the list of program arguments and passes them to mainWithArgs computation. The >>= operator is used to chain computations, the output of the preceding computation is fed as an argument to the computation coming after it.
One thing crucial for understanding – and at the same time quite difficult to grasp for new Haskel programmers – is the difference between computations and functions. Functions are pure, which means that when invoked with specific arguments, they always return the same value. Such idealistic approach to programming is not sustainable in real world programs. To solve this, Haskell has monads. I don’t understand the difference between a monad and a computation nor I know whether there is any difference between them yet. But monadic computations do not always return the same result, even when given the same input.
Because Haskell has this distinction between functions and computations, different types and operators are used to denote operations when monads are involved. This is also why mainWithArgs cannot be invoked like this:
main = mainWithArgs ( getArgs ) -- Invalid! These are not functions!
Back to the grep program. The mainWithArgs computation is implemented four times. But each implementation takes different arguments. This is very similar to overloading in C++. But notice that the selection of the overloaded computation will occur in run time! It depends on which arguments are used on the command line. Lines 6, 7: If the user invokes the program with no arguments, or with a single -h argument, the usage computation will be invoked. Line 8: If a single argument is passed, it is treated as a regular expression, and the input is taken from stdin. Line 9: Last but not least, if there are more arguments, the first one is treated as a regular expression and the rest are file names of files to be grepped.
mapM_ in line 9 is similar to map, except it works with computations instead of functions. We could not use map here, because we’re dealing with the IO monad.
In lines 8 and 9 we’re passing a function grepContents regex to be executed on contents. But on line 19 you can notice, that it’s really a concatenation of functions. There are no real arguments passed in it. In Haskell you can chain and pass functions around.
You can also pass incomplete list of arguments to a function and thus create a new function. In line 19 we are doing exactly that with the function grepLine. The grepLine function has two arguments, but we’re passing only one. The result is another function, which can be invoked on a String.
Last but not least, in line 20 we’re using the =~ operator which applies a regular expression to a string. The result may vary, but somehow Haskell figures out that we want the Bool result here.
I originally had type signatures for each function, but I removed them to make the code more concise. Normally type signatures provide additional check to validate the code.
One more thing to notice: the program is written with the most general computations on the top and the details on the bottom. This is not typically seen in statically typed languages, but it helps to read and understand the program.
Let’s be clear. It is mind-boggling. I am still trying to wrap my head around many aspects. Also, there are many new types of expressions in Haskell and many expressions from traditional languages are missing.
My adventure continues…