October | 2013 | Chris Dragan's Corner

Parsing regular expressions in Haskell (part 1)

23.10.2013 Leave a comment

If you’ve read my previous posts, you know I’m learning Haskell. But I want to avoid any boring way to learn it – learning anything that is boring is a waste of time. My approach is to read some online resources, do exercises and write real world programs. So far I’ve been mostly reading the Haskell Wikibook, which has some good exercises. I’ve also found some interesting pieces in Real World Haskell, although I haven’t started reading that book yet. Both resources seem to be quite good.

In my previous assignment I’ve written a grep tool in Haskell. I’ve also done some exercises with lists. And I think these steps resulted in a breakthrough for me – I can now write more complex code, which sometimes works out of the box and sometimes I only have to do a few fixes, but I don’t seem to be stuck for good and I don’t have to search for hints to the solution online anymore. The key was understanding that:

Recursion is an essential programming paradigm in Haskell.
In conjunction with lists of course, which are the basic tool for writing algorithms.
Monads are an important tool for achieving more beyond functional programming, and working with monads is different from working with functions.

You can notice that my first programs used more traditional programming, as if they were written in an imperative programming language. But the grep program started looking like a real Haskell program – no loops and even no classic conditionals (well, function overloading is a kind of conditional).

Simple regular expressions

I’ve always wanted to write a parser for regular expressions. Well, this is my opportunity. The nice thing about Haskell is that programs written in this language are quite concise. Once you understand the basics (which is not that easy), Haskell seems much less scary.

I started off with a simplified regular expression. Here is the definition:

^ matches the beginning of the string.
$ matches the end of the string.
. matches any character
* indicates that the previous token can be repeated zero or more times (for example a* means that a occurs zero or more times).
+ indicates that the previous token can be repeated one or more times (for example a+ means that a occurs one or more times).
Any other character in the pattern is treated literally.

My program requires 2 or more arguments on the command line. The first argument is the regular expression (definition above). Further arguments are tested against this regular expression, one by one, and for each of them we print either MATCH: string or just string if the regular expression did not match the string.

For example:

$ runhaskell regex.hs "^ab*c+" "abc" "ac" "bac" "abbcc"
MATCH: abc
MATCH: ac
       bac
MATCH: abbcc

Without further ado, here is the source code:

import System.Environment -- for getArgs

main                                = getArgs >>= mainWithArgs

usage                               = putStrLn $ "Usage: regex REGEX [string...]"

mainWithArgs []                     = usage
mainWithArgs [_]                    = usage
mainWithArgs (regex:strings)        = mapM_ (runRegexAndPrint regex) strings

runRegexAndPrint regex string       = if matchRegex regex string
                                         then putStrLn $ "MATCH: " ++ string
                                         else putStrLn $ "       " ++ string

matchRegex [] _                     = True
matchRegex ['^'] _                  = True
matchRegex ['$'] _                  = True
matchRegex ('^':rcs) scs            = tryMatch rcs scs
matchRegex rcs scs                  = tryMatch ('.':'*':rcs) scs

tryMatch ['$'] []                   = True
tryMatch [] _                       = True
tryMatch _ []                       = False
tryMatch (rc:'*':rcs) (sc:scs)      = if matchSymbol rc sc
                                         then tryMatch (rc:'*':rcs) scs ||
                                              tryMatch rcs scs ||
                                              tryMatch rcs (sc:scs)
                                         else tryMatch rcs (sc:scs)
tryMatch (rc:'+':rcs) (sc:scs)      = if matchSymbol rc sc
                                         then tryMatch (rc:'*':rcs) scs ||
                                              tryMatch rcs scs ||
                                              tryMatch rcs (sc:scs)
                                         else False
tryMatch (rc:rcs) (sc:scs)          = matchSymbol rc sc && tryMatch rcs scs

matchSymbol '.' _                   = True
matchSymbol rc sc                   = rc == sc

This program again heavily relies on function overloading and list operations (string is a list of characters).

One interesting thing is that + and * handling look very similar. Eventually it would be nice to collapse them into one function, esp. if we were to handle repeat ranges.

The + and * tokens are greedy. Any ideas how to make them non-greedy?

What’s next?

I am not planning to stop here. I’d like to implement things like groups, character ranges, repeat count/ranges, greedy/non-greedy selection, etc. I think a simple approach like above won’t suffice though, these features will require building a regular expression representation (known as compiling the regular expression) and then using that representation on the strings.

Another thing I would expect from a good regular expressions library would be to return found ranges, possibly multiple, so that we could e.g. subscript or highlight them when printing the strings.

One last point – unfortunately WordPress cannot highlight Haskell code, which is a pity. Also pasting pre-formatted source code is tedious.

Categories: Haskell

grep in Haskell

15.10.2013 1 comment

In-between my “assignments” I’ve been reading more about Haskell and doing various exercises. Here are some notable things worth mentioning about this peculiar programming language:

Types are used sparsely and needed rarely. Haskell mostly looks like a dynamically typed language. But it is not. Haskell is strongly, statically typed. The compiler figures out the types automagically. You could compare this to C++ templates, where you need or want to specify concrete types rarely. From this perspective, writing in Haskell is like hardcore template meta-programming.
Lists are a basic structured type in Haskell. Just like std::vector is the most important container type in C++. However lists in Haskell are more pervasive.
Recursion is probably as much pervasive in Haskell as lists. In fact, all list processing is done by the means of recursion! There is an operator for indexing lists, but underneath it seems to be retrieving elements using recursion.
Haskell is a lazy language. The code is not resolved until it is actually needed. Therefore code which looks like processing huge lists, e.g. reading huge files into memory, may not be using too much memory, since it is executed lazily. Also, infinite lists are possible in Haskell thanks to this.
The compiler seems to be doing a reasonable job at optimizing the code. This is hearsay on my end, I haven’t done any comparisons myself.

grep

My next assignment was to write a simple grep-like program. So, input coming from files or stdin is filtered line-by-line with a regular expression.

A lot of reading and searching went in this direction. I have seen lots of similar code. In fact there are many, many ways to implement this in Haskell.

Here is what I came up with:

import System.Environment -- for getArgs
import Text.Regex.Posix   -- for regular expressions

main                           = getArgs >>= mainWithArgs

mainWithArgs []                = usage
mainWithArgs ["-h"]            = usage
mainWithArgs [regex]           = processStdin $ grepContents regex
mainWithArgs (regex:filenames) = mapM_ (processFile $ grepContents regex) filenames

usage                          = putStrLn $ "Usage: grep regex [file...]"

processFile func filename      = readFile filename >>= processContents func

processStdin func              = getContents >>= processContents func

processContents func contents  = putStr $ func contents

grepContents regex             = unlines . filter (grepLine regex) . lines

grepLine regex line            = line =~ regex

There are several things here worth noting.

The code is quite Haskellesque. There are no loops. There are no conditional expressions. Except for a minimal use of Monads, it’s pure functional programming.

Lines 1 and 2 import modules for a few things which we need in this program.

Line 4 contains the actual main computation. The main computation retrieves the list of program arguments and passes them to mainWithArgs computation. The >>= operator is used to chain computations, the output of the preceding computation is fed as an argument to the computation coming after it.

One thing crucial for understanding – and at the same time quite difficult to grasp for new Haskel programmers – is the difference between computations and functions. Functions are pure, which means that when invoked with specific arguments, they always return the same value. Such idealistic approach to programming is not sustainable in real world programs. To solve this, Haskell has monads. I don’t understand the difference between a monad and a computation nor I know whether there is any difference between them yet. But monadic computations do not always return the same result, even when given the same input.

Because Haskell has this distinction between functions and computations, different types and operators are used to denote operations when monads are involved. This is also why mainWithArgs cannot be invoked like this:

main = mainWithArgs ( getArgs ) -- Invalid! These are not functions!

Back to the grep program. The mainWithArgs computation is implemented four times. But each implementation takes different arguments. This is very similar to overloading in C++. But notice that the selection of the overloaded computation will occur in run time! It depends on which arguments are used on the command line. Lines 6, 7: If the user invokes the program with no arguments, or with a single -h argument, the usage computation will be invoked. Line 8: If a single argument is passed, it is treated as a regular expression, and the input is taken from stdin. Line 9: Last but not least, if there are more arguments, the first one is treated as a regular expression and the rest are file names of files to be grepped.

mapM_ in line 9 is similar to map, except it works with computations instead of functions. We could not use map here, because we’re dealing with the IO monad.

In lines 8 and 9 we’re passing a function grepContents regex to be executed on contents. But on line 19 you can notice, that it’s really a concatenation of functions. There are no real arguments passed in it. In Haskell you can chain and pass functions around.

You can also pass incomplete list of arguments to a function and thus create a new function. In line 19 we are doing exactly that with the function grepLine. The grepLine function has two arguments, but we’re passing only one. The result is another function, which can be invoked on a String.

Last but not least, in line 20 we’re using the =~ operator which applies a regular expression to a string. The result may vary, but somehow Haskell figures out that we want the Bool result here.

I originally had type signatures for each function, but I removed them to make the code more concise. Normally type signatures provide additional check to validate the code.

One more thing to notice: the program is written with the most general computations on the top and the details on the bottom. This is not typically seen in statically typed languages, but it helps to read and understand the program.

Let’s be clear. It is mind-boggling. I am still trying to wrap my head around many aspects. Also, there are many new types of expressions in Haskell and many expressions from traditional languages are missing.

My adventure continues…

Categories: Haskell

Basic input and output

8.10.2013 Leave a comment

Day 2

It’s been an uphill battle. I’m actually surprised I got where I am!

My second program’s purpose is to retrieve two numbers from the command line, assume they are the shorter edges of a right triangle and compute the longest edge.

The “Hello, World!” program evolved only slightly, but it took a long time to arrive at this:

main = do
    putStrLn "Please enter edge A:"
    sa <- getLine
    let a = read sa

    putStrLn "Please enter edge B:"
    sb <- getLine
    let b = read sb

    let c = sqrt $ a^2 + b^2
    putStrLn ""
    putStrLn $ "A = " ++ show(a) ++ ", B = " ++ show(b) ++ ", C = " ++ show(c)

The first thing to notice here is that in line 3 we are creating a local variable sa and assigning the output of getLine to it. The return value of getLine is an IO monad, this is why we can’t just assign it, we have to use the <- operator. I don’t understand monads yet, so let’s assume they are something different from regular functions. The back story of this is that probably the whole expression has some special meaning.

Another fact about it is that the do expression, which is the entire main function, is also a monad (whatever that means).

The most difficult thing to get to work was to actually treat the value received from getLine as a number, because normally it returns a string. The thing did not want to compile no matter what! After a lot of searching I learned that local variables can be assigned/created with let, which in this case is a special, simplified version of the let statement, which works inside the do statements. The other thing is that read can be used like a function, which wasn’t apparent at first.

In other words, read turns a string into another value (like a number), and show turns any value into a string.

Helper function

The next thing I noticed was that there is some boiler plate code here. In particular we are repeating the following pattern twice: printing a prompt, reading a line from stdin and converting the read string into a usable value.

Again, an uphill battle to extract this into a function. And again it didn’t want to compile. The errors were suggesting something about the types, but heck, I wished I understood what it wanted from me! So I left it and returned the next day to arrive at this:

getData :: String -> IO Double
getData prompt = do
    putStrLn prompt
    val <- getLine
    return (read val)

main = do
    a <- getData "Please enter edge A:"
    b <- getData "Please enter edge B:"
    let c = sqrt $ a^2 + b^2
    putStrLn ""
    putStrLn $ "A = " ++ show(a) ++ ", B = " ++ show(b) ++ ", C = " ++ show(c)

The first line describes the types of the function defined in line two. It’s optional and can be deleted without consequences, but in this case it enforces that there is one argument of type String and the return type is Double. There is a monad involved, too, but I don’t understand yet how it’s linked with the main function.

The dollar sign in lines 10 and 12 changes the order of evaluation from its position to the end of the line, making the rest of the line form a single argument to the function. I could have used parentheses instead.

The ++ operator in line 12 concatenates strings.

Impressions

I began to wonder if it will continue to be an uphill battle and when it will become easier. I also started feeling a little bit like fighting PERL for a moment, I hope that feeling won’t persist…

Haskell is difficult. I am not sure yet whether it impacts its usability. We will see.

Categories: Haskell

Hello, World!

6.10.2013 Leave a comment

For some time now I’ve been planning to learn Haskell. Haskell is a modern functional programming language. The fact is that many other modern languages are constantly acquiring new features from Haskell, such as lambdas, coroutines, list comprehensions, etc. If you look at JS 1.8, Python 3 or C++11, they have more and more features which come from Haskell.

Learning new programming languages is a good way for a programmer to improve his skills. Haskell in particular requires a specific approach to programming, different from the languages most programmers use every day.

One thing, which is encouraging to me, is that you can compile Haskell programs into real executables and run them on bare metal, without any VMs, interpreters and other junk.

There is an obstacle, however. Haskell is scary, complex and difficult to learn. Even though its syntax is quite terse and there is very little boilerplate code in programs written in Haskell, it still has a steep learning curve. If you don’t know Haskell, a program written in this language may seem incomprehensible. This reminds me a lot of PERL, which I have a personal aversion to.

Python is trivial, you can learn it overnight by just reading the online tutorial and you can master it within a week. But Haskell? Forget about it. After one evening of reading my head swelled and at the end of an interactive tutorial I forgot half of the things I’ve learnt!

I’ve decided to take a different approach: to learn the language by combining reading a tutorial and simultaneously set some goals and achieve them by writing more and more complex programs which will do concrete things. I generally hate raw material and hands on experience is the best way for me to learn.

The Haskell Wikibook, which a friend recommended to me, is probably a good place to start. I also have upgraded the installation of Haskell Platform (I’ve had some older version installed for some time).

So for a start, after some basic reading to refresh what I’ve previously learnt and after looking here and there, I’ve managed to write a simple “Hello, World!” program:

main = print "Hello, World!"

I saved this in a file called test.hs. From there there are two options, either compile it like this: ghc -o test test.hs or to run it directly like this: runhaskell test.hs. However the output is not satisfactory:

$ runhaskell test.hs
"Hello, World!"

In particular I don’t want the double quotes. A quick Google search brought me down to this:

main = putStrLn "Hello, World!"

This achieves exactly what I wanted. Unfortunately putStrLn is more difficult to remember than print. After some practice I will probably remember it though.

I’ve also managed to print two strings, although it wasn’t trivial and involved using a do expression:

main = do
    putStrLn "Hello, World!"
    putStrLn "Hello, programmer!"

As you would expect, this is what I get:

$ runhaskell test.hs
Hello, World!
Hello, programmer!

So here is some recap:

The program has one function, main. This is the function which is executed to run the program. This is similar to C, Python and many other languages.
It’s trivial to write the above function. It has no arguments and the return value is implicit (no return value in fact). The function declaration couldn’t be more trivial: functionName = functionBody
putStrLn (or print) is the function I use to print a string, i.e. to write it to standard output.
I don’t have to use parentheses when invoking putStrLn. In fact everything until the end of line is treated as arguments for that function. If I wanted to do something more complex, I would have to use parentheses to tell the compiler where the arguments end. Another way is to use the $ operator, which I haven’t fully figured out yet.
The do expression lets me do several things, otherwise I would be limited to a single expression.
Similarly to Python, there are no semicolons needed. Haskell uses indentation to figure out what belongs where, just like Python. I could use semicolons if I wanted to put both putStrLn functions on one line though, like in Python.

One thing I’ve found out is that WordPress doesn’t color Haskell syntax, so my Haskell programs will be uncolored, sorry!

If you have ideas what I should do next, please let me know. I’m planning to do some string and/or list manipulation, together with maybe input and output formatting.

Categories: Haskell

Boilerplate

6.10.2013 Leave a comment

Programming languages have different levels of verbosity. Some languages have terse syntax, so you need less text to express what you want the computer to do. Others require you to repeatedly type elaborate constructs, often multiple times, to achieve the same.

Usually you don’t have a choice of programming language. You are hired by a company who already has some existing code and you have to work with that code base. Or you are targeting a specific platform and you have no choice, but to use a particular language.

Regardless of the language you use, you still have to make many choices when designing the software you write, and the choices you make will contribute to the size of the source code and may indirectly affect maintainability, extensibility and robustness.

So what makes a program a good program? I have one theory.

Copy&paste

They don’t teach how to write good programs in schools. In most schools they only teach you the mechanics of programing: they show you the tools, but they don’t teach you how to use them effectively.

My programming adventure started in high school with Turbo Pascal. One of my first projects was a simple game. One time I found a bug and I realized, that I have already fixed it once in another function. I noticed that both pieces of code which had the bug were originally copied from another function.

This was one of my first lessons, and as a programmer you never stop learning. The lesson learnt was that copy&paste approach to programming is a bad practice. If you had to modify one piece of a copied code for whatever reason, you likely have to modify all of them – that’s a lot of unnecessary manual labor, which is something programmers hate. If you just wrote a new expression, which looks similar to or exactly like an existing piece of code, you should instead put it in a new function and call it in both places.

Summary: copy&paste == bad programming practice.

Beyond copy&paste

Not too long ago I’ve been reading some articles criticizing C++, the language I use the most. One of the rightful points was that C++ needs you to type the same code at least twice in more than one place. A typical location of duplicate code is class definitions. First you define a class in a header file, so you type the function declarations there, then you type exactly the same function signatures in a source file where you define the functions.

class Vehicle {
public:
    void StartMotor();
    void Accelerate(double acc);
};

void Vehicle::StartMotor() // you had to type this again!
{
    :::
}

void Vehicle::Accelerate(double acc) // and this too!
{
    :::
}

If you later need to modify a function, e.g. change the number of arguments or their types, you have to do it at least twice. It is actually even worse if you have a derived class and you overload virtual functions. To change the interface, you have to change it in 2*C places, where C is the number of classes which declare that function.

Yet worse, it may happen that you change the function’s signature in the derived class, but forget to change it in the base class. In result, you will have a bug in your program. If you use a pointer to the base class to call the function, the function from the derived class will not be called, since it has a different signature. Fortunately modern compilers issue a warning when this happens, but you still have to write the same piece of code twice.

Sounds like copy&paste? Well, that’s how I write new classes in C++, I define class’ functions in a header file, then copy them to a source file, then use a macro in my editor to expand them by removing semicolons and adding braces on new lines.

Boilerplate code

Welcome to boilerplate code. The definition of boilerplate code is exactly that: redundant code, which you have to write to make a well-formed program source code, but which is unnecessary from your perspective as a programmer.

But the definition of boilerplate code extends beyond what you actually have to write to make a well-formed program. Consider a C++ program where you use a well-known libpng library to load a PNG image. In the basic version, you could write a single function like this:

bool LoadPng(const char* filename, std::vector< char>* image)
{
    :::
}

Inside the function you call the PNG library, which has the C interface, to verify whether the file exists, is a PNG, you load the headers, determine dimensions of the image and color format, and finally you load the image data. Without going into the details, here is how a piece of that function could look like:

bool LoadPng(const char* filename, std::vector< char>* image)
{
    :::

    png_structp png_ptr = png_create_read_struct(PNG_LIBPNG_VER_STRING, NULL, NULL, NULL);
    if (!png_ptr)
    {
        printf("Failed to create png structure\n");
        return false;
    }

    png_infop info_ptr = png_create_info_struct(png_ptr);
    if (!info_ptr)
    {
        printf("Failed to create png info structure\n");
        png_destroy_read_struct(&png_ptr, (png_infopp)NULL, (png_infopp)NULL);
        return false;
    }

    :::
}

This is maybe 10% of the code you have to write to load a PNG file. There are only two lines of code above, which actually do something potentially useful or necessary. The rest of the lines are boilerplate code.

Especially please notice that every time an error can potentially occur, you have to handle it. So you have to write error handling code many, many times. If an error occurs later in the function, you will have to delete the allocated data structures before returning from the function, and you have to write exactly the same code many times, once for every function which could fail.

I bet that there are many programs out there, which have a bug in their PNG loading code and don’t handle the error conditions fully correctly. So in some circumstances these programs will leak memory or behave unpredictably.

Now, the situation above can be improved by writing a C++ wrapper class for loading PNG. But I wouldn’t go too far with it, or we will just shift the boilerplate code elsewhere, instead of reducing it.

I can imagine somebody writing a PNG loader class, and declaring one function of the class per function of the libpng library. Such PNG loader class would simply build a C++ interface over the library’s C interface. That approach may be appealing to some, but the problem is that 90% of that class will be… boilerplate code! There will only be one single place in the whole program where this class would be used – the LoadPng() function. So all that code would be written in vain and only be a maintenance chore, plus a potential place for bugs to hide. Moreover, the compiler would generate much more unnecessary code, contributing to the program’s final size.

class PNGLoader {
public:
    PNGLoader();
    ~PNGLoader();
    static bool SigCmp(...);
    void CreateReadStruct(...); // throw on error
    void CreateInfoStruct(...); // throw on error
    :::
    void ReadImage(...); // throw on error
};

A fact of life is that many programmers call the above approach a “design”. They create beautiful class designs and hierarchies, which are only taking space and engineering time, but contribute little to the program.

And if you happen to be a C++ hater, please know that the above problem affects not only object-oriented languages, but all programming languages in general. Programmers often tend to put too much work and thought into the form instead of focusing on the contents.

So it seems to me that the best approach is to preserve a balance between the amount of code and functionality. Sure, you can write a beautiful command line argument parser, but what good is it if your program only handles two arguments anyway? Handle them correctly, but avoid too much boilerplate code which you will never use.

In case of the PNG loader, a good choice is a function like LoadPng(), which inside uses something like Boost.ScopeExit to handle errors and corner cases. Boost.ScopeExit is actually a good way of safely handling many kinds of resources in C++.

Quality

In general, programs in source code form consist of:

Comments and whitespaces, generally harmless if used wisely and not to comment out dead code,
Data structures, describing the internal state of the program,
Algorithms, which are mathematical transformations of the program state, and last but not least:
Boilerplate code, which clutters the programs, makes them harder to understand, hides bugs and generally causes programs to be big and slow.

To write good programs, avoid boilerplate code like the plague. It’s not the only rule for writing good programs, but I think it’s an important one.

Categories: Computing

Advertisement!

4.10.2013 Leave a comment

AdBlock is a wonderful little browser plugin. It does not get in the way. If you have it, you may not even know it’s there.

All it does to you is a favor. By blocking the ads, it removes all the unnecessary bling bling from your view. The result is that the websites that you are browsing contain only what you are interested in.

The functionality of AdBlock should really be part of browsers. Obviously Google would shoot themselves in the foot if they added it in Chrome. I suppose other browsers are trying to be politically correct by not including similar functionality.

There is a group of people who are against using AdBlock, because it strips them from potential income by preventing visitors from clicking on ads on their websites.

But I like AdBlock a lot, you wanna know why?

Let’s take Facebook, which is one of the most popular websites. It started off as a website who helped people get back together. Had a friend in school? Now it’s easy to reconnect! But Facebook accumulated a lot of users who upload a lot of information about themselves. It turned out to be a great source of information for which many companies pay prime money. After cashing on selling information about their users, Facebook also started serving ads to their users. Double win!

But I am not really against Facebook, I only don’t like their clunky web UI. If you are using Facebook, do you check out the things your friends post? So sometimes they post links to videos on YouTube. Unfortunately lots of YouTube videos are censored in many countries. Germany, for instance, is one of the countries leading in Internet censorship (among other countries). People from certain countries may in fact find it ironic!

So here are the two biggest problems the Internet has in this day and age:

People are the product. We, the users of the Internet, anything we produce and any information available about us are being traded.
Censorship is gaining strength, even in “highly developed” countries.

To me, AdBlock is our little means of getting back at them, a way of getting censorship onto our side.

Categories: Computing

Chris Dragan's Corner

Archive

Parsing regular expressions in Haskell (part 1)

grep in Haskell

Basic input and output

Hello, World!

Boilerplate

Advertisement!

Follow via Email

Categories

Site Map

Archives

Meta