Haskell Gotchas: Strings

Posted on September 9, 2013

I’ve been playing with Haskell on and off for some time now. Let’s face it: it’s not an easy language to learn. I would agree with most people that part of the difficulty comes from its unique features (laziness, purity, etc…), and the abstractions that derive from them when facing real-world problems (Monads, Iteratees, etc…), but my point here is that there are other sources of confusion that keep misleading beginners like myself and that have nothing to do with all that.

In fact, I can see a pattern in those gotchas: Haskell often provides several tools for a particular task. It is (too) often the case that the one that’s most at hand, or that seems to be favored in the docs, is just not what a newbie would expect. I think this fact has an impact on the perception of first-time users, if they do not perservere enough to seek for alternatives.

I am planning to write a series of posts on these gotchas, if only to prevent myself from falling into the same traps the next time I decide to take on Haskell again, and in the hope they will be useful to other learners. Please mind that I have no valorative intetion whatsoever (no flames: it is not a Haskell WTF), and that this is not the work of a seasoned Haskeller: if you find any mistakes in these posts, please report so I can correct them and learn something.

Strings

This is a paradigmatic case, and probably the reason for many newbies walking away from Haskell wondering how come a compiled language can perform so much worse than, say, Python.

To recap, Haskell has a String type. Now, for any non-trivial text processing you can pretty much forget about it. In Haskell, the String type is defined as a regular list of Char, and if you’ve had some previous exposure to functional lists (with their cars and cdrs, heads and tails, …) you’ll know how different a beast they are from the sort of char/wchar array most newcomers would expect for a string implementation.

Of course, Haskell ships with a bunch of modules designed to work as String replacements in its standard library. By replacement I mean that the names of the functions are kept consistent so that, if you use qualified imports, it should, in theory be easy to switch between them. Now, these modules not only change the underlying implementation of a string, but also provide the functions to perform IO according to the type, so they come in strict IO and lazy IO flavors: the gotcha here is that this can dramatically change the semantics of input/output on the importing module, so switching between them is not always that easy, in practice.

I have deribelately avoided to tackle the subtleties of lazy IO in this post (I may keep that for another gotcha). Take a look at this if you can’t wait. At the moment, my advice for a newcomer would be to start with the strict versions, because they are closer to the behaviour you’d expect in most other languages.

If you have already been doing IO in Haskell with Strings and System.IO, then you have already been doing lazy IO, since it’s the default. When in doubt, you can always try both and see which one (if any) matches your performance expectations.

The Basics

Here’s what most Haskellers would recommend:

If you do not care about Unicode, use Data.ByteString.Char8, which is a packed array of Char8 (bytes). The lazy variant is Data.ByteString.Lazy.Char8. This will be enough if you can assume your input is in (a subset of) latin-1.

import qualified Data.ByteString.Char8 as B

or

import qualified Data.ByteString.Lazy.Char8 as B

If you care about Unicode, go use Data.Text.Text:

import qualified Data.Text as T
import qualified Data.Text.IO as TextIO

or

import qualified Data.Text.Lazy as T
import qualified Data.Text.Lazy.IO as TextIO

If you need regular expressions with Unicode, though, the thing gets a little more involved.

Back and forth

Even if you use these types, you will still need Prelude.String in your code: there are a lot of libraries which will expect and return Strings. As an example, the FilePath type for file and directory manipulation is just an alias for String. Also, every string literal in your code will be parsed as a String by default (but see below), so converting from packed ByteArrays to unpacked Strings is achived, not surprisingly, by the functions pack and unpack. In fact, using String in your APIs, as long as they’re not too large, is one (the only?) sensible use for Strings.

For the GHC stack you can avoid packing and unpacking string literals by using the OverloadedStrings pragma. Ie. instead of writing:

import qualified Data.Text as T
myFuncExpectingDataText . T.pack $ "Hello World!"

you can add the pragma that makes the call to T.pack unnecessary:

{-# LANGUAGE OverloadedStrings #-}
import qualified Data.Text as T
myFuncExpectingDataText "Hello World!"

Conclusions

Here’s my piece of advice:

  • Avoid Prelude.String for text processing, but bear in mind it is sort of standard when defining your APIs.
  • If you care for Unicode, go for Data.Text, if the latin-1 subset is enough for you, stick to Data.ByteString, since regular expressions (and other tasks) are easier there.
  • If you are starting and the performance is enough for your use-case, go for the strict IO. The behaviour is more predictable.

A final note: this is, by far, not the last word regarding Strings in Haskell. For example, there are abstractions that aim to solve the predictability issues problems of lazy IO while keeping performant (for example, Iteratees or Conduits. I just think this is the bare minimum to be able to do text-processing in Haskell.