I’ve been playing with Haskell on and off for some time now. Let’s face it: it’s not an easy language to learn. I would agree with most people that part of the difficulty comes from its unique features (laziness, purity, etc…), and the abstractions that derive from them when facing real-world problems (Monads, Iteratees, etc…), but my point here is that there are other sources of confusion that keep misleading beginners like myself and that have nothing to do with all that.
In fact, I can see a pattern in those gotchas: Haskell often provides several tools for a particular task. It is (too) often the case that the one that’s most at hand, or that seems to be favored in the docs, is just not what a newbie would expect. I think this fact has an impact on the perception of first-time users, if they do not perservere enough to seek for alternatives.
I am planning to write a series of posts on these gotchas, if only to prevent myself from falling into the same traps the next time I decide to take on Haskell again, and in the hope they will be useful to other learners. Please mind that I have no valorative intetion whatsoever (no flames: it is not a Haskell WTF), and that this is not the work of a seasoned Haskeller: if you find any mistakes in these posts, please report so I can correct them and learn something.
This is a paradigmatic case, and probably the reason for many newbies walking away from Haskell wondering how come a compiled language can perform so much worse than, say, Python.
To recap, Haskell has a
String type. Now, for any non-trivial text processing you can pretty much forget about it. In Haskell, the
String type is defined as a regular list of
Char, and if you’ve had some previous exposure to functional lists (with their
tails, …) you’ll know how different a beast they are from the sort of
wchar array most newcomers would expect for a string implementation.
Of course, Haskell ships with a bunch of modules designed to work as
String replacements in its standard library. By replacement I mean that the names of the functions are kept consistent so that, if you use qualified imports, it should, in theory be easy to switch between them. Now, these modules not only change the underlying implementation of a string, but also provide the functions to perform IO according to the type, so they come in strict IO and lazy IO flavors: the gotcha here is that this can dramatically change the semantics of input/output on the importing module, so switching between them is not always that easy, in practice.
I have deribelately avoided to tackle the subtleties of lazy IO in this post (I may keep that for another gotcha). Take a look at this if you can’t wait. At the moment, my advice for a newcomer would be to start with the strict versions, because they are closer to the behaviour you’d expect in most other languages.
If you have already been doing IO in Haskell with
System.IO, then you have already been doing lazy IO, since it’s the default. When in doubt, you can always try both and see which one (if any) matches your performance expectations.
Here’s what most Haskellers would recommend:
If you do not care about Unicode, use
Data.ByteString.Char8, which is a packed array of
Char8 (bytes). The lazy variant is
Data.ByteString.Lazy.Char8. This will be enough if you can assume your input is in (a subset of) latin-1.
If you care about Unicode, go use
If you need regular expressions with Unicode, though, the thing gets a little more involved.
Even if you use these types, you will still need
Prelude.String in your code: there are a lot of libraries which will expect and return
Strings. As an example, the
FilePath type for file and directory manipulation is just an alias for
String. Also, every string literal in your code will be parsed as a
String by default (but see below), so converting from packed
ByteArrays to unpacked
Strings is achived, not surprisingly, by the functions
unpack. In fact, using
String in your APIs, as long as they’re not too large, is one (the only?) sensible use for
For the GHC stack you can avoid packing and unpacking string literals by using the
OverloadedStrings pragma. Ie. instead of writing:
you can add the pragma that makes the call to
Here’s my piece of advice:
Prelude.Stringfor text processing, but bear in mind it is sort of standard when defining your APIs.
Data.Text, if the latin-1 subset is enough for you, stick to
Data.ByteString, since regular expressions (and other tasks) are easier there.
A final note: this is, by far, not the last word regarding
Strings in Haskell. For example, there are abstractions that aim to solve the predictability issues problems of lazy IO while keeping performant (for example, Iteratees or Conduits. I just think this is the bare minimum to be able to do text-processing in Haskell.