This post discusses a Haskell-based project that is the central component of my thesis: a fast, atomic-level structural search engine named "Suns". I will discuss what problem this search engine solves, why I chose Haskell for this project, and what were the pros and cons of my language choice.
The Challenge
Proteins are amazing molecules that control biology on the molecular level. If you are new to cellular biology, you can get an immediate sense for how diverse and sophisticated they are by watching The Inner Life of the Cell, an artistic rendition of several diverse processes controlled by proteins, including:
- cellular adhesion (how cells recognize and stick to each other)
- forming a skeleton for the cell
- assembling and disassembling cellular highways
- trafficking cargo along those highways
In addition, proteins also:
- sense the molecular environment
- perform chemistry
- are cellular weapons
- and much, much more!
I am a graduate student in a research lab that designs new proteins for both medical treatments and scientific study. Unfortunately, proteins are significantly harder to design than software. A typical protein can be very chemically complex and tightly integrated:
Proteins are tangled ropes of amino acids, so they are the epitome of spaghetti code. However, it's not as bad as it sounds. Within these tangled balls we see several patterns emerge, both on small, medium, and large scales.
On a larger scale you have "protein domains". I liken these to high-level scripts: self-contained blobs of functionality with little coupling or state. They are often easy to connect together to generate new functionality:
On the medium scale you have "secondary structure", consisting primarily of alpha helices and beta strands, which form locally repetitive structures. I liken these to C code: low-level, but still reasonably portable. Our lab has historically had success transplanting patterns in secondary structure to new contexts, generating new functionality.
On the small scale you have "motifs", small configurations of a few amino acids that frequently occur in natural proteins. I liken these to assembly code: they are very tightly integrated and pack a lot of utility into a small space, but they are not very portable because they have very precise chemical and geometric requirements and little room for customization:
Our lab needs to design proteins on this small and compact scale, so they use my search engine to discover and incorporate these small motifs into their designed proteins. The search engine overcomes the "portability issues" of these low-level motifs by finding existing motifs that are exactly geometrically and chemically compatible with partial blueprints.
Overview
The most popular interface to the search engine is through PyMOL
, a molecular graphics system that lets you visualize protein structures three-dimensionally. You can find a PyMOL
search plugin on the official Suns web site, which also includes an extended tutorial for how to use the search plugin.
The search plugin lets you point and click on multiple protein fragments and then click "Search". Typically within less than a second it will stream up to 100 matches to the fragment of interest, superimposed on your original search query. You can then expand on your original query by incorporating new fragments from the search results:
The PyMOL
plugin is a thin Python-based client that connects to a remote search backend written in Haskell.
Why Haskell?
I initially did not begin this project in Haskell. I actually began it in C. The reason why was that at the time I did not have a solid background in data structures and algorithms, so I would reflexively turn to C for high-performance projects to try to make up for it by improving the constant factors of my code. However, all-atom searching is the kind of problem where the algorithm matters and the bare-metal performance does not. My initial implentation in C attempted to brute force the solution to no avail, taking hours to perform a single search.
The second problem that I encountered was that when I got the first slow draft working I immediately got additional feature requests and I was not prepared to handle them. C is a very brittle language to program in because of manual memory management and no standard library of data structures. However, like many inexperienced programmers I had trouble saying "no" and bit off more than I could chew trying (and failing) to satisfy these feature requests.
At that time I had been dabbling in Haskell in my free time and took interest in the language. Things that intrigued me about the language included:
- expressive and strong types
- an elegant mix of imperative and functional style (using monads)
- high-level idioms and design patterns
Plus, Haskell is garbage collected and has a rich library of data structures, which solved my two main qualms with C, so I decided to rewrite my code base in Haskell.
This rewrite was not a simple translation of the original C code. One of the main reasons that I transitioned was that I wanted to familiarize myself with improved data structures and algorithms. Haskell made it very easy to prototype new code for several reasons:
- You have a rich set of high-performance purely functional data structures in the
containers
,unordered-containers
, andvector
libraries - Garbage collection, type inference, and type classes automate common menial tasks
- Strong static types make it a breeze to refactor code without having to write tests
Consequently, Haskell gave me much greater confidence to rewrite large swaths of code because the compiler caught more of my mistakes and automated more details. This encouraged me to try out new ideas and rapidly gravitate towards a good solution instead of getting complacent with my first working solution. These changes improved performance and now queries took milliseconds instead of hours. At this point, PyMOL
became the bottleneck and could not load search results fast enough, so there was no point improving performance further.
Another major advantage of Haskell for my project is parsing. Haskell libraries excel at parsing by necessity: there is a huge pressure to convert untyped textual data to more strongly typed data structures in order to take advantage of the type system. I want to give a special shoutout to attoparsec
and aeson
, which are fantastic for this purpose.
Another big benefit of programming in Haskell was that my code was remarkably stable and devoid of bugs. This meant that I spent less time maintaining the code and more time adding features.
Problems with Haskell
The first issue I encountered is that if you program in Haskell you need to be prepared to fill in the missing gaps in the library ecosystem. I began at a time when the Haskell ecosystem was beginning to mature, but there were still several important gaps. The foremost of these was a stream programming library, which I needed to write myself because I was unsatisfied with existing iteratee-based solutions, and this eventually evolved into my pipes
stream programming library. If you are considering starting a large project in Haskell you need to be prepared to do some library work of your own.
Another issue was numerical code. This is limited almost exclusively to hmatrix
(Haskell bindings to the GNU scientific library) and repa
(a fast and native Haskell library, but without a standard library of numerical algorithms). hmatrix
was problematic because it is GPLv3 licensed, and my university, UCSF, forbids releasing code with a GPLv3 license because of the patent clause. This meant that I was forced to write my own LAPACK bindings for performance-critical code. Haskell's foreign-function interface (FFI) is nice, but not having to use it at all is even nicer.
Next, space leaks cause problems. Space leaks are like the Haskell analog to C's segmentation faults: they require programmer discipline to prevent and they are not easy to debug. With experience I learned how to avoid them, mainly by:
- Defining data types with strict fields
- Ensuring all folds are strict in the accumulator
... but this is still a problem because it bites beginners to the language really early. Also, space leaks are the biggest source of program instability, which is a shame because it compromises what would be an otherwise rock-solid stability story for the language.
The next problem is the lack of standardized error-handling idioms. I ended up writing my own library for this, too, called the errors
library, but this did not completely solve the problem either. I'm currently collaborating with John Wiegley and Michael Snoyman (of Yesod fame) to try to solve this problem.
Another problem is the pre-existing ecosystem of functionality dependent on lazy IO
. I'm not going to name any names, but one library that I depend on uses lazy IO
internally and it is still a source of bugs. Lazy IO
causes these sort of problems because it does not guarantee ordering of effects and also results in synchronous exceptions being thrown asynchronously in pure code (which is a big no-no in the Haskell world).
Conclusions
The code for this project is open source and hosted on Github. It consists of three separate projects:
You can also learn more about the project on the official Suns website.
People interested in learning Haskell may find the code for the search engine useful to see what a medium-sized Haskell application looks like. Haskell gets a bad reputation for being overly academic and I hope people can see past that to discover an excellent programming language for rapid development with high reliability.