The Moving Trilogy: Part 1

In which our hero moves between computational resources!

This summer, as I kill time until grad school starts up, I’m on campus, helping one of my professors with a little bioinformatics project he has going on. It’s been pretty interesting so far.

As you may know, proteins are chains of dozens or even hundreds of amino acids that do all sorts of important things for us living creatures. From things like genome sequencing and other tools it’s pretty easy to figure out what the amino acid sequence for a given protein is. Proteins have an additional level of structure; the chain of amino acids fold upon and around itself in various ways to form a three-dimensional structure. This structure is what allows the protein to do its thing, and that’s what biologists who study this stuff are really interested in. Problem is, it’s not easy to figure out how a protein folds, given its amino acid sequence.

With nuclear magnetic resonance (NMR) spectroscopy, you can get a sort of image of the protein’s structure. NMR spectroscopy will tell you about where in the three-dimensional structure the bonds between amino acid residues occur. With this experimental data (we’ll call them “spin systems,” since that’s what they’re called), if you can figure out which spin systems map to which residues in the protein, you can get a good idea of the protein’s structure.

Currently, this mapping process has to be done by hand, and it can take several months to find a mapping for a single protein. Our goal is to automate this process by implementing a computer program that, given the amino acid sequence of the protein and the NMR spectroscopy data, can figure out the most probable mappings. (An exact solution is impossible, not just because of noise in the data, but also because some residues might not show up in the NMR data, and some of the spin systems that do show up don’t correspond to any residue in the protein.)

What makes this problem tricky is that you quickly experience combinatorial explosion. A brute force approach is order O(nn), which is very very bad. For example, if a protein has 50 amino acids (and that’s a pretty small protein), a brute-force search would need to check about 8.88 * 1084 (8 followed by 84 zeroes) possible mappings. This approach is infeasible, unless you don’t mind that fact that you, your kids, your grandkids, and your great grandkids will all have long since died of old age before you find the answer.

The solution taken by the software we’re developing relies on a branch and bound strategy. In a nutshell, we start with the entire solution space. The program eliminates as many candidate mappings that it can find that are inconsistent (i.e., that are impossible given the data) or that fall outside the scoring criteria (i.e., that are so much worse than solutions we’ve already found that it’s not worth it to examine them more closely). It then looks for the position in the protein with the fewest possible number of spin systems that could be mapped to it. The program then tries each possible guess for that part of the mapping and repeats the process. For example, if residue 10 could be mapped to spin positions A, B, or C, we split the problem into three subproblems: one where 10=A, one where 10=B, and one where 10=C.

Here’s where I come in. This algorithm is what’s known as “embarassingly parallel” — it’s fairly straightforward (at least, from a conceptual point of view) to parallelize it. In other words, you can split the overall problem up into lots of smaller pieces, none of which depend on the others, and give those pieces to different processors or even different computers to work on, all at the same time. My job has been to take the original single-threaded code and parallelize it: both to add multithreading support (so if you have multiple processors, each can work on a piece of the problem simultaneously) and RMI support (so if you have multiple computers, each can work on a piece).

The trick, of course, is to figure out how to do this in an efficient manner. The data structures used by the program are huge — for example, the root solution for an 87-residue protein takes over 300 MB of memory. You simply can’t pass objects that large across the network in anything like real time, so you need to find good ways to trade size for time (i.e., it might be faster to have another computer recompute the data instead of sending the data to it). You want to keep all the computers busy as much as possible, without spending a lot of time moving tasks between computers (which has overhead associated with it). Then there’s the problem of efficiently reassembling all the results from all the subproblems into the final solution. (Note that I’m using “efficiently” to mean “not taking a ludicrous amount of time to complete.”) Not to mention things like handling errors (e.g., what happens if one of the servers you’ve given work to suddenly dies?).

One of the main annoyances is that I only have a handful of test data sets that I can use. I don’t know nearly enough about the biology and statistics that make up the actual algorithm to be able to construct my own test cases. The available test cases are either too trivial to put parallelism to good use or so large that they take a long time (hours) to run even with multiple machines working together. It’s not hard to have problems in the parallelizing code that only surface in the larger data sets — I spent most of this past morning tracking one of them down. In cases like that, you really need the test results before you know what you need to do; you definitely don’t want to try adding new features or tweaking other things when the code’s not yet even working right. Well, at least the down time gave me a chance to catch up on my inbox backlog.

At this point, I’ve got a lot of this stuff working. I’m still shaking some of the bugs out of the RMI code, but I’m not aware of any correctness problems that currently exist (besides ensuring reliability). Before I left campus today I started six computers, each with eight processors, working on a single protein. Running this protein using the old single-threaded implementation on a fairly fast machine took over eight hours. Tomorrow I’ll see how much of a speedup my mini-cluster exhibits over the old implementation (and, of course, make sure it produces the right result). Sure, the individual processors are much slower than that fast machine, but I have 48 of them working together!

Comments are closed.