It only works if it works

The only tangentially relevant intro: it turns out there’s a web page for the project I’m getting paid to work on. I don’t know how much of my code is in the tarball up for download, though; whether or not it is, I know for a fact it isn’t documented. (The license is BSD-ish, so go nuts.)

Anyway, Prof. Vitek’s come to the conclusion that a big cause of the long run times of the program for some inputs is caused by large numbers of missing spin systems. Since missings can go anywhere in the protein, whereas well-defined spin systems can only fit into a few places, having lots of missings vastly increases the size of the search space. And by “vastly,” I mean “from minutes to hours to days.”

One idea we’ve discussed to attack this problem was to find a way to figure out which positions in the protein are likely to be assigned the missings. If you could do that, you wouldn’t have to consider missings for large stretches of the protein, taking a nice big worthless chunk out of the search space.

For a few of the smaller data sets, a strategy that worked pretty well was to do an initial run with maxMissings deliberately set too low. You’d still get results, but they wouldn’t be very good. However, they had enough information to let you figure out which positions were the trouble spots. If you flag those spots, you can try again with more missings but limit them to those spots. When this works, it works quite well; for one data set, it cut execution time down from about 70 minutes to about 5 minutes. Not too shabby.

Unfortunately, it only works well if it works in the first place. For some of the other data sets, setting maxMissings too low initially prevented it from finding any solutions, or even getting through the precomputation phase entirely. Since the strategy I was taking relied on getting some results — any results — from the initial run, there’s nothing you can do in that case. And as I spent couple days figuring out, there doesn’t appear to be any feasible way to look at the scoring information to deduce which positions are likely to be the trouble spots. At best, there’s a very weak correlation between penalized scores and likelihood of needing a missing, but nothing nearly strong enough to justify trying to use as a heuristic. So much for that.

Well, at least in the process I refactored some parts of the code, which will help with implementing some other approaches to cutting down the search space. Hopefully they’ll turn out to work better.

Comments are closed.