Distributed Programming is Tricky

The story of how a seemingly obvious assumption can cause a program’s behavior to spiral out of control.

Remember that distributed program I’m developing on my day job? Here’s the outline of the strategy I’m currently using to allocate work to each computer:

  1. The first server that becomes ready receives a task that represents the entire computation.
  2. Each task, when executed, is either a leaf (no further computation necessary) or is split into one or more subtasks after some amount of computation. These subtasks are immediately queued for local execution.
  3. Whenever a server runs out of tasks to execute, it first sends a message to the client containing the cumulative results of all the tasks it has executed. The client merges all these intermediate results into the final result.
  4. Whenever a server runs out of tasks to execute, it then sends a message to the client. The client polls the busy servers, trying to steal a task sitting on the work queue. If it gets a task, the client relays it to the server.
  5. The client observes that the computation is complete when all servers are idle, waiting for more work.

With this strategy alone, the client has no way of knowing if one of the servers crashes or becomes unresponsive for some reason. So, while all of this is going on, the client periodically pings each server. If it doesn’t get a response, it assumes the server failed and requeues the last task sent to it for execution by another server. This way, whenever a server dies, that piece of the problem won’t also be lost.

See the problem yet? OK, technically you can’t, since this is a simplified description of what’s going on, although it is the meat of the problem. Now watch how this strategy snowballs out of control when things go wrong.

The protein I tried to run was a larger one than I usually do. The “expected results” for this test case were incomplete, since the old single-threaded version of the program ran out of memory before coming up with the solutions. For some reason, it didn’t occur to me that if a single-threaded version ran out of memory, then surely a multi-threaded version, doing more stuff at one time and thus using even more memory, would run into the same problem. But if I had realized that I might not have found this nice little bug.

Anyway, everything’s running fine. The parts of the algorithm being run on each server that are the most memory-intensive are coded to recover from out-of-memory errors automatically. However, running out of memory degrades the performance of the program. How much? Enough to cause the program not to respond to the client’s ping before the client gets a timeout.

The client, not getting a response back, assumes that server 2 (the server that didn’t reply) is no longer reachable. It tries to close the connection (not surprisingly, also not getting a response back) and requeues server 2′s task for the next available node. However, server 2 is in fact still running and processing normally, blissfully unaware that the client believes him to be dead!

Here’s where the interesting behavior starts. All the machines I’m using are essentially made of equivalent hardware, so if a set of tasks causes OOM on one of them, any of the machines will get OOM when it tries to run it. The same problem will happen to it as happened to the original server. Which is exactly what happened to server 5, which got reassigned server 2′s task. If this happens, we expect all the servers in the cluster to eventually fall to this problem until the client assumes all the servers are dead.

But remember, the servers in this case aren’t aware that the client has written them off. After a while, server 2 contacted the client with the results of its computation and asked for more work. The client wasn’t written to handle the case where a presumed-dead server contacts the client. The client seemed to handle the situation well, accepting the results and giving server 2 another job while marking it as busy instead of dead. But then the thread that pings the server comes along and tries to ping server 2. But when the client thought server 2 died, it nulled out the reference to server 2, so the pinging thread crashes with a NullPointerException. Now the client has no way of noticing when servers do in fact die! If one does, the client will wait forever for the client to send in its results.

But wait, it gets better! Suppose we fixed this problem in the client so that it can accept a resurrected server gracefully. Remember that we still have the task that triggers the OOM condition (from now on, called the Task of Doom) floating around. If no server in the cluster can execute the Task of Doom without OOMing and being temporarily presumed dead by the client, we’re in even worse shape than before. Since the task appears never to get completed, it keeps getting passed from server to server. Since the servers are coming back alive (from the client’s perspective — technically they never actually died), the client never runs out of servers to reassign the task to! The program still doesn’t terminate, but now all the nodes will be busy, each running through the computation of the Task of Doom without any help from its fellow servers, and whenever one finished, it just gets the Task of Doom reassigned to it again! In this case it would’ve been particularly nasty, because the Task of Doom, which was first assigned to server 2, was the original problem!

And all of this is because we assume that a server that’s still reachable can be pinged successfully. (OK, there are also several other contributing problems, but the pinging thing is what triggers the whole mess in the first place.)

Comments are closed.