Reading is dangerous (if you’re writing Haskell)

In Haskell, the read function is the usual, simple way to parse a String into a value of some other type:

ghci> :t read
read :: Read a => String -> a

read can parse anything that implements the aptly-named Read class. All the standard numeric types implement Read:

ghci> read "42" :: Int
42

Actually, it turns out the Read instance for integral types like Int is a bit too clever for its own good. Did you know (in GHC, at least) it can handle floating-point-style scientific notation, as long as the mantissa significand is an integer and the exponent is nonnegative?

ghci> read "42e2" :: Int
4200

This is nifty, but if the exponent is large, you can easily eat up all of GHC’s memory and crash the program:

ghci> read "1e1000000000" :: Int
<interactive>: out of memory (requested 45088768 bytes)
$

This is bad if the argument to read comes from an untrusted source. This was the subject of a recent security fix to happstack-server, where the entire server application could be brought down by sending it a request that used something like 1e1000000000 as a request parameter that would be parsed as an integral type. Of course, the vulnerability isn’t specific to happstack-server, but anything that tries to read an integer from untrusted input.

I don’t think Neutronium is currently vulnerable to this (other than being built against a vulnerable version of happstack-server at the moment), but that’s mostly because there isn’t yet anything that’s using read in this manner. One of the next changes I was planning to make was in how room membership is tracked, and part of that would be assigning each member of the room an integral identifier that is sent as a parameter to each room-related Ajax request. (This will support identifying who is who in the room, and solve the problem of having the same room open in multiple tabs without getting things all confused.) So, even if Neutronium isn’t vulnerable to this denial of service attack yet, it would be soon, and without having seen that posting, I don’t know if I would have even thought to worry about a seemingly simple built-in function like read for Int causing problems like that.

I can’t help but wonder if there are static source code analysis tools out there for Haskell that could find security-relevant flaws like this, like there are for more mainstream languages like C, C++, or Java. My gut tells me it’d be easier to write an analyzer for Haskell than for a language like C, but I don’t know if anyone has ever actually sat down to do it yet.

Too loco for NaNo

Having won NaNoWriMo the past three years (and another time several years before those), I think it’s safe to say that I’m capable of writing the first draft of a 50,000-word novel in the span of 30 days. The next time I write a work of fiction of that length, I’d like to release it after having a chance to give it some polish, instead of immediately posting the first thing to fall onto my keyboard.

So this November, I’m going to try to kick things up a notch.

I am going to try to create, from scratch, a fully-functional web-based multiplayer game during the month of November. By the end of November 30, I hope to have something that works and that I wouldn’t mind running on a closed network populated by trusted users. After all, just as a NaNoWriMo novel isn’t going to be in publishable quality at the end of November, I’m not expecting my program to be of high enough quality to be able to be run securely on the real Internet.

And, of course, by “from scratch”, I mean not having written any code for it yet. I’m still going to be using libraries and whatnot. It’s not like I’m going to be writing raw assembly directly to an executable or anything. That’d be crazy. And annoying. But then, it’s not as though you’re first inventing your own language when you do NaNoWriMo anyway.

Since this is something I’ll be doing on my own, it’s much more local than it is national. And I’ll be writing code instead of a novel. So it’s really a LoCoWriMo, which sounds about right.

Obviously, I won’t be posting playable demos every day, but I will be coming up with some sort of status update to post here each day. I haven’t quite figured out what will be interesting or make sense in terms of metrics. Shooting for a 50,000-line program is even more absurd than shooting for a 50,000-word novel. I’ll come up with something, I’m sure.

I wonder how long it’ll take before I start regretting this….

Comments Off

The myth of signed arithmetic wraparound

Pop quiz time! Suppose you’re on a machine where int is 32 bits wide and thus INT_MAX, the maximum possible value for an int, is 2147483647. What is the output of the following C program?

#include <limits.h>
#include <stdio.h>
#include <stdlib.h>
 
int main (int argc, char *argv[])
{
        int sum = INT_MAX + 1;
        printf("INT_MAX + 1 == %d\n", sum);
        return EXIT_SUCCESS;
}

If you’re like most C programmers, you probably answered the following:

INT_MAX + 1 == -2147483648

And if pressed to explain how you could add two positive numbers together and wide up with a negative number, you’d explain that it’s because of how two’s complement arithmetic works:

 2147483647 == 0111 1111 1111 1111 1111 1111 1111 1111
+         1 == 0000 0000 0000 0000 0000 0000 0000 0001
-----------    ---------------------------------------
-2147483648 == 1000 0000 0000 0000 0000 0000 0000 0000

It turns out that you’d be wrong. In C, the behavior of signed arithmetic overflow — that is, what happens when the result of an arithmetic operation is outside the range of what’s representable with that data type — is undefined. You might witness the wraparound behavior you’d expect from two’s complement arithmetic, but it’s certainly not guaranteed, since C doesn’t require the use of two’s complement arithmetic.

But, you might protest, that’s an awfully pedantic point to make, since pretty much any system you’ll encounter these days uses two’s complement arithmetic, with alternative representations like one’s complement or sign-and-magnitude being extremely difficult to find outside of legacy hardware. Two’s complement and arithmetic and the wraparound behavior it exhibits may not technically be part of the C standard, but in any practical application it’s a de facto standard because of the use of two’s complement arithmetic by the underlying processor, right?

Actually, no, not at all; real-world compilers will exploit the fact that the C standard does not define the behavior of integer arithmetic overflow when it’s optimizing your program. As a result, you can’t at all rely on wraparound happening. For instance, consider the following program:

#include <sys/stat.h>
#include <sys/types.h>
#include <fcntl.h>
#include <limits.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
 
int process_something (int fd, int size)
{
        // Catch integer overflow
        if (size > size + 1)
                abort();
 
        char *string = malloc(size + 1);
        int count = read(fd, string, size);
        string[count] = 0;
        int len = strlen(string);
        free(string);
        return len;
}
 
int main (int argc, char *argv[])
{
        int fd = open("/etc/issue", O_RDONLY);
        int len = process_something(fd, INT_MAX);
        close(fd);
        return EXIT_SUCCESS;
}

The function process_something wants to allocate size + 1 bytes and is trying to be careful not to accidentally pass a negative value to malloc, which would happen if size == INT_MAX. The code tries to exploit how two’s complement works by checking size > size + 1 to detect integer overflow, and aborting if that’s the case.

This code does not work when compiled with gcc, depending on the optimization level. At -O2 and above, gcc realizes that size > size + 1 should never be true — think about it mathematically — and so eliminates the entire if block. What about if there is integer overflow? The C standard says the behavior of the program is undefined if that happens, so the compiler doesn’t need to worry about it.

When compiled at optimization level -O0 or -O1, the output of the program is:

Aborted

Which is what happens when abort gets called. This is the behavior the programmer expected.

But when compiled at optimization level -O2 or -O3, the code that calls abort gets optimized away, and this happens (or at least, could happen) instead:

Segmentation fault

Why? On x86 Linux, malloc returns NULL when it receives a negative argument, since size + 1 is negative due to two’s complement wraparound. Since malloc returned a NULL pointer, a segmentation fault happens when the following code attempts to dereference it.

This is at least what happens on my computer. But since the program invokes undefined behavior, it could do anything at all and be considered correct, from a technical standpoint. Other possibilities are: successfully reading the input file anyway, failing silently without crashing, sending threatening e-mails to your boss, drinking your coffee when you’re not looking, or making demons fly out of your nose. Undefined behavior is undefined. Anything is possible.

The point is, relying on undefined behavior is a bad idea, even if you think you know what’s actually going to happen. Part of the reason that languages like C provide areas where undefined behavior can occur, as opposed to languages like Java that try to specify what must happen in every circumstance, is that it gives compilers a lot more flexibility in optimizing code. A compiler can assume, for example, that a < a + 1 without having to first prove that a will never be the maximum possible value for its type; if process_something in our previous example were in a separate library, the compiler would have no way of knowing what a could be at runtime, and would have to assume the worst.

For more about the gotchas of undefined behavior in C, read What Every C Programmer Should Know About Undefined Behavior (part 2) (part 3) on the LLVM Project Blog, which inspired this post and showed me why I would have answered the question I opened this post with incorrectly.