Residential Network Debugging: A Case Study

Problem: This morning, our apartment’s Internet connectivity was horrible. Connecting to websites took a long time (and sometimes failed outright with a timeout), and when a connection was established, the transfer rate was at dialup speeds. 3 KB/s is two orders of magnitude below the 300 KB/s and above I’ve achieved over our cable modem. The problem wasn’t limited to websites, either; similar problems plagued e-mail, instant messaging, and everything else using the Internet.

This condition lasted for several hours. Clearly, something was wrong. But what?

[Editor's note: long post ahead, but it's education and does contain a story about me impersonating Benji.]

Well, if I just came out and told you what the problem was, how will you know what to do if you ever face the same problem? As anyone with a modicum of computer expertise knows, family and friends will call on you, helpless, as soon as “the Internet is broken.” And once you finally remedy the problem, they’ll wonder, amazed, “how did you know what the problem was?” (They’ll probably ignore whatever answer you give, but they’ll still say it.) Of course, you didn’t “know” what was wrong; you applied your knowledge of how things are supposed to work and the good old scientific method to track down the problem and remedy it. If said family or friend was watching over your shoulder as you worked, they surely thought everything you did was fixing the problem, whereas 90% of it was trying to figure out what the problem was in the first place.

So for those of you wondering how the “magic” actually works, here’s how I tracked down the problem. We’ll have to delve a bit into how the Internet works, of course, but I’ll try to keep things from getting too complicated. (Which means this discussion will be simplified; don’t get pedantic on me.)

First, here’s what our network looks like (ASCII art courtesy of the deplorable state of diagramming tools for Linux):

kryten --+   Paul's                                |
         +-- router ---+       APARTMENT           |     CABLE
holly ---+             |                           |     COMPANY
                       |                           |
             Benji's   |                           |
                room --+
                       +---- apartment --- cable ----- gateway --- (rest of Internet)
             Adam's    |       router      modem
               room ---+                           |
                       |                           |
             Dave's    |                           |
               room ---+                           |

It’s a bit more complicated than your typical residential LAN. Note in particular that not only is there a router (technically a NAT box, but we’ll ignore that detail) that connects each bedroom to the network, but my bedroom has its own router to connect both of my computers to the network.

So kryten’s network connectivity is terrible. The cause could be anywhere between it and the servers it’s trying to connect to. Since the problem persists regardless of what Internet server is being contacted, the problem probably lies in the connection to the Internet, rather than the Internet itself.

The path to the Internet passes through two administrative domains: the apartment and the cable company. Why do we care about the distinction? If it’s a problem within the apartment, I can fix that myself. If it’s at the cable company, all I can do is report the problem to them and wait for them to fix it.

Whenever you’re facing connectivity problems, the first tool you reach for is almost always ping. In a nutshell, ping checks for a connection from your computer to another server and back. It does this by sending an ICMP “echo request” packet to the server; when the server receives it, it replies by sending an ICMP “echo reply” packet back. By sending multiple pings, we can gather two important statistics:

  • The round trip time: how long does it take to get the response back?
  • The amount of packet loss: how many pings don’t get a response?

Pinging a few servers revealed a typical round trip time of over two seconds, with about 60% packet loss.

Packet loss, you may ask? If the server is running, why wouldn’t it respond? That’s because the Internet is unreliable. More precisely, IP (the protocol that the Internet is based on) does not guarantee that a packet will actually reach its destination. Anything you send across the Internet is split up into chunks called packets, which are then sent to the destination. Since you usually don’t have a direct connection to the server you want to talk to, packets get passed along from one router to another until it reaches its destination. If one of those routers is malfunctioning or overloaded, it may drop packets.

But then why, if we’re experiencing high packet loss, can we still download web pages? That’s thanks to TCP, a protocol that runs on top of IP. It’s TCP’s job to provide the reliability that IP doesn’t. When TCP sees that a packet didn’t reach its destination, it resends it. This is why downloads still work despite packet loss: the packets will be resent until they arrive. Naturally, resending introduces additional delays, since the sender has to wait before trying again.

OK, so let’s recap. Pinging any server on the Internet shows high round-trip times and lots of packet loss. Since it’s unlikely that every server on the Internet is simultaneously having problems, the cause probably lies between our computer (kryten) and the Internet. But is the problem in the apartment’s LAN, or in the cable company’s network?

If you look at the diagram above, you’ll notice there’s several hops between kryten and the Internet: a router in my room, a router downstairs, the cable modem, and the gateway router. The gateway router is where all packets leaving the apartment go, and it’s run by the cable company. It’s the gateway’s job to forward the packets to their ultimate destination.

Since we’re getting packet loss, one of those hops is probably dropping the packets. (It could be a hop after the gateway, but that will still be in the cable company’s network.) How can we figure out which one? We can try pinging each of them in turn to see what kind of connectivity we have to each.

  • Paul’s router: < 1 ms round-trip time, no packet loss. This is what we expect, since kryten is directly connected to it. This router looks like it’s working fine.
  • Apartment’s router: ~ 2 ms round-trip time, no packet loss. Also looks good.
  • Gateway: 2 second round-trip time, high packet loss.

Wait, you say, why didn’t we ping the cable modem? Simple: because we can’t.

Huh?

Remember how I said that the Internet runs on IP? That’s not technically true. There’s an entire stack of protocols running on top of each other. It goes something like this (pedantry alert: this is a simplified version of the OSI model where layers 5 through 7 have been mashed together):

  • Layer 5: Application Layer: This is where the application-specific protocol runs (e.g., HTTP for web, POP3, IMAP, and SMTP for e-mail, etc.)
  • Layer 4: Transport Layer: This delivers packets from one program to another program. TCP and UDP are the most common.
  • Layer 3: Network Layer: This delivers packets from one computer to another computer. IP goes here.
  • Layer 2: Data Link Layer: This delivers frames between two devices directly connected to each other. Which protocol goes here depends on what kind of connection you have: Ethernet, Wi-Fi, etc.
  • Layer 1: Physical Layer The physical network link, along with the protocol used to send individual zeros and ones along it.

Routers operate at Layer 3; they look at the destination IP address and forward the packet accordingly. The cable modem, on the other hand, operates at Layer 2: it simply shovels frames from the apartment’s Ethernet network onto the cable company’s line and vice versa. The modem just serves as a way to move from one physical network to another, regardless of how the packet needs to be routed on the IP network at Layer 3.

Ping operates at Layer 3, since ICMP runs on top of IP. Since the cable modem is only at Layer 2, it doesn’t have an IP address, nor does it understand anything about IP at all. That’s why we can’t ping the modem, though we can ping the routers on either side of it.

Anyway, as far as our diagnostics can tell, the problem lies with the cable company’s gateway router. Time to call the cable company up and report the problem.

One minor wrinkle here: the cable service is in Benji‘s name, so to keep things simple, I just tell them I’m him when I call up. I had the foresight to ask Benji what the last four digits of what he told them his social security number were, so I could pass their authentication step. Yay for identity theft borrowing! (The only tricky part was spelling Benji’s last name over the phone, since I’d look really dumb if I messed it up. But no problems there, fortunately.)

Customer service proved surprisingly helpful. Once I told the rep about the problem I was seeing (horrible ping times and packet loss at the gateway), we jumped immediately into testing and resetting the cable modem to make sure it had a good connection to their network. No “try rebooting Windows” nonsense whatsoever.

The cable modem’s connection looked good from their end (obviously, they have access to Layer 2 diagnostic tools on their side of it that I lack), so the rep forwarded me to tech support. He verified the horrible pings along the link out of our apartment and suggested I try unplugging the apartment router from the cable modem. Surprisingly, once I did, the connectivity from the cable company to the modem cleared up! And once I plugged the router back into the cable modem, the problems reappeared.

So, it turned out my original diagnosis was incorrect: it wasn’t the gateway router after all, but the apartment router saturating the link to it through the cable modem for some reason. Now that I knew the problem was on our end after all, I thanked the tech for his help and went back to work on the problem myself.

Note how our initial diagnosis significantly reduced the time talking to tech support, since we were able to narrow down the original problem (“the Internet is slow!”) to something a lot more specific and easily testable (“ping times to the gateway router are horrible!”). And thanks to a support staff who knew what they were doing, we could skip the troubleshooting for people who don’t know what they’re doing (“is your computer plugged in?”) and head straight for the problem.

OK, so the apartment router’s saturating the link through the cable modem. Either the router itself is malfunctioning, generating garbage packets itself, or one of the computers in the apartment is sending out a flood of packets which the router is simply forwarding down the line as it should. Fortunately, we can do a little troubleshooting to figure out which is the problem by disconnecting everything, and then plugging each room back in by itself to see if the problem resurfaces.

Of course, we need some connection to test from, and since I was reasonably sure kryten wasn’t flooding the network, I connected my room first. No problems, so neither the router itself nor my room’s private LAN was the cause. Adding in the other rooms’ connections one-by-one, the problem only resurfaced when one of the rooms was reconnected, but the problem quickly and mysteriously resolved itself the second time it was reconnected. Maybe the extended break in connectivity caused whatever was flooding the network to stop? Unfortunately, since the problem didn’t resurface, and I’m not sure which room was connected to which port on the apartment router, I couldn’t track the problem further. But since the problem did disappear, that means the network was once again usable.

Finally, here’s a fun fact to finish this off: even if the miscreant computer was involved in a large upload, that shouldn’t have caused the problem, assuming TCP was being used. One of TCP’s nice features is congestion control; if it notices packet loss, it slows down the rate of transmission automatically to avoid saturating the link. A well-behaved TCP-based application wouldn’t cause that problem, which is why even when you’ve, say, got BitTorrent running, you can still surf the web without a huge reduction in speed; both BitTorrent and the web browser will adjust their transfer rates so that each gets to use the network.

7 Responses

  1. Why do you have two layers of NAT routing? That could be causing a problem, especially if there’s any overlap in the IP address ranges, which could be causing some OS to shit itself and send out a whole bunch of FIN packets under certain circumstances. But that’s assuming there’s actually a packet flood going on – how did you come to that particular conclusion, anyway? high RTT + packet loss doesn’t necessarily == packet flooding, that’s just one of MANY problems which can cause that symptom group.

    Also, tcpdump is an invaluable tool for trying to figure out where a packet flood is coming from.

  2. The two layers of NAT are just because both routers are you usual off-the-shelf so-called “Cable/DSL Router”, and I don’t think there’s a way to configure mine to just be an ordinary IP router or an Ethernet switch. But they are definitely using different address ranges.

    True, I don’t know it was a packet flood saturating our outbound link, but it seems like the most likely explanation, especially since rebooting the router itself didn’t fix it. But since the problem went away after I started plugging each room back in individually, I couldn’t track down exactly what the cause was. One of the rooms did briefly cause the problem to resurface when it was plugged back in, but didn’t after I unplugged it and re-plugged it again.

    If the problem does resurface, I’ll probably plug kryten into the apartment router’s outbound port and see what, if anything, is actually coming out.

  3. fun fact to finish this off: even if the miscreant computer was involved in a large upload, that shouldn’t have caused the problem, assuming TCP was being used.

    That was a fun fact!

  4. Wow! I totally clicked over to Paul’s journal and found a romantic and sentimental post:

    Residential Network Debugging: A Case Study

    Haha. Anyway, it was very informative. I learned a lot- thank you! If I may ask a silly question…how does one utilize such a ping-test? I doubt it is included with MS XP, but perhaps some of that open-source goodness has the answer?

    Anyway, I am not surprised to learn that Benji has had yet ANOTHER impersonator…

    Thanks for catching the grievous error in my V-Day post.

  5. Ping is very much included in Windows, though offhand I don’t think it has quite so many options as it does on Linux. Open up a command prompt and type “ping host“, where host is the name or address of the computer to ping.

  6. Wow, I think I learned more in your blog post than in my entire undergrad “Networking & Security” class!

    Sadly, some file sharing applications aren’t well behaved in some respects, and caused very similar symptoms in a house I lived in once. Turns out that one person was leaving eDonkey on all the time, which is definitely not a “well-behaved” app. The main problem was that protocol has individual peers hitting up all other peers with requests for a found file. Share enough popular files and before long, between the 45 connections or so from downloading and all the people trying to get your files, you’ve just started your own Denial of Service attack against yourself, which is why network problems from eDonkey can continue 15-30 minutes after you’ve shut down the app.

    That’s my theory anyway, based on the raw data I collected while diagnosing the problem. I certainly can’t prove it, and so it is just as likely that I experienced Intelligent Internet Traffic. The Internet is honestly just so complex that I don’t think it could have come about naturally.

  7. It wouldn’t surprise me. TCP congestion control can’t do anything for getting hammered by new connections; the ongoing connections will slow themselves down, but that just leaves even more bandwidth to be taken by the hammering.

    And if eDonkey uses UDP instead of TCP, you’re screwed anyway.

Comments are closed.