Adventures in HPC: The case of the disappearing node

When you work in an environment with thousands of computers, you're bound to come across strange problems that make you go "Hmm..." Parts fail, computers die, switches go bad. But they don't always die in the most obvious ways.

We got a report a month or two ago about a node in one of our 10 GigE clusters which apparently went offline for some reason. After a student went to look at the node, they found the node seemed fine, but it didn't have any network connectivity. They started down the normal path of troubleshooting: flip the cable, change the cable out, replace the NIC. Still nothing. So they decided to exchange cables with the node next to it. This is where things got really weird...

Normally when you switch cables with the adjacent node after you've already replaced the cable, you're looking to see whether the node or the switch port is bad. Again, switches go bad, it's nothing new. What you'd normally expect is for the problem to either stay on the original node (the node is bad) or the problem moves to the adjacent node (the port is bad). What you don't expect is for the problem to disappear. Both nodes now connect without problems. Move the cable back, and the problem reappears on the original node. Hmm...

Diving into low-level networking, you learn that every network interface has a clock source in the PHY layer. This seems pretty logical, as you have to establish time somehow. We want to make sure all of our interfaces use the same clock speed. Unfortunately, clock sources aren't perfect, and over time they may drift one way or another. We call this network jitter. Normally, it's within a certain range and can be corrected for in software. Sometimes, however, things start to drift in different directions. One clock may slow while the other speeds up.

We haven't yet resolved the problem of the disappearing node. An obvious solution is to leave the cables flopped. This solves the short-term goal of getting a node returned to service, but leaves the original problem of PHY jitter driving connections into the ground.

The long-term solution will probably be replacing the switch. Our switch vendor probably won't like that solution, but we don't like nodes that go poof.

And I get tired of saying "hmm..." all the time.

Adventures in HPC

Saturday, January 28, 2012

The case of the disappearing node

No comments:

Post a Comment