Saturday, December 10, 2016

Bluehost has extended partial outage caused by "network flapping", but the problem with ISP's is more common than most customers realize



Friday afternoon and evening, Wordpress-site hosting provider Bluehost had a partial outage that affected many consumers.

The problem seems to have been caused by “network flapping”, where routers point connections to other routers that turn out to be invalid.  Early Facebook messages suggested that this problem can happen form DDOS but is more often server hardware or software related.  There’s a typical explanation here.   My sties remained up but slow, and Jetpack was not available.
 
Bluehost “single-tracked” many of the customers (analogous to single-tracking on the Washington DC Metro during SafeTrack) or “single-threaded” to reduce contention and lockouts or loops  It then tested various network segments with the help of a vendor (was that Cisco?) to help locate a “packet switching” problem resulting in infinite or recursive loops.

The Control Panel was made available to costumers shortly after midnight EST.

A DDOS on a major service provider would be a serious incident, requiring investigation by the FBI.  But customers are especially edgy now given all the “scandals” about Russian hacking around the elections.

Service providers don’t identity their customers publicly, but it’s obviously possible for any provider to host a politically unpopular potential target for foreign enemies.  The Democrats would consider my mentioning names as hate speech.

The video below from Cisco's Charles Germany explains how router loops and flapping can commonly occur. Considerable care goes into various fail-safe methods to prevent these problems.
 


The “Downdetector” thread as well as Facebook threads for Bluehost had a lot of angry comments from customers, some of whom said they were losing a lot or revenue.  This doesn’t affect my income very much, as the ad revenue and book sales are small financially as part of a much bigger picture.

Connections are themselves “objects” on a router database, so I suppose that database could get corrupted.

A similar problem could be “thread death”, at least in java, that used to happen at the mid-tier where I worked in 2001 (ING).

I wonder if any of my old buddies are working for Bluehost (or Cisco) now, and saw action last night.

There was a logically similar problem back in 1974 on a customer Univac 1100 benchmark in Minnesota where simulated database transactions in DMS (similar to IDMS) could lock each other out.

In the DC area, Comcast-Xfinity had some problems of slow Internet service in late 2015 and again in the spring of early 2016, with frequent stalls reaching only certain sites (like those owned by Google).  These could be related to network topology problems,

Update: Dec. 14, 2016

Bluehost has a detailed explanation of what happened, posted on Facebook;  some comments there concern me.

No comments: