There was some downtime last night. This was caused by a problem with an upstream provider. Although there was nothing we could do but wait, here is a quick timeline of events:
- At approximately 9:40PM PST we got a notification that our primary server was under attack, and that attack mitigation was automatically put into effect.
Due to information obtained throughout the incident, we believe this was either a false alarm or unrelated to the problem.
- At approximately 10:10PM PST I called our provider. They explained they had some electrical problems in their BHS datacenter, and that many dedicated servers and almost all VPS systems had to be taken offline. They said they had dispatched a team, and had either just finished the repairs or were nearly done.
- At approximately 10:25PM PST I called our service provider again. They confirmed that the electrical problem had been resolved. Servers were coming back online, and that I was in que. ETA to online was yet unknown.
- At approximately 12:15AM PST the server came back online. There were networking issues. The machine could detect the failover IP addresses, but the primary (which was listed as still under attack) was not detected. There was no network access. Web services were still offline.
- I got off the phone with our provider again at about 12:40AM PST. They said the networking problem might be caused by ongoing electrical issues. Some switches had gone down, and needed replaced.
- Over the course of the next 2 hours, the server had rebooted a few times. I called our provider again at 2:30 a.m. PST, they said that they are actively working on the problem. The recovery time was estimated at two and a half hours.
- Approximately 5:15 a.m., I got off the phone with our service provider. They said they had restarted all VPSs however they could not ping approximately 7000 some of them. The estimated another few hours.
- I got first response from the server at about 6:15 a.m. It was back online, but with severely degraded performance.
Over the course of the next several hours, performance and response times returned to normal and the mail server began to get caught up. Everything went back to normal by 10:00 AM PST.
Here is a quote directly from the service provider (and a rough translation below), with more information of the incident:
A 10h40pm (4h40 heure française) nous avons constaté un défaut d’alimentation
de 2 salles d’hébergement, C et D, à BHS4, sur 54 baies. Il s’agit de
baies C1-18 et D1-36. Dans ces baies nous avons le PCI/VPS et les SD.
L’origine du problème est un gros cours-circuit dans le bus qui alimente les
baies D1-18. Le court-circuit a mis en défaut l’onduleur T04D. L’équipe
d’électriciens s’est dépêchée sur place pour vérifier l’etat de l’onduleur,
de lignes. L’onduleur a été remis en route. Les baies C1-18 ont été réalimenté
0h30am (6h30). Le remplacement du bus a pris 2H30. A 3h30am (9h30) les baies
D1-36 ont été réalimentée à nouveau.
Nous travaillons sur la mise en route de tous les services qui ne sont pas
D’autres informations vont suivre.
At 10h40pm (4:40 French time) we found a power failure in
2 facilities hosting, C and D, BHS4, of 54 bays. It is
C1-18 and D1-36. In these bays we have the PCI / VPS and SD.
The origin of the problem is a large current circuit in the bus that feeds bays
D1-18. The short circuit has faulted the T04D inverter. The team
electricians was dispatched to check the state of the inverter,
lines. The UPS has been restarted. C1-18 were brought online
0h30am (6:30 am). The replacement bus took 2:30. At 3h30am (9.30) bays
D1-36 were brought online again.
We are working on the initiation of all services that are not
Further information will follow.
An important note:
This is a one in a million event. The chances of a catastrophic and cascading failure of this sort are incredibly small. Once the dust and fog of this event have been cleared, I will work closely with Server Canyons provider to harden our infrastructure against future events of this kind. If no progress can be made with this provider, I will not hesitate to leave them for another company.
It is good to know that more often than not, you will learn much more through failure than success. This does not excuse the fact that this event has happened. However, I believe all affected parties have learned a lot from this event!