Server Canyon - Your Cloud, Local.

Yearly Archives

10 Articles

Service Interruption Report (8/5/2016)

by Judah Wright 0 Comments

What happened:

At approximately 5:00AM PST ServerCanyon was alerted to severe performance degradation. The cause as to why is still unknown, but for transparency here is a timeline of events:

  • 5:30AM, We rebooted our primary server
  • 5:45AM, After realizing a severe problem was preventing the server from booting, the failover server kicked into gear and began to download client backups
  •  6:45AM, 80% of affected clients had been activated on the backup server
  • 7:30AM, We got on the phone with our domain registrar, as the changes to our name servers “A” record were about 45min overdue.
  • 7:40AM, Data recovery began on the old server.
  • 8:00AM, All accounts had been brought back online, changes since the last offsite backup began to sync.


What we are doing to prevent this from happening in the future:

  • We increased our in-datacenter backup provision by 50%
  • All accounts will now be backed up daily (5 day retention) and weekly (4 week retention)
  • Over the course of the next several weeks we will be moving clients to a pool of IP addresses that can be routed to new servers
  • We are moving to a more reliable domain registrar


If any customer discovers any issues with their website, please contact support immediately!

1 view

Apache OCSP Queries

by Judah Wright 0 Comments

An incident was reported where sites using SSL would break and become unresponsive.

This problem has been encountered before. We currently do not know what causes it, but we will keep a close eye on it.

The origin of the problem resides in OCSP stapling, a process designed to greatly increase the speed of the initial SSL handshake.

The issue has since been resolved.


Unscheduled Downtime Update

There was some downtime last night. This was caused by a problem with an upstream provider. Although there was nothing we could do but wait, here is a quick timeline of events:

  • At approximately 9:40PM PST we got a notification that our primary server was under attack, and that attack mitigation was automatically put into effect.
    Due to information obtained throughout the incident, we believe this was either a false alarm or unrelated to the problem.
  • At approximately 10:10PM PST I called our provider. They explained they had some electrical problems in their BHS datacenter, and that many dedicated servers and almost all VPS systems had to be taken offline. They said they had dispatched a team, and had either just finished the repairs or were nearly done.
  • At approximately 10:25PM PST I called our service provider again. They confirmed that the electrical problem had been resolved. Servers were coming back online, and that I was in que. ETA to online was yet unknown.
  • At approximately 12:15AM PST the server came back online. There were networking issues. The machine could detect the failover IP addresses, but the primary (which was listed as still under attack) was not detected. There was no network access. Web services were still offline.
  • I got off the phone with our provider again at about 12:40AM PST. They said the networking problem might be caused by ongoing electrical issues. Some switches had gone down, and needed replaced.
  • Over the course of the next 2 hours, the server had rebooted a few times. I called our provider again at 2:30 a.m. PST, they said that they are actively working on the problem. The recovery time was estimated at two and a half hours.
  • Approximately 5:15 a.m., I got off the phone with our service provider. They said they had restarted all VPSs however they could not ping approximately 7000 some of them. The estimated another few hours.
  • I got first response from the server at about 6:15 a.m. It was back online, but with severely degraded performance.

Over the course of the next several hours, performance and response times returned to normal and the mail server began to get caught up. Everything went back to normal by 10:00 AM PST.

Here is a quote directly from the service provider (and a rough translation below), with more information of the incident:


A 10h40pm (4h40 heure française) nous avons constaté un défaut d’alimentation
de 2 salles d’hébergement, C et D, à BHS4, sur 54 baies. Il s’agit de
baies C1-18 et D1-36. Dans ces baies nous avons le PCI/VPS et les SD.
L’origine du problème est un gros cours-circuit dans le bus qui alimente les
baies D1-18. Le court-circuit a mis en défaut l’onduleur T04D. L’équipe
d’électriciens s’est dépêchée sur place pour vérifier l’etat de l’onduleur,
de lignes. L’onduleur a été remis en route. Les baies C1-18 ont été réalimenté
0h30am (6h30). Le remplacement du bus a pris 2H30. A 3h30am (9h30) les baies
D1-36 ont été réalimentée à nouveau.
Nous travaillons sur la mise en route de tous les services qui ne sont pas
encore UP.
D’autres informations vont suivre.



At 10h40pm (4:40 French time) we found a power failure in
2 facilities hosting, C and D, BHS4, of 54 bays. It is
C1-18 and D1-36. In these bays we have the PCI / VPS and SD.

The origin of the problem is a large current circuit in the bus that feeds bays
D1-18. The short circuit has faulted the T04D inverter. The team
electricians was dispatched to check the state of the inverter,
lines. The UPS has been restarted. C1-18 were brought online
0h30am (6:30 am). The replacement bus took 2:30. At 3h30am (9.30) bays
D1-36 were brought online again.

We are working on the initiation of all services that are not

Further information will follow.



An important note:

This is a one in a million event. The chances of a catastrophic and cascading failure of this sort are incredibly small. Once the dust and fog of this event have been cleared, I will work closely with Server Canyons provider to harden our infrastructure against future events of this kind. If no progress can be made with this provider, I will not hesitate to leave them for another company.

It is good to know that more often than not, you will learn much more through failure than success. This does not excuse the fact that this event has happened. However, I believe all affected parties have learned a lot from this event!


Were sorry


New SSL Certificate Installed

by Judah Wright 0 Comments

I finally got around to spending the money to get a valid SSL Certificate.

This is exciting because now customers wont have to see annoying “self-signed” certificate warnings when accessing  FTP, Email, or cPanel.

Its another great step on the long road to being the best, but we wont stop until we are!

1 view

Status Page Live

by Judah Wright 0 Comments

I thought it would be pretty cool to have a status page to show the network status, so I set one up.

You can find it at

I will embed the system into a widget on this site eventually, but for now this will do.

If you would like more information on the software used for that status page, you can see it on its authors GitHub page HERE.

1 view

Scheduled Maintenance Notice

by Judah Wright 0 Comments

On the first of February we will apply some software updates to our core infrastructure.

We do not anticipate any downtime, however the increased network load may result in file-transfers and downloads to be slower than normal. The entire process from beginning to end should take no more than 20 minutes.

If any details change, we will put out another post when we have more information.


Load Spike

by Judah Wright 0 Comments

At approximately 10:30 AM PST we experienced a momentary load spike, resulting in several websites being slow to respond or timing out.

Upon investigation, it appeared to be a bot trying to gain unauthorized access to the primary server. Security measures and automated attack mitigation services picked up around 10:35 AM PST and traffic slowly died down to normal levels by 11:00 AM PST. We will continue to closely monitor our infrastructure for any further problems.

No data was lost and there was no unauthorized access granted. Services are running smoothly again.


Billing Errors

by Judah Wright 0 Comments

Earlier today we discovered that a few clients had been invoiced incorrect amounts for hosting.

The problem was discovered after some customers were notified that a new invoice was available. After resolving the problem new invoice notifications were sent out.
It is important to note that no clients were actually billed because of the error.

As of 9:00PM Tonight (PST), all the green lights are happily blinking again!

1 view