Server Canyon - Reimagining Web Infrastructure

Blog

13 Articles

Upgrades Completed

by judahnator

Following up the previous post, which can be found here, all upgrades have completed successfully with less downtime than initially expected.

We are now using the latest and most secure versions of all our software, keeping our customers security a top priority!

0 views

Notice of Scheduled Upgrades

by judahnator

Server Canyon will be installing critical software updates, some of which are security updates, early Saturday (2/25/2017) morning, starting around 7:00AM Chicago Time.

The duration of the downtime is expected to be between 10 and 15 minutes.

What we are doing to prevent loss of data:
The upgrade will be comprised of three steps.

  1. Backups of all accounts will be verified before attempting the upgrade
  2. A live system image will be captured. This way if anything goes wrong, we can roll back to a running snapshot.
  3. The new software will be installed, the kernel will be updated, and the server will reboot

Once all that is done service will continue like normal.

Look forward to many new features in the coming months that will make not only hosting, but billing and upgrades even smoother!

–EDIT–
Upgrades have been completed. See the blog post here.

0 views

Service Interruption Report (8/5/2016)

by judahnator 0 Comments

What happened:

At approximately 5:00AM PST ServerCanyon was alerted to severe performance degradation. The cause as to why is still unknown, but for transparency here is a timeline of events:

  • 5:30AM, We rebooted our primary server
  • 5:45AM, After realizing a severe problem was preventing the server from booting, the failover server kicked into gear and began to download client backups
  •  6:45AM, 80% of affected clients had been activated on the backup server
  • 7:30AM, We got on the phone with our domain registrar, as the changes to our name servers “A” record were about 45min overdue.
  • 7:40AM, Data recovery began on the old server.
  • 8:00AM, All accounts had been brought back online, changes since the last offsite backup began to sync.

 

What we are doing to prevent this from happening in the future:

  • We increased our in-datacenter backup provision by 50%
  • All accounts will now be backed up daily (5 day retention) and weekly (4 week retention)
  • Over the course of the next several weeks we will be moving clients to a pool of IP addresses that can be routed to new servers
  • We are moving to a more reliable domain registrar

 

If any customer discovers any issues with their website, please contact support immediately!

1 view

Apache OCSP Queries

by judahnator 0 Comments

An incident was reported where sites using SSL would break and become unresponsive.

This problem has been encountered before. We currently do not know what causes it, but we will keep a close eye on it.

The origin of the problem resides in OCSP stapling, a process designed to greatly increase the speed of the initial SSL handshake.

The issue has since been resolved.

0 views

Unscheduled Downtime Update

There was some downtime last night. This was caused by a problem with an upstream provider. Although there was nothing we could do but wait, here is a quick timeline of events:

  • At approximately 9:40PM PST we got a notification that our primary server was under attack, and that attack mitigation was automatically put into effect.
    Due to information obtained throughout the incident, we believe this was either a false alarm or unrelated to the problem.
  • At approximately 10:10PM PST I called our provider. They explained they had some electrical problems in their BHS datacenter, and that many dedicated servers and almost all VPS systems had to be taken offline. They said they had dispatched a team, and had either just finished the repairs or were nearly done.
  • At approximately 10:25PM PST I called our service provider again. They confirmed that the electrical problem had been resolved. Servers were coming back online, and that I was in que. ETA to online was yet unknown.
  • At approximately 12:15AM PST the server came back online. There were networking issues. The machine could detect the failover IP addresses, but the primary (which was listed as still under attack) was not detected. There was no network access. Web services were still offline.
  • I got off the phone with our provider again at about 12:40AM PST. They said the networking problem might be caused by ongoing electrical issues. Some switches had gone down, and needed replaced.
  • Over the course of the next 2 hours, the server had rebooted a few times. I called our provider again at 2:30 a.m. PST, they said that they are actively working on the problem. The recovery time was estimated at two and a half hours.
  • Approximately 5:15 a.m., I got off the phone with our service provider. They said they had restarted all VPSs however they could not ping approximately 7000 some of them. The estimated another few hours.
  • I got first response from the server at about 6:15 a.m. It was back online, but with severely degraded performance.

Over the course of the next several hours, performance and response times returned to normal and the mail server began to get caught up. Everything went back to normal by 10:00 AM PST.

Here is a quote directly from the service provider (and a rough translation below), with more information of the incident:

 

Bonjour,
A 10h40pm (4h40 heure française) nous avons constaté un défaut d’alimentation
de 2 salles d’hébergement, C et D, à BHS4, sur 54 baies. Il s’agit de
baies C1-18 et D1-36. Dans ces baies nous avons le PCI/VPS et les SD.
L’origine du problème est un gros cours-circuit dans le bus qui alimente les
baies D1-18. Le court-circuit a mis en défaut l’onduleur T04D. L’équipe
d’électriciens s’est dépêchée sur place pour vérifier l’etat de l’onduleur,
de lignes. L’onduleur a été remis en route. Les baies C1-18 ont été réalimenté
0h30am (6h30). Le remplacement du bus a pris 2H30. A 3h30am (9h30) les baies
D1-36 ont été réalimentée à nouveau.
Nous travaillons sur la mise en route de tous les services qui ne sont pas
encore UP.
D’autres informations vont suivre.

Octave

 

Hello,
At 10h40pm (4:40 French time) we found a power failure in
2 facilities hosting, C and D, BHS4, of 54 bays. It is
C1-18 and D1-36. In these bays we have the PCI / VPS and SD.

The origin of the problem is a large current circuit in the bus that feeds bays
D1-18. The short circuit has faulted the T04D inverter. The team
electricians was dispatched to check the state of the inverter,
lines. The UPS has been restarted. C1-18 were brought online
0h30am (6:30 am). The replacement bus took 2:30. At 3h30am (9.30) bays
D1-36 were brought online again.

We are working on the initiation of all services that are not
online.

Further information will follow.

Octave

 

An important note:

This is a one in a million event. The chances of a catastrophic and cascading failure of this sort are incredibly small. Once the dust and fog of this event have been cleared, I will work closely with Server Canyons provider to harden our infrastructure against future events of this kind. If no progress can be made with this provider, I will not hesitate to leave them for another company.

It is good to know that more often than not, you will learn much more through failure than success. This does not excuse the fact that this event has happened. However, I believe all affected parties have learned a lot from this event!

 

TL;DR:
Were sorry

0 views

New SSL Certificate Installed

by judahnator 0 Comments

I finally got around to spending the money to get a valid SSL Certificate.

This is exciting because now customers wont have to see annoying “self-signed” certificate warnings when accessing  FTP, Email, or cPanel.

Its another great step on the long road to being the best, but we wont stop until we are!

1 view

Expanding Our Network

by judahnator 0 Comments

Earlier today we received another block of IP addresses from our main service provider.

It is just a small step, but any great adventure must be taken one step at a time!

2 views

Status Page Live

by judahnator 0 Comments

I thought it would be pretty cool to have a status page to show the network status, so I set one up.

You can find it at http://servercanyon.com/status.

I will embed the system into a widget on this site eventually, but for now this will do.

If you would like more information on the software used for that status page, you can see it on its authors GitHub page HERE.

1 view

Scheduled Maintenance Notice

by judahnator 0 Comments

On the first of February we will apply some software updates to our core infrastructure.

We do not anticipate any downtime, however the increased network load may result in file-transfers and downloads to be slower than normal. The entire process from beginning to end should take no more than 20 minutes.

If any details change, we will put out another post when we have more information.

0 views