Tech:Incidents/2014-04-Downtimes

From Orain Meta
Jump to navigation Jump to search

During April 2014, Orain experienced two major downtimes fairly close together. The first downtime (lasting 16 hours) occurred around 17:00 UTC April 4th to 09:00 UTC April 5th. The second downtime (lasting around 80 hours) occurred from April 10th to April 14th.

First Downtime

Timeline

  • April 4th - 13:00 - Pingdom reports orain down
  • April 5th - 08:45 - issue was identified by John and nginx was restarted with i18n being rebuilt
  • April 5th - 09:00 - Whole farm accessible again

Description

After the first downtime, a notice was published to users at the wiki forum briefly explaining the case. At the time of writing this, we have narrowed the issue down to nginx restarting during the evening UTC and hitting port issues possibly with Apache which mysteriously started at the same time. The issue was identified by John at around 08:45 UTC and nginx was restarted (after checking Apache was stopped) with i18n being rebuilt making the whole farm accessible at around 09:00 UTC.

Ways to identify this type of downtime occurring again (mainly monitoring the sites up time) was looked into and a solution to use Pingdom more actively was decided as a way to deal with site downtime. Apache was checked to ensure it would not start again without cause and the reason behind nginx stopping is still unknown to date.

Second Downtime

This is the first reports being made by staff officially in regards to this downtime. A few hours before the server crashed, a key system file was identified as corrupted on prod2. This issue had prevented staff from gaining access to the host. A few hours after the key file was identified, the server without reason died during the night (early hours UTC). For the first 24 hours, all three sysadmins discussed ways to get the webserver back up and the possibility of migration. After 24 hours from the initial downtime, Kudu began setting up prod4 to be the new webserver and migrated all files over.

The latter 40 hours of the downtime can be attributed to not having a valid SSL certificate on hand (was in the process of being set up by Kudu but third party issues delayed this) and consistent 502s. The last 8 hours of the downtimes were spread between John and Addshore solving php-fpm issues and rebuilding i18n cache. The farm was briefly accessible for around 1 hour before going down again. The final hour of the downtime was Kudu finalising the set up and disabling services causing issues on the server.

We are currently discussing ways we can prevent this type of downtime from occurring again and will no doubt update this page within the week coming.

Meta

Staff on hand in downtimes: Addshore, John and Kudu.

Incident report published by: John (talk)

Timestamped: 18:20, 18 April 2014 (EDT)