Tech:Incidents/2014-12-prod3

This incident briefly covers all of prod3's incidents in the month of December but focuses on the major and final incident on December 22nd.

December Blues

In general, through the month of December prod3 had several downtimes ranging minutes to hours as a result of a lack of memory (despite having the most memory on the cluster, ironic). The MySQL process never through errors up regarding memory is just suddenly stopped without any logs and only careful analysis of the kern.log showed any details of memory issues.

In theory, it was attributed to a high irregular load as it occurred so sporadically it was not worth implementing any serious fixes.

December 22

Timeline

09:01 JohnLewis: prod3 is reported down by nagios
09:05 JohnLewis: prod3 is rebooted and an investigation into the cause of the crash begins.
09:30 JohnLewis: cause is linked to the memory issues which seem to have solved. Memory usage ~40%
17:49 JohnLewis: code change is deployed on prod3 and prod5, updating MariaDB
17:58 JohnLewis: prod3 doesn't update 'fails to stop' while prod5 has been updated for 10 minutes
18:01 JohnLewis: manually stop MariaDB via 'mysqladmin shutdown -p'
18:06 JohnLewis: prod3 begins to cycle in nagios between 'OK' and 'CRITICAL' while prod5 is OK
18:07 JohnLewis: MariaDB crashes
18:33 JohnLewis: prod3 goes into a kernel panic with all services affected
18:39 JohnLewis: SSH either won't go through or fails during key authentication
18:42 JohnLewis: 'johnflewis is not in the sudo list' messages occur, fall back to logging in as root manually
19:00 JohnLewis: boot prod3 in a recovery kernel to access MySQL files
19:20 JohnLewis: tar /var/lib/mysql up
19:25 JohnLewis: created prod12 running bare ansible, install MariaDB manually
19:40 JohnLewis: untar prod3 onto prod12, kernel panics more
19:45 JohnLewis: fresh prod12 install, get MariaDB running and deploy changes to prod{8|9|11} before importing prod3 to prod12
20:20 JohnLewis: after a lot of work, prod3 is on prod12 - all works.
20:37 JohnLewis: wikis confirmed up after a reboot of MariaDB on prod12

Summary

prod3 had been the victim of quite a few memory failure recently. An upgrade of MariaDB some-how triggered prod3 causing a serious failure that took ~2 and a half hours to solve.

Lessons

Sometimes, a single small memory issue could result in a major issue. Whether it is a month later or a few hours later.
Databases need to be monitored more with memory usage

Action

Decommissioned prod3 and pooled a replacement, prod12.

Todo

Create an 'emergency fail over' plan. This is essential for services but also passwords as following on from prod3's past, compromises can occur and we need to be prepared.

Tech:Incidents/2014-12-prod3

Contents

December Blues

December 22

Timeline

Summary

Lessons

Action

Todo

Meta

Navigation menu

Tech:Incidents/2014-12-prod3

December Blues

December 22

Timeline

Summary

Lessons

Action

Todo

Meta

Navigation menu

Search