Tech:Incidents/2015-04-04-prod7-resize

From Orain Meta
Revision as of 21:28, 4 April 2015 by imported>Addshore (sign)
Jump to navigation Jump to search

15 mins of downtime were caused when resizing prod7 from a 512MB DO instance to a 1GB DO instance (memory) and 20Gb to 30GB hdd.

Details

Why the resize needed to happen:

  • static (nfs) which is hosted on prod7 had a very limited amount of space left.
  • redis kept being killed OOM

What we were expecting to happen during resizing:

  • page load speed be increased
  • login / session ability be decreased
  • Loading of files / static content js / css be impossible

What happened:

  • Site was inaccessible for 15 mins while a snapshot was taken.

Timeline

  • Some testing happened prior to this seeing short bursts of no loading pages lasting around 30 - 60 seconds x2-3
  • 20:14 - SF - disable redis for MW
  • 20:32 - SF - reverted the above
  • 20:49 - AS - umount /mnt/mediawiki/private/uploads on prod11
  • 20:53 - AS - Switch all mw traffic to hit prod11
  • 20:54 - AS - Disable redis d on prod7
  • 20:55 - AS - Shutdown prod7
  • 20:58 - AS - Start snapshot of prod7 before resizing should happen
  • 20:59 - MW pages stop being loaded, 504s!
  • 21:14 - AS - Noticed this revert and manually commented out redis stuff on prod11
  • 21:14 - MW Pages start loading again with no styles of JS
  • 21:15 - Snapshot of prod7 done - rebooted automatically
  • 21:18 - AS - Shutdown prod7 again for resize
  • 21:21 - AS - Resizing prod7 in DO
  • 21:36 - Resize done! - Booting up
  • 21:42 - AS - Rebooting mw servers prod8 and prod9
  • 21:53 - AS - Switch all MW traffic to prod8 & prod9 All MW pages fully load again
  • 21:53 - AS - Running ansible on prod11
  • 21:58 - AS - Running ansible on prod10 (restoring original LB settings)

lessons.

  • When performing tasks like this make sure you have a fool proof plan and stick to every stage of it, double check each stage!!
  • It may be nice to have a CDN infront of static meaning if static.orain.org is down pages still get JS CSS and files (We did once have this)
  • We need to work out how login sessions can be handled without redis (so we can touch redis and people can still log in)
  • NFS for static sucks? Maybe we could use S3? Perhaps? Or something similar?


·addshore· talk to me! 22:28, 4 April 2015 (BST)