Tech:Incidents/2015-04-04-prod7-resize: Difference between revisions

Jump to navigation Jump to search
no edit summary
imported>Addshore
(sign)
No edit summary
Line 1:
An attempt to upgrade [[Tech:prod7|prod7]] from the 512MB to the 1024MB plan, caused 15 minutes of downtime, and more than 1 hour of [[wikipedia:FOUC|FOUC]] issues + cookie issues + login issues etc.
'''15 mins of downtime''' were caused when '''resizing prod7''' from a 512MB DO instance to a 1GB DO instance (memory) and 20Gb to 30GB hdd.
 
== Details ==
 
'''Why the resize needed to happen:'''
* staticStatic (nfsNFS) which is hosted on prod7 had a very limited amount of space left (~1G).
* redisRedis kept being killed OOMby the Linux Out-Of-Memory Killer
 
'''What we were expecting to happen during resizing:'''
* Page loading times increased due to expensive operations normally being cached now being executed all the time
* page load speed be increased
* login Login/ session ability be decreasedissues
* Loading of files / static content /js / css be impossible
 
'''What happened:'''
Line 18:
 
* Some testing happened prior to this seeing short bursts of no loading pages lasting around 30 - 60 seconds x2-3
* 20:14 - SFSouthparkfan - [https://github.com/Orain/ansible-playbook/commit/890d63efbc718cf7859ba9503cc6ac7783685ee5 disable redis for MW]
* 20:32 - SFSouthparkfan - [https://github.com/Orain/ansible-playbook/commit/54581a97ffdad0200d366d61b216498c31682163 reverted the above]
* 20:49 - ASAddshore - umount /mnt/mediawiki/private/uploads on prod11
* 20:53 - ASAddshore - Switch all mw traffic to hit prod11
* 20:54 - ASAddshore - Disable redis d on prod7
* 20:55 - ASAddshore - Shutdown prod7
* 20:58 - ASAddshore - Start snapshot of prod7 before resizing should happen
* 20:59 - '''MW pages stop being loaded, 504s!'''
* 21:14 - ASAddshore - Noticed [https://github.com/Orain/ansible-playbook/commit/54581a97ffdad0200d366d61b216498c31682163 this revert] and manually commented out redis stuff on prod11
* 21:14 - '''MW Pages start loading again with no styles of JS'''
* 21:15 - Snapshot of prod7 done - rebooted automatically
* 21:18 - ASAddshore - Shutdown prod7 again for resize
* 21:21 - ASAddshore - Resizing prod7 in DO
* 21:36 - Resize done! - Booting up
* 21:42 - ASAddshore - Rebooting mw servers prod8 and prod9
* 21:53 - ASAddshore - Switch all MW traffic to prod8 & prod9 '''All MW pages fully load again'''
* 21:53 - ASAddshore - Running ansible on prod11
* 21:58 - ASAddshore - Running ansible on prod10 (restoring original LB settings)
 
== lessons.Lessons ==
 
* When performing tasks like this make sure you have a fool proof plan and stick to every stage of it, double check each stage!!
* It may be nice to have a CDN infrontin front of static meaning if static.orain.org is down pages still get JS CSS and files (We did once have this)
* We need to work out how login sessions can be handled without redisRedis (so we can touch redisRedis and people can still log in)
* NFS for static sucks? Maybe we could use S3? Perhaps? Or something similar? Or a server dedicated for
 
== Meta ==
 
* Staff on hand: Addshore, Southparkfan
'''[[User:Addshore|<span style="color:black">·addshore·</span>]]''' <sup>[[User_talk:Addshore|<span style="color:black;">talk to me!</span>]]</sup> 22:28, 4 April 2015 (BST)
* Report published by: Addshore
* Timestamp: '''[[User:Addshore|<span style="color:black">·addshore·</span>]]''' <sup>[[User_talk:Addshore|<span style="color:black;">talk to me!</span>]]</sup> 22:28, 4 April 2015 (BST), [[User:Southparkfan|Southparkfan]] ([[User talk:Southparkfan|talk]]) 22:34, 4 April 2015 (BST)
0

edits

Cookies help us deliver our services. By using our services, you agree to our use of cookies.

Navigation menu