Tech:Incidents/2015-04-04-prod7-resize: Difference between revisions
Jump to navigation
Jump to search
Content added Content deleted
imported>Addshore No edit summary |
(Undo revision 16015 by 188.138.17.15 (talk) : vandalisme) |
||
(5 intermediate revisions by 5 users not shown) | |||
Line 1: | Line 1: | ||
An attempt to upgrade [[Tech:prod7|prod7]] from the 512MB to the 1024MB plan, caused 15 minutes of downtime, and more than 1 hour of [[wikipedia:FOUC|FOUC]] issues + cookie issues + login issues etc. |
|||
'''15 mins of downtime''' were caused when '''resizing prod7''' from a 512MB DO instance to a 1GB DO instance (memory) and 20Gb to 30GB hdd. |
|||
== Details== |
== Details == |
||
'''Why the resize needed to happen:''' |
'''Why the resize needed to happen:''' |
||
* |
* Static (NFS) which is hosted on prod7 had a very limited amount of space left (~1G). |
||
* |
* Redis kept being killed by the Linux Out-Of-Memory Killer |
||
'''What we were expecting to happen during resizing:''' |
'''What we were expecting to happen during resizing:''' |
||
* Page loading times increased due to expensive operations normally being cached now being executed all the time |
|||
* page load speed be increased |
|||
* |
* Login/session issues |
||
* Loading of files |
* Loading of files/static content/js/css be impossible |
||
'''What happened:''' |
'''What happened:''' |
||
Line 18: | Line 18: | ||
* Some testing happened prior to this seeing short bursts of no loading pages lasting around 30 - 60 seconds x2-3 |
* Some testing happened prior to this seeing short bursts of no loading pages lasting around 30 - 60 seconds x2-3 |
||
* 20:14 - |
* 20:14 - Southparkfan - [https://github.com/Orain/ansible-playbook/commit/890d63efbc718cf7859ba9503cc6ac7783685ee5 disable redis for MW] |
||
* 20:32 - |
* 20:32 - Southparkfan - [https://github.com/Orain/ansible-playbook/commit/54581a97ffdad0200d366d61b216498c31682163 reverted the above] |
||
* 20:49 - |
* 20:49 - Addshore - umount /mnt/mediawiki/private/uploads on prod11 |
||
* 20:53 - |
* 20:53 - Addshore - Switch all mw traffic to hit prod11 |
||
* 20:54 - |
* 20:54 - Addshore - Disable redis on prod7 |
||
* 20:55 - |
* 20:55 - Addshore - Shutdown prod7 |
||
* 20:58 - |
* 20:58 - Addshore - Start snapshot of prod7 before resizing should happen |
||
* 20:59 - '''MW pages stop being loaded, 504s!''' |
* 20:59 - '''MW pages stop being loaded, 504s!''' |
||
* 21:14 - |
* 21:14 - Addshore - Noticed [https://github.com/Orain/ansible-playbook/commit/54581a97ffdad0200d366d61b216498c31682163 this revert] and manually commented out redis stuff on prod11 |
||
* 21:14 - '''MW Pages start loading again with no styles of JS''' |
* 21:14 - '''MW Pages start loading again with no styles of JS''' |
||
* 21:15 - Snapshot of prod7 done - rebooted automatically |
* 21:15 - Snapshot of prod7 done - rebooted automatically |
||
* 21:18 - |
* 21:18 - Addshore - Shutdown prod7 again for resize |
||
* 21:21 - |
* 21:21 - Addshore - Resizing prod7 in DO |
||
* 21:36 - Resize done! - Booting up |
* 21:36 - Resize done! - Booting up |
||
* 21:42 - |
* 21:42 - Addshore - Rebooting mw servers prod8 and prod9 |
||
* 21:53 - |
* 21:53 - Addshore - Switch all MW traffic to prod8 & prod9 '''All MW pages fully load again''' |
||
* 21:53 - |
* 21:53 - Addshore - Running ansible on prod11 |
||
* 21:58 - |
* 21:58 - Addshore - Running ansible on prod10 (restoring original LB settings) |
||
== |
== Lessons == |
||
* When performing tasks like this make sure you have a fool proof plan and stick to every stage of it, double check each stage |
* When performing tasks like this make sure you have a fool proof plan and stick to every stage of it, double check each stage! |
||
* It may be nice to have a CDN |
* It may be nice to have a CDN in front of static meaning if static.orain.org is down pages still get JS CSS and files (We did once have this) |
||
* We need to work out how login sessions can be handled without |
* We need to work out how login sessions can be handled without Redis (so we can touch Redis and people can still log in) |
||
* NFS for static sucks? Maybe we could use S3? Perhaps? Or something similar? |
* NFS for static sucks? Maybe we could use S3? Perhaps? Or something similar? Or a server dedicated for |
||
== Meta == |
|||
* Staff on hand: Addshore, Southparkfan |
|||
* Report published by: Addshore |
|||
* Timestamp: '''[[User:Addshore|<span style="color:black">·addshore·</span>]]''' <sup>[[User_talk:Addshore|<span style="color:black;">talk to me!</span>]]</sup> 22:28, 4 April 2015 (BST), [[User:Southparkfan|Southparkfan]] ([[User talk:Southparkfan|talk]]) 22:34, 4 April 2015 (BST) |
Latest revision as of 19:00, 14 June 2015
An attempt to upgrade prod7 from the 512MB to the 1024MB plan, caused 15 minutes of downtime, and more than 1 hour of FOUC issues + cookie issues + login issues etc.
Details
Why the resize needed to happen:
- Static (NFS) which is hosted on prod7 had a very limited amount of space left (~1G).
- Redis kept being killed by the Linux Out-Of-Memory Killer
What we were expecting to happen during resizing:
- Page loading times increased due to expensive operations normally being cached now being executed all the time
- Login/session issues
- Loading of files/static content/js/css be impossible
What happened:
- Site was inaccessible for 15 mins while a snapshot was taken.
Timeline
- Some testing happened prior to this seeing short bursts of no loading pages lasting around 30 - 60 seconds x2-3
- 20:14 - Southparkfan - disable redis for MW
- 20:32 - Southparkfan - reverted the above
- 20:49 - Addshore - umount /mnt/mediawiki/private/uploads on prod11
- 20:53 - Addshore - Switch all mw traffic to hit prod11
- 20:54 - Addshore - Disable redis on prod7
- 20:55 - Addshore - Shutdown prod7
- 20:58 - Addshore - Start snapshot of prod7 before resizing should happen
- 20:59 - MW pages stop being loaded, 504s!
- 21:14 - Addshore - Noticed this revert and manually commented out redis stuff on prod11
- 21:14 - MW Pages start loading again with no styles of JS
- 21:15 - Snapshot of prod7 done - rebooted automatically
- 21:18 - Addshore - Shutdown prod7 again for resize
- 21:21 - Addshore - Resizing prod7 in DO
- 21:36 - Resize done! - Booting up
- 21:42 - Addshore - Rebooting mw servers prod8 and prod9
- 21:53 - Addshore - Switch all MW traffic to prod8 & prod9 All MW pages fully load again
- 21:53 - Addshore - Running ansible on prod11
- 21:58 - Addshore - Running ansible on prod10 (restoring original LB settings)
Lessons
- When performing tasks like this make sure you have a fool proof plan and stick to every stage of it, double check each stage!
- It may be nice to have a CDN in front of static meaning if static.orain.org is down pages still get JS CSS and files (We did once have this)
- We need to work out how login sessions can be handled without Redis (so we can touch Redis and people can still log in)
- NFS for static sucks? Maybe we could use S3? Perhaps? Or something similar? Or a server dedicated for
Meta
- Staff on hand: Addshore, Southparkfan
- Report published by: Addshore
- Timestamp: ·addshore· talk to me! 22:28, 4 April 2015 (BST), Southparkfan (talk) 22:34, 4 April 2015 (BST)