Tech:Incidents/2015-04-04-prod7-resize: Difference between revisions

Content added Content deleted

Inline

Latest revision as of 19:00, 14 June 2015

An attempt to upgrade prod7 from the 512MB to the 1024MB plan, caused 15 minutes of downtime, and more than 1 hour of FOUC issues + cookie issues + login issues etc.

Details

Why the resize needed to happen:

Static (NFS) which is hosted on prod7 had a very limited amount of space left (~1G).
Redis kept being killed by the Linux Out-Of-Memory Killer

What we were expecting to happen during resizing:

Page loading times increased due to expensive operations normally being cached now being executed all the time
Login/session issues
Loading of files/static content/js/css be impossible

What happened:

Site was inaccessible for 15 mins while a snapshot was taken.

Timeline

Some testing happened prior to this seeing short bursts of no loading pages lasting around 30 - 60 seconds x2-3
20:14 - Southparkfan - disable redis for MW
20:32 - Southparkfan - reverted the above
20:49 - Addshore - umount /mnt/mediawiki/private/uploads on prod11
20:53 - Addshore - Switch all mw traffic to hit prod11
20:54 - Addshore - Disable redis on prod7
20:55 - Addshore - Shutdown prod7
20:58 - Addshore - Start snapshot of prod7 before resizing should happen
20:59 - MW pages stop being loaded, 504s!
21:14 - Addshore - Noticed this revert and manually commented out redis stuff on prod11
21:14 - MW Pages start loading again with no styles of JS
21:15 - Snapshot of prod7 done - rebooted automatically
21:18 - Addshore - Shutdown prod7 again for resize
21:21 - Addshore - Resizing prod7 in DO
21:36 - Resize done! - Booting up
21:42 - Addshore - Rebooting mw servers prod8 and prod9
21:53 - Addshore - Switch all MW traffic to prod8 & prod9 All MW pages fully load again
21:53 - Addshore - Running ansible on prod11
21:58 - Addshore - Running ansible on prod10 (restoring original LB settings)

Lessons

When performing tasks like this make sure you have a fool proof plan and stick to every stage of it, double check each stage!
It may be nice to have a CDN in front of static meaning if static.orain.org is down pages still get JS CSS and files (We did once have this)
We need to work out how login sessions can be handled without Redis (so we can touch Redis and people can still log in)
NFS for static sucks? Maybe we could use S3? Perhaps? Or something similar? Or a server dedicated for

@@ Line 1: / Line 1: @@
+An attempt to upgrade [[Tech:prod7|prod7]] from the 512MB to the 1024MB plan, caused 15 minutes of downtime, and more than 1 hour of [[wikipedia:FOUC|FOUC]] issues + cookie issues + login issues etc.
-'''15 mins of downtime''' were caused when '''resizing prod7''' from a 512MB DO instance to a 1GB DO instance (memory) and 20Gb to 30GB hdd.
-== Details==
+== Details ==
 '''Why the resize needed to happen:'''
-* static (nfs) which is hosted on prod7 had a very limited amount of space left.
+* Static (NFS) which is hosted on prod7 had a very limited amount of space left (~1G).
-* redis kept being killed OOM
+* Redis kept being killed by the Linux Out-Of-Memory Killer
 '''What we were expecting to happen during resizing:'''
+* Page loading times increased due to expensive operations normally being cached now being executed all the time
-* page load speed be increased
-* login / session ability be decreased
+* Login/session issues
-* Loading of files / static content js / css be impossible
+* Loading of files/static content/js/css be impossible
 '''What happened:'''
@@ Line 18: / Line 18: @@
 * Some testing happened prior to this seeing short bursts of no loading pages lasting around 30 - 60 seconds x2-3
-* 20:14 - SF - [https://github.com/Orain/ansible-playbook/commit/890d63efbc718cf7859ba9503cc6ac7783685ee5 disable redis for MW]
+* 20:14 - Southparkfan - [https://github.com/Orain/ansible-playbook/commit/890d63efbc718cf7859ba9503cc6ac7783685ee5 disable redis for MW]
-* 20:32 - SF - [https://github.com/Orain/ansible-playbook/commit/54581a97ffdad0200d366d61b216498c31682163 reverted the above]
+* 20:32 - Southparkfan - [https://github.com/Orain/ansible-playbook/commit/54581a97ffdad0200d366d61b216498c31682163 reverted the above]
-* 20:49 - AS - umount /mnt/mediawiki/private/uploads on prod11
+* 20:49 - Addshore - umount /mnt/mediawiki/private/uploads on prod11
-* 20:53 - AS - Switch all mw traffic to hit prod11
+* 20:53 - Addshore - Switch all mw traffic to hit prod11
-* 20:54 - AS - Disable redis d on prod7
+* 20:54 - Addshore - Disable redis on prod7
-* 20:55 - AS - Shutdown prod7
+* 20:55 - Addshore - Shutdown prod7
-* 20:58 - AS - Start snapshot of prod7 before resizing should happen
+* 20:58 - Addshore - Start snapshot of prod7 before resizing should happen
 * 20:59 - '''MW pages stop being loaded, 504s!'''
-* 21:14 - AS - Noticed [https://github.com/Orain/ansible-playbook/commit/54581a97ffdad0200d366d61b216498c31682163 this revert] and manually commented out redis stuff on prod11
+* 21:14 - Addshore - Noticed [https://github.com/Orain/ansible-playbook/commit/54581a97ffdad0200d366d61b216498c31682163 this revert] and manually commented out redis stuff on prod11
 * 21:14 - '''MW Pages start loading again with no styles of JS'''
 * 21:15 - Snapshot of prod7 done - rebooted automatically
-* 21:18 - AS - Shutdown prod7 again for resize
+* 21:18 - Addshore - Shutdown prod7 again for resize
-* 21:21 - AS - Resizing prod7 in DO
+* 21:21 - Addshore - Resizing prod7 in DO
 * 21:36 - Resize done! - Booting up
-* 21:42 - AS - Rebooting mw servers prod8 and prod9
+* 21:42 - Addshore - Rebooting mw servers prod8 and prod9
-* 21:53 - AS - Switch all MW traffic to prod8 & prod9 '''All MW pages fully load again'''
+* 21:53 - Addshore - Switch all MW traffic to prod8 & prod9 '''All MW pages fully load again'''
-* 21:53 - AS - Running ansible on prod11
+* 21:53 - Addshore - Running ansible on prod11
-* 21:58 - AS - Running ansible on prod10 (restoring original LB settings)
+* 21:58 - Addshore - Running ansible on prod10 (restoring original LB settings)
-== lessons. ==
+== Lessons ==
-* When performing tasks like this make sure you have a fool proof plan and stick to every stage of it, double check each stage!!
+* When performing tasks like this make sure you have a fool proof plan and stick to every stage of it, double check each stage!
-* It may be nice to have a CDN infront of static meaning if static.orain.org is down pages still get JS CSS and files (We did once have this)
+* It may be nice to have a CDN in front of static meaning if static.orain.org is down pages still get JS CSS and files (We did once have this)
-* We need to work out how login sessions can be handled without redis (so we can touch redis and people can still log in)
+* We need to work out how login sessions can be handled without Redis (so we can touch Redis and people can still log in)
-* NFS for static sucks? Maybe we could use S3? Perhaps? Or something similar?
+* NFS for static sucks? Maybe we could use S3? Perhaps? Or something similar? Or a server dedicated for
+== Meta ==
+* Staff on hand: Addshore, Southparkfan
+* Report published by: Addshore
+* Timestamp: '''[[User:Addshore|<span style="color:black">·addshore·</span>]]''' <sup>[[User_talk:Addshore|<span style="color:black;">talk to me!</span>]]</sup> 22:28, 4 April 2015 (BST), [[User:Southparkfan|Southparkfan]] ([[User talk:Southparkfan|talk]]) 22:34, 4 April 2015 (BST)

Tech:Incidents/2015-04-04-prod7-resize: Difference between revisions

Latest revision as of 19:00, 14 June 2015

Contents

Details

Timeline

Lessons

Meta

Navigation menu

Tech:Incidents/2015-04-04-prod7-resize: Difference between revisions

Latest revision as of 19:00, 14 June 2015

Details

Timeline

Lessons

Meta

Navigation menu

Search