Tech:Server admin log: Difference between revisions

+addshore fixed problems yesterday with disk space
(prod6 done prod7 in progress)
(+addshore fixed problems yesterday with disk space)
 
(142 intermediate revisions by 9 users not shown)
Line 1:
== July 19 ==
* '''Addshore''' Fixed problems with Disk space --[[User:Reception123|Reception123]] ([[User talk:Reception123|talk]]) 06:36, 20 July 2015 (BST)
 
 
== July 4 ==
* I fixed everything (see git history) .... '''[[User:Addshore|<span style="color:black">·addshore·</span>]]''' <sup>[[User_talk:Addshore|<span style="color:black;">talk to me!</span>]]</sup> 09:28, 4 July 2015 (BST)
 
== June 30 ==
* ~09:10 Southparkfan: pooled prod9 back in prod with below changes applied. Ansible on both servers disabled. DO NOT run ansible on those servers unless you are 100% sure it won't cause issues.
* 08:54 Southparkfan: disable CSS, OnlineStatus and EmbedVideo on All The Tropes Wiki. Meta (why Meta too?) and All The Tropes are now back online and running without throwing MWExceptions.
 
== June 29 ==
* 14:48 Southparkfan: shutdown & destroy prod11
* 12:21 NDKilla: Not experiencing issues on any wiki's that reported issues. extloadtest still shows frequent errors
* 11:20 NDKilla: Rebuild LC on extloadwiki per SPF
* 10:48 NDKilla: Ran all jobs on metawiki and allthetropeswiki
* 10:38 NDKilla: Investigating DB (and hoping I didn't cause them)
* 02:02 GethN7 notifies #orain of a lot of DB issues on allthetropeswiki
 
== June 28 ==
* Late afternoon: Manually ran "sudo /root/ans-all --skip-tags=slow" on prod 8.9, and 11
 
== June 16 ==
* 17:16 Southparkfan: "sudo usermod -u 2020 www-scripts" on prod9 and prod11
 
== June 13 ==
* 11:46 Southparkfan: destroyed prod8 for testing
 
== May 14 ==
* 20:49 Southparkfan: DROP DATABASE spamwiki; on prod12 - massive disk space free up :D
* 20:49 Southparkfan: ran php5 /srv/mediawiki/w/maintenance/Orain/removeDeletedWikis.php --wiki loginwiki on prod9
 
== April 28 ==
* 13:09 Southparkfan: pooled prod11 back
* 13:02 Southparkfan: reboot prod11
* 12:51 Southparkfan: depooled prod11 from haproxy
 
== April 4 ==
* ..... Stuff happened, [[Tech:Incidents/2015-04-04-prod7-resize]]
* 20:10 Addshore: Restart prod7
* 19:43 Addshore: prod9 back up and resized
* 19:41 Addshore: resize prod9 to 512mb instance and restart
* 19:36 Southparkfan: removed prod9 from haproxy config (planned for downgrade/re-install as needed)
* 14:51 Addshore: Login issues, Redis down, Restarted (We should really have a watchdog or something check and restart this)
 
== April 3 ==
* 21:24 Addshore: "pear install net_smtp" on prod9
* 20:35 Addshore: added prod9 back to LB, Cheers SPF!
* 20:24 Addshore: added prod9 back to LB -> it broke stuff -> promptly removed
* 20:20 Addshore: restarted redis-server on prod7 (yes everyone got logged out...)
* 20:16 Addshore: removed prod8 from LB for reboot then added back
* 20:10 Addshore: removed prod11 from LB for reboot then added back
* 19:00 Addshore: removed prod9 from LB and rebuilt (SPFCloud to add everything to prod9 and add back to LB)
 
== April 1 ==
* 13:15 Addshore: got reports users were unable to login. Redis was no longer running on prod7, restarted.
 
== March 26 ==
* 19:53 Southparkfan: noticed great things on prod9 :D
* 19:34 addshore: resize complete, powering back prod9
* 19:27 addshore: shutdown prod9 for resize
 
== March 17 ==
* 15:00 Southparkfan: ran update.php on memewiki
 
== March 16 ==
* 13:09 Southparkfan: HHVM died on prod8 for an unknown reason, causing downtime on the farm - restarted it
 
== March 14 ==
* 15:42 Southparkfan: ran update.php on lovelifesiftwwiki
 
== March 10 ==
* 17:23 Southparkfan: restart HHVM on all servers for HHVM admin password reset
 
== March 6 ==
* 23:47 Southparkfan: enable ansible on prod7
* 20:59 Southparkfan: disable ansible on prod7
* 20:48 Southparkfan: restart ssh on prod7
 
== March 5 ==
* 16:24 Southparkfan: kill'd & restarted HHVM on prod9 and prod11 too. Let's see if performance is improved now.
* 16:21 Southparkfan: disable ansible cron on prod9 and prod11
* 16:06 Southparkfan: start HHVM on prod8
* 16:06 Southparkfan: kill HHVM on prod8
* 15:48 Southparkfan: disable ansible cron on prod8 for HHVM testing
 
== March 4 ==
* 17:53 Southparkfan: (prod7) sudo cp -R /tmp/lovelivesiftw.twgg.org/ /var/mediawiki/uploads/ - all should be fixed now
* 17:52 Southparkfan: (prod7) sudo rm -rf lovelivesiftw.twgg.org lovelivesiftw.orain.org/
* 17:43 Southparkfan: possibly messed up the below commands, so restored directories from backup and tried again
* 17:32 Southparkfan: (prod7) sudo rm -rf http:/lovelivesiftw.twgg.org <- "lovelivesiftw.twgg.org" was a directory inside another directory, "http:"
* 17:30 Southparkfan: (prod7) sudo cp -R lovelivesiftw.twgg.org/ /var/mediawiki/uploads/lovelivesiftw.twgg.org
 
== March 2 ==
* 14:04 Southparkfan: after a bunch of Piwik issues complaining about "mysqli extension (could) not (be) loaded/found" and a thousand restarts, fixed php5-fpm issues
* 13:42 Southparkfan: stop php5-fpm on prod6 (kill -9'd all processes)
 
== February 28 ==
* 13:05 Southparkfan: reload nagios on prod6
 
== February 27 ==
* 23:14 Southparkfan: deleted all jobs from allthetropeswiki's job table
* 23:05 Southparkfan: "delete from job where job_cmd = 'cirrusSearchLinksUpdate';" on prod5
* 22:40 Southparkfan: enable cron again
* 22:29 Southparkfan: disable ansible cron on prod6
* 14:23 Southparkfan: restarted redis
 
== February 25 ==
* 19:43 Southparkfan: ran some apt-get commands on prod7 to get some more disk space
* 19:28 Southparkfan: for now, I did it for trialsintaintedspacewiki, spiralwiki and metawiki too on prod9. The servers should be able to survive a few days/weeks now with the broken logrotate.
* 19:21 Southparkfan: cleaned some diskspace (5.5GB) by compressing files of some wikis on prod11 too. Wikis: (incomplete list) trialsintaintedspacewiki, rightwiki, corruptionsofchampionswiki, loginwiki, metawiki, allthetropeswiki
* 19:06 Southparkfan: below has been done for loginwiki and allthetropeswiki too on prod9. Looks all good, moving on to prod11 for now
* 19:00 Southparkfan: compressed corruptionsofchampions.orain.org.log manually (to corruptionsofchampionswiki.gz), and then deleted the .log file
 
== February 24 ==
* 20:30 Southparkfan: enable ansible cron again on prod6. php5-fpm will now replace mod_php forever, and this will make Piwik twice as fast!
* 20:07 Southparkfan: restarted php5-fpm and apache2 on prod6. Apache will now serve stuff via php5-fpm!
* 20:04 Southparkfan: disable ansible on prod6
* 11:38 Southparkfan: enable ansible again for now. Will test apache stuff at another moment.
* 11:34 Southparkfan: disable ansible temporarily on prod6 (apache testing)
* 11:29 Southparkfan: install python-mwclient.deb (and python-support) on prod6 for orainLog
 
== February 23 ==
* 23:38 Southparkfan: completed a full security upgrade of all packages on all servers (duration: more than one hour)
* 16:43 Southparkfan: killed all LC rebuild processes on prod9 (but at least prod9 is up again!)
* 16:40 Dusti: reboot prod9
* 12:59 Southparkfan: stop apache2 on prod12
 
== February 22 ==
* 12:48 Southparkfan: below on prod11 too
* 12:46 Southparkfan: cd /var/log/mediawiki/ && sudo rm -f spam.orain.org* on prod9
 
== February 21 ==
* 16:18 Southparkfan: (prod10) sudo service cron restart
* 16:17 Southparkfan: change root user password on prod10 again
* 15:33 Southparkfan: change password of root user on prod10
* 15:12 Southparkfan: sudo service cron restart on prod10
* 15:12 Southparkfan: sudo service cron start on prod10
 
== February 19 ==
* 14:12 Southparkfan: (prod8, prod9, prod11) cd /var/log/mediawiki/ && sudo rm -f spam.orain.org*
 
== February 16 ==
* 06:39 Southparkfan: restart HHVM on prod11
 
== February 15 ==
* 16:45 Southparkfan: ran below again
* 16:38 Southparkfan: (prod9, prod11 - /var/log/mediawiki/) sudo rm -f *1.gz - logrotate is having trouble with these files ending with "1.gz" (for an unknown reason these files are not compressed log files, just empty files)
* 09:28 Southparkfan: re-installed php5-gd package on prod6
* 09:06 Southparkfan: restart HHVM on prod9 (it died for an still unknown reason)
 
== February 14 ==
* 19:09 Southparkfan: (prod8, prod9, prod11) cd /var/log/mediawiki/ && sudo rm -f spam.orain.org*
* 15:09 Southparkfan: remove unnecessary packages on prod9, cleans up another 300MB.
* 14:50 Southparkfan: forced a logrotate run on prod9 again (all logs)
* 14:46 Southparkfan: forced a logrotate run on prod9 (mediawiki logs only)
 
== February 13 ==
* 14:34 Southparkfan: forced a logrotate run on prod11 too per the same reason as below. It seems it partially failed too, but still freed up ~100MB disk space.
* 14:14 Southparkfan: forced a logrotate run on prod9 due to a critical amount of disk space left (<150 MB). It seems it partially failed, but it at least freed up something like 400MB disk space or so. Finding out now how to make the run succeed and compress even more log files.
 
== February 11 ==
* 14:22 Addshore: fixed ansible run on prod7 due to conflict of user ids with 'git' user id 2003: Ran the following:
<pre>
usermod -u 2103 git
groupmod -g 2103 git
find / -user 2003 -exec chown -h 2103 {} \;
find / -group 2005 -exec chgrp -h 2103 {} \;
usermod -g 2103 git
</pre>
 
* 14:15 Addshore: remove and re clone private repo on prod12 (fixes ansible run)
 
== February 9 ==
* 12:30 Addshore: reloading haproxy on prod10
 
== February 8 ==
* 13:57 Southparkfan: ran changePassword.php on techwiki for OrainLog
 
== February 7 ==
* 15:15 Southparkfan: deleted "zacharydubois" on prod6 per request
Dusti upgraded GitHub to the Silver plan which includes private repos. SPF working on moving prod7 to a private git repo.
 
== February 6 ==
* 17:01 Southparkfan: upgraded packages with security fixes
 
== February 5 ==
Addshore: on prod6! @ 5:00 GMT / 00:30hrs EST
 
Killed all processes for users noreply and jasper and altered users Ids to fix ansible run. Ran the following:
 
<pre>
usermod -u 2101 noreply
groupmod -g 2101 noreply
find / -user 2006 -exec chown -h 2101 {} \;
find / -group 2008 -exec chgrp -h 2101 {} \;
usermod -g 2101 noreply
</pre>
<pre>
usermod -u 2102 jasper
groupmod -g 2102 jasper
find / -user 2007 -exec chown -h 2102 {} \;
find / -group 2009 -exec chgrp -h 2102 {} \;
usermod -g 2102 jasper
</pre>
 
== January 25 ==
* 17:14 Southparkfan: upgraded packages with security fixes (again)
 
== January 23 ==
* 22:30 Tanner: Migrated DNS to CloudFlare for stability.
* 18:49 Southparkfan: installed security updates across the servers
 
== January 22 ==
* 23:39 Addshore - manually add technoratimedia_sv_115e9.txt file to the root mediawiki directory for Dusti, No point in this being in ansible, it can be removed / vanish whenever...
 
== January 21 ==
* 15:00 Addshore - Killed udp.py script on prod6 that was point at JDnet
** Manually copied script to /home/addshore/udp.py for testing (Not in ansible....) - seems to work fine and will run in screen
 
== January 20 ==
* Sometime - Addshore: Manually patched rebuildtextindex in a secret place and ran accross ALL wikis in a Screen on some prod. Run successful and all indexes rebuilt.
* 14:10 Southparkfan: ran rebuildtextindex.php on metawiki again (with php instead of php5)
* 13:45 Southparkfan: ran rebuildtextindex.php on metawiki
 
== January 17 ==
* 17:47 Southparkfan: changed Southparkfan2's password with changePassword.php (my account of which I forgot the password, and no email was set on the account).
* 00:25 JohnLewis: prod9 has been running at 100% CPU since December 8th. Missing from ganglia. Hard reboot and investigating.
 
== January 9 ==
* 18:42 JohnLewis: update.php on donjonwiki for BF tables
* 18:41 Southparkfan: ran update.php again on donjonwiki to fix dberrors (run conflict with John but k)
* 16:16 Southparkfan: ran update.php on donjonwiki
 
== January 8 ==
* 16:30 Southparkfan: ran importImages.php again on donjonwiki (a few files had bad filenames, and now still a few have....)
* 15:52 Southparkfan: ran importImages.php on donjonwiki
 
== January 4 ==
* 18:24 Southparkfan: ran importDump.php on donjonwiki
 
== December 30 ==
* 15:17 Southparkfan: ran importImages.php on robloxclanswiki
 
== December 29 ==
* 21:13 JohnLewis: delete councilwiki
 
== December 22 ==
* 22:21 JohnLewis: destoy prod3
* 22:19 JohnLewis: push prod3 decom changes and pool prod12 in its place
* 17:32 JohnLewis: deleted 5 wikis form prod3 and cleared respective tables in CA and loginwiki.
* 09:05 JohnLewis: boot prod3 after uninitiated power down. Investigating.
 
== December 19 ==
* 22:20 JohnLewis: password have been migrated
* 21:05 JohnLewis: begin password type migration (pbkdf2-legacyB)
* 20:58 JohnLewis: prod8 and prod11 are now running MW1.24. prod9 is still depooled pending finalising the update. Passwords needs to be wrapped (will do shortly)
 
== December 5 ==
* 21:10 JohnLewis: MariaDB [(none)]> drop database dalieuwiki;
 
== November 20 ==
* 15:30 Arcane: ran a database/ansible update.
 
== November 14 ==
* 15:43 JohnLewis: MariaDB [(none)]> drop database Powersystemswiki;
 
== October 29 ==
* 20:13 JohnLewis: update hhvm
 
== October 27 ==
* 16:52 JohnLewis: drop database esourcewnywiki; (per technical reasons)
 
== October 15 ==
* 20:02 JohnLewis: shutdown prod4 (planned for reinstall tomorrow)
* 19:50 JohnLewis: remove prod4 from lb and purge DNS on ns1 and ns2
* 17:27 JohnLewis: powercycle prod4 (was not responding to anything)
 
== October 11 ==
* 19:47 JohnLewis: switch ns2.orain.org to prod7 (dns cache)
* 18:46 JohnLewis: deleted wikis in the list [[m:Special:Diff/10079|here]] from prod3.
* 18:43 JohnLewis: php5 /srv/mediawiki/w/maintenance/Orain/removeDeletedWikis.php --wiki loginwiki
* 18:00 JohnLewis: security updates for MediaWiki
* 16:40 JohnLewis: DNS change confirmed to have propagated to myself
* ~14:00 JohnLewis: change orain.org's DNS to ns1.orain.org/ns2.orain.org from pam.ns.cloudflare.com/woz.ns.cloudflare.com
 
== October 5 ==
* 18:10 JohnLewis: php5 /srv/mediawiki/w/maintenance/importDump.php --wiki classwiki /home/johnflewis/backup.xml
 
== October 3rd ==
* 18:22 JohnLewis: applied security fixes; email going to sysadmin shortly regarding this.
 
== September 27 ==
* 17:48 JohnLewis: prod6 seems good. Moving onto prod7
Line 4 ⟶ 296:
 
== September 19 ==
* 19:46 JohnLewis: changed pord8prod8's kernel to 3.13.0-32-generic (from 3.13.0-35-generic)
* 17:36 JohnLewis: prod8 is now our first Ubuntu machine
* 17:32 JohnLewis: begin upgrading prod8 to Ubuntu 14.04 (Trusty)