Tech:Incidents/2014-12-hhvm
Jump to navigation
Jump to search
A few days of constant loading issues because of a mix of issues which were initially traced down to HHVM not fulfilling requests, presumably due to a lack of resources but after investigation - wider causes were sought and found.
Lessons
- Matters should be investigated as major issues unless otherwise proven. John takes fault for this as he passed these slow load times as 'just a temporary issue' for several days.
- The HHVM admin server proves essential for dealing with suspected HHVM issues.
- Ganglia is essential to debugging server side performance issues.
- Servers should not be added to solve a problem until all other methods have been tried, tested and failed. Be persistent if you think a new server will not help but duplicate the problem until you are proven wrong.
Action
- prod8 and prod9 were reinstalled by John.
- GDNSD was reconfigured to use a realistic load balancing system.
- HHVM was migrated to use the ini syntax as opposed to the deprecated hdf syntax.
- Implemented a loadbalancer on December 7th 2014 to give more consistent resource sharing with loads.
- Nagios checks were added for HHVM Health, currently only the 'queued' metric is monitored but more can be added when needed.
Meta
- Operations on hand: John
- Report published by: John
- Dated: 20:49, 5 December 2014 (GMT)