Tech:Incidents/2014-12-hhvm

A few days of constant loading issues because of a mix of issues which were initially traced down to HHVM not fulfilling requests, presumably due to a lack of resources but after investigation - wider causes were sought and found.

Lessons

Matters should be investigated as major issues unless otherwise proven. John takes fault for this as he passed these slow load times as 'just a temporary issue' for several days.
The HHVM admin server proves essential for dealing with suspected HHVM issues.
Ganglia is essential to debugging server side performance issues.
Servers should not be added to solve a problem until all other methods have been tried, tested and failed. Be persistent if you think a new server will not help but duplicate the problem until you are proven wrong.

Action

prod8 and prod9 were reinstalled by John.
GDNSD was reconfigured to use a realistic load balancing system.
HHVM was migrated to use the ini syntax as opposed to the deprecated hdf syntax.
Implemented a loadbalancer on December 7th 2014 to give more consistent resource sharing with loads.
Nagios checks were added for HHVM Health, currently only the 'queued' metric is monitored but more can be added when needed.

Tech:Incidents/2014-12-hhvm

Lessons

Action

Meta

Navigation menu

Tech:Incidents/2014-12-hhvm

Lessons

Action

Meta

Navigation menu

Search