Tech:Incidents/2014-12-hhvm

From Orain Meta
Jump to navigation Jump to search

A few days of constant loading issues because of a mix of issues which were initially traced down to HHVM not fulfilling requests, presumably due to a lack of resources but after investigation - wider causes were sought and found.

Lessons

  • Matters should be investigated as major issues unless otherwise proven. John takes fault for this as he passed these slow load times as 'just a temporary issue' for several days.
  • The HHVM admin server proves essential for dealing with suspected HHVM issues.
  • Ganglia is essential to debugging server side performance issues.
  • Servers should not be added to solve a problem until all other methods have been tried, tested and failed. Be persistent if you think a new server will not help but duplicate the problem until you are proven wrong.

Action

  • prod8 and prod9 were reinstalled by John.
  • GDNSD was reconfigured to use a realistic load balancing system.
  • HHVM was migrated to use the ini syntax as opposed to the deprecated hdf syntax.
  • Implemented a loadbalancer on December 7th 2014 to give more consistent resource sharing with loads.
  • Nagios checks were added for HHVM Health, currently only the 'queued' metric is monitored but more can be added when needed.

Meta

  • Operations on hand: John
  • Report published by: John
  • Dated: 20:49, 5 December 2014 (GMT)