There was an API outage on November 28th. Switch failures led to cascading problems that took approximately 50 minutes to stabilize.
At 04:19 UTC on the 28th November, a switch began failing. Traffic was routed to an alternate data center. Unfortunately a software bug meant that those servers were attempting to write to a master server in the first data center, and so those servers stalled. As a result from 04:20 UTC to 04:38 UTC the API was not responding.
Between 04:38 UTC and 05:01 UTC we deployed a series of fixes. Availability was between 95% and 50% during that time. The switch came back online at 05:11 UTC and availability returned to 100%.
At 07:20 UTC another switch failed, which caused a very brief outage and availability dropped to 60% during that time. This time however, traffic switched over to an alternate data center and by 07:32 UTC, the API returned to 100% availability.
WordPress users will have seen messages indicating that spam comments were temporarily held in the moderation queue during the outage. The Akismet plugin will re-try those now that the API is back up.
We’ve fixed several software problems already as a result of the failure. We’ve also identified some systems and software improvements that will prevent the same condition from happening in future, and we’re working to get those in place as soon as possible.