angle-graphic

CxAlloy TQ Downtime Postmortem

On Wednesday, March 23, CxAlloy TQ experienced a period of extended downtime as the result of a failed database migration. This postmortem will explain what went wrong, the steps we took to recover, and the changes we are making to mitigate the chances of such an event in the future.

Most importantly, we would like to deeply apologize for the disruption the outage caused. We know that the availability of our software is directly tied to your ability to get your work done and it is a responsibility we take very seriously. Postmortems, such as this one, are part of our commitment to you to provide a high-level of service by examining those instances where we fall short and communicating to you how we are improving our systems and processes to be more resilient and dependable.

Incident Timeline

11:06 PM Eastern – Site update performed and change to database structure begins

The evening of March 23 starting at 11:06 PM Eastern we released several updates to the application. These updates included two database “migrations”, or changes to the structure of the database.

The second migration added several new columns to the table that stores checklist lines. Because that table is so large (approximately 125 million rows) we knew from past experience that adding these new columns would likely take several hours, but adding columns does not block the application from operating. No downtime was expected as a result of this change and we have performed similar actions many times in the past.

4:45 AM Eastern – Database error occurs

At approximately 4:45 AM Eastern time the migration failed which caused the database to automatically shut down and initiate a restart. Our initial analysis indicates that the migration failed due to excessive contention for server resources that built up over the course of the migration.

Our team discovered this issue just before 5 AM Eastern and begin communicating the issue to other team members and working towards a resolution.

Shortly thereafter we closed the site to users and posted our prebuilt maintenance page that directs users to our status page for further information and updates. Later in the day we updated that page with more appropriate messaging for the nature of the incident.

5:34 AM Eastern – Database is restarted and status page incident is created

At 5:34 AM Eastern we updated our status page (status.cxalloy.com) with the incident and the information we had at the time. We would continue to update our status page throughout the day.

The first thing we did was wait for the database to complete its restart cycle. On the rare occasions in the past where a database error has occurred it has been resolved by the automated restart of the database. This step alone is painful, however, as restarting our database takes over an hour due to its size. The restart started at 4:47 AM Eastern time and was not completed until 6:03 AM Eastern.

When the database completed its restart we began verification that the database was fully functioning. This is when we discovered that the checklist lines table would not load. Bringing up the logs showed:

Table `cxalloytq`.`checklistsectionline` is corrupted. Please drop the table and recreate it

6:34 AM Eastern – Backup data is secured

At this point our primary concern was securing your data to ensure no data would be lost. At 6:34 AM Eastern we ran an export of the checklist line data from our backup database and moved that to another machine. We ran a second export at 6:56 AM Eastern after the replica had a chance to synchronize the remaining data captured in the source database’s logs (data that had been created during the time the source database was attempting to run the migration that ultimately failed).

At the same time we investigated whether it was possible to repair the corrupted table rather than recreate it. We determined based on the logs, server state, and MySQL’s documentation that it was not possible to repair.

We are now approaching 7 AM and customers are logging on. Our goal is to get the site up and running as soon as possible. In pursuit of this goal, and having secured the data, we began the process of removing the corrupted checklist lines table and replacing it with the one exported from our backup. Our plan was to consider other options, including failing over to our backup database, while the data was rebuilt.

7:18 AM Eastern – Restore begins and a mistake is made

At 7:18 AM we began the restore of the checklist lines table from our backup. Unfortunately we made a critical error at this point. We failed to pause replication to our backup server before starting the process of replacing the checklist lines table. This meant that the backup database, replicating the actions of the primary database exactly, also removed its checklist lines table. The result of this mistake was that our only option to bring the database back up was to wait for the restore to complete.

We did not discover this failure to pause replication until approximately 8:15 AM. That is when after an hour of observation we estimated the data restore was going to take 10 hours or more. That crossed the threshold where it is worth it for us to instead switch over to our replica database. As we began to prep for that switch we discovered our earlier error.

The team discussed whether it would be possible to open the site in a limited state that blocked access to the checklist areas. However, because of the interrelated nature of CxAlloy data we determined this was not feasible. Other areas, such as issues and files, would attempt to pull in checklist line data and fail.

8:39 AM Eastern – Waiting for restore is determined as the course of action

At 8:39 AM Eastern we determined that our course of action was fixed – we had to keep the site closed while the restore completed.

At 9:11 AM Eastern we posted an update on the status page of the situation, including that the restore would take 10 hours to complete. From that point on we observed the restore, posted regular updates to the status page, and continued to communicate with customers through our support channels.

At approximately 9:30 AM we also paused the automated jobs we run in the background, such as sending out activity notifications.

5:47 PM Eastern – Restore completes

At 5:47 PM Eastern the restore finished. We posted an update to the status page at 5:52 PM Eastern that the restore had completed and the site would open after we verified the completeness and stability of the restore.

Over the next 30 minutes we tested the site and iOS app against the restored data, verified that the full data set was present, and tested for general application stability and performance.

6:21 PM Eastern – Application is made available

At 6:21 PM Eastern we reopened the site. At 6:29 PM Eastern we posted an update to the status page that the site was now open and changed the incident status to “Monitoring”.

Over the remainder of the evening we observed the operation of the application for any issues. Finding none through the evening and overnight, we resolved the incident at Thursday morning at 8:44 AM Eastern.

Process and Infrastructure Changes We Are Implementing

We never want to experience anything like this again. We have evaluated the failure points in this incident and will be implementing the following changes.

Process Changes

  • Adding columns to large tables will be done using a different methodology. Instead of adding the column in-place to the existing table we will create a new table with the updated structure and copy data to it prior to any application update. At the time the application is updated we will then rename the tables so the table with the new structure comes into use. We have used this approach in the past with success and it provides greater resilience and control compared to modifying a table in-place.
  • Releases that involve migrations on large or critical tables will be done outside of the business week. This incident was more impactful than it might otherwise have been because it happened on a business day during business hours. We have found that releasing application updates during the week usually provides a better experience for our team and users because it allows the quick identification and correction of any issues, however for database changes that have the potential – however unexpected – to cause downtime we will no longer do those during business weeks.
  • Implement a “buddy system” for emergency actions. A significant factor in the length of this outage was the inability to fail over to our replica due to a mistake we made. Although the team was communicating and working together, each member was taking individual actions in their domain to address the issue. Going forward we will require at least two people to give an OK before infrastructure actions are taken during incident response. We can never eliminate the possibility of a mistake being made but we can reduce their likelihood.

Infrastructure Changes

  • A second replica will be created. Although the failure to pause replication in this instance was purely a mistake, by having a second replica it allows us to make pausing a replica part of our standard process rather than a decision that has to be made during the incident.

In addition and separate from this incident, our offsite backup program is moving to a frequency greater than 24 hours. Although our offsite backups were not a factor in this case there is always the possibility that a catastrophic situation happens (such as a data center burning down) that could require restoration from an offsite backup. At the size and scale we now operate we believe it is insufficient to have the potential to lose more than a few hours of data even in the case of a truly catastrophic situation.

Our Commitment to You

We are always trying to get things right. In this case we got it wrong. When that happens we look at what where the problem came from, identify the solutions, and implement them. We’ve done it before and we will do it here, and we will – as always – aim to be open, honest, and transparent with you about our efforts.