Last month, software tools vendor Atlassian suffered a major network outage that lasted two weeks and affected more than 400 of their over 200,000 customers. The outage took down several of their products, including Jira, Confluence, Atlassian Access, Opsgenie, and Statuspage.
While only a few customers were affected for the full two weeks, the outage was significant in terms of the depth of problems uncovered by the company’s engineers and the lengths they had to go to find and fix the problems.
The outage was the result of a series of unfortunate internal errors by Atlassian’s own staff, and not the result of a cyberattack or malware. In the end, no customer lost more than a few minutes’ worth of data transactions, and the vast majority of customers didn’t see any downtime whatsoever.
What is interesting about the entire Atlassian outage situation is how badly they managed their initial communication of the incident to their customers, and then how they eventually published a lengthy blog post that goes into tremendous detail about the circumstances.
It is rare that a vendor who has been hit with such a massive and public outage takes the effort to thoughtfully piece together what happened and why, and also provide a roadmap that others can learn from as well.
In the post, they describe their existing IT infrastructure in careful detail, point out the deficiencies in their disaster recovery program, how to fix its shortcomings to prevent future outages, and describe timelines, workflows and ways they intend to improve their processes.
The document is frank, factual, and full of important revelations and should be required reading for any engineering and network manager. It should be used as a template for any business that depends on software to locate and fix similar mistakes that you might have made, and also serve as a discussion framework to honestly assess your own disaster recovery playbooks.
Lessons learned from the incident
The trouble began when the company decided to delete a legacy app that was being made redundant by the purchase of a functionally similar piece of software. However, they made the mistake of assigning two different teams with separate but related responsibilities. One team requested the redundant app be deleted, but another was charged with figuring out how to actually do the task. That should have raised some red flags immediately.
The two teams didn’t use the same language and parameters, and as a result had immediate communication problems. For example, one team used the app ID to identify the software to be deleted, but the other team thought they were talking about the ID for the entire cloud instance where the apps were located.
Lesson 1: Improve internal and external communication
Teams that request network changes and the team that actually implements them should be one and the same. If not, then you need to put in place solid communication tools to ensure that they are in sync, using the same language, and have precision on procedures. Because of the miscommunication, Atlassian engineers didn’t realize the extent of their mistake for several days.
But cross-team communication was only one part of the problem. When Atlassian analyzed its communications between various managers and its customers, they discovered that they posted details about the outage within a day on their own monitoring systems, but they weren’t able to directly reach some of their customers because contact information was lost when the legacy sites were deleted, and other information was woefully outdated.
Plus, the deleted data contained information that was necessary for customers to fill out a valid support request ticket. Getting around this problem required a group of developers to build and deploy a new support ticketing process. The company also admits they should have reached out earlier in the outage timeline and not waited until they had a full picture of the scope of the recovery processes.
This would have allowed customers to better plan around the incident, even without specific time frames. “We should have acknowledged our uncertainty in providing a site restoration date sooner and made ourselves available earlier for in-person discussions so that our customers could make plans accordingly. We should have been transparent about what we did know about the outage and what we didn’t know.”
Lesson 2: Protect customer data
Treat your customer data with care, ensure that it is current and accurate and backed up in multiple, separate places. Make sure your customer data can survive a disaster and include specific checks in any playbook.
This brings up another point about disaster recovery. During the April outage, Atlassian missed its recovery time objectives (obviously, given the weeks taken to restore systems), but managed to meet their recovery point objectives, since they were able to restore data just a few minutes before the actual outage. They also had no way to select a set of customer sites and restore all of their interconnected products from backups to a previous moment in time in any automated way.
“Our site-level deletions that happened in April did not have runbooks that could be quickly automated for the scale of this event,” they wrote in their analysis. “We had the ability to recover a single site, but we had not built capabilities and processes for recovering a large batch of sites.”
In the blog confessional, they chart their previous large-scale incident management process – you can see that it has a lot of moving parts and wasn’t up to the task to “handle the depth, expansiveness and duration of the April incident.
Lesson 3: Test complex disaster recovery scenarios
Check and re-check your disaster recovery programs, playbooks, and procedures to ensure they meet various objectives. Make sure to test scenarios across all sizes of customer infrastructure. This means specifically addressing and anticipating larger-scale incident response and understanding the various complex relationships of customers that use multiple products or depend on an interlocking series and sequence of your applications.
If you are using automation, make sure your APIs are functioning properly and sending appropriate warning signals when they aren’t. This was one of the issues that Atlassian had to debug on the fly while the outage dragged on for days.
Lesson 4: Protect configuration data
Finally, there is the issue about how data was deleted, which started the entire outage. They now realize that deleting data, especially of an entire site, shouldn’t be allowed. Atlassian is moving to what they call a “soft delete,” which doesn’t immediately dispose of data until it has been vetted with defined system rollbacks and pass through a number of safeguards.
Atlassian is establishing a “universal soft delete” policy across all their systems and creating a series of standards and internal reviews. The soft delete option is more than just an option. Don’t delete any configuration data until you have tested it throughout your infrastructure.