It was 3 a.m. on July 19 when the phone rang at Tyson Morris’s. Atlanta’s trains and buses were scheduled to start running in two hours, but all systems were down, displaying the “blue screen of death.”
“It’s the call an IT leader hopes to never get,” says Tyson Morris, director of information technology (IT) for Atlanta’s transit system. “I jumped up, my wife thought someone had died, she asked me what was going on.”
Mr. Morris rushed to mobilize his team of 130 people urgently. Was it a cyberattack? Sabotage by an employee? For hours, they searched in vain.
A debilitating breakdown
The outage, caused by a botched update from cybersecurity firm CrowdStrike, was the kind of outage that IT teams anticipate and hope never happens. About 8.5 million Windows systems crashed, crippling hospitals, airlines, and 911 centers around the world. Insurers expect to pay out more than $1 billion to insured businesses, and Fortune 500 companies are expected to lose $5.4 billion in revenue.
The outage made work difficult, if not impossible, for many people. IT teams worked long hours, some pulling all-nighters, to get systems up and running again over the weekend. The outage also exposed vulnerabilities, and lessons were learned for future outages.
“I’ve never seen this level of stress,” said Morris, who has worked in the industry for more than 20 years. “Every second counted.”
A manual installation
The outage brought IT staff out of the shadows, said Eric Grenier, a cybersecurity analyst at Gartner, a market research firm. CrowdStrike sent a fix to users, but it had to be manually installed on each system. Grenier can recall only one other massive outage of similar magnitude, a buggy McAfee update in 2010.
Our reports are that hundreds of thousands of devices were brought back online over the weekend; that’s huge. IT was the superhero in this story.
Eric Grenier, cybersecurity analyst at Gartner
On the ground, it was hell. Kyle Haas, a systems engineer for the computer company Mirazon in Louisville, spent Friday driving around the city to get his customers back online. In the car, between customers, he would field emails and phone calls to help others. For nine hours straight, Mr. Haas didn’t stop.
“I skipped my coffee that morning,” he says. He woke up to panicked emails and messages from distraught customers. “There were as many issues as we could fix. Everything needed to be fixed.”
It could have been worse
Mr. Haas’s team of about 40 people spent 12 hours reconnecting all the customers, he says. It was a hectic day, but the problem was purely technical and relatively easy to fix. At least he didn’t have to fight off hackers or recover lost data, which is common during ransomware attacks or system outages.
He is particularly proud of helping a water filtration plant that was an hour away from switching to manual operation, which would have prevented it from testing water quality.
For Morris, who has been in Atlanta for three months, the outage was a shock. Fortunately, the IT department already had a contingency plan, with a list of phone numbers and dedicated communication channels. But the experience was still rough. Morris, who was visiting family in Tennessee, jumped in the car and drove to Atlanta. The team worked around the clock, with some members putting in 18-hour days and sleeping in the office.
By 9am on Friday, the buses and trains were running again. By Monday morning, all the laptops had been repaired.
Many thanks
“We had a lot of encouragement and thanks. It really motivated the troops,” said Mr. Morris.
On the West Coast, the outage began late Thursday night, giving some time to identify the problem. An email from an IT contractor sounded the alarm at 10:30 p.m. Pacific time, followed by a cascade of automated alerts from the server, according to Jerry Leever, IT director at the Los Angeles accounting and tax firm GHJ.
Mr. Leever was checking his emails while brushing his teeth when he saw the message. His stomach tightened.
I had a moment of anxiety, then I remembered that we are trained for this kind of situation. We don’t really have time to panic, because we have to get everything back online as quickly as possible.
Jerry Leever, IT director at accounting and tax firm GHJ
By 3 a.m., Leever and his team had the servers back up and running. They scheduled an email to go out at 5 a.m., informing their 200 colleagues of the situation and explaining how to fix it. They also set up a conference call for 6 a.m. for staff who needed guidance through each step. By 10:30 a.m., everyone was back online, a feat Leever attributes to their communications plan and timely alerts.
All the computer scientists interviewed by the Washington Post believe they have learned lessons from the CrowdStrike outage. It highlighted the importance of having an up-to-date business continuity plan that focuses on communication procedures, which is complicated if systems are down. The incident also led some executives to question whether the contingency measures in place were sufficient to maintain operations in the event of an outage.
Others are considering diversifying their suppliers so that their entire operations don’t depend on a single one if it has a problem. Some organizations are reassessing whether they have enough staff to handle emergencies or whether they need to have outside help committed in advance. The outage also highlighted the importance of storing key data, such as recovery codes for encrypted systems, in multiple locations in case a server goes down.
Mr. Leever considers the July 19 outage to be the worst incident of his career. At the end, when everything was sorted out, he ran to his favorite restaurant and bar and ordered a burger and an Aperol spritz. “Give your IT a big hug,” he says. “It’s good that people are understanding and caring in times of crisis.”
This article was published in the Washington Post.
Read the original version (in English; subscription required)