Tech executives reassess IT resilience in CrowdStrike outage aftermath
Technology executives reassessed their IT operational resilience in the wake of a global wave of costly systems outages caused by a faulty CrowdStrike security update in July. Most were not happy with what they found, according to a survey of 1,000 senior cloud architects and engineering executives conducted by Cockroach Labs and Wakefield Research in August and September.
More than 9 in 10 respondents said they were aware of operational weaknesses within their organization that leave IT systems vulnerable to costly service interruptions. Nearly half acknowledged they hadn’t done enough to improve resilience.
Every company surveyed reported revenue losses from outages in the past year.
“IT outages are pervasive,” Spencer Kimball, CEO of Cockroach Labs, told CIO Dive. “But the CrowdStrike issue was just so blatant and so preventable that people realized they have blind spots when it comes to critical vulnerabilities.”
The CrowdStrike event caught executives by surprise. Although it was live for less than two hours, the update brought down millions of Windows-based systems, grinding operations to a near halt at major airlines and interrupting banking functions globally as technology teams scrambled to respond.
CrowdStrike’s broad reach across continents and industries amplified the disruptive impact of the outage. Images of stranded passengers staring at error messages on airport monitors drove home the cost.
“When you make things really large, whatever can go wrong does go wrong 100% of the time,” Kimball said. “You can’t run something at scale and not be prepared to have machines, power systems and networking equipment fail — sometimes it’s a backhoe accidentally cutting into a fiber-optic cable that brings things down.”
Stress tests
IT snafus are endemic and persistent. Companies experience an average of 86 outages annually and more than half reported weekly service disruptions, the report found. The average recovery time was 196 minutes, or more than three hours.
“That’s a lot of lost productivity and a lot of stress on the engineers who have the pagers and have to do the postmortems,” Kimball said.
For a geographically dispersed operation, the challenges are manifold.
United Airlines dispatched teams to hundreds of airport locations to reboot more than 26,000 Windows devices in the days following the CrowdStrike outage, which hit in the early morning hours of Friday, July 19. The effort required staff to drive to sites lacking field support over the weekend, CIO Jason Birnbaum told CIO Dive.
To read the complete article, visit Cybersecurity Dive.