CrowdStrike – lessons learned from a Systemic Dependency Risk “near miss”

One year ago tomorrow on July 19^th 2024, we all woke up to Friday morning headlines about a global IT problem causing Windows-based computers to display the “blue screen of death”. I’m sure like many other people, my immediate reaction was that this might be cyber risk’s version of the “big one” that we’ve all long feared: a global malware event that would brick all of our computers and send society back to the Stone Age. While it was initially being called a Windows outage, it soon became clear that the culprit was a faulty software update from CrowdStrike that had somehow crashed Windows.

The breadth was truly global and across almost all industries. Major airlines were grounded, health care systems went into downtime protocol, banks and payment systems¹ were down… it looked bad². But within hours CrowdStrike issued a fix and Microsoft provided restoration instructions so that recovery began over the course of that day for most companies and mopped up over the subsequent weekend by hard-working IT teams. And just like that it was mostly over (unless you were a passenger trying to fly on Delta Airlines), and it became a “near-miss” incident for the types of interconnected systems disruption that are worrisome for Systemic Dependency Risk.

This is the first in a series of retrospective looks at historical Systemic Dependency Risk events and near-misses. Because this one is a near-miss, it will focus less on impact details and more on implications and lessons learned. The headlines on the impact are economic losses estimated at $1.7 billion to $5.4 billion, with insured losses of only $300 million to $1.5 billion³.

Before diving into it, it’s worth noting that this incident perhaps doesn’t strictly qualify as a Systemic Dependency Risk event because the failure actually occurred on the victims’ internal machines, rather than on an external service⁴. It’s splitting hairs, but CrowdStrike’s software and services weren’t unavailable; their software update service successfully sent updates, one of which unfortunately caused a Microsoft Windows failure on machines that installed the update. Otherwise, the event bears all the hallmarks of a Systemic Dependency Risk event – particularly the breadth of impact – which can provide some very important lessons learned:

1. Systemic Dependency Risks can come from unexpected places

It is no small irony that the cause of the incident is software that was meant to protect its users from cyber security threats. If you had asked senior executives and/or risk management for a typical large company to list their top 25 highest-risk dependencies, I seriously doubt that any of them would have had CrowdStrike on their list; software as a category probably wouldn’t have any representation on the list for many companies. There are several reasons why CrowdStrike or similar might not be front of mind for dependency risks, but probably the two most important are: (1) dependency awareness tends to be operations-centric, i.e. traditional physical supply chain, key partners and critical service providers, etc., and (2) the list of dependencies can be really, really long in the modern age of complex business models and technology.

To the first point, CrowdStrike snuck into the dependency chain via the IT department rather than revenue-facing business units – or even more “traditional” mid-office functions like HR and Finance – where the senior management team tends to have more focus and familiarity. When it comes to technology dependencies, the IT department is typically positioned as a gatekeeper to prevent these sorts of dependencies, or at least raise awareness and force mitigation and disaster recovery plans. That gatekeeping function may be compromised when IT is the ”buyer”, as discussed further below.

It may be tedious, but it’s probably worthwhile to do routine reviews of the end-to-end comprehensive process flow for each business unit and/or distinct product line as well as all support functions in order to surface potential dependencies. And it’s not just the vendor list. Dependencies can arise on the sales and revenue-cycle side as well as the production side of the business, and can come from a broader range of inputs, infrastructure and background conditions than the more narrow traditional supply chain view.

Internal vendor management and IT approval documentation might help to ensure that the process flows are comprehensive and complete, but there’s no substitute for doing the full “I’m Just a Bill” detailed process walk-throughs with the responsible managers.

2. Knock-on effects can be worse than the source disruption

Arguably, the CrowdStrike incident is one big knock-on event – the Windows failure caused by the CrowdStrike software update – as discussed above. For most companies, the downtime on core systems was less than a day, plus some additional IT restoration time over the subsequent weekend.

Knock-on effects into other critical dependencies were mostly brief and not very consequential. Some major ports experienced shut-downs overnight but were operational again by morning. FedEx and UPS experienced delays on deliveries scheduled for the day of the incident. Cancelled flights caused a jet fuel storage problem for excess supply in California.

But not all dangerous cascading consequences are external. Like most other companies, Delta Airlines was able to restore its Windows-based computers relatively quickly, but the number of changes caused by cancelled and delayed flights left their crew-tracking software “unable to effectively process the unprecedented number of changes triggered by the system shutdown” resulting in even more cancellations. Delta’s disruption spanned approximately 5 days while competitors quickly recovered over the weekend. The airline industry was already well aware of the criticality of crew-scheduling systems following the Southwest Airlines meltdown in December 2022, and indeed maintaining high-volume dependability of this system was a key consideration in adopting a hybrid cloud-mainframe architecture that was highly touted in the renewal of its third-party systems support and modernization agreement in 2023. Despite this, Delta’s system failed spectacularly, turning a brief IT disruption into a $500 million loss. Obviously this suggests they probably needed more robust testing of the system – including a scenario with a cold shutdown and an initial load of a full day of cancelled flights – but more importantly, also a better plan for what to do if the system is unavailable.

One key aspect of mitigating Systemic Dependency Risk is planning ahead for what to do in the event of a failure of a key dependency or critical downstream system – an outage “runbook”. This is understandably a standard for industries where operational continuity has life-and-death consequences such as hospitals. Indeed, airline pilots are trained extensively in protocols for in-flight failures including mechanical, electronics and communications systems, so you might think airlines would find this a familiar concept. An outage runbook won’t eliminate the risk – presumably the reason the dependency exists is because it’s operationally more efficient than alternatives – but it can help everyone quickly get on the same page for how best to minimize the impact.

3. The first rule of IT Security should be to follow all of the rules of IT Security

Anyone who has been entangled in a months-long software procurement process at a large company knows that IT departments typically have quite stringent rules about systems access, external connections, information security, testing requirements, disaster recovery documentation, etc. You wouldn’t dare bring the IT approvers a software vendor that wants privileged access to automatically push updates to “kernel driver” files⁵ on your company’s servers and laptops: you’d be laughed out of the room because of the risk of introducing a problem – either malicious or unintentional – to those computers’ operating systems. Yet somehow, that’s what many companies’ IT Security groups signed up for with CrowdStrike.

While cyber security software may have a legitimate need to run at the operating system level in order that it can’t be interfered with by the malware it is trying to detect and prevent, and cyber security software updates may be urgent in the face of continually evolving threats, those updates shouldn’t circumvent safe IT practices. In particular, software updates should be applied first in an isolated “test” environment to ensure that they work as intended with no adverse consequences. It’s also good practice to stagger the roll-out of broad updates so that the update can be halted and corrected if any problems are experienced in the initial waves.

But the broader issue is one of checks and balances. Part of why IT Security has an approval function is because of the gap in relative expertise between the IT department and typical business users; otherwise those business users might unknowingly make risky decisions. Equally important, IT Security approval ensures that awareness of the organization’s IT risks is centralized and that IT risk mitigation policies are applied consistently throughout the organization. While it’s important that IT Security follow all of its own policies and processes for approving IT Security software, from a checks and balances perspective there may also be a need for audit and/or risk management review and reporting to ensure organizational awareness and acceptance of any risks IT Security is taking on.

4. Contracting practices are overdue for review

Like many software and services vendors, CrowdStrike’s standard terms and conditions includes a very strict “Limitation of Liability” clause that provides no liability whatsoever for lost revenues or profits, and in any case limits liability to the fees paid for their service⁶. While there may be some uncertainty around the enforceability of such limitations⁷, it’s important to recognize that these terms can leave the company with no recourse for failure of a critical dependency, which will also typically not be covered by insurance. If acceptance of these contractual limitations can effectively mean acceptance of an enterprise-level material risk, the contracting process should expand beyond legal and vendor management to include risk management review and senior management approval where appropriate.

The market practice of strict limitation of liability for software and services vendors is a relic of an era when these types of vendors were smaller (oftentimes so small that their liability was effectively very limited by financial capacity anyway) and less likely to be critical dependencies with the potential to wreak business interruption havoc. In our modern interconnected economy, the software and services companies that have the potential to be Systemic Dependency Risks are often some of our largest and most cash-rich companies⁸.

What if instead of generating risk to their customers, software and services vendors provided stronger warranties and service level agreements without unduly narrow limitations of liability? This would align the risk of outage or disruption with the authority over product design, processes and policies for controlling that risk. Presumably this would shift the problem of insurance cost and capacity to the vendors, but it might be better aligned to existing Errors & Omissions liability products. And if customers adequately recognized their economic cost of risk for potential dependency disruptions, some of the vendors’ insurance costs could presumably be passed on to customers in a higher-price-for-lower-risk value proposition relative to competitors who foist the risk onto their customers.

5. Value of diversity in commercial ecosystems

At the time of the incident, CrowdStrike had more than 24,000 clients, including 60% of the Fortune 500. Like many technology-driven services, there is a tendency towards concentration due to inherent operating leverage (the marginal cost to service an additional customer is low), especially in cases where the business value is amplified by network effects. Absent any cost to discourage it, these dynamics will tend towards unhealthy concentrations.

The CrowdStrike incident provided a particularly good example of the value of ecosystem diversity: while the largest airlines experienced thousands of cancellations, Southwest Airlines (somewhat ironically), Alaska Airlines and JetBlue had almost none. Indeed, SEC filings from JetBlue and Alaska Airlines even noted that the disruption provided a revenue boost from accommodating passengers whose flights on other airlines were cancelled. Initially there were unfounded rumors that Southwest dodged the bullet because they were running on an outdated Windows 3.1 operating system, but it turns it was because out they “primarily utilize a CrowdStrike competitor for endpoint cyber security protection”. Jet Blue and Alaska Airlines have also been clients of CrowdStrike competitors.

Nobody chooses their software based on the potential opportunity for profit or schadenfreude if their competitors experience a disruption from using an alternative software product with higher market share⁹ . Products might achieve high market share simply because they’re better, outweighing the protection from diversity provided by low market share products (see box below for my anecdotal experience). But from an overall economic perspective with respect to Systemic Dependency Risk, everyone would be better off if no single vendor had a concentrated share. Following the CrowdStrike incident, there were some calls for regulatory intervention to break up “digital monoculture” in attempt to prevent or limit the severity of similar incidents in the future. Trying to regulate market share has always been politically challenging, and even well-meaning regulations to encourage healthier competition may be easily circumvented.

Some Love for Lotus Notes

In 2000, the ILOVEYOU computer worm became one of the most widespread computer viruses in history to that point, infecting over 50 million computers (thought to be around 10% of those connected to the internet at the time). It contained a Visual Basic script attachment which would send copies of itself to all contacts in the user’s address book, searched for certain file types on any connected drives to replace with copies of itself, and set the Internet Explorer homepage to a URL that downloaded a Trojan Horse.

Very few people at the time had any real cyber security awareness. The previous worst incident – the Melissa virus in 1999 – only infected an estimated 1 million computers and didn’t receive widespread attention.

I received the e-mail with subject line “ILOVEYOU” from a former colleague who was known for being pretty funny, so I opened it immediately. Luckily nothing happened because the company I worked for at the time used the already outmoded Lotus Notes suite for e-mail, and the Visual Basic script was only able to run in Microsoft Outlook.

As an e-mail application, Lotus Notes had a lot of drawbacks compared to Microsoft Outlook at the time: clunky user interface, unexpected application crashes, and occasional incompatibility with other applications. But as an inadvertent cyber security defense, it saved me – and all of my coworkers and e-mail contacts – from the ILOVEYOU virus.

Insurance for Systemic Dependency Risk would be the best solution to encourage diversity via the “invisible hand” of pricing, both by placing a transparent cost on the risk companies are assuming, and also by adding a risk premium to the most concentrated risks to account for limited capital capacity and potential correlation to financial markets.

For each incident in this series of case studies, a few key characteristics and statistics will be tracked with an eye towards eventually having the tools to quantify frequency and severity of these events:

Incident Summary
CrowdStrike Global IT Outage – July 18, 2024

Source of disruption		Software
Root cause		Human error (programming)
Scope of impact	Geographic	Global
	Industries	All (but especially airlines and healthcare)
	Estimated revenue	$30 billion – $60 billion per day¹⁰
Duration		< 24 hours for restoration of core systems for most companies, with some additional time and expense for full recovery
Estimated losses	Economic	$1.7 billion – $5.4 billion
	Insured	$300 million – $1.5 billion
Known losses		Delta Airlines: $500M

Consumers reported payments problems even though Visa and Mastercard stated they were unaffected, but disruptions may have occurred with other parties in the payments processing chain. ↩︎
The Wikipedia page for the incident does a very good job of cataloging many of the impacts. ↩︎
Low insured losses are potentially due to a number of factors: take-up rates on cyber risk insurance overall, take-up rate and sub-limits for systems failure coverage (i.e. business interruption) in cyber policies, the quick recovery time such that typical waiting period thresholds – the deductible for business interruption coverage – may not have been met, and significant retentions typical in cyber risk insurance programs for large companies. ↩︎
And I have no idea how this was adjudicated with respect to cyber insurance policies, where the difference between a (internal) systems failure vs. and a (external) dependent systems failure could determine whether the full limit applies rather than a much lower sub-limit. ↩︎
Apologies if my very superficial technology knowledge has resulted in any botched terminology here. ↩︎
Some clients may have negotiated changes to the terms and conditions. For example, based on court filings in Delta Air Lines Inc. vs. CrowdStrike Inc., Delta’s contract specified a liability limit of two times fees and an exception for gross negligence or willful misconduct ↩︎
Delta’s lawsuit against CrowdStrike recently survived CrowdStrike’s motion to dismiss, albeit with a significantly narrowed set of claims and some skepticism about Delta’s likelihood of success. ↩︎
For example, cloud computing providers like Amazon, Microsoft and Google are three of the five largest companies in the US by market capitalization. CrowdStrike at $117 billion as of today ranks 93^rd; as of April 30^th, 2025, it had $4.6 billion cash and equivalents on its balance sheet, which is a bit more than its annual revenue annualized from the most recent quarterly result of $1.1 billion. ↩︎
On the other hand, it might not be a bad strategy for mitigating cyber risk: low market share vendors may be less attractive targets. ↩︎
Daily equivalent of 60% of $19.9 trillion annual revenue for Fortune 500 plus a smaller proportion of small to medium-sized enterprises which have similar magnitude of revenues overall to Fortune 500. ↩︎

CrowdStrike – lessons learned from a Systemic Dependency Risk “near miss”

Discover more from SystemicDependencyRisk.com

Comments

Leave a comment Cancel reply

More posts

GPS Disruption Risk

Change Healthcare and CDK Global ransomware outages