The Crowdstrike Debacle and Its Lessons

29 July 2024
Author: Peter Schnoor   |   Reading time: 7 minutes

What began on July 19, 2024 in Australia, spread with the rising morning sun across the globe, taking millions of Windows computers with it: the largest IT failure of all time. What were the causes of this global outage, and what lessons must we learn?

An Overview

It is a daily ritual in offices around the world: Arrive, turn on the coffee machine, sit down, and then: boot up the computer. But on Friday, July 19, this routine was abruptly interrupted by something known in the industry as "BSOD," or "Blue Screen of Death." Nothing worked on Windows computers worldwide, in hospitals, airports, government offices, schools, and industry. The computers could not boot up.

What had happened? The clues quickly began to accumulate. Apparently, the symptoms were related to the latest update of a security software, the so-called "Falcon" platform from the Texas-based company Crowdstrike. This software is used by companies to detect and eliminate malware and suspicious activities caused by it - essentially a type of advanced antivirus software.

This software is also available for Mac and Linux PCs, but in this case, only Windows systems were affected. And it was quickly clear: the damage was in the billions, and the aftermath of the debacle would occupy companies and the public for years to come. Technically though, a workaround was found relatively quickly to address the immediate problem of the blue-screening PCs. This required starting the affected PC in safe mode and then deleting a specific file from the faulty Falcon update. But in practice, this was tricky.

Thus began for many IT administrators what they refer to as "sneaker administration." They had to triage due to capacity reasons - who do we save first? Who needs it most urgently? And which customers do we put on the back burner? They then had to physically go to each affected computer and carry out the procedure. This was a mammoth task in complex organizations like a hospital or an airport, with hundreds or thousands of workstations, terminals, and machines. Hours passed, flights were canceled, life-saving surgeries were postponed, production lines were halted, and customers were left waiting. It wasn't until Saturday that most affected businesses could report progress - at least for the most critical computers. However, even a week after the incident, hundreds of thousands of less critical PCs were still affected by the problem.

What Exactly Caused the Error?

That a simple update of security software could trigger such a devastating chain reaction is due to specific characteristics of the Windows operating system. Every home user knows that antivirus programs and similar software only work if they have comprehensive permissions within the operating system. To detect and neutralize threats deep within the system, Norton, Avira, and others need to be able to intervene deeply in the system. This is especially true for high-end security solutions like the Falcon Suite.

This so-called "Endpoint Detection and Response" (EDR) from Crowdstrike requires what are known as "kernel-mode rights" during operation. A kernel is like the CEO of a computer. It ensures that programs and hardware work well and efficiently together. There are several security zones in a computer, similar to large public events. Not everyone has the right to access every zone. Software at the kernel level (i.e., in "kernel mode") has all conceivable rights on a PC.

Windows granted these rights to third parties in 2009 following a lawsuit by the EU Commission over anti-competitive practices. Thus, it happens that under Windows, some security-critical software is equipped with kernel rights. If there is a "worm" in this software, whether due to an attack or a simple coding error, what happened now can occur: the computer detects an error at the level of its core functions and becomes blocked.

Monocultures?

Errors happen to everyone at some point. In a way, it is perhaps the small mistakes that make this world human and lovable. Personally, I find that the subtle crackle of a vinyl record enriches the listening experience more than the crystal-clear sound of Spotify. And that their - physically conditioned - slight graininess makes analog photos more vivid than the glossy images from the latest iPhone.

But it becomes critical when a simple error causes more global chaos than any known hacker attack. Then one must ask the question: is the global IT order to blame? After all, Windows has a global market share of about 70-80% in the desktop and laptop computer sector. And the company Crowdstrike, with its Falcon Suite as its flagship, is widely used in the largest companies in the world. Are monocultures to blame for the debacle?

Well, I'm not so sure. Microsoft clearly leads as an operating system for desktops and laptops - in mobile devices and IoT, they play no role (iOS is from Apple, and Android is a Linux system), and the server landscape that keeps the 21st century running is dominated by Linux systems. Therefore, a true monopoly cannot be directly established.

And the error was indeed with Crowdstrike, not with Windows. However, Crowdstrike, despite its market power, is by no means a monopolist. Competitors like SentinelOne, Cisco, Broadcom, Mandiant, WithSecure, and even Windows itself offer alternatives to Falcon, and these are also used. It is therefore reasonable to assume that a similar error could have affected other operating systems as well as other software.

All of this should not downplay the influence of players like Microsoft and Crowdstrike, whose name tragically fits here. When the software that is supposed to ensure the security of a system has an error, it can quickly lead to disaster.

What Can Companies Do to Prepare Themselves?

So how can companies protect themselves from debacles like this? How should they align their IT to be as resilient as possible during such a crisis? How can we make our companies' IT so antifragile that we might even benefit from such events (just think of airlines: some flew, some did not...).

There is actually no simple solution here. One could switch operating systems, and indeed, different systems have individual advantages and disadvantages. Our clients know, for example, that we are big fans of Linux. But Crowdstrike itself has also previously provided updates that blocked Linux systems. The response was predictably lower, but the problem for the affected companies was similar.

Building parallel structures is rarely sensible, especially in the area of security software. Just as home users cannot install Kaspersky and Norton simultaneously, this is also very limited in more professional settings because these programs would otherwise interfere with each other. And maintaining multiple computers with different operating systems for the same tasks is not only expensive but also organizationally complex and only safer if the admins can keep all these systems compatible and up to date - a mammoth task.

Some fundamental measures could be taken by companies depending on their size and use case. These include:

  • Establishing analog and/or digital redundancies to continue core business operations in emergency mode.
  • Developing and communicating robust emergency procedures.
  • Avoiding centralized solutions like the big cloud providers where sensible and possible. Using solutions with lower market penetration.
  • Where possible, applying updates only recursively, i.e., staggered, and testing the impacts.

But especially the last point leads well into the main issue in this case.

What Must Software Developers Do to Avoid Such Errors?

The error was less in the system. It was also difficult for users to avoid. But Crowdstrike as the main culprit and Microsoft as the affected platform could have and should have done more:

  • Updates should be rolled out recursively, no matter how small they are. It should not take the whole world burning down before the error is noticed. It is common practice for good reason to first update a portion of users and then see if everything works. This seems to have been overlooked here.
  • Before rolling out an update, it should be tested automatically and manually on all relevant systems to fix any coding errors in time (e.g., through fault injection, stability tests, stress tests, fuzzing, or interface checks). It seems that Crowdstrike inadequately addressed this - the change seemed too small.
  • Possible rollback procedures should also be established and tested, allowing a faulty update to be reverted. A rollback did not work in this case because the computer could not boot up after the update. But such situations are avoidable.

Crowdstrike itself promises improvement. But the damage has already been done - especially to their customers.

The Real Solution: Skin in the Game

Currently, companies like Crowdstrike have little incentive, aside from the risk to their own reputation, to establish such a rigid testing regime. Except for open-source applications, developers rarely have an outside party looking over their shoulders to point out potential dangers (this, by the way, is an important reason for the security of open-source like Gnu/Linux).

Structures should be created where companies truly have "skin in the game." Where they (and their decision-makers) must be directly and personally liable for any damages incurred by the software they sold. This would not prevent errors - but it has the potential to prevent them from spreading as globally and devastatingly as was the case this month.

For Sure!

Make your business digital with customized and meaningful software solutions - but don't skimp on security.
We are happy to advise you on finding the ideal balance for you.

Unterschrift
Peter Schnoor, Founder of Netjutant
contact@netjutant.com (+49) 8685-30998-22