Businesses worldwide faced significant disruptions on July 19, 2024, due to a faulty update from cybersecurity firm CrowdStrike. The update caused numerous Windows systems to crash, leading to the dreaded “Blue Screen of Death” (BSOD) and triggering widespread operational issues. In response, CrowdStrike’s CEO George Kurtz issued a statement clarifying that the problem was confined to Windows hosts and was neither a security incident nor a cyber attack. He reassured customers that Mac and Linux systems were unaffected.
The root of the problem was a defective content update for CrowdStrike‘s Falcon Sensor product, which caused Windows hosts to crash and reboot continuously. CrowdStrike quickly identified the issue and deployed a fix, advising affected customers to refer to their support portal for the latest updates. They provided specific instructions to mitigate the problem for systems already impacted, including booting Windows in Safe Mode or Windows Recovery Environment, navigating to the C:\Windows\System32\drivers\CrowdStrike directory, finding and deleting the faulty file “C-00000291*.sys,” and then restarting the computer or server normally.
The faulty update also affected Google Cloud Compute Engine, causing Windows virtual machines using CrowdStrike’s csagent.sys to crash and enter an unexpected reboot state. Google advised that Windows VMs currently running should no longer be impacted. Microsoft Azure reported similar issues, noting that multiple virtual machine restart operations might be required for recovery, with some instances needing as many as 15 reboots.
Amazon Web Services (AWS) took steps to mitigate the issue for as many Windows instances, Windows Workspaces, and Appstream Applications as possible. They recommended customers still experiencing problems to take action to restore connectivity. Security researcher Kevin Beaumont highlighted the severity of the situation, stating that the CrowdStrike driver pushed via auto-update was not a validly formatted driver, causing Windows to crash every time.
The incident had far-reaching consequences, affecting airlines, financial institutions, food and retail chains, hospitals, hotels, news organisations, railway networks, and telecom firms. CrowdStrike’s stock fell 15% in U.S. premarket trading. The Texas-based company, serving over 530 Fortune 1,000 companies, develops endpoint detection and response (EDR) software that has deep access to operating system kernels. This access, while intended to enhance security, also posed significant risks, as seen in this incident.
Omer Grossman, Chief Information Officer at CyberArk, described the event as one of the most significant cyber issues of 2024, with dramatic global business process disruptions. Grossman pointed out that the problem required manual resolution, endpoint by endpoint, by starting them in Safe Mode and removing the buggy driver. The root cause of the malfunction, a software update of CrowdStrike’s EDR product, was of utmost interest to cybersecurity experts.
Jake Moore, global security advisor at Slovakian cybersecurity company ESET, emphasised the need for multiple fail-safes and diversification in IT infrastructure. He noted that small errors in system updates could have wide-reaching consequences, as experienced by CrowdStrike’s customers. Moore also stressed the importance of diversity in large-scale IT infrastructure to prevent single points of failure that could lead to global-scale outages.
The incident occurred while Microsoft was dealing with its own separate outage, which affected Microsoft 365 apps and services, including Defender, Intune, OneNote, OneDrive for Business, SharePoint Online, Windows 365, Viva Engage, and Purview. Microsoft attributed the issue to a configuration change in its Azure backend workloads, which caused connectivity failures impacting downstream Microsoft 365 services.
Omkhar Arasaratnam, general manager of OpenSSF, highlighted the fragility of monocultural supply chains, emphasising the importance of diverse technology stacks for greater resilience and security. He pointed out that monocultural supply chains, relying on single operating systems or EDR solutions, were inherently fragile and susceptible to systemic faults. Arasaratnam advocated for gradual rollout of system changes to observe impacts in smaller segments rather than all at once, enhancing system resilience.
In the wake of the CrowdStrike incident, the U.S. Cybersecurity and Infrastructure Security Agency (CISA) warned of malicious actors exploiting the disruption for phishing and other malicious activities. They observed the setup of scam domains and phishing pages, impersonating CrowdStrike staff and offering fake remediation and recovery scripts in exchange for cryptocurrency payments.
CrowdStrike apologised for the havoc caused by the update, acknowledging the gravity and impact of the situation. They warned customers about adversaries exploiting the event and shared additional technical details about the boot loop issue following the configuration update. The company committed to performing a root cause analysis to determine how the logic flaw occurred and emphasised the ongoing nature of sensor configuration updates as part of the Falcon platform’s protection mechanisms.
In the aftermath, Microsoft collaborated closely with CrowdStrike and industry partners to provide technical guidance and support, helping customers bring their systems back online safely. The incident underscored the critical need for robust cybersecurity measures, diversified IT infrastructure, and gradual implementation of system changes to prevent widespread outages and enhance global business resilience.
Click here, to know more about Stravito scales SaaS security and governance with nudge security to support rapid growth.