In a startling revelation, Microsoft AI researchers inadvertently exposed tens of terabytes of sensitive internal data while publishing a storage bucket of open-source training data on GitHub. The issue was discovered by cloud security startup Wiz during its investigation into the accidental exposure of cloud-hosted data.

The Perils of Shared Access

The GitHub repository offered open-source code and AI models for image recognition and directed users to download the models from an Azure Storage URL. However, Wiz found that the URL was misconfigured to grant permissions to the entire storage account, rather than just the intended open-source data.

This resulted in the accidental exposure of 38 terabytes of sensitive data, including personal backups of two Microsoft employees’ computers, secret keys, passwords to Microsoft services, and over 30,000 internal Microsoft Teams messages from numerous Microsoft employees.

Full Control, Full Risk

Wiz pointed out that the URL allowed for “full control” rather than “read-only” permissions. This meant that anyone who knew where to look could potentially delete, replace, and even inject malicious content into the data. The root cause of this issue was an overly permissive Shared Access Signature (SAS) token in the URL, a mechanism used by Azure for creating shareable links.

Responding to the Crisis

Microsoft Entra Security

Upon discovering the vulnerability, Wiz promptly shared its findings with Microsoft on June 22. Microsoft acted swiftly, revoking the SAS token just two days later and concluding its organizational impact investigation by August 16. Microsoft’s Security Response Center clarified that “no customer data was exposed, and no other internal services were at risk due to this issue.”

A Bigger Issue in AI Development?

Wiz Co-founder and CTO Ami Luttwak remarked, “As data scientists and engineers race to bring new AI solutions to production, the massive amounts of data they handle require additional security checks and safeguards.” This incident underscores the complexities and risks involved in AI development, particularly when dealing with enormous datasets.

Lessons and Precautions

Microsoft has since expanded GitHub’s secret scanning service to monitor all public open-source code changes for plaintext exposure of credentials and other sensitive information. This move aims to prevent similar incidents involving overly permissive SAS tokens.

A History of Data Leaks

This incident is not an isolated case; Microsoft has had its share of data leaks over the years. From exposed customer emails to security breaches purportedly orchestrated by hacker groups, Microsoft’s track record reveals the complexities and vulnerabilities inherent in safeguarding data.

Final Thoughts

While Microsoft was quick to address this most recent data exposure, the incident raises significant concerns about data security, especially as companies rush to innovate in fields like AI. These vulnerabilities not only jeopardize the company’s internal data but also pose risks to employee privacy and potentially even national security. This episode serves as a cautionary tale for tech giants and small startups alike, emphasizing the critical importance of rigorous security checks in an era of rapid technological advancement.