API Flaws Put AI Models at Risk of Data PoisoningHugging Face Fixes Flaw; Meta, Other Tech Giants Revoke Vulnerable Tokens
Security researchers could access and modify an artificial intelligence code generation model developed by Facebook after scanning for API access tokens on AI developer platform Hugging Face and code repository GitHub.
Lasso Security said on Monday that it had found hundreds of publicly exposed API authentication tokens for accounts held by tech giants including Meta, Microsoft, Google and VMware on the two platforms.
Among the exposed tokens granted read and write access were Meta's Llama2 model and the Bloom and Pythia language models. The vast majority of the 723 accounts with exposed tokens that Lasso Security found had write permissions, raising concerns of supply chain vulnerabilities such as training data poisoning or mode and dataset theft. Researchers said they had gained "full control over the repositories of several prominent companies." Tampering with training data to introduce vulnerabilities or biases is among the top 10 threats to large language models recognized by OWASP.
Hugging Face has fixed the vulnerability, and companies including Meta, Google, Microsoft and VMware have revoked the vulnerable tokens and removed public access to the code, the firm said.
Hugging Face is a $4.5 billion company funded by Google and Amazon that offers a platform for developers to build, deploy and train machine learning models. It provides infrastructure for them to demo, run and deploy AI in live applications. They can also store and manage their code.
The exposed tokens that allowed writing permissions could put not only the compromised companies' projects at risk, they could affect millions of their models' users without their knowledge, Bar Lanyado, security researcher at Lasso, told Information Security Media Group.
A Google API token, he said, had contributor privilege, which means that in addition to stealing Google's data, an attacker could create a new malicious model and spread it under their name.
Lanyado said that the team expected to find a large number of exposed tokens when it started the scanning project but was "extremely overwhelmed" by the number of the found, as well as the simplicity with which they were able to gain access. "We were able to access nearly all of the top technology companies' tokens and gain full control over some of them. Even Meta, Microsoft, Google, VMware, which have high levels of security, still didn't know about the existence of these tokens publicly."
Vulnerabilities in the infrastructure elements that build AI tools are not new. Nvidia recently detailed security flaws in a library that provides the framework to implement LLM plug-ins, researchers with Google found a relatively easy way to extract training data from ChatGPT, and Protect AI found nearly a dozen critical vulnerabilities in the technical infrastructure that companies use to build AI models (see: Popular AI Tools Contain Critical, Sometimes Unpatched, Bugs).
The "gravity of the situation cannot be overstated," according to Lanyado.
Lasso researchers also gained access to 14 datasets that had been downloaded by hundreds of thousands of users each month. Hackers could exploit this access to poison training data. Just one of Hugging Face assets - albeit a key one - is its open-source Transformers library, which holds more than 500,000 AI models and 250,000 datasets, including that of the Meta-Llama, Bloom and Pythia models - each of which have been downloaded millions of times.
The researchers also could have stolen an additional 10,000 private AI models that were linked to more than 2,500 datasets, the firm said.