Artificial Intelligence & Machine Learning , Next-Generation Technologies & Secure Development
Meta Prompt Guard Is Vulnerable to Prompt Injection Attacks
Researchers Add Spaces in 'Ignore Previous Instructions' Prompt to Bypass SecurityA machine learning model that Meta released last week to prevent prompt injection attacks is vulnerable to prompt injection attacks, researchers said.
See Also: Solving business challenges with data & AI: 5 insights from C-suite leaders
Hackers use prompt injection attacks to input a malicious set of instructions that manipulate machine learning models into bypassing security controls and performing unwanted actions. Meta launched Prompt Guard alongside its Llama 3.1 generative model to "help developers detect and respond to prompt injection and jailbreak inputs."
Typical jailbreak attacks prompt LLMs with an "ignore previous instructions" prompt - a strategy AI companies usually prevent by instructing the model to ignore any instruction that contains those words. But researchers at Robust Intelligence found that they could get around the Meta Prompt Guard's prompt injection guardrails and ask it to "ignore previous instructions" by simply adding spaces between the letters and omitting punctuation.
"The bypass involves inserting character-wise spaces between all English alphabet characters in a given prompt. This simple transformation effectively renders the classifier unable to detect potentially harmful content," said Aman Priyanshu, a bug hunter at Robust Intelligence who discovered the workaround.
The root cause of the latest bypass may lie in the fine-tuning process, detailed by Robust Intelligence in a May blog post. Prompt Guard is a fine-tuned version of its base model, meant to catch high-risk prompts.
There is as yet no definitive solution to the problems of jailbreaking and prompt injection attacks, and Robust Intelligence's latest experiment is not the first of its kind. Carnegie Mellon University researchers last year discovered an automated technique to generate adversarial prompts to compromise safety mechanisms. A separate set of researchers from Robust Intelligence and Yale University found a way to automate jailbreaking OpenAI, Meta and Google LLMs with no obvious fix.
Robust Intelligence said it has suggested countermeasures for the exploit to Meta, which said it is "actively working on a fix."