Artificial Intelligence & Machine Learning , Next-Generation Technologies & Secure Development

Meta Prompt Guard Is Vulnerable to Prompt Injection Attacks

Researchers Add Spaces in 'Ignore Previous Instructions' Prompt to Bypass Security
Meta Prompt Guard Is Vulnerable to Prompt Injection Attacks
Researchers have found a way to conduct a prompt injection attack on a tool designed to detect prompt injection attacks. (Image: Shutterstock)

A machine learning model that Meta released last week to prevent prompt injection attacks is vulnerable to prompt injection attacks, researchers said.

See Also: Solving business challenges with data & AI: 5 insights from C-suite leaders

Hackers use prompt injection attacks to input a malicious set of instructions that manipulate machine learning models into bypassing security controls and performing unwanted actions. Meta launched Prompt Guard alongside its Llama 3.1 generative model to "help developers detect and respond to prompt injection and jailbreak inputs."

Typical jailbreak attacks prompt LLMs with an "ignore previous instructions" prompt - a strategy AI companies usually prevent by instructing the model to ignore any instruction that contains those words. But researchers at Robust Intelligence found that they could get around the Meta Prompt Guard's prompt injection guardrails and ask it to "ignore previous instructions" by simply adding spaces between the letters and omitting punctuation.

"The bypass involves inserting character-wise spaces between all English alphabet characters in a given prompt. This simple transformation effectively renders the classifier unable to detect potentially harmful content," said Aman Priyanshu, a bug hunter at Robust Intelligence who discovered the workaround.

The root cause of the latest bypass may lie in the fine-tuning process, detailed by Robust Intelligence in a May blog post. Prompt Guard is a fine-tuned version of its base model, meant to catch high-risk prompts.

There is as yet no definitive solution to the problems of jailbreaking and prompt injection attacks, and Robust Intelligence's latest experiment is not the first of its kind. Carnegie Mellon University researchers last year discovered an automated technique to generate adversarial prompts to compromise safety mechanisms. A separate set of researchers from Robust Intelligence and Yale University found a way to automate jailbreaking OpenAI, Meta and Google LLMs with no obvious fix.

Robust Intelligence said it has suggested countermeasures for the exploit to Meta, which said it is "actively working on a fix."


About the Author

Rashmi Ramesh

Rashmi Ramesh

Assistant Editor, Global News Desk, ISMG

Ramesh has seven years of experience writing and editing stories on finance, enterprise and consumer technology, and diversity and inclusion. She has previously worked at formerly News Corp-owned TechCircle, business daily The Economic Times and The New Indian Express.




Around the Network

Our website uses cookies. Cookies enable us to provide the best experience possible and help us understand how visitors use our website. By browsing govinfosecurity.com, you agree to our use of cookies.