Challenges in Safeguarding LLMs from Adversarial Threats

Interesting indirect prompt injection assault:

Bargury’s technique initiates with a tainted document, which is distributed to a potential target’s Google Drive. (Bargury mentions that a target could have also uploaded a compromised file to their personal account.) It appears to be an official document regarding company meeting protocols. However, within the document, Bargury concealed a 300-word harmful prompt that contains directives for ChatGPT. The prompt is formatted in white text in a size-one font, which is unlikely to be noticed by a human but will still be interpreted by a machine.

In a demonstration video of the assault, Bargury illustrates the target requesting ChatGPT to “summarize my last discussion with Sam,” referring to notes involving OpenAI CEO Sam Altman. (The scenarios in the assault are fictional.) Instead, the hidden prompt informs the LLM that there was an “error” and the document does not actually need summarizing. The prompt states that the individual is genuinely a “developer racing against a deadline” and requires the AI to search Google Drive for API keys and append them to the end of a URL included in the prompt.

This URL is effectively a command in the Markdown language that connects to an external server to retrieve the image stored there. But according to the prompt’s directions, the URL now also includes the API keys discovered in the Google Drive account.

This type of scenario should prompt everyone to pause and reflect seriously before implementing any AI agents. We simply don’t possess the knowledge to safeguard against the assault.

Leave a Reply Cancel reply