OpenAI improves AI security with instruction hierarchy

AI bots are known to be vulnerable to humorous attacks on the internet. In particular, commands like “forget all previous instructions” can cause bots to ignore the original programming instructions. This can lead to AI systems being exploited and behaving in unexpected ways. OpenAI has developed a new technique to solve this problem: instruction hierarchy.

OpenAI’s new security technique

With this new technique, OpenAI researchers aim to prevent AI models from being misused and manipulated with unauthorized instructions. The instruction hierarchy allows models to prioritize original instructions given by the developer, so that user-injected commands that attempt to fool the system are given lower priority.

The first model to implement this new security method was OpenAI’s recently released GPT-4o Mini. Olivier Godement, product manager for the OpenAI API platform, says that this technique will make AI models more secure. “It makes the model actually obey the developer system messages,” Godement explains. This method forces AI bots to follow developer instructions rather than user instructions.

Preventing abuse

The instruction hierarchy aims to prevent malicious commands from users by improving the security of AI models. This new technique makes AI bots more resilient to common attacks on the internet. According to the research paper, existing large language models (LLMs) were not capable of treating system instructions specified by the developer differently from user instructions. This new method gives the highest priority to system instructions and low priority to malicious user commands.

“If there is a conflict, you should follow the system message first. With this new technique, we expect the model to be more secure than before,” Godement adds. This security mechanism supports OpenAI’s goal of developing fully automated agents to manage digital life in the future. OpenAI wants to take the necessary security measures before introducing such agents.

In addition to the instruction hierarchy method, OpenAI plans to develop more sophisticated security measures. The research paper notes that the modern internet is full of protection measures, such as web browsers that detect unsafe websites or machine learning-based spam filters that classify phishing attempts. It is envisioned that such security measures could also be applied to more sophisticated AI agents in the future.

This new security update aims to address concerns from employees and former employees who have been asking OpenAI to improve its security and transparency practices. Criticism that security culture and processes are being neglected has led OpenAI to invest more research and resources in this area. In this context, the development of techniques such as instruction hierarchy aims to make AI models more secure and user-friendly.