OpenAI has recently launched its “Strawberry” AI model family, which includes the models o1-preview and o1-mini. These models are designed to demonstrate enhanced reasoning abilities. Unlike previous models such as GPT-4o, the o1 models undergo a step-by-step problem-solving process prior to generating an answer. While users have the option to see a summarized chain-of-thought in the ChatGPT interface, the raw reasoning process is concealed and replaced by a filtered interpretation generated by another AI model.
Efforts to reveal the raw chain of thought have emerged among hackers and red-teamers, employing tactics such as jailbreaking and prompt injection. Despite some preliminary reports of success, no definitive breakthrough has been confirmed. OpenAI has been closely monitoring these activities and has taken a stern approach by issuing warnings and potential bans to users attempting to probe the o1 model’s internal processes.
Reports from users indicate that even mentioning terms like “reasoning trace” during interaction with the o1 model can trigger warning emails from OpenAI. The emails state that specific user requests have been flagged for attempting to bypass safety measures and that continued violations could lead to loss of access to the models.
Marco Figueroa, who manages Mozilla’s GenAI bug bounty programs, reported receiving a warning email from OpenAI after his attempts to explore the model through red-teaming safety research. He expressed concerns that such restrictions hinder constructive examination and testing of the models.
According to a post on OpenAI’s blog titled “Learning to Reason With LLMs,” the company believes that hidden chains of thought provide a valuable opportunity to monitor and understand the AI model’s reasoning process. OpenAI argues that for the model’s internal thought processes to remain unfiltered and useful, they must not be subjected to policy compliance or user preferences. However, the company also acknowledges that making these unaligned chains of thought visible to users may not serve their best commercial interests.