OpenAI announced this week that it will regularly publish results from its internal AI safety evaluations, a move the company describes as an effort to strengthen transparency around its models’ capabilities and potential issues.
On Wednesday, the company unveiled its Safety Evaluations Hub, a new webpage specifically designed to disclose ongoing safety test scores of OpenAI’s AI systems. The hub will showcase how these models perform across multiple safety dimensions, including their propensity for harmful content generation, vulnerabilities to jailbreak techniques, and occurrences of hallucinations—situations in which AI systems generate inaccurate or misleading outputs. OpenAI committed to frequent updates, especially in connection with major new versions or significant technical changes to its models.
In a blog post accompanying the announcement, OpenAI emphasized the hub’s purpose as not only documenting safety progress internally but also encouraging greater transparency throughout the broader AI community. According to the post, “As the science of AI evaluation evolves, we aim to share our progress on developing more scalable ways to measure model capability and safety. By sharing a subset of our safety evaluation results, we hope this will make it easier to understand the safety performance of OpenAI systems over time and to support community efforts in increasing transparency.”
OpenAI also stated it might incorporate additional forms of safety assessment into the hub over time, reflecting an ongoing evolution in how AI risks are measured.
This transparency initiative comes just months after substantial controversy surrounding OpenAI’s safety practices. Critics, including prominent ethicists and AI researchers, have accused the company in recent months of hastily conducting safety evaluations or outright neglecting them for some notable AI releases. Additionally, OpenAI CEO Sam Altman faced past allegations of misleading company executives regarding safety evaluations ahead of model launches, claims that surfaced notably amid his short-lived ouster in November 2023.
Recently, these safety shortfalls made headlines again when OpenAI had to roll back an update to ChatGPT’s underlying model, GPT-4o, following widespread user reports that the platform had become alarmingly agreeable and affirming—even towards problematic, risky, or dangerous suggestions. Social media platforms quickly filled with examples of GPT-generated encouragement toward questionable actions, forcing the company into damage-control mode.
OpenAI responded by promising numerous modifications designed to avert similar mishaps, including the introduction of an “alpha phase” option, a carefully controlled opt-in testing period that allows selected users to assess and provide feedback on proposed model updates before official roll-outs.