Technology
Danish Kapoor
Danish Kapoor

OpenAI’s new O3 model could not give the expected performance in the tests

Developed by OpenAI, the O3 Artificial Intelligence Model was introduced in December with great expectations. During the presentation, it was suggested that the model has shown a striking success, especially in the challenging mathematical tests called Frontiermath. According to these data, the O3 model seemed to have left behind all other opponent models, giving correct response over 25 percent in the test. However, these scores, which were shared with the public, became questioned by independent tests.

The EPOch AI, the developer of the Frontiermath data set, has recently released its own comparisons on the O3 model. According to the independent measurements, the success rate of the O3 model remained at only 10 percent. This result did not directly contradict OpenAI’s previous statements, but he drew a picture well below expectations. But the main point is that the O3, which is presented to the public, is technically different from the version used in the tests.

OpenAI may have introduced a different version of O3

The EPOch AI said it is based on a different Frontiermath version of OpenAI during its test process. Accordingly, the comparisons of EPOch were made with a new version of 290 mathematics, while OpenAI’s data was based on an previous version of only 180 problems. In addition, it is estimated that OpenAI uses an O3 version that is supported by much stronger processing power in internal tests. Therefore, the technical characteristics of the test environments may also have a direct effect on the results.

In addition, the statement made by the Arc outlet Foundation confirms this difference. The organization said that the public O3 is actually an optimized version for a conversation -based use. This shows that the O3, which is offered to the public, may have a lower capacity than the main tested version. As a matter of fact, it is already expected that models with high processing power will achieve higher scores in comparisons.

Wenda Zhou of the OpenAI team tried to clarify the issue in a live broadcast he recently attended. According to Zhou’s statement, the O3 in the production phase has been optimized, considering speed and cost efficiency. For this reason, it should be considered as normal that the high scores shown in the comparison cannot be repeated in the public version. Nevertheless, it is clear that this difference causes confusion in the public opinion.

Another remarkable aspect is that these differences are not explicitly stated in the period when OpenAI introduces the O3 model. During the promotion, the prominent success rate was launched as if it belonged to the model to be released directly. This gave up question marks in terms of both transparency and informing the public. Considering that reliability in the field of artificial intelligence becomes more critical, it is of great importance to share such details clearly.

In spite of everything, the lower scores of the O3 -open version of the O3 does not overshadow OpenAI’s overall success. Because the company called O3-Mini-High and O4-Mini has reached higher success rates compared to O3 in Frontiermath tests. However, OpenAI is expected to use O3-pro, a stronger version of O3-pro. These developments show that the O3 is only a temporary stage.

In the world of artificial intelligence, Benchmark tests began to create more and more controversy. Because the extent to which the comparison results reflect the real world performance of the model is frequently questioned. In particular, the tests in which companies evaluate their own products carry serious question marks in terms of independence and impartiality. The example of O3 has brought this discussion back to the agenda.

In the past, similar situations have occurred for other companies. For example; The XAI was criticized for publishing misleading comparison tables about the Grok 3 model. Meta admitted that the model it offered to the developers did not coincide with the version used in the benchmark tests. Such examples clearly show that comparison results always represent the quality reflected to the final user.

Danish Kapoor