Air Canada shares slump on Q2 earnings miss
Investing.com -- OpenAI has announced the launch of BrowseComp, an open-source benchmark designed to test the ability of AI agents to browse the internet to locate hard-to-find information. The benchmark, which is available in OpenAI's simple evals GitHub repository, consists of 1,266 challenging problems.
BrowseComp is designed to measure the ability of AI agents to locate complex, intertwined information on the internet. AI agents that can gather knowledge by browsing the internet are becoming increasingly valuable. A competent browsing agent should be able to locate information that is difficult to find, potentially requiring the browsing of tens or even hundreds of websites.
The benchmark was created to be both challenging for models and easy to verify. It focuses on questions where the answer is short and there is only one correct answer. This makes grading short answers simple and makes the benchmark easy to use.
The benchmark was created following the guidelines of OpenAI's previous factuality benchmark, SimpleQA. Human trainers were asked to create challenging, fact-seeking questions with single, indisputable, short answers that would not change over time and were supported by evidence. The trainers created questions that were extremely challenging, and three checks were used to ensure that the questions were sufficiently challenging.
The trainers were asked to create tasks that were challenging enough that another person would not be able to solve it within ten minutes. To create challenging questions, trainers were encouraged to start with a fact, and then create an "inverted" question, where the answer is hard to find but easy to verify.
The distribution of topics in the BrowseComp benchmark was diverse, with topics ranging from TV shows and movies, to science and technology, art, history, sports, music, video games, geography, and politics.
OpenAI evaluated a range of models on BrowseComp, including models without browsing—GPT‑4o, GPT‑4.5, and OpenAI o1 (medium)—as well as GPT‑4o with browsing and Deep Research, an agent model explicitly trained for persistent web browsing. The results showed that both tool use and reasoning contribute meaningfully to performance on BrowseComp.
Deep Research significantly outperformed all other models, solving around half of the problems. Its ability to autonomously search the web, evaluate and synthesize information from multiple sources, and adapt its search strategy enables it to handle questions that are otherwise intractable.
A key feature of agents is that performance scales with respect to the amount of compute used at inference time. In a similar fashion, additional inference-time compute improves performance on BrowseComp, because the questions require iteratively browsing a large number of websites and combining information.
BrowseComp evaluates how well models can browse the internet to search for hard-to-find information. While BrowseComp does not aim to measure performance on common queries, it measures the ability to find a single targeted piece of information, is easy-to-evaluate, and is challenging for existing browsing agents. OpenAI hopes that open-sourcing BrowseComp drives research on more trustworthy and reliable AI.
This article was generated with the support of AI and reviewed by an editor. For more information see our T&C.
Is MSFT truely undervalued?
With MSFT making headlines, investors are asking: Is it truly valued fairly? InvestingPro's advanced AI algorithms have analyzed MSFT alongside thousands of other stocks to uncover hidden gems with massive upside. And guess what? MSFT wasn't at the top of the list.
Unlock ProPicks AI