In AI-driven search, innovation often meets formidable challenges. However, a new contender, iAsk Pro, has recently hit a milestone that’s causing waves across the industry. The latest model from iAsk AI, iAsk Pro, achieved the top score on the GPQA (Graduate-Level Google-Proof Q&A) benchmark—a test designed to push AI to its limits—and surpassed some of the most established names in AI, including OpenAI and Anthropic. This achievement marks a significant step forward for AI’s role in delivering precise, context-driven answers to complex queries.
The GPQA benchmark is designed to separate basic search capabilities from advanced AI problem-solving. Developed by subject-matter experts, GPQA presents questions that require deep, multi-step reasoning in fields like biology, chemistry, and physics. This isn’t a typical benchmark as GPQA is considered among the most difficult due to its “Google-proof” structure, which prevents answers from being easily searched or inferred without extensive domain knowledge.
The GPQA Diamond subset, where iAsk Pro scored highest, includes the toughest 198 questions. Even PhD-level experts achieve an average of 65% accuracy on these questions, highlighting the intensity of the test. By comparison, many advanced AI models struggle to hit 50%, making iAsk Pro’s 78.28% accuracy a standout result that redefines what’s possible in AI.
In its GPQA performance, iAsk Pro’s impressive results prove its accuracy and efficiency. It achieved its top score with the pass@1 metric, meaning it arrived at the correct answer on its very first try—a feat that surpasses both OpenAI’s o1 model and Anthropic’s Claude 3.5. The model’s accuracy increased further when measured with cons@5, achieving an impressive 83.83% by cross-referencing its top five answers.
For context, OpenAI’s o1 model required 64 attempts to achieve a comparable accuracy, and Anthropic’s Claude 3.5 performed similarly with cons@32. By achieving high accuracy with fewer tries, iAsk Pro’s results highlight a model designed for precision without repeated “guessing.” This efficiency, when applied to real-world queries, means faster, more reliable answers—a critical advantage for applications across fields where timely, accurate information is crucial.
One key factor in iAsk Pro’s high performance is its use of Chain of Thought (CoT) reasoning. This method allows iAsk Pro to tackle complex problems by breaking them down into smaller, more manageable steps. By mimicking human-style reasoning, the model can engage in deep, logical problem-solving—particularly useful for scientific benchmarks like GPQA, where multi-layered reasoning is often required.
GPQA’s unique construction also adds to its challenge. Built specifically to be “Google-proof,” the benchmark was tested against non-experts with unrestricted web access, who achieved only a 34% accuracy rate. This structure means that simply pulling data from the internet won’t suffice; the model must process the underlying concepts, making it especially challenging for typical AI models.
The significance of iAsk Pro’s performance goes beyond beating traditional benchmarks. With technology capable of handling complex, layered questions in scientific fields, iAsk Pro signals a shift in what AI can accomplish. This kind of intelligent search tool has transformative potential across a range of industries, from research and healthcare to engineering and higher education. Imagine a platform that can provide in-depth, accurate answers to the kinds of questions that would traditionally take hours of research—iAsk Pro’s capabilities are bringing that future closer.
In an age where information is abundant yet finding high-quality, relevant answers remains a challenge, iAsk Pro’s success on the GPQA benchmark is a promising sign of what’s to come.
For those interested in the data specifics, iAsk AI’s official results page offers a deeper look into the numbers and comparisons with rival models. The results clearly demonstrate that iAsk Pro excels in areas that require more than just surface-level knowledge; they highlight a model that delivers nuanced, informed answers.