AIApril 6, 2025Anthropic’s Evaluation of Chain-of-Thought Faithfulness: Investigating Hidden Reasoning, Reward Hacks, and the Limitations of Verbal AI Transparency in Reasoning Models
AIApril 2, 2025Open AI Releases PaperBench: A Challenging Benchmark for Assessing AI Agents’ Abilities to Replicate Cutting-Edge Machine Learning Research
AIMarch 27, 2025This AI Paper Introduces the Kolmogorov-Test: A Compression-as-Intelligence Benchmark for Evaluating Code-Generating Language Models
AIJanuary 23, 2025Plurai Introduces IntellAgent: An Open-Source Multi-Agent Framework to Evaluate Complex Conversational AI System