AIJune 19, 2025HtFLlib: A Unified Benchmarking Library for Evaluating Heterogeneous Federated Learning Methods Across Modalities
AIMay 31, 2025Multimodal Foundation Models Fall Short on Physical Reasoning: PHYX Benchmark Highlights Key Limitations in Visual and Symbolic Integration
AIApril 24, 2025AWS Introduces SWE-PolyBench: A New Open-Source Multilingual Benchmark for Evaluating AI Coding Agents
AIApril 17, 2025Model Performance Begins with Data: Researchers from Ai2 Release DataDecide—A Benchmark Suite to Understand Pretraining Data Impact Across 30K LLM Checkpoints
AIMarch 5, 2025Researchers from FutureHouse and ScienceMachine Introduce BixBench: A Benchmark Designed to Evaluate AI Agents on Real-World Bioinformatics Task