A weekly automated synthesis of the latest benchmark findings, generated by an AI Analyst. It covers longitudinal model drift, cross-model reliability, and notable safety anomalies.
Generated by GPT-4o using live audit data from — models.
This report is AI-generated and may contain errors. Always verify against raw benchmark data.