| January 20, 2026 |
Habeeb Abdulfatah |
Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities |
| January 13, 2026 |
Justin Dollman |
RepliBench: Evaluating the Autonomous Replication Capabilities of Language Model Agents |
| December 16, 2025 |
Matt Broerman |
UK AISI Align Evaluation Case-Study |
| December 9, 2025 |
Kanishk Gandhi (author) |
Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs |
| December 2, 2025 |
Hyunwoo Kim (author) |
Hypothesis-Driven Theory-of-Mind Research for Large Language Models |
| November 18, 2025 |
Lorenzo Pacchiardi (author) |
PredictaBoard: Benchmarking LLM Score Predictability |
| November 11, 2025 |
Mark Keavney |
Agentic Reinforcement Learning for Search is Unsafe |
| November 4, 2025 |
Preeti Ravindra |
AgentMisalignment: Measuring the Propensity for Misaligned Behaviour in LLM-Based Agents |
| October 28, 2025 |
Paolo Bova |
TextQuests: How Good are LLMs at Text-Based Video Games? |
| October 21, 2025 |
Chris Ackerman (author) |
Evidence for Limited Metacognition in LLMs |
| October 14, 2025 |
Wyatt Boyer |
Building and Evaluating Alignment Auditing Agents |
| October 7, 2025 |
Miguel Guirao |
Security Challenges in AI Agent Deployment: Insights from a Large Scale Public Competition |
| September 30, 2025 |
Tegan Green |
Secret Collusion among AI Agents: Multi-Agent Deception via Steganography |
| September 23, 2025 |
Achu Menon |
When Chain of Thought is Necessary, Language Models Struggle to Evade Monitors |
| September 16, 2025 |
Sydney Von Arx (author) |
CoT May Be Highly Informative Despite “Unfaithfulness” |
| September 9, 2025 |
Iván Arcuschin (author) |
Chain-of-Thought Reasoning In The Wild Is Not Always Faithful |
| September 2, 2025 |
Linda Liu |
AI Sandbagging: Language Models can Strategically Underperform on Evaluations |
| August 26, 2025 |
Ashly Jiju |
Reasoning Models Don't Always Say What They Think |
| August 19, 2025 |
Miguel Guirao |
Language Models Don't Always Say What They Think |
| August 12, 2025 |
Ceyda Guzelsevdi |
Comparing Humans, GPT-4, and GPT-4V On Abstraction and Reasoning Tasks |
| August 5, 2025 |
Tegan Green |
Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress? |
| July 29, 2025 |
Justin Dollman |
Measuring Faithfulness in Chain-of-Thought Reasoning |
| July 22, 2025 |
Matt Broerman |
Audit Cards: Contextualizing AI Evaluations |
| July 15, 2025 |
Morgan Sinclaire |
An Example Safety Case for Safeguards Against Misuse |
| July 8, 2025 |
Aditya Thomas |
Alignment faking in large language models |
| July 1, 2025 |
Morgan Sinclaire |
Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation |
| June 24, 2025 |
Paolo Bova |
General Scales Unlock AI Evaluation with Explanatory and Predictive Power |
| June 17, 2025 |
Justin Dollman |
Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs |
| June 10, 2025 |
Justin Dollman |
Ctrl-Z: Controlling AI Agents via Resampling |
| May 27, 2025 |
Morgan Sinclaire |
Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs |
| May 13, 2025 |
- |
100+ concrete projects and open problems in evals |
| April 29, 2025 |
Matt Broerman |
Sabotage Evaluations for Frontier Models |
| April 1, 2025 |
Matt Broerman |
Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations |