| March 31, 2026 |
Hongye Cao (author) |
SafeDialBench: A Fine-Grained Safety Evaluation Benchmark for Large Language Models in Multi-Turn Dialogues with Diverse Jailbreak Attacks |
Recording |
| March 24, 2026 |
Peiyang Song (author) |
Large Language Model Reasoning Failures |
Recording, slides |
| March 17, 2026 |
Mark Keavney |
Training large language models on narrow tasks can lead to broad misalignment |
Recording, slides |
| March 10, 2026 |
Justin Dollman |
Toward a Science of AI Agent Reliability |
Recording |
| March 3, 2026 |
Matt Broerman |
Continuation of Measuring AI Ability to Complete Long Tasks |
Recording |
| February 24, 2026 |
Nathan Witkin (author) |
Measuring AI Ability to Complete Long Tasks, Against the METR graph |
Recording, slides |
| February 17, 2026 |
Mark Keavney |
RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts |
Recording, slides |
| February 10, 2026 |
Justin Dollman |
HCAST: Human-Calibrated Autonomy Software Tasks |
Recording |
| February 3, 2026 |
Yulun Jiang (author) |
Meta_RL Induces Exploration in Language Agents |
Recording, slides |
| January 27, 2026 |
James Sykes |
Scaling Up Active Testing to Large Language Models |
Recording, slides |
| January 20, 2026 |
Habeeb Abdulfatah |
Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities |
Recording, slides |
| January 13, 2026 |
Justin Dollman |
RepliBench: Evaluating the Autonomous Replication Capabilities of Language Model Agents |
|
| December 16, 2025 |
Matt Broerman |
UK AISI Align Evaluation Case-Study |
|
| December 9, 2025 |
Kanishk Gandhi (author) |
Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs |
Recording |
| December 2, 2025 |
Hyunwoo Kim (author) |
Hypothesis-Driven Theory-of-Mind Research for Large Language Models |
|
| November 18, 2025 |
Lorenzo Pacchiardi (author) |
PredictaBoard: Benchmarking LLM Score Predictability |
Recording |
| November 11, 2025 |
Mark Keavney |
Agentic Reinforcement Learning for Search is Unsafe |
|
| November 4, 2025 |
Preeti Ravindra |
AgentMisalignment: Measuring the Propensity for Misaligned Behaviour in LLM-Based Agents |
|
| October 28, 2025 |
Paolo Bova |
TextQuests: How Good are LLMs at Text-Based Video Games? |
|
| October 21, 2025 |
Chris Ackerman (author) |
Evidence for Limited Metacognition in LLMs |
Recording |
| October 14, 2025 |
Wyatt Boyer |
Building and Evaluating Alignment Auditing Agents |
|
| October 7, 2025 |
Miguel Guirao |
Security Challenges in AI Agent Deployment: Insights from a Large Scale Public Competition |
|
| September 30, 2025 |
Tegan Green |
Secret Collusion among AI Agents: Multi-Agent Deception via Steganography |
|
| September 23, 2025 |
Achu Menon |
When Chain of Thought is Necessary, Language Models Struggle to Evade Monitors |
|
| September 16, 2025 |
Sydney Von Arx (author) |
CoT May Be Highly Informative Despite “Unfaithfulness” |
Recording |
| September 9, 2025 |
Iván Arcuschin (author) |
Chain-of-Thought Reasoning In The Wild Is Not Always Faithful |
Recording |
| September 2, 2025 |
Linda Liu |
AI Sandbagging: Language Models can Strategically Underperform on Evaluations |
Recording |
| August 26, 2025 |
Ashly Jiju |
Reasoning Models Don't Always Say What They Think |
|
| August 19, 2025 |
Miguel Guirao |
Language Models Don't Always Say What They Think |
|
| August 12, 2025 |
Ceyda Guzelsevdi |
Comparing Humans, GPT-4, and GPT-4V On Abstraction and Reasoning Tasks |
|
| August 5, 2025 |
Tegan Green |
Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress? |
|
| July 29, 2025 |
Justin Dollman |
Measuring Faithfulness in Chain-of-Thought Reasoning |
|
| July 22, 2025 |
Matt Broerman |
Audit Cards: Contextualizing AI Evaluations |
|
| July 15, 2025 |
Morgan Sinclaire |
An Example Safety Case for Safeguards Against Misuse |
|
| July 8, 2025 |
Aditya Thomas |
Alignment faking in large language models |
|
| July 1, 2025 |
Morgan Sinclaire |
Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation |
|
| June 24, 2025 |
Paolo Bova |
General Scales Unlock AI Evaluation with Explanatory and Predictive Power |
|
| June 17, 2025 |
Justin Dollman |
Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs |
|
| June 10, 2025 |
Justin Dollman |
Ctrl-Z: Controlling AI Agents via Resampling |
|
| May 27, 2025 |
Morgan Sinclaire |
Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs |
|
| May 13, 2025 |
- |
100+ concrete projects and open problems in evals |
|
| April 29, 2025 |
Matt Broerman |
Sabotage Evaluations for Frontier Models |
|
| April 1, 2025 |
Matt Broerman |
Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations |
|