We aim to stay on top of cutting-edge research in AI Safety Evaluations and develop a thoughtful community of critical thinkers eager to apply their skills to AI Safety. Sign up to attend using the links in the Event link column in the Schedule table below. In addition to participating as an attendee, you can suggest a paper for us to cover or volunteer to present one.

<aside> <img src="/icons/info-alternate_gray.svg" alt="/icons/info-alternate_gray.svg" width="40px" />

If you’re new here and wondering what this is all about, check out our guide “How to Eval” where we explain what an eval is, how to get the most from the reading group, and more!

</aside>

Schedule

Date Presenter Paper Event link
April 7, 2026 Morgan Sinclaire AI Control: Improving Safety Despite Intentional Subversion Event
April 14, 2026 Morgan Sinclaire (author) When can we trust untrusted monitoring? A safety case sketch across collusion strategies Event

FAQ

Attendance and Comms

Discussion Norms

Papers and Presenting

Presentation Archive

Date Presenter Paper Recording/slides
March 31, 2026 Hongye Cao (author) SafeDialBench: A Fine-Grained Safety Evaluation Benchmark for Large Language Models in Multi-Turn Dialogues with Diverse Jailbreak Attacks Recording
March 24, 2026 Peiyang Song (author) Large Language Model Reasoning Failures Recording, slides
March 17, 2026 Mark Keavney Training large language models on narrow tasks can lead to broad misalignment Recording, slides
March 10, 2026 Justin Dollman Toward a Science of AI Agent Reliability Recording
March 3, 2026 Matt Broerman Continuation of Measuring AI Ability to Complete Long Tasks Recording
February 24, 2026 Nathan Witkin (author) Measuring AI Ability to Complete Long Tasks, Against the METR graph Recording, slides
February 17, 2026 Mark Keavney RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts Recording, slides
February 10, 2026 Justin Dollman HCAST: Human-Calibrated Autonomy Software Tasks Recording
February 3, 2026 Yulun Jiang (author) Meta_RL Induces Exploration in Language Agents Recording, slides
January 27, 2026 James Sykes Scaling Up Active Testing to Large Language Models Recording, slides
January 20, 2026 Habeeb Abdulfatah Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities Recording, slides
January 13, 2026 Justin Dollman RepliBench: Evaluating the Autonomous Replication Capabilities of Language Model Agents
December 16, 2025 Matt Broerman UK AISI Align Evaluation Case-Study
December 9, 2025 Kanishk Gandhi (author) Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs Recording
December 2, 2025 Hyunwoo Kim (author) Hypothesis-Driven Theory-of-Mind Research for Large Language Models
November 18, 2025 Lorenzo Pacchiardi (author) PredictaBoard: Benchmarking LLM Score Predictability Recording
November 11, 2025 Mark Keavney Agentic Reinforcement Learning for Search is Unsafe
November 4, 2025 Preeti Ravindra AgentMisalignment: Measuring the Propensity for Misaligned Behaviour in LLM-Based Agents
October 28, 2025 Paolo Bova TextQuests: How Good are LLMs at Text-Based Video Games?
October 21, 2025 Chris Ackerman (author) Evidence for Limited Metacognition in LLMs Recording
October 14, 2025 Wyatt Boyer Building and Evaluating Alignment Auditing Agents
October 7, 2025 Miguel Guirao Security Challenges in AI Agent Deployment: Insights from a Large Scale Public Competition
September 30, 2025 Tegan Green Secret Collusion among AI Agents: Multi-Agent Deception via Steganography
September 23, 2025 Achu Menon When Chain of Thought is Necessary, Language Models Struggle to Evade Monitors
September 16, 2025 Sydney Von Arx (author) CoT May Be Highly Informative Despite “Unfaithfulness” Recording
September 9, 2025 Iván Arcuschin (author) Chain-of-Thought Reasoning In The Wild Is Not Always Faithful Recording
September 2, 2025 Linda Liu AI Sandbagging: Language Models can Strategically Underperform on Evaluations Recording
August 26, 2025 Ashly Jiju Reasoning Models Don't Always Say What They Think
August 19, 2025 Miguel Guirao Language Models Don't Always Say What They Think
August 12, 2025 Ceyda Guzelsevdi Comparing Humans, GPT-4, and GPT-4V On Abstraction and Reasoning Tasks
August 5, 2025 Tegan Green Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?
July 29, 2025 Justin Dollman Measuring Faithfulness in Chain-of-Thought Reasoning
July 22, 2025 Matt Broerman Audit Cards: Contextualizing AI Evaluations
July 15, 2025 Morgan Sinclaire An Example Safety Case for Safeguards Against Misuse
July 8, 2025 Aditya Thomas Alignment faking in large language models
July 1, 2025 Morgan Sinclaire Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation
June 24, 2025 Paolo Bova General Scales Unlock AI Evaluation with Explanatory and Predictive Power
June 17, 2025 Justin Dollman Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs
June 10, 2025 Justin Dollman Ctrl-Z: Controlling AI Agents via Resampling
May 27, 2025 Morgan Sinclaire Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs
May 13, 2025 - 100+ concrete projects and open problems in evals
April 29, 2025 Matt Broerman Sabotage Evaluations for Frontier Models
April 1, 2025 Matt Broerman Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations