We aim to stay on top of cutting-edge research in AI Safety Evaluations and develop a thoughtful community of critical thinkers eager to apply their skills to AI Safety. Sign up to attend using the links in the Event Link column in the Schedule table below. In addition to participating as an attendee, you can suggest a paper for us to cover here. If you’d like to take even more initiative, you can volunteer to present a paper using this form and we will get back to you.

<aside> <img src="/icons/info-alternate_gray.svg" alt="/icons/info-alternate_gray.svg" width="40px" />

If you’re new here and wondering what this is all about, check out our guide “How to Eval” where we explain what an eval is, how to get the most from the reading group, and more!

</aside>

Schedule

Date Presenter Paper
January 27, 2026 James Sykes Scaling Up Active Testing to Large Language Models
February 3, 2026 Yulun Jiang (author) Meta-RL Induces Exploration in Language Agents

FAQ

Attendance and Comms

Discussion Norms

Papers and Presenting

Presentation Archive

Date Presenter Paper
January 20, 2026 Habeeb Abdulfatah Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities
January 13, 2026 Justin Dollman RepliBench: Evaluating the Autonomous Replication Capabilities of Language Model Agents
December 16, 2025 Matt Broerman UK AISI Align Evaluation Case-Study
December 9, 2025 Kanishk Gandhi (author) Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs
December 2, 2025 Hyunwoo Kim (author) Hypothesis-Driven Theory-of-Mind Research for Large Language Models
November 18, 2025 Lorenzo Pacchiardi (author) PredictaBoard: Benchmarking LLM Score Predictability
November 11, 2025 Mark Keavney Agentic Reinforcement Learning for Search is Unsafe
November 4, 2025 Preeti Ravindra AgentMisalignment: Measuring the Propensity for Misaligned Behaviour in LLM-Based Agents
October 28, 2025 Paolo Bova TextQuests: How Good are LLMs at Text-Based Video Games?
October 21, 2025 Chris Ackerman (author) Evidence for Limited Metacognition in LLMs
October 14, 2025 Wyatt Boyer Building and Evaluating Alignment Auditing Agents
October 7, 2025 Miguel Guirao Security Challenges in AI Agent Deployment: Insights from a Large Scale Public Competition
September 30, 2025 Tegan Green Secret Collusion among AI Agents: Multi-Agent Deception via Steganography
September 23, 2025 Achu Menon When Chain of Thought is Necessary, Language Models Struggle to Evade Monitors
September 16, 2025 Sydney Von Arx (author) CoT May Be Highly Informative Despite “Unfaithfulness”
September 9, 2025 Iván Arcuschin (author) Chain-of-Thought Reasoning In The Wild Is Not Always Faithful
September 2, 2025 Linda Liu AI Sandbagging: Language Models can Strategically Underperform on Evaluations
August 26, 2025 Ashly Jiju Reasoning Models Don't Always Say What They Think
August 19, 2025 Miguel Guirao Language Models Don't Always Say What They Think
August 12, 2025 Ceyda Guzelsevdi Comparing Humans, GPT-4, and GPT-4V On Abstraction and Reasoning Tasks
August 5, 2025 Tegan Green Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?
July 29, 2025 Justin Dollman Measuring Faithfulness in Chain-of-Thought Reasoning
July 22, 2025 Matt Broerman Audit Cards: Contextualizing AI Evaluations
July 15, 2025 Morgan Sinclaire An Example Safety Case for Safeguards Against Misuse
July 8, 2025 Aditya Thomas Alignment faking in large language models
July 1, 2025 Morgan Sinclaire Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation
June 24, 2025 Paolo Bova General Scales Unlock AI Evaluation with Explanatory and Predictive Power
June 17, 2025 Justin Dollman Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs
June 10, 2025 Justin Dollman Ctrl-Z: Controlling AI Agents via Resampling
May 27, 2025 Morgan Sinclaire Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs
May 13, 2025 - 100+ concrete projects and open problems in evals
April 29, 2025 Matt Broerman Sabotage Evaluations for Frontier Models
April 1, 2025 Matt Broerman Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations