Tomek Korbak

I’m a Member of Technical Staff at OpenAI working on monitoring LLM agents for misalignment. Previously, I worked on AI control and safety cases at the UK AI Security Institute and on honesty post-training at Anthropic. Before that, I did a PhD at the University of Sussex with Chris Buckley and Anil Seth focusing on RL from human feedback (RLHF) and spent time as a visiting researcher at NYU working with Ethan Perez, Sam Bowman and Kyunghyun Cho. I studied cognitive science, philosophy and physics at the University of Warsaw.

Highlighted papers

A sketch of an AI control safety case

Tomek Korbak, Josh Clymer, Benjamin Hilton, Buck Shlegeris, Geoffrey Irving
Looking Inward: Language Models Can Learn About Themselves by Introspection

Felix Binder, James Chua, Tomek Korbak, Henry Sleight, John Hughes, Robert Long, Ethan Perez, Miles Turpin, Owain Evans

ICLR 2025
Is model collapse inevitable? breaking the curse of recursion by accumulating real and synthetic data

Matthias Gerstgrasser, Rylan Schaeffer, Apratim Dey, Rafael Rafailov, Henry Sleight, John Hughes, Tomek Korbak, Rajashree Agrawal, Dhruv Pai, Andrey Gromov, Daniel A Roberts, Diyi Yang, David L Donoho, Sanmi Koyejo

COLM 2024
Pretraining Language Models with Human Preferences

Tomek Korbak, Kejian Shi, Angelica Chen, Rasika Bhalerao, Christopher Buckley, Jason Phang, Samuel Bowman, Ethan Perez

ICML 2023 (oral)

Code
Towards Understanding Sycophancy in Language Models

Mrinank Sharma, Meg Tong, Tomek Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R. Johnston, Shauna Kravec, Timothy Maxwell, Sam McCandlish, Kamal Ndousse, Oliver Rausch, Nicholas Schiefer, Da Yan, Miranda Zhang, Ethan Perez

ICLR 2024

Code
Compositional preference models for aligning LMs

Dongyoung Go, Tomek Korbak, Germán Kruszewski, Jos Rozen, Marc Dymetman

ICLR 2024

Code
The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A"

Lukas Berglund, Meg Tong, Max Kaufmann, Mikita Balesni, Asa Cooper Stickland, Tomek Korbak, Owain Evans

ICLR 2024

Code
Many-shot jailbreaking

Cem Anil, Esin Durmus, Mrinank Sharma, Joe Benton, Sandipan Kundu, Joshua Batson, Nina Rimsky, Meg Tong, Jesse Mu, Daniel Ford, Francesco Mosconi, Rajashree Agrawal, Rylan Schaeffer, Naomi Bashkansky, Samuel Svenningsen, Mike Lambert, Ansh Radhakrishnan, Carson Denison, Evan J Hubinger, Yuntao Bai, Trenton Bricken, Timothy Maxwell, Nicholas Schiefer, Jamie Sully, Alex Tamkin, Tamera Lanham, Karina Nguyen, Tomek Korbak, Jared Kaplan, Deep Ganguli, Samuel R Bowman, Ethan Perez, Roger Grosse, David Duvenaud
Taken out of context: On measuring situational awareness in LLMs

Lukas Berglund, Asa Cooper Stickland, Mikita Balesni, Max Kaufmann, Meg Tong, Tomek Korbak, Daniel Kokotajlo, Owain Evans

Code
Training Language Models with Language Feedback at Scale

Jérémy Scheurer, Jon Ander Campos, Tomek Korbak, Jun Shern Chan, Angelica Chen, Kyunghyun Cho, Ethan Perez

Code
Improving Code Generation by Training with Natural Language Feedback

Angelica Chen, Jérémy Scheurer, Tomek Korbak, Jon Ander Campos, Jun Shern Chan, Samuel Bowman, Kyunghyun Cho, Ethan Perez

Code
Aligning Language Models with Preferences through f-divergence Minimization

Dongyoung Go, Tomek Korbak, Germán Kruszewski, Jos Rozen, Nahyeon Ryu, Marc Dymetman

ICML 2023

Code

ArXiv
On reinforcement learning and distribution matching for fine-tuning language models with no catastrophic forgetting

Tomek Korbak, Hady Elsahar, Germán Kruszewski, Marc Dymetman

NeurIPS 2022 (oral)

Code

ArXiv
Controlling conditional language models without catastrophic forgetting

Tomek Korbak, Hady Elsahar, Germán Kruszewski, Marc Dymetman

ICML 2022

Code

ArXiv
RL with KL penalties is better viewed as Bayesian inference

Tomek Korbak, Ethan Perez, Christopher Buckley

Findings of EMNLP 2022

ArXiv
Catalytic role of noise and necessity of inductive biases in emergence of compositional communication

Łukasz Kuciński, Tomek Korbak, Paweł Kołodziej, Piotr Miłoś

NeurIPS 2021

arXiv
Energy-based models for code generation under compilability constraints

Tomek Korbak, Hady Elsahar, Marc Dymetman, Germán Kruszewski

NLP4Programming workshop, ACL 2021

arXiv
Measuring non-trivial compositionality in emergent communication

Tomek Korbak, Julian Zubek, Joanna Rączaszek-Leonardi

Emergent communication workshop, NeurIPS 2020

Code

arXiv code
Developmentally motivated emergence of compositional communication via template transfer

Tomek Korbak, Julian Zubek, Łukasz Kuciński, Piotr Miłoś, Joanna Rączaszek-Leonardi

Emergent communication workshop, NeurIPS 2019

Code

arXiv code

Highlighted papers

A sketch of an AI control safety case

Looking Inward: Language Models Can Learn About Themselves by Introspection

Is model collapse inevitable? breaking the curse of recursion by accumulating real and synthetic data

Pretraining Language Models with Human Preferences

Towards Understanding Sycophancy in Language Models

Compositional preference models for aligning LMs

The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A"

Many-shot jailbreaking

Taken out of context: On measuring situational awareness in LLMs

Training Language Models with Language Feedback at Scale

Improving Code Generation by Training with Natural Language Feedback

Aligning Language Models with Preferences through f-divergence Minimization

On reinforcement learning and distribution matching for fine-tuning language models with no catastrophic forgetting

Controlling conditional language models without catastrophic forgetting

RL with KL penalties is better viewed as Bayesian inference

Catalytic role of noise and necessity of inductive biases in emergence of compositional communication

Energy-based models for code generation under compilability constraints

Measuring non-trivial compositionality in emergent communication

Developmentally motivated emergence of compositional communication via template transfer