Tomek Korbak

I’m a Member of Technical Staff at Anthropic working on aligning language models with human preferences. Previously, I was a PhD student at the University of Sussex with Chris Buckley and Anil Seth focusing on RL from human feedback (RLHF) and spent time as a visiting researcher at NYU working with Ethan Perez, Sam Bowman and Kyunghyun Cho. I also interned at Naver Labs Europe and FAR AI. Before that, I studied cognitive science, philosophy and physics at the University of Warsaw and worked on compositional generalisation and emergent communication with Joanna Rączaszek-Leonardi and Piotr Miłoś, and on Bayesian accounts of self-organisation with Marcin Miłkowski.

Highlighted papers

Pretraining Language Models with Human Preferences

Tomasz Korbak, Kejian Shi, Angelica Chen, Rasika Bhalerao, Christopher Buckley, Jason Phang, Samuel Bowman, Ethan Perez

ICML 2023 (oral)

Code
Towards Understanding Sycophancy in Language Models

Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R. Johnston, Shauna Kravec, Timothy Maxwell, Sam McCandlish, Kamal Ndousse, Oliver Rausch, Nicholas Schiefer, Da Yan, Miranda Zhang, Ethan Perez

Code
Compositional preference models for aligning LMs

Dongyoung Go, Tomasz Korbak, Germán Kruszewski, Jos Rozen, Marc Dymetman
The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A"

Lukas Berglund, Meg Tong, Max Kaufmann, Mikita Balesni, Asa Cooper Stickland, Tomasz Korbak, Owain Evans

Code
Taken out of context: On measuring situational awareness in LLMs

Lukas Berglund, Asa Cooper Stickland, Mikita Balesni, Max Kaufmann, Meg Tong, Tomasz Korbak, Daniel Kokotajlo, Owain Evans

Code
Training Language Models with Language Feedback at Scale

Jérémy Scheurer, Jon Ander Campos, Tomasz Korbak, Jun Shern Chan, Angelica Chen, Kyunghyun Cho, Ethan Perez

Code
Improving Code Generation by Training with Natural Language Feedback

Angelica Chen, Jérémy Scheurer, Tomasz Korbak, Jon Ander Campos, Jun Shern Chan, Samuel Bowman, Kyunghyun Cho, Ethan Perez

Code
Aligning Language Models with Preferences through f-divergence Minimization

Dongyoung Go, Tomasz Korbak, Germán Kruszewski, Jos Rozen, Nahyeon Ryu, Marc Dymetman

ICML 2023

Code

ArXiv
On reinforcement learning and distribution matching for fine-tuning language models with no catastrophic forgetting

Tomasz Korbak, Hady Elsahar, Germán Kruszewski, Marc Dymetman

NeurIPS 2022 (oral)

Code

ArXiv
Controlling conditional language models without catastrophic forgetting

Tomasz Korbak, Hady Elsahar, Germán Kruszewski, Marc Dymetman

ICML 2022

Code

ArXiv
RL with KL penalties is better viewed as Bayesian inference

Tomasz Korbak, Ethan Perez, Christopher Buckley

Findings of EMNLP 2022

ArXiv
Catalytic role of noise and necessity of inductive biases in emergence of compositional communication

Łukasz Kuciński, Tomasz Korbak, Paweł Kołodziej, Piotr Miłoś

NeurIPS 2021

arXiv
Energy-based models for code generation under compilability constraints

Tomasz Korbak, Hady Elsahar, Marc Dymetman, Germán Kruszewski

NLP4Programming workshop, ACL 2021

arXiv
Measuring non-trivial compositionality in emergent communication

Tomasz Korbak, Julian Zubek, Joanna Rączaszek-Leonardi

Emergent communication workshop, NeurIPS 2020

Code

arXiv code
Developmentally motivated emergence of compositional communication via template transfer

Tomasz Korbak, Julian Zubek, Łukasz Kuciński, Piotr Miłoś, Joanna Rączaszek-Leonardi

Emergent communication workshop, NeurIPS 2019

Code

arXiv code

Highlighted papers

Pretraining Language Models with Human Preferences

Towards Understanding Sycophancy in Language Models

Compositional preference models for aligning LMs

The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A"

Taken out of context: On measuring situational awareness in LLMs

Training Language Models with Language Feedback at Scale

Improving Code Generation by Training with Natural Language Feedback

Aligning Language Models with Preferences through f-divergence Minimization

On reinforcement learning and distribution matching for fine-tuning language models with no catastrophic forgetting

Controlling conditional language models without catastrophic forgetting

RL with KL penalties is better viewed as Bayesian inference

Catalytic role of noise and necessity of inductive biases in emergence of compositional communication

Energy-based models for code generation under compilability constraints

Measuring non-trivial compositionality in emergent communication

Developmentally motivated emergence of compositional communication via template transfer