All Posts

2023

Repetition suppression. Details on my inverse scaling prize submission

In this post I provide some details on my submission to the inverse scaling prize, a contest focusing on finding important tasks where larger language models...

2022

Training a compute-optimal gpt2-small

Assume you’d like to train a gpt2-small-sized model (117m parameters). What is the optimal training set size? I’ll try to estimate that number following Trai...

RL with KL penalties is better viewed as Bayesian inference

TLDR: Naively applying RL to aligning language models (LMs) results in distribution collapse: turning an LM into a degenerate distribution putting all probab...

Tips on setting up a GPU cluster on Google Kubernetes Engine

This blog post is a bunch of unstructured notes to my future self on setting up a virtual GPU cluster for machine learning research (i.e. running experiments...

2021

EM for Gaussian mixtures using einsum

The goal of this blogpost is to present a concise implementation of the Gaussian Mixture Model (GMM) using einsum notation. Along the way, I will also descri...

2020

Helmholtz machines and variational autoencoders

Helmholtz machines are the predecessors of variational autoencoders (VAEs). They were first proposed by Dayan et al. in 1995 as a probabilistic model of patt...

Triplet loss and quadruplet loss via tensor masking

In this blog post, I show how to implement triplet loss and quadruplet loss in PyTorch via tensor masking. The idea of triplet loss is to learn meaningful re...

Implementing additive and multiplicative attention in PyTorch

Attention mechanisms revolutionized machine learning in applications ranging from NLP through computer vision to reinforcement learning. Attention is the key...

Interpreting uncertainty in Bayesian linear regression

While vanilla linear regression predicts a maximum likelihood estimate of the target variable, Bayesian linear regression predicts a whole distribution over ...

Implementing shunting-yard parsing in Python

Consider the problem of parsing an arithmetic expression, such as 4*(1+6)/3, into a binary expression tree. The problem would be quite easy with postfix nota...

Where syntax ends and semantics begins and why should we care

The relation between syntax (how words are structured in a sentence) and semantics (how words contribute to the meaning of a sentence) is a long-standing ope...

2019

NeurIPS 2019 highlights

In this blog post, I sketch out a summary of the NeurIPS 2019 conference as I experienced it. Obviously, the motifs I highlight are specific to my somewhat u...

Introduction to Lewis signaling games with Python