### Tomek Korbak

machine learning,
cognitive science,
philosophy

# NeurIPS 2019 highlights

In this blog post, I sketch out a summary of the NeurIPS 2019 conference as I experienced it. Obviously, the motifs I highlight are specific to my somewhat unorthodox interests (cognitive (neuro)science and language): I will briefly discuss compositionality, uncovering mechanisms using machine learning models, and some new ideas in representation learning.

During the conference, I was privileged to present Developmentally motivated emergence of compositional communication via template transfer at Emergent communication workshop.

## Compositionality

As noted by Jelle Zuidema on Twitter, compositionality was a recurring theme during NeurIPS this year. It was mentioned in Yousha Bengio’s keynote From System 1 Deep Learning to System 2 Deep Learning as one of the key components missing in most of contepontary AI, there was a workshop dedicated to Compositionality in Biological and Artificial Neural Systems, and problems of compositional communication and compositional reasoning played an important role during Emergent communication, Visually Grounded Interaction and Language, and Metalearning workshops. Finally, numerous impactful papers dealing with compositionality were presented, including Compositional generalization through meta sequence-to-sequence learning, Ease-of-Teaching and Language Structure from Emergent Communication, and a bunch of papers in hierarchical reinforcement learning.

But what does it mean for something to be compositional? Intuitively, a representational system (a communication protocol, a language et cetera) is compositional if the meaning of a complex representation is determined by its structure and the meanings of its constituent representations. Having said that, the machine learning community uses the concept quite liberally, not always in line with its linguistic roots. It is sometimes conflated with related notions of zero-shot generalization, structure, compressibility or being able to induce a formal grammar.

I particularly liked how Jacob Andres defined compositionality in his keynote during the Emergent Communication workshop as a homeomorphism between a description space and an interpretation space. As an example, we will consider what does it mean for a natural language noun phrase $\text{purple cat}$ to be compositional or, more precisely, composed from $\text{purple}$ and $\text{cat}$.

The building blocks of a simple formal account of compositionality are

• a description space, i.e. a set of possible representations at various levels of abstraction, such as strings, words, or vectors,
• an interpretations space, i.e. a set of possible referents of representations, such as real-world objects, numbers or images,
• a syntactic composition function $+$ defined over descriptions, i.e. concatenation (for strings and vectors) or addition (for vectors),
• a semantic composition function $\circ$, i.e. set intersection for real-world objects,
• an interpretation function $[\mkern-3.5mu[\cdot]\mkern-3.5mu]: x \mapsto [\mkern-3.5mu[x]\mkern-3.5mu]$ mapping descriptions to interpretations.

Consider a complex description $x+y$ (think $\text{purple cat}$) and its interpretation $[\mkern-3.5mu[x + y]\mkern-3.5mu]$ (think the intersection of the set of purple things and the set of cats). Then, $[\mkern-3.5mu[\cdot]\mkern-3.5mu]$ is compositional if and only iff $[\mkern-3.5mu[x + y]\mkern-3.5mu] = [\mkern-3.5mu[x]\mkern-3.5mu] \circ [\mkern-3.5mu[y]\mkern-3.5mu]$. That is, $[\mkern-3.5mu[\cdot]\mkern-3.5mu]$ should be a homeomorphism from descriptions to interpretations. Intuitively, it doesn’t matter if we first combine representations ($+$) and them interpret them, or if we first interpret representations and the combine their meanings ($\circ$). For the $\text{purple cat}$ example, we have $[\mkern-3.5mu[\text{purple} +\text{cat}]\mkern-3.5mu] = [\mkern-3.5mu[\text{purple}]\mkern-3.5mu] \cap [\mkern-3.5mu[\text{cat}]\mkern-3.5mu]$ (where we take our interpretation space to be a set of sets of objects and our semantic composition function to be set intersection).

This definition also works for formal languages. Consider a Lisp-like program $\text{neg} (\text{plus} (1 \ 2)$. Its meaning or interpretations is $[\mkern-3.5mu[\text{neg} (\text{plus} (1 \ 2))]\mkern-3.5mu]$. Assuming that the semantics of our programming language is compositional, we can decompose its meaning into $[\mkern-3.5mu[ \text{neg} ]\mkern-3.5mu] ([\mkern-3.5mu[\text{plus}(1 \ 2)]\mkern-3.5mu])$ and evaluate it sequentially into $[\mkern-3.5mu[ \text{neg} ]\mkern-3.5mu] (3)$ and then $-3$. Here $[\mkern-3.5mu[\cdot]\mkern-3.5mu]$ can be seen as an interpreter

Importantly, the presence of structure (or being able to fit a context-free grammar) in the description space or interpretation space is not a sufficient condition for compositionality. There is also no guarantee that a compositional representation system is interesting in any relevant sense or that an interesting representation system is compositional.

Compositionality is considered to be an essential feature of human languages and is assumed to be an important building block of general intelligence by being linked to productivity, systematicy, and generalization.:

• Productivty is the property that an unbounded number of meanings can be created using a finite number of primitive elements. (This assumes a recursively defined semantic composition function $\circ$.) This property is fundamental to several theories of universal grammar developed the generative approach in linguistics.
• Systematicy is the presence of definite and predictable patterns in the communication protocol, which could potentially improve the learnability of the protocol. Systematicy can also be understood as a symmetry of a communication protocol with respect to composition (e.g. understanding the meaning of “Eve loves Marry” entails understanding the meaning of “Mary loves Eve”).
• Finally, generalization is the ability adapt to novel contexts. As such, it is central to machine learning and productivity, and systematicy is frequently seen simply as a mean of improving generalization in some settings. This in particular involves compositional or zero-shot generalization, i.e. adaptability to novel combinations of known elements.

It is probably compositional generalization that sparks the recent interest of deep learning community in compositionality. A compositionally generalizing AI — one Yoshua Bengio imagines — can factorize its knowledge into reusable components (think concepts), refer indirectly to the components (think variables), and then build novel abstractions out of these.

## Explaining not predicting

This year’s keynote talks and workshops featured a large number of machine learning applications in life sciences and healthcare. These areas pose interesting challenges (data missingness correlated with important features, etc) and also enforces a particular shift of focus. Most successful model is frequently $l_1$-regularized logistic regression. More importantly, life sciences care more about uncovering causal mechanisms than predicting variables. During their Machine Learning for Computational Biology and Health workshop, Anna Goldenberg and Barbara Engelhardt put that nicely saying that computational biologists usually solve a beta-hat problem, not a y-hat, meaning that they are interested in estimated regression coefficients rather than the predictions itself. The coefficient can then provide important biological insight into associations between measured variables. Associations, however, do not entail understanding and are not necessarily causal. Similarly, deep learning techniques are quite effective in finding low-dimensional manifolds (think: a space of phenotypes) in high-dimensional data (a space of genomes or transcriptomes), which can then be clustered, visualized (via t-SNE) or modeled as Markov chains, providing insights about cell differentiation.

One other example that resonated with me was how model interpretability can be rooted in domain knowledge about the mechanism generating the data. During her Veridical data science keynote, Bin Yu mentioned that a thresholding behavior of a system on a molecular level can motivate using a random forest classifier and enable interpreting queries (branching operations on nodes of a tree) as molecular switches. Here again a fitted decision forest can be interpreted as a model of a molecular mechanism.

## The representation learning approach to fairness

I really liked Sanmi Koyejo’s Representation learning and fairness workshop. Koyejo understands fairness in terms of McNamara et al.’s framework, where fairness is guaranteed at the level of producing representations for downstream processing, and there is a separation of concerns between fairness and target task utility. More specifically, they propose to decompose a fair machine learning system into three components:

• Data regulator who determines fairness criteria (e.g. sensitive attributes, fairness metrics) and audits the predictions according to these criteria,
• Data producer who computes the representation based on data,
• Data user: trains the actual model on the sanitized representations (without access to the original data).

There are two families of fairness criteria:

1. Individual fairness requires similar individuals to be treated similarly. This can be interpreted as partitioning the dataset into disjoint cells and requiring examples from the same cell to be treated similarly.
2. Groups fairness requires groups of samples to exhibit similar classifier statistics. Statistics in question are usually based on the confusion matrix of a classifier and amount to something like the average predicted label, which we usually want to be the same across groups.

The idea is to guaranteed individual and group fairness by imposing fairness constraints during representation learning. Representation learning is here understood as generating a concise and informative summary of the data, usually a non-linear low-dimensional transformation. The added benefit of this approach is that the data user is relieved of reasoning about fairness: it is the sole responsibility of the data producer. Therefore, auditing a machine learning system for fairness can be centralized (even if a representation is consumed my multiple data users), and violating fairness is effectively impossible (unless the data user can bypass the data producer and access the original data). The representation learning approach to fairness comes at the cost of less precise control over fairness/performance trade-off (as the data user is now allowed to manipulate fairness). An alternative is to jointly train and optimize for fairness or post-process a pre-trained model.

There is an interesting relationship between fairness, representation disentanglement, and generalization. Individual fairness imposes an upper bound on generalization gap. Similarly, disentanglement correlates with fairness. Moreover, disentangled representations allow for a stronger notion of fairness — flexible fairness — meaning a representation can be adapted to be fair to a variety of protected groups and their intersections. In some settings, disentanglement allows for increasing fairness without knowing the protected attributes or can be a proxy metric for fairness.

## Miscellanea

• Anti-efficient encoding in emergent communication . While in natural languages word length follows a power law distribution (the most common words are short), languages developed by communicating neural agents are anti-efficient in the sense that most common words are long. This happens when a speaker has no physiological pressure towards brevity and a listener network displays an a priori preference for longer messages, which is not counterbalanced by a need to minimize articulatory effort on the side of the speaker. The effect is probably due to a bias for message discriminability, which correlates with message length. When a message length penalty is added to the cost, power law re-appears.
• MixMatch: A Holistic Approach to Semi-Supervised Learning achieves impressive results in a semi-supervised setting given its simplicity. It boils down to creating proxy labels by (i) averaging softmax predictions across augmentations of unlabeled examples and (ii) sharpening the distribution to obtain a low-entropy proxy label. Subsequently, mixup is applied. The technique reportedly allows for achieving 2% error on SVNH using as little as 250 labeled examples.
• Inducing brain-relevant bias in natural language processing models. Authors fine-tune BERT to predict, given a piece of text, the fMRI response of subjects reading that piece of text. The model generalizes across subjects and also to MEG signals.