Contrastive Decoding: Open-ended Text Generation as Optimization

3 minute read

Published: October 01, 2023

The paper proposes a new decoding method for open-ended text generation, called contrastive decoding (CD), which aims to generate text that is fluent, coherent, and informative, by exploiting the contrasts between expert model and amateur model behaviors.

Table of Contents

Adaptive Plausibility Constraint

Abstract

The paper proposes a new decoding method for open-ended text generation, called contrastive decoding (CD).
CD measures the difference between the log-probabilities of a large model (expert) and a small model (amateur), subject to a plausibility constraint that filters out low-probability tokens under the expert.
CD aims to generate text that is fluent, coherent, and informative, by exploiting the contrasts between expert and amateur behaviors.
CD requires no additional training and works across different model scales and domains.

Image Description

Problem Statement

Generated text from LLM is prone to incoherence and topic drift due to poor sampling choices over long sequences
Maximizing probability in sequence often results in short, repetitive and tedious text

Solution

Author proposes contrastive decoding (CD), which searches for text that maximizes the difference between expert log-probabilities and amateur log-probabilities, with restricted search space
Why take the max difference between expert and amateur log-probabilities?
- Smaller LLM gives high probability to short, repetitive, irrelevant, uninteresting tokens
- Larger LLM gives high probability to desirable outputs
- Hence, by getting the max difference between expert and amateur model, we can improve text generation
Advantage of CD:
- No training required, compared to unlikelihood training and contrastive learning proposed in other paper

Methodology

Expert model gives high probability to good, non-repetitive tokens, whereas amateur model gives high probability to undesired tokens
We want to reward text patterns favored by the expert LMs and penalizes patterns favored by the small amateur LMs. Therefore, the author proposed Contrastive objective, $L_{CD}(x_{\text{cont}}, x_{\text{pre}})$:

\[\log p_{\text{EXP}}(x_{\text{cont}} | x_{\text{pre}}) - \log p_{\text{AMA}}(x_{\text{cont}} | x_{\text{pre}})\]

If tokens from small LMs not favorable, why don’t we penalize amateur LMs completely?
- Small LMs still capture simple aspects of English grammar and common sense
- The author introduced plausibility constraint to complement CD objective

Adaptive Plausibility Constraint

\[V_{\text{head}}(x_{<i}) = \{x_i \in V : p_{\text{EXP}}(x_i | x_{<i}) \geq \alpha \max_{w} p_{\text{EXP}}(w|x_{<i})\}\]

The constraint only keeps tokens that have probability greater than or equal to a fraction α of the maximum probability token under the expert model
$\alpha$ in the range of [0,1], it truncates next token distribution of $p_{EXP}$
- large $\alpha$, keep only tokens with high probabilities
- small $\alpha$, allows lower probabilities tokens to be considered
- $\alpha = 0.1$ is used in this paper.
  - So, it only keeps 10% of the most likely token of the expert model

Full Method

\[\text{CD-score}(x_i; x_{<i}) = \begin{cases} \log \frac{p_{\text{EXP}}(x_i|x_{<i})}{p_{\text{AMA}}(x_i|x_{<i})}, & \text{if } x_i \in V_{\text{head}}(x_{<i}), \\-\infty, & \text{otherwise}.\end{cases}\]

Choice of Amateur

We should select the amateur LM that downplays behaviors from the expert LM. Three aspects to consider:

Scale: We prefer small scale LMs because they are more prone to errors
Temperature: We prefer $\tau$ close to 0 because it highlights the mode of the amateur distribution, which is more prone to errors
Context window: Short context length. To weaken the coherence of the amateur LMs

Results

Generated text are more diverse and higher quality when using XL model size as teacher, small model size as student

Image Description

Reference

Li et al. “Contrastive Decoding: Open-ended Text Generation as Optimization”, ACL 2023

Share on

Twitter Facebook LinkedIn

Lee Zhi Cheng

Contrastive Decoding: Open-ended Text Generation as Optimization

Abstract

Problem Statement

Solution

Methodology

Adaptive Plausibility Constraint

Full Method

Choice of Amateur

Results

Reference

Share on

You May Also Enjoy

NEFTune: Noisy Embedding Instruction Fine Tuning

Making Large Language Models Perform Better in Knowledge Graph Completion

LLM-Planner: Few-Shot Grounded Planning for Embodied Agents with Large Language Models

Notes on Recent Talks about Autonomous Intelligence by Yann LeCun