NEFTune: Noisy Embedding Instruction Fine Tuning

2 minute read

Published: November 21, 2023

This paper proposes NEFTune, a simple trick by adding noise to embedding vectors during training which improve the outcome of instruction fine-tuning by large margin. If you are using SFT trainier by huggingface, you can use this trick by simply adding one line of code!

💡

Table of Contents

Abstract

Methodology

Results

Conversational ability

Further improvement on chat models

Does NEFTune sacrifices performance on other tasks to improve conversational ability?

Reference

Link

Appendix

Abstract

The author proposes NEFTune, a simple trick by adding noise to embedding vectors during training which improve the outcome of instruction fine-tuning by large margin
This performance gain is as shown below:

Untitled

Methodology

During fine-tuning, pairs of instruction and responses are sampled, in the form of text
The text are then tokenized, then turned into embedding vectors.
These embeddings are then added with random noise sampled from uniform distribution
The noise are then scaled with $\frac{\alpha}{\sqrt {Ld}}$ , where $\alpha$ is a tunable parameter, $L$ is the sequence length, $d$ is the embedding dimension
Details:

Untitled

Results

Most of the experiments are conducted using 7B parameters of LLMs, including LLaMa-1, LLaMa-2 and OPT-6.7B

Conversational ability

NEFTune improves conversational and answer quality, as measured via AlpacaEval

Untitled

Further improvement on chat models

NEFTune can further improve chat models.
- LLaMA-2-Chat (7B) is a extensively tuned model, with multiple rounds of RLHF.
- Further tuning LLaMA-2 Chat (7B) on Evol-Instruct gives another 3% boosts
- Furthermore, with NEFTune, we see a sizable, additional performance increase of 10%

Untitled

NEFTune works with QLORA

Untitled

Does NEFTune sacrifices performance on other tasks to improve conversational ability?

Figure below shows that NEFTune preserves model capabilities on other tasks

Untitled

Reference

Jain et al. “NEFTune: Noisy Embedding Instruction Fine Tuning, arXiv preprint arXiv:2310.05914 (2023)

Link

Appendix

To use it with SFT Trainer by huggingface

from datasets import load_dataset
from trl import SFTTrainer

dataset = load_dataset("imdb", split="train")

trainer = SFTTrainer(
    "facebook/opt-350m",
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=512,
    neftune_noise_alpha=5,
)
trainer.train()

More details from link.

Detail implementation by huggingface SFT trainer:

def neftune_post_forward_hook(module, input, output):
    """
    Implements the NEFTune forward pass for the model using forward hooks. Note this works only for
    torch.nn.Embedding layers. This method is slightly adapted from the original source code
    that can be found here: https://github.com/neelsjain/NEFTune

    Simply add it to your model as follows:
    ```python
    model = ...
    model.embed_tokens.neftune_noise_alpha = 0.1
    model.embed_tokens.register_forward_hook(neftune_post_forward_hook)
    ```

    Args:
        module (`torch.nn.Module`):
            The embedding module where the hook is attached. Note that you need to set
            `module.neftune_noise_alpha` to the desired noise alpha value.
        input (`torch.Tensor`):
            The input tensor to the model.
        output (`torch.Tensor`):
            The output tensor of the model (i.e. the embeddings).
    """
    if module.training:
        dims = torch.tensor(output.size(1) * output.size(2))
        mag_norm = module.neftune_noise_alpha / torch.sqrt(dims)
        output = output + torch.zeros_like(output).uniform_(-mag_norm, mag_norm)
    return output

Share on

Twitter Facebook LinkedIn

Lee Zhi Cheng

NEFTune: Noisy Embedding Instruction Fine Tuning

Abstract

Methodology

Results

Conversational ability

Further improvement on chat models

Does NEFTune sacrifices performance on other tasks to improve conversational ability?

Reference

Link

Appendix

Share on

You May Also Enjoy

Making Large Language Models Perform Better in Knowledge Graph Completion

LLM-Planner: Few-Shot Grounded Planning for Embodied Agents with Large Language Models

Notes on Recent Talks about Autonomous Intelligence by Yann LeCun

Contrastive Decoding: Open-ended Text Generation as Optimization