Guide to Fine-Tuning Large Language Models (LLMs) with QLoRA
Part of an explainer series of practical and useful guides to understand how we can get the most out of machine learning and advances in artificial intelligence.
Large language models (LLMs) like Llama2 have shown immense potential for generating human-like text. However, they still require task-specific fine-tuning to reach optimal performance. This process adjusts the model's weights to customise it with new data.
In this guide, we provide a practical guide to fine-tuning LLMs using techniques like LoRA and quantisation. We explain the key parameters involved and how they impact model size, speed, and accuracy. By the end, you'll have the knowledge to effectively fine-tune an LLM for your own applications.
Introduction
LLMs contain billions of parameters, enabling them to produce remarkably coherent text. However, their knowledge remains confined to their training data. Fine-tuning adapts these giant, pre-trained models to new domains where their capabilities can be specifically applied..
For instance, an LLM pre-trained on Wikipedia can summarise news articles after fine-tuning on media data. The technique adjusts the model's weights to reinforce patterns in the new data. This article details best practices to make your fine-tuning process efficient and effective.
We'll cover:
LoRA and quantisation to optimise the process
How key parameters affect model performance and training stability
Strategies to avoid common pitfalls like overfitting
A step-by-step walkthrough from data preparation to inference
By the end, you'll have an actionable approach to customise LLMs for your own tasks. Let's get started!
What is QLoRA?
Training an LLM from scratch requires prohibitively large amounts of data and computing resources, so fine-tuning offers a more practical alternative by slightly adapting a pre-trained model. However, even fine-tuning can strain resources on massive models. Methods like LoRA, which introduces low-rank matrices to selectively update parts of the pre-trained weights, and quantisation, which compresses the model by reducing parameters' precision through 4-bit quantisation, optimize fine-tuning to feasibly adjust LLMs using standard hardware. Together, LoRA and quantisation enable rapid adaptation of even trillion-parameter models on a single GPU by restricting the knowledge transferred and shrinking model size with minimal accuracy drop. With these efficient fine-tuning methods in place, properly tuning key hyper-parameters that control model size, memory use, and training stability becomes crucial for achieving good performance.
QLoRA is a method designed to refine large language models more efficiently. It optimises memory usage so that even substantial models can be adjusted using standard graphics hardware.
The method allows for rapid and efficient fine-tuning, achieving near top-tier performance levels with significantly reduced time and computational resources.
How It Works:
QLoRA employs a 4-bit representation of the original model to conserve memory.
It introduces a new data format optimized for the kind of information it manages.
To further conserve memory, the method uses a technique termed "double quantization."
To handle unexpected surges in memory demands, it deploys a strategy known as "paged optimisers."
Steps to Fine-Tune LLaMA2 using QLoRA:
Set Up the Environment:
Ensure you're working in an environment with sufficient computational resources, such as Google Colab. Install necessary Python libraries: accelerate, peft, bitsandbytes, transformers, and trl.
Load the Datasets:
Prepare your training (train.jsonl) and evaluation datasets (test.jsonl). Each dataset should have entries with prompt and response keys.
Define Hyperparameters:
Set up model-specific parameters like model_name and new_model. Define training parameters such as learning_rate, weight_decay, and num_train_epochs.
Initialise the Model and Tokenizer:
Initialise the pre-trained model and tokenizer based on model_name.
Create a model with LoRA (Low-Rank Adaptation) layers for fine-tuning.
Start the Fine-Tuning Process:
Train the model using the defined hyperparameters and the training dataset.
Evaluate the model's performance on the evaluation dataset periodically.
Perform Inference:
Use the fine-tuned model to generate text or predictions based on provided prompts.
Save and Load the Model:
Once fine-tuning is complete, save the model weights. You can later load these weights to resume training or for inference.
model_name = "meta-llama/Llama-2-7b-chat-hf"
dataset_name = "/content/train.jsonl"
new_model = "llama-2-7b-fine-tuned"
lora_r = 64
lora_alpha = 16
lora_dropout = 0.1
use_4bit = True
bnb_4bit_compute_dtype = "float16"
bnb_4bit_quant_type = "nf4"
use_nested_quant = False
output_dir = "./results"
num_train_epochs = 1
fp16 = False
bf16 = False
per_device_train_batch_size = 4
per_device_eval_batch_size = 4
gradient_accumulation_steps = 1
gradient_checkpointing = True
max_grad_norm = 0.3
learning_rate = 2e-4
weight_decay = 0.001
optim = "paged_adamw_32bit"
lr_scheduler_type = "constant"
max_steps = -1
warmup_ratio = 0.03
group_by_length = True
save_steps = 25
logging_steps = 25
max_seq_length = None
packing = False
device_map = {"": 0}
Fine-Tuning Parameters and Their Implications:
LoRA Parameters:
Low-Rank Adaptation (LoRA) is a technique to fine-tune large pre-trained models without requiring a large amount of new data. It introduces low-rank layers to the model, which are trained while keeping the original pre-trained weights static.
lora_r (Rank of the low-rank approximation):
Effect: Determines the size of the low-rank matrix.
Implications:
A higher value captures more information but increases computational demands and risk of overfitting with small datasets.
A lower value decreases computation but might not capture sufficient information for adaptation.
Example: Adjust based on dataset size: smaller datasets might benefit from a lower value.
Possible values: Positive integers, typically ranging from 1 to 128.
lora_alpha (Scaling factor for initialisation):
Effect: Scales the initialised weights of the LoRA layers.
Implications:
A higher value results in larger initial weights, potentially overshadowing the pre-trained weights.
A lower value offers a more subtle LoRA adaptation.
Example: Adjust if the model isn't deviating enough from pre-trained behaviour.
Possible values: Positive floats, commonly between 0.1 and 10.
lora_dropout (Dropout applied to the LoRA weights):
Effect: Introduces dropout to the LoRA layers for regularisation.
Implications:
A higher value can prevent overfitting but might also hinder learning.
A lower value reduces regularisation.
Example: Increase if observing overfitting.
Possible values: Floats between 0 and 1 (e.g., 0.1, 0.2, 0.3).
Quantisation Parameters:
Quantisation reduces the memory footprint of models, making them suitable for deployment on memory-limited devices.
use_4bit
(Whether to use 4-bit quantisation):
Effect: Adjusts the bit-width of the model weights.
Implications:
Using 4-bit quantisation shrinks model size but may slightly degrade performance.
Example: Enable for deployments on memory-limited edge devices.
Possible values: True or False.
bnb_4bit_compute_dtype
(Data type for 4-bit quantised computations):
Effect: Sets the computation data type.
Implications:
"float16" enhances speed but may introduce minor numerical inaccuracies compared to "float32".
Example: Opt for "float16" if speed is paramount and minor accuracy losses are acceptable.
Possible values: "float16" or "float32".
bnb_4bit_quant_type
(Quantisation type for 4-bit computations):
Effect: Specifies the quantisation strategy.
Implications:
Different strategies can influence accuracy and computation speed variably.
Example: The ideal setting may depend on data nature and task.
Possible values: Depends on the library. Here, "nf4" is used.
use_nested_quant
(Whether to use nested quantisation):
Effect: Enables another layer of quantisation.
Implications:
Potentially further diminishes model size at the cost of performance.
Example: Useful for extreme memory-constrained situations.
Possible values: True or False.
Training Hyperparameters:
learning_rate:
Effect: Dictates the optimisation step size.
Implications:
Too high values may cause loss oscillations or divergence.
Too low values lead to slow convergence or getting trapped in local minima.
Example: Adjust based on training stability.
Possible values: Positive floats (e.g., 1e-3, 5e-4).
weight_decay (Regularisation term):
Effect: Penalises large weight values.
Implications:
Higher values slow down learning but can prevent overfitting.
Lower values minimise regularisation.
Example: Increase to combat overfitting.
Possible values: Non-negative floats (e.g., 0, 0.0001).
num_train_epochs:
Effect: Determines how many times the model reviews the entire training dataset.
Implications:
More epochs might improve performance but risk overfitting.
Fewer epochs might result in underfitting.
Example: Adjust based on convergence observations.
Possible values: Positive integers, typically between 1 and several hundred.
gradient_accumulation_steps:
Effect: Delays updates by accumulating gradients over multiple steps.
Implications:
Useful for circumventing memory constraints, effectively simulating a larger batch size without memory overflow.
Example: Increase when confronting out-of-memory errors without wanting to decrease batch size.
Possible values: Positive integers like 1, 2, 4.
To try out this concept for yourself please check out this fantastic Google Colab Notebook from Matt Shumer which also credits work from Maxime Labonne. https://colab.research.google.com/drive/1Zmaceu65d7w4Tcd-cfnZRb6k_Tcv2b8g?usp=sharing
Glossary:
Low-Rank Adaptation (LoRA):
A method to tweak large models with small amounts of new data. Think of it as adding a few new components to an existing machine to adjust its functionality, without altering the original parts.
Quantisation:
A technique to make a model smaller and faster by simplifying it, though this might slightly reduce its accuracy. It's similar to compressing a photo to save space, causing it to become a tad blurry.
Dropout:
A strategy to strengthen the model by randomly disabling some of its parts during training. Picture a sports team practising with some players occasionally sitting out, ensuring the team doesn't become overly dependent on any single player.
Overfitting:
Occurs when a model excels with familiar data but struggles with new, unseen data. It's akin to a student who memorises test answers but has difficulty applying the knowledge in real-world scenarios.
Gradient:
A metric that informs the model about the discrepancy between its predictions and the actual answers, guiding its improvements.
Backpropagation:
A process where the model reflects on its mistakes and tweaks itself. Imagine playing a game, learning from errors, and improving for the next round.
Exploding Gradients:
An issue where the model's corrections become excessively large, leading to unstable training. Picture trying to fine-tune a radio but adjusting the dial too drastically, causing signal disruption.
Gradient Clipping:
A method to restrict the corrections from becoming too extensive. It's like setting a limit on how much you can adjust a radio dial to maintain a clear signal.
Gradient Accumulation Steps:
An approach to manage vast data by segmenting it and updating the model incrementally. Think of consuming a large meal in several smaller portions.
Gradient Checkpointing:
A technique to conserve memory during training by recording only select vital information. Consider it similar to jotting down key points during a lecture, rather than transcribing every word.