MLOps Consulting

Pac-Man Lego and an LLM epiphany

Adam Knight — Fri, 15 Sep 2023 09:00:29 GMT

On my 42nd birthday, I was handed a challenge wrapped in a gift box - a 2651-piece Lego Icons Pac-Man set. As I unwrapped the box, a wave of nostalgia washed over me both for Lego and of course Pac-Man. The bright, colourful blocks, the familiar click as they snapped together, and the anticipation of seeing a pile of pieces transform into a tangible, recognisable form – it was a trip back to the carefree days of my childhood.

But this was no ordinary Lego set. With its 2600+ pieces and an instruction guide an inch thick, it was a behemoth that demanded patience, precision, and a keen eye for detail. Speaking to a large number of users and engineers over the last year, I couldn't help but draw parallels between my Lego endeavour and what is being asked of large language models the majority of the time.

‘Achieve this hard task, but I’m only going to give you a little if any of the nuance and context that I would give to a person if I were asking them to do the same job’.

Language, like this Lego set, is not a simple, straightforward entity. It is a complex structure, a jigsaw puzzle with countless pieces. Each word, each phrase, each idiom carries a weight of history, nuance, and context. They are not just building blocks of communication; they are the carriers of culture, the markers of time, the reflections of our collective consciousness.

As I poured over the instructions, something particularly struck me: the organisation of pieces into numbered bags. These bags make it easier to locate the next brick required, but they also constrain my choice, effectively directing my actions towards the correct sequence for assembly. In much the same way, the addition of nuance and context to a prompt given to a large language model narrows down the pool of potential next tokens, making the model’s response more aligned with the intended query.

Consider a general prompt like, "Tell me about climate change." A large language model could respond with a broad spectrum of answers, ranging from the science behind climate change to its socio-political implications. Now, contrast this with a more nuanced prompt, such as, "Explain the impact of climate change on polar ice caps in the last decade." The latter, with its added context and specificity, restricts the model's potential outputs. The range of relevant tokens narrows significantly, honing the model's focus and driving it toward a more precise and relevant response.

Just as the numbered bags in my Lego set served to filter out irrelevant pieces, leading me to the exact components I needed to complete a specific section, contextual cues and nuanced prompts steer a language model toward generating more relevant, coherent, and valuable outputs. In both instances, whether assembling intricate Lego structures or generating human-like text, the devil is in the details—nuance and context act as invaluable guides in the building process.

Expecting AI to understand and replicate this complexity based on surface-level information is like expecting to build my Pac-Man set by just looking at the box. It's not just about knowing what the final product looks like; it's about understanding the 'why' and 'how' of each piece, each step in the process. It's about recognising patterns, making connections, and anticipating outcomes.

As engineers in this new world it's imperative that we do more than just write code; we must be architects of a new digital-human interface. While it's tempting to rely solely on the advancements in language models, or place the burden on users to adapt, such an approach sells the technology short and limits its transformative potential.

We must take the helm in actively shaping how these models interact with users. We are the mediators who must make AI not just powerful, but also accessible, intuitive, and deeply attuned to human needs. Failing to do so risks leaving an enormous gap between what AI could be and what it becomes, a gap filled with missed opportunities to genuinely improve the human experience.

The onus is on us to not just engineer but to envision, to pave the way for a future where technology truly serves humanity. Anything less is an abdication of our role as the architects of tomorrow's digital world.

As I sat there with my Lego set, painstakingly placing each block, following each step in the guide, for two whole days. I couldn't help but marvel at the complexity of the set and the attention to detail in the instructions.

Building the Lego Icons Pac-Man set was a journey, a challenge, and a joy. It was a reminder of the beauty of complexity and the thrill of creation. And it gave me an insight in to the potential of AI, a promise of a future where technology understands us, aids us, and enriches our lives in ways we can't even imagine yet.

So, as AI engineers, our task is clear. We need to understand the complexity, appreciate the nuance, and respect the context. We need to create tools that help users be more productive, do better, and more creative work. And we need to remember that our goal is not just to build technology, but to build bridges between technology and humanity.

Here's to the journey ahead, filled with challenges, discoveries, and breakthroughs. Because with the right tools, the right approach, and the right mindset, who knows what we can build?

Guide to Fine-Tuning Large Language Models (LLMs) with QLoRA

Adam Knight — Fri, 18 Aug 2023 08:08:11 GMT

Large language models (LLMs) like Llama2 have shown immense potential for generating human-like text. However, they still require task-specific fine-tuning to reach optimal performance. This process adjusts the model's weights to customise it with new data.

In this guide, we provide a practical guide to fine-tuning LLMs using techniques like LoRA and quantisation. We explain the key parameters involved and how they impact model size, speed, and accuracy. By the end, you'll have the knowledge to effectively fine-tune an LLM for your own applications.

Introduction

LLMs contain billions of parameters, enabling them to produce remarkably coherent text. However, their knowledge remains confined to their training data. Fine-tuning adapts these giant, pre-trained models to new domains where their capabilities can be specifically applied..

For instance, an LLM pre-trained on Wikipedia can summarise news articles after fine-tuning on media data. The technique adjusts the model's weights to reinforce patterns in the new data. This article details best practices to make your fine-tuning process efficient and effective.

We'll cover:

LoRA and quantisation to optimise the process
How key parameters affect model performance and training stability
Strategies to avoid common pitfalls like overfitting
A step-by-step walkthrough from data preparation to inference

By the end, you'll have an actionable approach to customise LLMs for your own tasks. Let's get started!

What is QLoRA?

Training an LLM from scratch requires prohibitively large amounts of data and computing resources, so fine-tuning offers a more practical alternative by slightly adapting a pre-trained model. However, even fine-tuning can strain resources on massive models. Methods like LoRA, which introduces low-rank matrices to selectively update parts of the pre-trained weights, and quantisation, which compresses the model by reducing parameters' precision through 4-bit quantisation, optimize fine-tuning to feasibly adjust LLMs using standard hardware. Together, LoRA and quantisation enable rapid adaptation of even trillion-parameter models on a single GPU by restricting the knowledge transferred and shrinking model size with minimal accuracy drop. With these efficient fine-tuning methods in place, properly tuning key hyper-parameters that control model size, memory use, and training stability becomes crucial for achieving good performance.

QLoRA is a method designed to refine large language models more efficiently. It optimises memory usage so that even substantial models can be adjusted using standard graphics hardware.

The method allows for rapid and efficient fine-tuning, achieving near top-tier performance levels with significantly reduced time and computational resources.

How It Works:

QLoRA employs a 4-bit representation of the original model to conserve memory.
It introduces a new data format optimized for the kind of information it manages.
To further conserve memory, the method uses a technique termed "double quantization."
To handle unexpected surges in memory demands, it deploys a strategy known as "paged optimisers."

Steps to Fine-Tune LLaMA2 using QLoRA:

Set Up the Environment:

Ensure you're working in an environment with sufficient computational resources, such as Google Colab. Install necessary Python libraries: accelerate, peft, bitsandbytes, transformers, and trl.

Load the Datasets:

Prepare your training (train.jsonl) and evaluation datasets (test.jsonl). Each dataset should have entries with prompt and response keys.

Define Hyperparameters:

Set up model-specific parameters like model_name and new_model. Define training parameters such as learning_rate, weight_decay, and num_train_epochs.

Initialise the Model and Tokenizer:

Initialise the pre-trained model and tokenizer based on model_name.
Create a model with LoRA (Low-Rank Adaptation) layers for fine-tuning.

Start the Fine-Tuning Process:

Train the model using the defined hyperparameters and the training dataset.
Evaluate the model's performance on the evaluation dataset periodically.

Perform Inference:

Use the fine-tuned model to generate text or predictions based on provided prompts.

Save and Load the Model:

Once fine-tuning is complete, save the model weights. You can later load these weights to resume training or for inference.

model_name = "meta-llama/Llama-2-7b-chat-hf"
dataset_name = "/content/train.jsonl"
new_model = "llama-2-7b-fine-tuned"
lora_r = 64
lora_alpha = 16
lora_dropout = 0.1
use_4bit = True
bnb_4bit_compute_dtype = "float16"
bnb_4bit_quant_type = "nf4"
use_nested_quant = False
output_dir = "./results"
num_train_epochs = 1
fp16 = False
bf16 = False
per_device_train_batch_size = 4
per_device_eval_batch_size = 4
gradient_accumulation_steps = 1
gradient_checkpointing = True
max_grad_norm = 0.3
learning_rate = 2e-4
weight_decay = 0.001
optim = "paged_adamw_32bit"
lr_scheduler_type = "constant"
max_steps = -1
warmup_ratio = 0.03
group_by_length = True
save_steps = 25
logging_steps = 25
max_seq_length = None
packing = False
device_map = {"": 0}

Fine-Tuning Parameters and Their Implications:

LoRA Parameters:

Low-Rank Adaptation (LoRA) is a technique to fine-tune large pre-trained models without requiring a large amount of new data. It introduces low-rank layers to the model, which are trained while keeping the original pre-trained weights static.

lora_r (Rank of the low-rank approximation):

Effect: Determines the size of the low-rank matrix.

Implications:

A higher value captures more information but increases computational demands and risk of overfitting with small datasets.
A lower value decreases computation but might not capture sufficient information for adaptation.

Example: Adjust based on dataset size: smaller datasets might benefit from a lower value.

Possible values: Positive integers, typically ranging from 1 to 128.

lora_alpha (Scaling factor for initialisation):

Effect: Scales the initialised weights of the LoRA layers.

Implications:

A higher value results in larger initial weights, potentially overshadowing the pre-trained weights.
A lower value offers a more subtle LoRA adaptation.

Example: Adjust if the model isn't deviating enough from pre-trained behaviour.

Possible values: Positive floats, commonly between 0.1 and 10.

lora_dropout (Dropout applied to the LoRA weights):

Effect: Introduces dropout to the LoRA layers for regularisation.

Implications:

A higher value can prevent overfitting but might also hinder learning.
A lower value reduces regularisation.

Example: Increase if observing overfitting.

Possible values: Floats between 0 and 1 (e.g., 0.1, 0.2, 0.3).

Quantisation Parameters:

Quantisation reduces the memory footprint of models, making them suitable for deployment on memory-limited devices.

use_4bit
(Whether to use 4-bit quantisation):

Effect: Adjusts the bit-width of the model weights.

Implications:

Using 4-bit quantisation shrinks model size but may slightly degrade performance.

Example: Enable for deployments on memory-limited edge devices.

Possible values: True or False.

bnb_4bit_compute_dtype
(Data type for 4-bit quantised computations):

Effect: Sets the computation data type.

Implications:

"float16" enhances speed but may introduce minor numerical inaccuracies compared to "float32".

Example: Opt for "float16" if speed is paramount and minor accuracy losses are acceptable.

Possible values: "float16" or "float32".

bnb_4bit_quant_type
(Quantisation type for 4-bit computations):

Effect: Specifies the quantisation strategy.

Implications:

Different strategies can influence accuracy and computation speed variably.

Example: The ideal setting may depend on data nature and task.

Possible values: Depends on the library. Here, "nf4" is used.

use_nested_quant
(Whether to use nested quantisation):

Effect: Enables another layer of quantisation.

Implications:

Potentially further diminishes model size at the cost of performance.

Example: Useful for extreme memory-constrained situations.

Possible values: True or False.

Training Hyperparameters:

learning_rate:

Effect: Dictates the optimisation step size.

Implications:

Too high values may cause loss oscillations or divergence.
Too low values lead to slow convergence or getting trapped in local minima.

Example: Adjust based on training stability.

Possible values: Positive floats (e.g., 1e-3, 5e-4).

weight_decay (Regularisation term):

Effect: Penalises large weight values.

Implications:

Higher values slow down learning but can prevent overfitting.
Lower values minimise regularisation.

Example: Increase to combat overfitting.

Possible values: Non-negative floats (e.g., 0, 0.0001).

num_train_epochs:

Effect: Determines how many times the model reviews the entire training dataset.

Implications:

More epochs might improve performance but risk overfitting.
Fewer epochs might result in underfitting.

Example: Adjust based on convergence observations.

Possible values: Positive integers, typically between 1 and several hundred.

gradient_accumulation_steps:

Effect: Delays updates by accumulating gradients over multiple steps.

Implications:

Useful for circumventing memory constraints, effectively simulating a larger batch size without memory overflow.

Example: Increase when confronting out-of-memory errors without wanting to decrease batch size.

Possible values: Positive integers like 1, 2, 4.

To try out this concept for yourself please check out this fantastic Google Colab Notebook from Matt Shumer which also credits work from Maxime Labonne. https://colab.research.google.com/drive/1Zmaceu65d7w4Tcd-cfnZRb6k_Tcv2b8g?usp=sharing

Glossary:

Low-Rank Adaptation (LoRA):

A method to tweak large models with small amounts of new data. Think of it as adding a few new components to an existing machine to adjust its functionality, without altering the original parts.

Quantisation:

A technique to make a model smaller and faster by simplifying it, though this might slightly reduce its accuracy. It's similar to compressing a photo to save space, causing it to become a tad blurry.

Dropout:

A strategy to strengthen the model by randomly disabling some of its parts during training. Picture a sports team practising with some players occasionally sitting out, ensuring the team doesn't become overly dependent on any single player.

Overfitting:

Occurs when a model excels with familiar data but struggles with new, unseen data. It's akin to a student who memorises test answers but has difficulty applying the knowledge in real-world scenarios.

Gradient:

A metric that informs the model about the discrepancy between its predictions and the actual answers, guiding its improvements.

Backpropagation:

A process where the model reflects on its mistakes and tweaks itself. Imagine playing a game, learning from errors, and improving for the next round.

Exploding Gradients:

An issue where the model's corrections become excessively large, leading to unstable training. Picture trying to fine-tune a radio but adjusting the dial too drastically, causing signal disruption.

Gradient Clipping:

A method to restrict the corrections from becoming too extensive. It's like setting a limit on how much you can adjust a radio dial to maintain a clear signal.

Gradient Accumulation Steps:

An approach to manage vast data by segmenting it and updating the model incrementally. Think of consuming a large meal in several smaller portions.

Gradient Checkpointing:

A technique to conserve memory during training by recording only select vital information. Consider it similar to jotting down key points during a lecture, rather than transcribing every word.

Coming soon

Adam Knight — Fri, 18 Aug 2023 06:52:49 GMT

This is MLOps Consulting.

Subscribe now

MLOps Consulting

Pac-Man Lego and an LLM epiphany

Guide to Fine-Tuning Large Language Models (LLMs) with QLoRA

Introduction

We'll cover:

What is QLoRA?

How It Works:

Steps to Fine-Tune LLaMA2 using QLoRA:

Set Up the Environment:

Load the Datasets:

Define Hyperparameters:

Initialise the Model and Tokenizer:

Start the Fine-Tuning Process:

Perform Inference:

Save and Load the Model:

Fine-Tuning Parameters and Their Implications:

LoRA Parameters:

lora_r (Rank of the low-rank approximation):

lora_alpha (Scaling factor for initialisation):

lora_dropout (Dropout applied to the LoRA weights):

Quantisation Parameters:

use_4bit(Whether to use 4-bit quantisation):

bnb_4bit_compute_dtype(Data type for 4-bit quantised computations):

bnb_4bit_quant_type(Quantisation type for 4-bit computations):

use_nested_quant(Whether to use nested quantisation):

Training Hyperparameters:

learning_rate:

weight_decay (Regularisation term):

num_train_epochs:

gradient_accumulation_steps:

Glossary:

Low-Rank Adaptation (LoRA):

Quantisation:

Dropout:

Overfitting:

Gradient:

Backpropagation:

Exploding Gradients:

Gradient Clipping:

Gradient Accumulation Steps:

Gradient Checkpointing:

Coming soon

use_4bit
(Whether to use 4-bit quantisation):

bnb_4bit_compute_dtype
(Data type for 4-bit quantised computations):

bnb_4bit_quant_type
(Quantisation type for 4-bit computations):

use_nested_quant
(Whether to use nested quantisation):