## Introduction

LLMs are very complex pieces of software that use many complex components organized in hundreds of layers to develop an understanding of human languages—the mathematics behind them. LLM, deep learning model training, and validations is fairly complicated, and more complexity is added daily.

This guide is about fine-tuning, low-rank adaptation, and other techniques developed so a business can adapt LLM to its custom data and run it cost-effectively. Instead of showing code snippets or trying to create an understanding of the math behind these models, this guide focuses on developing an intuitive understanding of key technologies involved in adapting and using LLM.

## Low-rank adaptation (LoRA), quantization, and fine-tuning

LLMs are large language models at the heart of current generative AI resolution. They can understand human languages, and some models even integrate speech, video, and pictures to better understand the prompts from humans. They provide a wealth of information to humans and assist them in understanding complex legal and financial documents, generating code for them, or simply looking for information on the internet and giving an answer better than any search engine can do.

As LLM models are bringing so much innovation rapidly, most IT managers want to intuitively understand how they can adopt these models for solving their day-to-day business problems and the cost associated with running them.

The first step to customizing LLM for a particular business need is fine-tuning. Fine-tuning lets an LLM learn more about your business, including business processes, workflows, standard operating procedures, HR policies, how to help employees with questions about their benefits, and other helpful information.

**LoRA** or low-rank adaptation is one such technique of training an LLM with billions of parameters cost-effectively of your data set using commodity and readily accessible lower-cost GPUs. These models were trained on GPUs that cost hundreds of thousands of dollars, and any fine-tuning, even for small data sets for a few hours, would still cost thousands of dollars without LORA.

Quantization is further improved as instead of saving each parameter as a 2-byte floating point number (fp16), the parameter is quantified in 16 equals quantiles, and a 4-bit quantized value of the parameter is saved in memory so these huge models can fit in servers with lower cost and commodity GPUs

Before we look at these concepts in detail, let’s look at the Financial aspects of LLM training and inference.

## Cost and Benefits of Customizing a Superlarge LLM

Let’s assume a company needs an LLM with performance comparable to ChatGPT and chooses to run it with fine-tuning on a private dataset within its data center.

Meta’s Llama 2 70 GB parameters LLM can go head to head with hundreds of billion parameters ChatGPT.

To run inference jobs like chat prompt, document, or code generation, you will need a server-class machine with at least 4 A100 class GPUs, each with 80 GB memory and 128/256 vCPUs, and a lot of main memory and high-performance storage.

The current cost of such hardware is between 75K and -100 K USD. Multiple machines may be needed for critical business operations support, and the cost of support and set is additional.

These servers can be rented on demand from various public cloud and GPU computing providers.

Hourly rental for such a machine costs an average of 10 K-15 K USD monthly. If we can use the techniques mentioned in this article and run it on hardware with a single 80 GB A100 GPU, the monthly rental cost of such a server will suddenly be reduced to $800. That is a minimum saving of a whopping 92%.

Aside from cost savings, even if the fine-tuning dataset is roughly 1000 proprietary documents about your business, this in-house fine-tuned LLM can easily beat the latest and greatest LLMs from any closed-source LLM provider.

## Fine-tuning the LLM

Fine-tuning is taking all the training weights of a pre-trained LLM and adding more weights for your custom data set. LLM performs much better in all tasks once fine-tuned for a custom task on task-specific proprietary data otherwise unavailable from internet-based data sets.

Following is a high-level flow of fine-tuning process.

Fine-tuning is like pre-training of models, except most pre-training is done via an unsupervised learning process where some text sequence from each row of text data is randomly masked to generate a training data set. In contrast, during fine-tuning, training dare rows have text to train on and expected output so the model can be taught.

This carefully hand-crafted dataset makes fine-tuning a tedious but advantageous process. Suddenly, the LLM model has skills compared to employees with access to the company’s internal data.

The following row is part of fine-tuning T5 LLM for the grammar correction task(Kaggle provides this dataset row as part of the competition). The fine-tuning dataset has a text column for incorrect text with spelling mistakes and an expected text with corrected spelling.

id | language | text | Expected text |

583b46d4-8246-451d-9590-993e1d473491 | en | The team fielded the Nos. 77 and 97 in the NASCAR Xfinity Series between 2015 and 2017 | The team fielded the Nos. 77 and 97 in the NASCAR Xfinity Series between 2015 and 2017. |

When pre-processing such data, the text column is used as ‘input’ data, and the expected text column is used as ‘label’, and LLM is trained to generate the labeled text. We deliberately hide labeled text from the validation dataset and ask LLM to predict the correct spelling for unseen rows of data. During prepossessing, we balance our dataset with comparable positive and negative labels for a task and remove invalid /not well-formed data rows.

The next step is encoding. The encoding step uses a Tokenizer that is specific to LLM. The Tokenizer converts each word of textual data in vectorized form so LLM can consume data and construct a model to predict the next correct token. Most LLM eTokenizers encode various other information, including position and sub-parts of words, to predict the right word for a given context.. Uncased Tokenizer will tokenize the words “Hello” and “HELLO” to the same floating-point token.

In the model download step, we specify which model and what size code should fetch the model from. for example

`model = TFMT5ForConditionalGeneration.from_pretrained("google/mt5-small")`

The code snippet downloads Google mT5 from the hugging face model catalog in a small footprint. See the model card (https://huggingface.co/docs/transformers/model_doc/mt5) for details.

Most LLMs have a specialized model for fine-tuning that cares for details about which set of weights of the model will be amended and when, and most do gradual unfreezing and updating of weights as fine-tuning-based training gets more and more accurate.

In the case of mT5 finetuning, Model class TFTMT5ForConditionalGeneration does all the lower-level weight adjustment and freezing/unfreezing of appropriate weight layers and adjusts all weights at the end of fine-tuning

## Low-Rank Adaptation (LoRA)

LLM models are built on layers of matrices that contain words and sentences in vectorized format, and those metrics have huge sizes. The earliest LLM (Bert base-uncased) has input embedding matrices of 30522 x 768, and each cell contains a two or four-byte floating point number. As this data passes through multiple model layers, hundreds of other matrics get created to hold layer weights and layer inputs and outputs. All this information has to be kept in memory and the more GPU memory a model consumes, the more expansive and time-consuming to train and run the mode. That is where the concept of low-rank adaptation becomes very useful.

Instead of using a full-blown matrix, how about using a lower-resolution matrix that still carries the same information but at a much lower resolution? These matrices are called low-rank matrices, and there is a well-known technique of creating a lower-rank matrix of any diagonal matrix. One such technique is called singular value decomposition(SVD). SVD takes a large multi-dimensional matrix and returns component metrics that can be used to build a lower-rank metric at any desired resolution.

Companies like FaceBook take full advantage of SVD to save memory and commute resources as their algorithms work for millions of users. Take, for example, facebook’s facial recognition feature for uploaded images. To do quick facial recognition of pictures from millions of other face pictures. Facebook algorithm might create a huge matrix to hold facial image data. Such a matrix will have each image from a user as one very long column with 1 million pixels crammed in it and 1 million columns representing 1 million user pictures.

Figure below shows one such matrix

If each image pixel is represented by 1 byte [256 colors or shades of gray], this matrix of size one terabyte will not fit in 99.99% of available computing hardware. With SVD, Facebook engineers can easily downrank such a matrix to a more manageable size of 4 GB and test if the facial recognition algorithm is still accurate.. Such a downgraded matrix is called a lower-rank matrix. The rank of a matrix is a simple numerical value that is the count of linearly independent columns. For example, rank of the matrix

**[ 2**** ****4**** **** 6**

**4**** ****8**** ****12**

**6 12 18 ]**

Is 1 as both column 2 and Column 3 can be derived by multiplying column 1 with 2 or 3 so they are not linearly independent columns. A lower-rank matrix

**[2**

**4**

**6] **

can still represent the matrix above

Tech giants like Facebook, Google, and Microsoft save hundreds of millions annually using SVD and similar low-rank matrix methods. The low-rank adaptation technique works similarly for LLM training

The authors of the LORA technique (https://arxiv.org/abs/2106.09685) hypothesized that for fine-tuning datasets, there might be linear dependencies and matrices to represent weight changes can be far lower rank matrices then originally used in pretraining. That is why they came up with the idea of freezing the weights matrix with dimensions of d rows and k columns and only apply gradients calculated during pre-training on B[d,r] and A[r,k] matrices where r is number between 0 and rank of weight matrix[d,k] by following formula

**h = W0x + ∆W x = W0x + BAx**

The saving comes as calculating dot products of A and B is much more memory efficient than the Weight matrix and Change matrix. Imagine if weigh matrix W0 is [2048 , 2048] and for LoRA if R is chosen as 4 , then dimensions of A and B matrices become much smaller e.g A[4,2048] and B[2048,4] and they will need much less GPU memory

## Cost Optimized LLM fine-tuning

The core concepts of the LoRA method are:

Parameter efficient tuning: We do not have to update every parameter weight in every model layer for fine-tuning, as this consumes a lot of time and GPU power. These billions of extra parameters were needed for the huge pre-training data set. Now, let’s freeze all weights and only modify one used to generate inference and calculate loss during pre-training.

The figure below shows a simplified view of trainable parameters used to make inferences for loss calculation during the time-running loop.

Since pre-trained model is trained on several tasks and fine-tuning is very task-specific, several methods have been proposed to freeze irrelevant parameters and update only required ones (see https://arxiv.org/pdf/2104.08691.pdf ) for details

LoRA for memory efficient weight update: When updating weights of the trainable parameter layer of LLM, instead of updating it directly, we can calculate two lower-rank matrices (A and B) that can be used to create a Change Matrix. If the Change matrix dimensions are the same as the weight metrics, we can add/subtract the change matrix to the original weight matrix and update it only once at the end of the training loop. During the fine-tuning loop, lower-rank A and B matrices can be updated when the model calculates the loss value to determine how off it is from the predicted value. It can then adjust lower rank metrics A and B rather than higher rank weight metrics of a layer. When the loss is acceptable, the change metrics can be used to update the weight matrix only once. This saves billions of matrix multiple operations for each fine-tuning exercise

The figure below shows the modified training loop of LoRA fine-tuning.

During this loop, all trainable hidden layers relevant to fine-tuning data are updated for each hidden layer. We avoid the compute expensive matrix multiplication of weights change vector, and LLM pre-trained weight vector

## 4-bit quantization of LLM

4-bit quantization is a technique to compress all matrices in a pre-trained LLM.

Most LLMs use 2-byte FP16 floating point numbers or 4-byte FP32 floating point numbers as values for embedding vectors and LLM layers. FP 32 has a practically infinite range and can represent any small or very large value. FP16 or half-precision floats have a ±65,504 range.

For 4-bit quantization, the FP16 floats are normalized between a value of 0 to 1 and then this range is quantized into 16 equidistance buckets. So, during inference, instead of storing the FP16 value, a 4-bit bucket address of float value is stored in memory. This bucket is remapped to the original FP16 value during training or fine-tuning, but for all other work, just this super compressed 4-bit value is used; when

When this 4-bit quantization technique was introduced, the original authors did an exhaustive test of 1 – 20 B parameter LLMs and found minimum or little inference performance degradation for a large number of LLMs in various benchmarks. See details at QLoRA: Efficient Fine Tuning of Quantized LLMs.

QLoRA simply initializes pre-trained LLM using Quantization, which significantly compresses its GPU memory footprint, then fine-tuning using the LoRA technique during LoRA finetuning. The trainable parameters are restored to full half-precision floats so training weights can be adjusted using regular gradient descent. Gradients from loss can be adjusted via lower-rank matrices. Still, since trainable parameters are a tiny percentage of overall pre-trained weights, no additional GPUs with more memory are needed.

Testing Results of LoRA Techniques

## Coding for QLoRA, Libraries, and other resources

LoRA, fine-tuning, and quantization techniques are well-documented in several sample notebooks.

Some libraries include Supervised Fine-tuning Trainer (**SFTTrainer**)

from transformer lib and bits and **bytes** lib, which does actual quantization work. Parameter efficient training lib ( **LEFT** library from Hugging Face ecosystem) has LoRA config and regulation parameters to initialize the model for LoRA and select appropriate layers of the model where ranking linear algebra will be used.

In our labs, we have done T5 V1.1 11B Model and Mistral 13B using these techniques with great results. Work on 70B Llama 2 is in progress.

## Conclusion

It is possible to run LLM with extensive parameters efficiently on commodity GPUs while harnessing all their AI power.

## About Author

Ajmal Mahmood is the chief engineer for High Plains Computing and leads a competent platform engineering team, Kubernetes, and AI/ML teams of engineers.

He can be reached at ajmal.mahmood@highplains.io.