A Step by Step Guide to Fine Tuning LLMs for Text Summarization Using Hugging Face

Published on:

09 Jan 2024, 7:10 am

Background:

Among the many applications of LLM's , text summarization has come to play an important role with applications in summarizing large news chunks, legal documents, reports etc. In this article we will discuss a step by step approach to fine tune an LLM for text summarization using a news data set.

About Text Summarization

Summarization

Summarization is a task of getting short summaries from long documents i.e. news articles or research articles. Basically it can be of two types i.e. Extractive and abstractive summarization.

Extractive Summarization

Extractive Summarization is a shortening of paragraphs in large documents i.e. news articles, medical publications or research articles through extracting important information from those documents without keeping context in mind.

Abstractive Summarization

Abstractive Summarization is quite different from prior basic summarization technique. In prior summarization, resulting summaries may or may not be meaningful because it's just a process of extracting important sentences from long documents but in abstractive summarization , resulting summaries tries to consider context for whole document and then summarize it accordingly where words may not be exact similar to given documents.

Key Process Steps

The key steps maybe summarized as below:

Setting up the environment
Dataset handling and loading
Pre Processing
Training
Validation
Inference

Setting up the environment

We will use Google Collab for executing the python code. For this Project we will set the "Runtime" to CPU . Those who wish to replicate the project may try using GPU options as well.

We will need the following imports as well to install and load pytorch, transformers and our data set .

Data set handling and loading

For the fine tuning we use the hugging face provided CNN-Daily News Summary dataset data set as below:

https://huggingface.co/datasets/cnn_dailymail

About the data set

The data set is in the form of a dict which at a high level is split into test train and valid as we see below

We focus on the train and validation subsets ad since the original dataset is large we choose 4000 observations for the train and 200 observations for the validation data. We also save these reduced size subsets to local / gdrive so that we can use them next time we run the code without having to load the entire dataset again.

Each of the three subsets are dictionaries in turn with the following keys

id: a string containing the heximal formatted SHA1 hash of the url where the story was retrieved from
article: a string containing the body of the news article
highlights: a string containing the highlight of the article as written by the article author

Extract train and validation subsets and save to local storage / gdrive

The code snippet below shows how to extract the train and validation portions of the data set and save them to local drive / gdrive with appropriate naming

Load / Read from local storage

The below code snippet shows how to read and check one example of an article and summary from the 'train' data set

This results in the following output

Data preparation

Data cleaning

We observe from the sample text the name of the news agency within parenthesis and the double hyphens. The code snippet below is a tidy function to clean up the text and remove them from the train and validation data.

Tokenization

A tokenizer is in charge of preparing the inputs for a model. The library contains tokenizers for all the models. Most of the tokenizers are available in two flavors: a full python implementation and a "Fast" implementation based on the Rust library 🤗 Tokenizers. The "Fast" implementations allows:

A significant speed-up in particular when doing batched tokenization
Additional methods to map between the original string (character and words) and the token space (e.g. getting the index of the token comprising a given character or the span of characters corresponding to a given token).

We will be using T5TokenizerFast in this example. The below code shows how to invoke the Tokenizer. As an example we will select a sample text from the train data set. Pre process it for tokenization , then pass the text to the tokenizer and view the tokenized output to understand the output.

Code snippet below extracts a sample text pre processes it offline

The code shown below passes the sample text to tokenizer after invoking the tokenizer

Notes:

We use the pre trained tokenizer "t5-base" from T5 family
The max length parameter limits the content to the length mentioned
Padding = true ensures smaller lengths are padded with a 'padding' token
The output is returned as a tensor

Notes on Batch Encoding

BatchEncoding holds the output of the PreTrainedTokenizerBase's encoding methods (__call__, encode_plus and batch_encode_plus) and is derived from a Python dictionary. When the tokenizer is a pure python tokenizer, this class behaves just like a standard python dictionary and holds the various model inputs computed by these methods (input_ids, attention_mask…). When the tokenizer is a "Fast" tokenizer (i.e., backed by HuggingFace tokenizers library), this class provides in addition several advanced alignment methods which can be used to map between the original string (character and words) and the token space (e.g., getting the index of the token comprising a given character or the span of characters corresponding to a given token).

Tokenized Output

The output is in the form of a dictionary , the code snippet below prints the output from the tokenizer

For brevity a partial output is shown below

Post Process for Model Input

The attention masks and Input Ids would need to be fed to the model. The below extracts them separately. The ".squeeze" ensures that a reduction in dimension of the tensor is applied where possible.

The below code maybe used to print out the Decoded versions of the input_ids

For brevity a part output is shown.

A part output from the above code is shown below

Custom Dataset class for data loading for training

Background

We ideally want our dataset code to be decoupled from our model training code for better readability and modularity.
PyTorch provides two data primitives: torch.utils.data.DataLoader and torch.utils.data.Dataset that allow you to use pre-loaded datasets as well as your own data.
Dataset stores the samples and their corresponding labels, and DataLoader wraps an iterable around the Dataset to enable easy access to the samples.

Creating a Custom Dataset Class

A custom Dataset class must implement three functions: init, len, and _getitem

The methods are explained below:

init

The init function is run once when instantiating the Dataset object We instantiate the tokenizer and extract the occurrences of 'article' and 'highlights' from the supplied data

len

The len function returns the number of samples in our dataset.

getitem

The getitem function loads and returns a sample from the dataset at the given index idx.

Based on the index, it extracts a sample text and the corresponding highlight.

It tokenizes the text and the summary and extracts the input ids and the attention masks

it then returns the input ids and the attention masks for the text and the summary

The code snippet below defines the class described in detail above

Model Training

The model training involves the following key steps as mentioned below:

The model loops through the data loader
loads the data to device CPU or GPU
y_ids – select all the Ids in the sequence except last one – This will be decoder input
lm_label – Skip the pre sequence addition and select all ids : this will be the loss function label
check if padding token exsits in the label if so then replace with -100 as internally the loss function compute will neglect them
move the source id, source masks to device
invoke model
print loss at every 10th step or 10 batches
optimize weights through back prop loss

The code snippet below shows the custom function for model train

We pass to the above function the following inputs which will be called or set from the primary calling module/ code portion

epoch: In the current example we will only train for 1 epoch
model: We are going to fine tune the base model which is "t5-base" from T5 family
tokenizer: Here we will use T5TokenizerFast
loader: The Data loader described above will be passed here
optimizer: We use AdamW optimizer
device: CPU or GPU as chosen

The T5 model

T5 is an encoder-decoder model and converts all NLP problems into a text-to-text format.
It is trained using teacher forcing. This means that for training we always need an input sequence and a target sequence.
The input sequence is fed to the model using input_ids.
The target sequence is shifted to the right, i.e. prepended by a start-sequence token and fed to the decoder using the decoder_input_ids.
In teacher-forcing style, the target sequence is then appended by the EOS token and corresponds to the labels.
The PAD token is hereby used as the start-sequence token. T5 can be trained / fine-tuned both in a supervised and unsupervised fashion.

Model input

labels represent the desired output and have two uses: as decoder_input_ids and as labels for the loss function.
These two are identical except labels do not include the right-shift token at the start. Therefore, we create two copies of encodings, one for decoder input and one for loss labels.
We remove the starting right-shift token from labels as this token is not part of the expected output.
We then remove the last token from decoder_input_ids to equalize tensor sizes.

Padding

Frequently, model inputs are padded to some maximum length to ensure consistent tensor sizes.
This is accomplished by appending padding tokens to the inputs.
These tokens need to be excluded from loss calculations.
Hugging face's loss functions are defined to exclude the ID -100 during loss calculations.
Therefore, we need to convert all padding token IDs in labels to -100

Model Validation

For validating our fine tuned model we have the following key steps in the validation function defined below

Initiate model for eval
Loop through validation data loader
Extract source id , source mask and target ids
Predict using model parameters as described below
Decode the predictions and the labels using parameters described below
predictions are extended to a list at each iteration
the extended list is finally returned

Prediction / Generation Parameters

The following parameters are used in the model prediction during validation

input_ids : the validation set input token Ids
attention mask – attenion mask for input tokens
max_length (int, optional, defaults to model.config.max_length) — The maximum length of the sequence to be generated
num_beams (int, optional, defaults to 1) — Number of beams for beam search. 1 means no beam search.
repetition_penalty (float, optional, defaults to 1.0) — The parameter for repetition penalty. 1.0 means no penalty.
Exponential penalty to the length. 1.0 means that the beam score is penalized by the sequence length. 0.0 means no penalty. Set to values < 0.0 in order to encourage the model to generate longer sequences, to a value > 0.0 in order to encourage the model to produce shorter sequences.

The output from the prediction is then decoded, the decoding parameters used are as follows:

skip_special_tokens (bool, optional, defaults to False) — Whether or not to remove special tokens in the decoding.
clean_up_tokenization_spaces (bool, optional, defaults to True) — Whether or not to clean up the tokenization spaces

The function for the validation is coded as below

Main Driver code

Here we discuss the main driving code portions which will use the prior mentioned custom functions for model fine tuning and validation

Define model and parameters

Here we define the transformer model to be fine tuned, the Tokenizer and also set the device. The code portion is shown below.

Dataset Load and Pre processing

We use the cnn_dailymail summarization dataset for abstractive summarization. We will use only a part of the data set for training. The subsets of the data for train and validation are then pre processed using a custom NLP function and then tokenized. The code portions are repeated below and are the same as described in the section "Data Preparation". The Custom Dataset class defined earlier is instantiated for train and validation. The objects train_dataset and val_dataset as per code snippet below would then be fed as input to the Data Loader.

DataLoader

The Custom Data Set class retrieves our dataset's features and labels one sample at a time. While training a model, we typically want to pass samples in "minibatches", reshuffle the data at every epoch to reduce model overfitting, and use Python's multiprocessing to speed up data retrieval. DataLoader is an iterable that abstracts this complexity for us in an easy API.

The output from the "Dataset" is fed as input to DataLoader as shown below.

num_workers tells the data loader instance how many sub-processes to use for data loading. If the num_worker is zero (default) the GPU has to wait for CPU to load data. Theoretically, greater the num_workers, more efficiently the CPU load data and less the GPU has to wait.

Define and train model

We instantiate the model, Optimizer and call the train function defined earlier with the below code. We iterate through number of epochs (in this case set to 1)

Save and Load trained Model

It is a good practice to save the trained model – either on local server, or in this case on gdrive. For further inference and validation calls we will load the model again from saved location. This is achieved via the code snippet shown below

Validation

The validation function as defined previously is called as below. The resultant output lists of the predicted value and the actual target are returned as lists. It is a good practice to save the validation result for future reference. The code snippet below calls validation function and saves results locally / in this case to gdrive

Once the file is saved we can read anytime from the local / gdrive storage even at a later Run without worrying about losing the results. The code snippet below loads the saved file and chooses a random output from the validation data and displays the same

A sample random output from the above code is shown below

Inferencing

Once you have the model trained to your expectation levels and saved the model maybe loaded at a later stage or deployed for inferencing.

We refer to the below news chunk from the url provided

https://www.bbc.com/news/world-asia-india-67657873

We have saved a chunk of news text from this link in a txt file which we will load and inference with the fine tuned model. The news chunk is as below

The below code snippet prepocesses this text item and generates a summary by inferencing our fine tuned model.

The summary is below

This concludes our example for Text Summarization

CODE DOWNLOAD

https://github.com/anishiisc/LLMs/blob/main/Text_Summary_T5_Fine_Tuned.ipynb

REFERENCES

[1] https://wandb.ai/biased-ai/huggingface/reports/Text-Summarization-on-HuggingFace–Vmlldzo3ODA5MjI

Author Bio's

Dr Anish Roychowdhury

Dr Anish Roychowdhury is a Data Science Professional and Educator with a total career experience of 20 plus years across industry and academia. Having both taught in full time and part time roles at leading B Schools and also have held leadership roles in multiple organizations He holds a Ph.D, in computational Micro Systems from IISc Bangalore), with a Master's Thesis in the area of Microfabrication, (Louisiana State University, USA) and an Undergraduate degree from NIT Durgapur with published research in GA -Fuzzy applications to Medical Diagnostics.

Yugal Jain

Yugal Jain is a data science professional with 5+ years of experience in various startups and companies. He has worked on complex NLP problems for industry specific use cases. He is experienced in building end to end data engineering (ETL) pipelines, NLP pipelines. He holds a bachelor's degree in Computer Science from Guru Gobind Singh Indraprastha University, New Delhi.

Join our WhatsApp Channel to get the latest news, exclusives and videos on WhatsApp

_____________

Disclaimer: Analytics Insight does not provide financial advice or guidance. Also note that the cryptocurrencies mentioned/listed on the website could potentially be scams, i.e. designed to induce you to invest financial resources that may be lost forever and not be recoverable once investments are made. You are responsible for conducting your own research (DYOR) before making any investments. Read more here.

LLMS