Among the many applications of LLM's , text summarization has come to play an important role with applications in summarizing large news chunks, legal documents, reports etc. In this article we will discuss a step by step approach to fine tune an LLM for text summarization using a news data set.
Summarization
Summarization is a task of getting short summaries from long documents i.e. news articles or research articles. Basically it can be of two types i.e. Extractive and abstractive summarization.
Extractive Summarization
Extractive Summarization is a shortening of paragraphs in large documents i.e. news articles, medical publications or research articles through extracting important information from those documents without keeping context in mind.
Abstractive Summarization
Abstractive Summarization is quite different from prior basic summarization technique. In prior summarization, resulting summaries may or may not be meaningful because it's just a process of extracting important sentences from long documents but in abstractive summarization , resulting summaries tries to consider context for whole document and then summarize it accordingly where words may not be exact similar to given documents.
The key steps maybe summarized as below:
We will use Google Collab for executing the python code. For this Project we will set the "Runtime" to CPU . Those who wish to replicate the project may try using GPU options as well.
We will need the following imports as well to install and load pytorch, transformers and our data set .
For the fine tuning we use the hugging face provided CNN-Daily News Summary dataset data set as below:
About the data set
The data set is in the form of a dict which at a high level is split into test train and valid as we see below
We focus on the train and validation subsets ad since the original dataset is large we choose 4000 observations for the train and 200 observations for the validation data. We also save these reduced size subsets to local / gdrive so that we can use them next time we run the code without having to load the entire dataset again.
Each of the three subsets are dictionaries in turn with the following keys
Extract train and validation subsets and save to local storage / gdrive
The code snippet below shows how to extract the train and validation portions of the data set and save them to local drive / gdrive with appropriate naming
Load / Read from local storage
The below code snippet shows how to read and check one example of an article and summary from the 'train' data set
This results in the following output
Data cleaning
We observe from the sample text the name of the news agency within parenthesis and the double hyphens. The code snippet below is a tidy function to clean up the text and remove them from the train and validation data.
Tokenization
A tokenizer is in charge of preparing the inputs for a model. The library contains tokenizers for all the models. Most of the tokenizers are available in two flavors: a full python implementation and a "Fast" implementation based on the Rust library 🤗 Tokenizers. The "Fast" implementations allows:
We will be using T5TokenizerFast in this example. The below code shows how to invoke the Tokenizer. As an example we will select a sample text from the train data set. Pre process it for tokenization , then pass the text to the tokenizer and view the tokenized output to understand the output.
Code snippet below extracts a sample text pre processes it offline
The code shown below passes the sample text to tokenizer after invoking the tokenizer
Notes:
Notes on Batch Encoding
BatchEncoding holds the output of the PreTrainedTokenizerBase's encoding methods (__call__, encode_plus and batch_encode_plus) and is derived from a Python dictionary. When the tokenizer is a pure python tokenizer, this class behaves just like a standard python dictionary and holds the various model inputs computed by these methods (input_ids, attention_mask…). When the tokenizer is a "Fast" tokenizer (i.e., backed by HuggingFace tokenizers library), this class provides in addition several advanced alignment methods which can be used to map between the original string (character and words) and the token space (e.g., getting the index of the token comprising a given character or the span of characters corresponding to a given token).
Tokenized Output
The output is in the form of a dictionary , the code snippet below prints the output from the tokenizer
For brevity a partial output is shown below
Post Process for Model Input
The attention masks and Input Ids would need to be fed to the model. The below extracts them separately. The ".squeeze" ensures that a reduction in dimension of the tensor is applied where possible.
The below code maybe used to print out the Decoded versions of the input_ids
For brevity a part output is shown.
A part output from the above code is shown below
Background
Creating a Custom Dataset Class
A custom Dataset class must implement three functions: init, len, and _getitem
The methods are explained below:
init
The init function is run once when instantiating the Dataset object We instantiate the tokenizer and extract the occurrences of 'article' and 'highlights' from the supplied data
len
The len function returns the number of samples in our dataset.
getitem
The getitem function loads and returns a sample from the dataset at the given index idx.
Based on the index, it extracts a sample text and the corresponding highlight.
It tokenizes the text and the summary and extracts the input ids and the attention masks
it then returns the input ids and the attention masks for the text and the summary
The code snippet below defines the class described in detail above
The model training involves the following key steps as mentioned below:
The code snippet below shows the custom function for model train
We pass to the above function the following inputs which will be called or set from the primary calling module/ code portion
The T5 model
Model input
Padding
For validating our fine tuned model we have the following key steps in the validation function defined below
Prediction / Generation Parameters
The following parameters are used in the model prediction during validation
The output from the prediction is then decoded, the decoding parameters used are as follows:
The function for the validation is coded as below
Here we discuss the main driving code portions which will use the prior mentioned custom functions for model fine tuning and validation
Define model and parameters
Here we define the transformer model to be fine tuned, the Tokenizer and also set the device. The code portion is shown below.
Dataset Load and Pre processing
We use the cnn_dailymail summarization dataset for abstractive summarization. We will use only a part of the data set for training. The subsets of the data for train and validation are then pre processed using a custom NLP function and then tokenized. The code portions are repeated below and are the same as described in the section "Data Preparation". The Custom Dataset class defined earlier is instantiated for train and validation. The objects train_dataset and val_dataset as per code snippet below would then be fed as input to the Data Loader.
DataLoader
The Custom Data Set class retrieves our dataset's features and labels one sample at a time. While training a model, we typically want to pass samples in "minibatches", reshuffle the data at every epoch to reduce model overfitting, and use Python's multiprocessing to speed up data retrieval. DataLoader is an iterable that abstracts this complexity for us in an easy API.
The output from the "Dataset" is fed as input to DataLoader as shown below.
num_workers tells the data loader instance how many sub-processes to use for data loading. If the num_worker is zero (default) the GPU has to wait for CPU to load data. Theoretically, greater the num_workers, more efficiently the CPU load data and less the GPU has to wait.
Define and train model
We instantiate the model, Optimizer and call the train function defined earlier with the below code. We iterate through number of epochs (in this case set to 1)
Save and Load trained Model
It is a good practice to save the trained model – either on local server, or in this case on gdrive. For further inference and validation calls we will load the model again from saved location. This is achieved via the code snippet shown below
Validation
The validation function as defined previously is called as below. The resultant output lists of the predicted value and the actual target are returned as lists. It is a good practice to save the validation result for future reference. The code snippet below calls validation function and saves results locally / in this case to gdrive
Once the file is saved we can read anytime from the local / gdrive storage even at a later Run without worrying about losing the results. The code snippet below loads the saved file and chooses a random output from the validation data and displays the same
A sample random output from the above code is shown below
Once you have the model trained to your expectation levels and saved the model maybe loaded at a later stage or deployed for inferencing.
We refer to the below news chunk from the url provided
We have saved a chunk of news text from this link in a txt file which we will load and inference with the fine tuned model. The news chunk is as below
The below code snippet prepocesses this text item and generates a summary by inferencing our fine tuned model.
The summary is below
This concludes our example for Text Summarization
CODE DOWNLOAD
[1] https://wandb.ai/biased-ai/huggingface/reports/Text-Summarization-on-HuggingFace–Vmlldzo3ODA5MjI
Dr Anish Roychowdhury
Dr Anish Roychowdhury is a Data Science Professional and Educator with a total career experience of 20 plus years across industry and academia. Having both taught in full time and part time roles at leading B Schools and also have held leadership roles in multiple organizations He holds a Ph.D, in computational Micro Systems from IISc Bangalore), with a Master's Thesis in the area of Microfabrication, (Louisiana State University, USA) and an Undergraduate degree from NIT Durgapur with published research in GA -Fuzzy applications to Medical Diagnostics.
Yugal Jain
Yugal Jain is a data science professional with 5+ years of experience in various startups and companies. He has worked on complex NLP problems for industry specific use cases. He is experienced in building end to end data engineering (ETL) pipelines, NLP pipelines. He holds a bachelor's degree in Computer Science from Guru Gobind Singh Indraprastha University, New Delhi.
Join our WhatsApp Channel to get the latest news, exclusives and videos on WhatsApp
_____________
Disclaimer: Analytics Insight does not provide financial advice or guidance. Also note that the cryptocurrencies mentioned/listed on the website could potentially be scams, i.e. designed to induce you to invest financial resources that may be lost forever and not be recoverable once investments are made. You are responsible for conducting your own research (DYOR) before making any investments. Read more here.