Text Summarization using LLMs


Text summarization has become an integral task in Natural Language Processing (NLP), allowing us to distill lengthy paragraphs of text into precise, concise, and accurate summaries. It finds applications in various domains, from Healthcare and legal documents to news publication articles and other longer text forms.

In this blog, we’ll delve into text summarization using Large Language Models (LLMs), an innovative approach that has revolutionized how summaries are generated. Before the rise of LLMs, we relied on techniques like TextRank and Naive Bayes for summary extraction, but these approaches had limitations. LLMs, powered by deep learning, offer a more robust and abstract representation of the text, enabling them to create coherent and contextually relevant summaries. We’ll explore the different LLMs like BART, T5, Pegasus, and ProphetNet, each with unique architecture and capabilities. By the end of this blog, you’ll better understand how LLMs are transforming the landscape of text summarization and making it easier than ever to distill information from extensive textual sources.

LLMs for Text Summary Generation Task

Generating Text summaries is a common NLP task. Text summarization aims to provide an accurate and concise summary of long paragraphs of text. This technique works with external data sources, including articles, documents, and social media blogs and posts.  Table below (Table-1) presents common examples of summaries that are machine-generated

TextGenerated summary
Following are the leading scorers in the English premier league after Saturday’s matches:##-Alan Shearer – lrb Newcastle United -rrb-, James Beattie.English premier league leading scorers
PG&E stated it scheduled the blackouts in response to forecasts for high winds amid dry conditions. The aim is to reduce the risk of wildfires. Nearly 800 thousand customers were scheduled to be affected by the shutoffs which were expected to last through at least midday tomorrow.California’s largest electricity provider has turned off power to hundreds of thousands of customers.

The next few sections elaborate on how Large language models (aka Transformer architecture) can summarize the text. 

New LLM-Based Summary Generation

Deep learning models have revolutionized the automated Text summarization task. Before Deep learning models, common approaches that extract summary information from a longer text include

  1. Graph network-based models that create a graph structure of all sentences based on word similarity. This way, the most popular of similar sentences can be chosen. TextRank is one such summarizer.
  2. Dividing the document into sections and then section-by-section feature extraction to find the theme of each section. Classifiers like Naive Bayes can be used to decide whether a feature belongs to a summary based on training data.

Instead of an extractive approach, deep learning models use a more robust abstract representation of the document with the aim that a neural network can be trained to generate a condensed abstract representation of a text document. The Figure 1 represents the deep learning model approach.


In the beginning, Bi-Directional LSTM models were extensively utilized based on Recurrent Neural Networks, and tools such as Gensim and SpaCy offered satisfactory solutions. However, a common drawback in many LSTM-based NLP solutions was their tendency to lose context when dealing with lengthy sentences.

LLM-Based Encoder-Decoder Architecture

The early Transformer architectures fell into two major categories. 

  1. Encoder-focused BERT and its Variants (ROBERTA, ELECTRA, etc.).
  2. Decoder-only GPT series auto-regressive models

The BERT-based encoder architecture, trained using the Masked Language model, can predict any absent tokens to the left and right of a given word within a sentence. This training empowers BERT to harness its self-attention mechanisms and craft coherent sentences with accurate word order. Figure 2 below illustrates the bidirectional prediction of missing tokens by the BERT Encoder.


Since it is not a compact representation, BERT is not Ideal for Text summarization. It does have an abstract understanding of the text and can be fine-tuned for text summary, but a model specialized for text summary work can easily outperform it.

GPT2/3 Auto regressive model generates the next token based on all previous input tokens with remarkable context size. The Figure 3 below shows the autoregressive nature of the GPT models with a single Token prediction direction.


The GPT architecture enables the model to undergo unsupervised training using unlabeled data. With the virtually boundless text data available on the internet, extensive training datasets like Book Corpus, consisting of over 11,000 books with nearly 1 billion words, prove to be highly advantageous for GPT models. While GPT models excel at generating text, it’s important to note that they lack inherent capabilities for generating summaries.

Accessing LLM Text Summarization Performance

Several famous text summary datasets exist for training and benchmarking, and various metrics to evaluate the performance of Text Summarizers. The benchmark data sets that are most commonly used are

  1. CNN Daily Mail text dataset  
  2. Billsum Data set

CNN Daily Mail dataset is generated from CNN and daily mail stories. The model consists of long text stories with human-generated summaries and highlights of sections of the story. It is a very good test for accurate summary generators and a large data set with sufficient training, test, and validation samples to create a pre-trained model.

Billsum Data set is a collection of US Congress and California Congress legislative records documents from 1991 to 2019. The collection also contains summaries written by experts in the legislative domain for most documents. Generating abstractive summaries for this dataset is quite challenging.

Furthermore, in addition to these datasets utilized for training and assessment, there exist well-known objective metrics for evaluating the generated summaries. One such metric is ROUGE-N, which belongs to a family of metrics measuring the correspondence of n-grams in words between the generated summary and a typically handcrafted reference summary. ROUGE, an acronym for Recall-Oriented Understudy for Gisting Evaluation, calculates the ROUGE-N score by tallying the common N-Grams shared between the reference summary and the generated one. These N-grams represent sets of words that tend to co-occur, and their utilization is rooted in the observation that languages exhibit specific patterns of word associations, making N-grams a valuable tool in text generation.

A simple example would be the calculation of ROGUE-2  for the text:

“A cat with hat at on the mat”

If reference summary and bigrams are as follows:

Cat , hat , mat  —–> (cat , hat) (cat, mat)

Generated summary bigrams are also:   (cat , hat) (cat, mat)

ROGUE-2 can be calculated as Total count of bi-grams in generated summary that matched reference bigrams (2)  divided by total count of bigrams in reference text (2) , so ROGUE-2 score would be perfect 1. A score of 0.4 to .6 is generally considered good for text summarization.

ROUGE-N can be calculated for model precision as well as recallability (true-positive /true-false positive to report ROUGUE-N F1 score that tells accuracy of model.

Initially designed as a metric to assess translation quality, BLEU (BiLingual Evaluation Understudy) can also be effectively applied to compare the generated summary against a reference. It relies on n-grams but can handle instances of very brief and disjointed text. A score ranging from 0.3 to 0.4 is generally deemed indicative of good performance.

Comprehending these benchmarks and metrics is instrumental in the process of choosing the most suitable model for text summarization. This is especially crucial as summarization models tend to be domain-specific, and the process of fine-tuning plays a pivotal role in achieving efficient text summary generation.

Contemporary LLMs for Generating Text Summaries

Following are a few state-of-the-art LLMs for Text summaries.


BART employs an encoder-decoder architecture, but its Encoder is unique, known as a denoising autoencoder. The distinctive feature of a denoising autoencoder lies in its training to generate a representation of imperfect or flawed documents. This representation is subsequently utilized by an auto-regressive decoder to reconstruct the original documents with minimal loss. To support this approach, BART incorporates an additional deletion token and masking token.

In terms of sequence-to-sequence training, BART deviates from the conventional positional encoding and instead employs techniques like sentence in-filling, document rotation, and sentence permutation to train the model to discern the correct sequences from the corrupted ones. The accompanying figure (Figure-4) illustrates the model’s training process in distinguishing accurate sequences from altered ones.


BART’s fine-tuning for downstream tasks includes sequence classification, token classification/question answering, sequence generation (BART has autoregressive decoders like GPT models), and machine translation. 


Like BART, T5 has a very similar architecture. It consists of a Denoising Encoder and a GPT-like autoregressive Decoder. The encoder generates an abstract representation of the document to summarize based on pre-training, and the auto-regressive decoder converts it into text. However, the creators of T5 transformer viewed every NLP task as a sequence-to-sequence text generation problem so they created T5 for tasks beyond text summary generation and had a heavy focus on pretraining., the idea is that to ask the transformer to “summarize”: “ xxx long text.” It can provide the summary, and when we ask “translate to German”: “xxxy …some English text”, it should do that job too, all based on pretraining. The tasks it gets trained on include unsupervised training.on large text corpus as well as supervised training on various datasets, including text summary tasks. T5 pre-trained model comes in various sizes, with the largest being with 11 Billion parameters.


PEGASUS (Pre-training with Extracted gap sentences for Abstractive Summarization)

Is an encoder-decoder architecture transformer specifically designed for text summary generation. BERT’s Mask Language Model helps encoder the product’s next tokens based on attention. PEGASUS ‘equivalent is “Gap Sentence Generation, “ where a whole sequence of words from the sentence is masked as the model is trained. The core idea is that if a model can be trained to select and emit most of the important few sentences from a document, it is already doing a summary. ROGUE-F1 score is calculated for each sentence as compared to the overall document and top-scoring sentences are kept. As trained models are taught to select significant and non-repetitive sentences from documents, further fine-tuning of pre-trained models can generate more effective summaries. Pegasus is currently State of for summarization tasks and various pre-trained models from size 500 M to 1.5 Billion are available.

ProphetNet also has a similar performance to Pegasus and it is also a dedicated high-performance Text summary.

Implementation Notes

The transformer library has made it straightforward to fine-tune a pre-trained model for text summarization. The following link outlines the steps:


Generating summaries from pre-trained models is also made easy via the transformer library. The following code snippet shows how to summarize the provided text.

text = "summarize: The Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs. It's the most aggressive action on tackling the climate crisis in American history, which will lift up American workers and create good-paying, union jobs across the country. It'll lower the deficit and ask the ultra-wealthy and corporations to pay their fair share. And no one making under $400,000 per year will pay a penny more in taxes

from transformers import pipeline

summarizer = pipeline("summarization",model="stevhliu/my_awesome_billsum_model")
[{"summary_text": "The Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs. It's the most aggressive action on tackling the climate crisis in American history, which will lift up American workers and create good-paying, union jobs across the country."}]


In conclusion, Text summarization, facilitated by Large Language Models (LLMs) like BERT, T5, Pegasus, and ProphetNet, has evolved into an essential tool for condensing vast amounts of text into concise summaries. These models leverage deep learning to generate coherent and contextually relevant summaries

Using a pre-trained and fine-tuned model for generating summaries is fairly straightforward with transformer library pipelines. The following example uses a pre-trained T5 Model on the BillSum dataset to generate a legal text summary. As technology progresses, text summarization powered by LLMs continues to play a vital role in simplifying complex documents and efficiently distilling information from extensive textual sources.

About the Author:

Ajmal Mahmood is the Chief Architect for High Plains Computing (HPC). HPC provides cloud DevOps and MLOps services and helps roll out ML models to production using AWS cloud and Kubernetes.

Social Share :

What is Retrieval Augmented Generation

What is Retrieval Augmented Generation Introduction Retrieval-augmented generation (RAG) is a cutting-edge technique that combines…

AWS Inspector: A Quick Security Guide

Security is crucial when using cloud-based applications. Cloud security involves using tools, regulations, and services…

AWS Security Enhancements

In today’s swiftly evolving tech landscape, prioritizing security is imperative. As a leading cloud service…

Ready to make your business more efficient?