Master the art of training Text Summarization AI models with Python! Expert insights, step-by-step guidance, and practical tips await. Dive in now!
Table of Contents
- Types of Text Summarization
- Implementing Text Summarization in Python (using the Seq2Seq model)
- Find A Valid Training Dataset
- Preprocess Data To Clean It
- Make Tokenizers (To Convert Words To Strings)
- Finalize the Model
- Final Words
Text summarization means to shorten a text by keeping its intended meaning the same. In today’s haphazardness, there’s a rising need for clear, concise content. This is to quickly deliver all the points to the audience so that every party gains value in less time.
Thus, AI summarizers come into play. These automated tools are built on trained ML models with vast datasets. The summarizers utilize NLP (Natural Language Processing) to read and predict suitable words for the shortened text.
However, there are ways that you can train a model with Python to make your text summarizer. This article will teach you the
(Intro explaining the definition of text summarization (in NLP) and its use cases. Also, give background for AI models and how we can train them with Python.)
1. Types of Text Summarization
There are two types of text summarizations possible.
- Extractive
This type of summary means to extract important sentences from a sample text and display them. The resulting text is much shorter than the original one, without altering the original meaning.
An Attractive summarizer doesn’t alter the sentences in any way. It just copies them from the source text to make a summary that justifies the context.
- Abstractive
An abstractive summary is where the summarizer doesn’t copy the same sentences from a sample text. Instead, it learns
whatever is in an input text and writes new sentences to complete the summary.
For today’s post, we will show the procedure of training an AI model to create abstractive summaries. The reason to pick abstractive over extractive summaries is that they’re more intuitive to human learning.
So, you can utilize the summarised text to digest a lengthy, intricate text for your convenience. This can be helpful to take valid inspiration from published content or to learn manual summarization techniques.
2. Implementing Text Summarization in Python (using the Seq2Seq model)
To effectively train a text summarization AI model in Python, we’ll use the Seq2Seq model. The S2S model consists of two main components: Encoder, and Decoder. We recommend reading a complete LSTM tutorial to understand these concepts in detail.
Moving on, to train our AI model, let’s do the following steps.
Import the Right Libraries
There are many libraries out there for Python that can achieve our task. However, we’ll limit ourselves to the following ones.
- BeautifulSoup
- Nltk
- Numpy
- Pandas
- Re
- Keras
Let us briefly explain what each does, so that you have an idea of what we’re doing here.
BeautifulSoup is an HTML parser that makes it easy to extract content from web pages. It is helpful to remove HTML tags and extract pure text.
Nltk stands for Natural Language Toolkit. It is an API used for Natural Language Processing. The Nltk library converts text to numbers, easily recognizable for the machine.
Numpy is used to carry out mathematical operations with large, multidimensional arrays or matrices. This makes them suitable for iterative tasks, like the one we have.
Panda is another popular Python library that creates two-dimensional heterogeneous objects and helps in mathematical calculations.
Re is short for Regex. In short, it is a ‘match function’ that catches and detects a given string to match from an ocean of words and phrases.
Keras is an open-source library developed by Google. It is a high-level deep-learning API that enables creators to make AI models. Later, in 2017, Keras merged with TensorFlow.
Finally, import the libraries in a suitable Python environment by applying the following code.
import numpy as np import pandas as pd import re from bs4 import BeautifulSoup from keras. preprocessing.text import Tokenizer from keras. preprocessing.sequence import pad_sequences from nltk.corpus import stopwords from TensorFlow. keras.layers import Input, LSTM, Embedding, Dense, Concatenate, TimeDistributed, Bidirectional from tensorflow. keras.models import Model from TensorFlow. keras.callbacks import EarlyStopping import warnings pd.set_option("display.max_colwidth", 200) warnings.filter warnings("ignore")
3. Find a Valid Training Dataset
After creating a solid base for the libraries to use, let’s move on to find the right dataset. The one we’ll use for our demonstration is Amazon’s Fine Food reviews. The data includes around 560,000 reviews on the Amazon website from October 1999 to October 2012.
The reason to use this training dataset is that product reviews generally have all types of vocabulary and writing styles. It’s a good representation of the machine of how humans write in real life.
Thus, the data will work wonders for teaching the AI model on text summarization. The abstract summaries will resemble those done by humans. The text summarizations will be coherent, cohesive, and effective.
However, if you want to train your AI model to create extractive summaries, then use the TextRank algorithm. The code works on a two-dimensional array and defines probabilities of correlation of each index in the calculation.
Nevertheless, we’ll stick with the review dataset to train abstractive summaries. For ease of computation, we’ll limit the sample size to 100,000 by following the code below.
data=pd.read_csv("Reviews.csv", nrows=100000) data.drop_duplicates(subset=['Text'],inplace=True) data.dropna(axis=0,inplace=True)
Additionally, you should eliminate redundancy from the data by removing repeating and NULL or NaN values. Then, you’ll be ready to use your dataset for training your AI model on summaries.
4. Preprocess Data To Clean It
Clearing out NUL and repeating words was the initial step toward cleaning the data. However, it requires much more than that. Below, we have listed some more steps you need to follow to make your training dataset pristine.
- Mapping contracted words
- Converting everything to lowercase
- Removing anything in parenthesis
- Removing stop words and short words
- Eliminating special characters and punctuation
To do these steps, we need to follow some coding techniques. First, we’ll declare an array that has some contraction definitions in it like,
contraction_mapping = {“can’t”: “cannot”… and so on.}
We’ll remove short words and stop words from our text. This is because stop words like, ‘and’, ‘there’, ‘is’, and ‘the’ don’t add any value to the content of the summary. A similar argument can be presented for short words like abbreviations that aren’t meaningful to the overall understanding of an input text.
stop_words = set(stopwords.words('english')) def text_cleaner(text): newString = text.lower() newString = BeautifulSoup(newString, "lxml").text newString = re.sub(r'\([^)]*\)', '', newString) newString = re.sub('"','', newString) newString = ' '.join([contraction_mapping[t] if t in contraction_mapping else t for t in newString.split(" ")]) newString = re.sub(r"'s\b","",newString) newString = re.sub("[^a-zA-Z]", " ", newString) tokens = [w for w in newString.split() if not w in stop_words] long_words=[] for i in tokens: if len(i)>=3: #removing short word long_words. Append(i) return (" ".join(long_words)).strip() cleaned_text = [] for t in data['Text']: cleaned_text.append(text_cleaner(t))
In the above code, the variable stop_words recalls and stores all the relevant stop words (as mentioned above) from the nltk library.
Then, a text_cleaner function performs all the steps we mentioned above to convert the text to lowercase and remove special characters, parenthesis, punctuation, etc.
5. Make Tokenizers (To Convert Words To Strings)
Now, it’s time to tokenize the words so that the machine converts them to phrases and sentences for the summary.
Before that, we need to convert our input text data into two parts: training set, and holdout set. This is to ensure that our end product is functioning well enough.
max_len_text=80 max_len_summary=10 from sklearn.model_selection import train_test_split x_tr,x_val,y_tr,y_val=train_test_split(data['cleaned_text'],data['cleaned_summary'],test_size=0.1,random_state=0,shuffle=True)
Notice that we set the maximum length of text and summary as 80 and 10, respectively. This was done as reviews were at an average length of 80.
You can change this value according to the training dataset you have chosen for your AI model. In our case, we figured that these constraints would work the best.
Now, let’s define a text tokenizer for our training dataset.
#prepare a tokenizer for reviews on training data x_tokenizer = Tokenizer() x_tokenizer.fit_on_texts(list(x_tr)) #convert text sequences into integer sequences x_tr = x_tokenizer.texts_to_sequences(x_tr) x_val = x_tokenizer.texts_to_sequences(x_val) #padding zero up to maximum length x_tr = pad_sequences(x_tr, maxlen=max_len_text, padding='post') x_val = pad_sequences(x_val, maxlen=max_len_text, padding='post') x_voc_size = len(x_tokenizer.word_index) +1
Next, prepare a summary tokenizer.
#preparing a tokenizer for a summary of training data y_tokenizer = Tokenizer() y_tokenizer.fit_on_texts(list(y_tr)) #convert summary sequences into integer sequences y_tr = y_tokenizer.texts_to_sequences(y_tr) y_val = y_tokenizer.texts_to_sequences(y_val) #padding zero up to maximum length y_tr = pad_sequences(y_tr, maxlen=max_len_summary, padding='post') y_val = pad_sequences(y_val, maxlen=max_len_summary, padding='post') y_voc_size = len(y_tokenizer.word_index) +1
Make sure to define the start and end words correctly. Then, move on to the next and final step.
6. Finalise the Model
We are at the step at which every single element will stitch together to create a functional AI model. To do so, we need to explore some concepts on LSTM, hidden state, cell-state, etc.
LSTM stands for long short-term memory. It is a type of recurrent neural network (RNN) that the AI model will use to predict the right words for our summary sentences.
from keras import backend as K K.clear_session() latent_dim = 500 # Encoder encoder_inputs = Input(shape=(max_len_text,)) enc_emb = Embedding(x_voc_size, latent_dim,trainable=True)(encoder_inputs) #LSTM 1 encoder_lstm1 = LSTM(latent_dim,return_sequences=True,return_state=True) encoder_output1, state_h1, state_c1 = encoder_lstm1(enc_emb) #LSTM 2 encoder_lstm2 = LSTM(latent_dim,return_sequences=True,return_state=True) encoder_output2, state_h2, state_c2 = encoder_lstm2(encoder_output1) #LSTM 3 encoder_lstm3=LSTM(latent_dim, return_state=True, return_sequences=True) encoder_outputs, state_h, state_c= encoder_lstm3(encoder_output2) # Set up the decoder. decoder_inputs = Input(shape=(None,)) dec_emb_layer = Embedding(y_voc_size, latent_dim,trainable=True) dec_emb = dec_emb_layer(decoder_inputs) #LSTM using encoder_states as initial state decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True) decoder_outputs,decoder_fwd_state, decoder_back_state = decoder_lstm(dec_emb,initial_state=[state_h, state_c]) # Concat attention output and decoder LSTM output decoder_concat_input = Concatenate(axis=-1, name='concat_layer')([decoder_outputs, attn_out]) #Dense layer decoder_dense = TimeDistributed(Dense(y_voc_size, activation='softmax')) decoder_outputs = decoder_dense(decoder_concat_input) # Define the model model = Model([encoder_inputs, decoder_inputs], decoder_outputs) model.summary()
In the above code, we’ve built a 3-layer LSTM that we’ll use for the encoder as an input to the AI model.
Now, let’s set up inference for the encoder and decoder of the text summarizer.
# encoder inference encoder_model = Model(inputs=encoder_inputs,outputs=[encoder_outputs, state_h, state_c]) # decoder inference # Below tensors will hold the states of the previous time step decoder_state_input_h = Input(shape=(latent_dim,)) decoder_state_input_c = Input(shape=(latent_dim,)) decoder_hidden_state_input = Input(shape=(max_len_text,latent_dim)) # Get the embeddings of the decoder sequence dec_emb2= dec_emb_layer(decoder_inputs) # To predict the next word in the sequence, set the initial states to the states from the previous time step decoder_outputs2, state_h2, state_c2 = decoder_lstm(dec_emb2, initial_state=[decoder_state_input_h, decoder_state_input_c]) #attention inference attn_out_inf, attn_states_inf = attn_layer([decoder_hidden_state_input, decoder_outputs2]) decoder_inf_concat = Concatenate(axis=-1, name='concat')([decoder_outputs2, attn_out_inf]) # A dense softmax layer to generate prob dist. over the target vocabulary decoder_outputs2 = decoder_dense(decoder_inf_concat) # Final decoder model decoder_model = Model( [decoder_inputs] + [decoder_hidden_state_input,decoder_state_input_h, decoder_state_input_c], [decoder_outputs2] + [state_h2, state_c2])
Afterward, let’s define a function that shows the implementation of the inference.
def decode_sequence(input_seq): # Encode the input as state vectors. e_out, e_h, e_c = encoder_model.predict(input_seq) # Generate an empty target sequence of length 1. target_seq = np.zeros((1,1)) # Chose the 'start' word as the first word of the target sequence target_seq[0, 0] = target_word_index['start'] stop_condition = False decoded_sentence = '' while not stop_condition: output_tokens, h, c = decoder_model.predict([target_seq] + [e_out, e_h, e_c]) # Sample a token sampled_token_index = np.argmax(output_tokens[0, -1, :]) sampled_token = reverse_target_word_index[sampled_token_index] if(sampled_token!='end'): decoded_sentence += ' '+sampled_token # Exit condition: either hit max length or find a stop word. if (sampled_token == 'end' or len(decoded_sentence.split()) >= (max_len_summary-1)): stop_condition = True # Update the target sequence (of length 1). target_seq = np.zeros((1,1)) target_seq[0, 0] = sampled_token_index # Update internal states e_h, e_c = h, c return decoded_sentence
Finally, we’ll define a function that converts the integer sequences to a word sequence for our abstractive summary.
def seq2summary(input_seq): newString='' for i in input_seq: if((i!=0 and i!=target_word_index['start']) and i!=target_word_index['end']): newString=newString+reverse_target_word_index[i]+' ' return newString def seq2text(input_seq): newString='' for i in input_seq: if(i!=0): newString=newString+reverse_source_word_index[i]+' ' return newString for i in range(len(x_val)): print("Review:",seq2text(x_val[i])) print("Original summary:",seq2summary(y_val[i])) print("Predicted summary:",decode_sequence(x_val[i].reshape(1,max_len_text))) print("\n")
Below are some of the outputs that this AI model spits out.
The results are quite similar to those produced by an AI summarizer available online.
We must note that the result produced above by the commercial-level summarizer is top-notch. It even included punctuations and expressions in its results, which our model doesn’t. However, there is always room for improvement in your model-training techniques.
7. Final Words
In this post, we learned the steps that you need to follow to effectively train your AI model for text summarization.
We learned that we need to create a 3-layer LSTM layer to successfully build the capability of predicting the right words in a summary. The codes will run only if the valid libraries are installed in the system.
There are still chances of an error in the whole process, so you’ll need to be patient and avoid bugs in the code. Debugging is a game of patience and mastering it will make your AI models the best!
That’s it for the post! We hope you enjoyed reading our content!
That’s a wrap!
Thank you for taking the time to read this article! I hope you found it informative and enjoyable. If you did, please consider sharing it with your friends and followers. Your support helps me continue creating content like this.
Stay updated with our latest content by signing up for our email newsletter! Be the first to know about new articles and exciting updates directly in your inbox. Don't miss out—subscribe today!
If you'd like to support my work directly, you can buy me a coffee . Your generosity is greatly appreciated and helps me keep bringing you high-quality articles.
Thanks!
Faraz 😊