This notebook demonstrates the implementation of text summarization using a combination of extractive and abstractive methods. It utilizes the CNN/Daily Mail dataset to train and evaluate models for generating summaries of news articles.
- Extractive Summarization: Identifies key sentences from articles.
- Abstractive Summarization: Generates human-like summaries using transformer models.
- Evaluation Metrics: Measures performance using ROUGE and BLEU scores.
- Pretrained Models: Utilizes pretrained BERT and BART models for summarization.
Ensure the following Python libraries are installed:
transformers
datasets
rouge-score
evaluate
nltk
torch
matplotlib
tqdm
You can install them by running:
!pip install transformers datasets rouge-score evaluate nltk -q
The notebook uses the CNN/Daily Mail dataset, which is automatically downloaded using the datasets
library.
- Loading: The CNN/Daily Mail dataset is loaded using the
datasets
library. - Cleaning: Articles and summaries are cleaned to remove HTML tags, special characters, and extra whitespace.
- Splitting: The dataset is divided into training, validation, and testing subsets.
- Sentence Splitting: Articles are divided into sentences using NLTK.
- Sentence Selection: Sentences are classified using a pretrained BERT model for sequence classification.
- Key Sentences: Selected sentences form the extractive summary.
- Fine-Tuning: The BART model is fine-tuned on the cleaned dataset using
Seq2SeqTrainer
. - Summary Generation: Summaries are generated using beam search for improved results.
- ROUGE Score: Measures the overlap of n-grams between generated and reference summaries.
- BLEU Score: Assesses the similarity of generated summaries to reference summaries.
- Visualization: Displays example articles, reference summaries, and generated summaries.
-
Clone the Notebook Save the notebook on your local system or Jupyter environment.
-
Install Dependencies Ensure all required Python libraries are installed (see requirements above).
-
Execute the Notebook Run each cell sequentially. Major sections are labeled clearly with comments.
-
Model Training
- If fine-tuning BART, the training may take a few hours based on your hardware.
- You can reduce the dataset size during training for faster results.
-
Evaluate the Models View the generated summaries and evaluate their performance using ROUGE and BLEU metrics.
Article:
"John went to the store to buy groceries. He forgot to bring his wallet but managed to find a way to pay."
Extractive Summary:
"John went to the store to buy groceries."
Abstractive Summary:
"John bought groceries without his wallet."
- ROUGE-1: Measures overlap of unigrams.
- ROUGE-2: Measures overlap of bigrams.
- ROUGE-L: Measures the longest common subsequence.
- BLEU: Measures n-gram overlap with a smoothing function.
- Dataset Size: Modify the
train_dataset.select
andval_dataset.select
calls to use fewer samples for quick testing. - Hyperparameters: Adjust training arguments like
learning_rate
,num_train_epochs
, andbatch_size
inSeq2SeqTrainingArguments
. - Model Selection: Replace
facebook/bart-base
with other transformer models for experimentation.
The trained model and tokenizer are saved in the ./saved_model
directory. Use these files to reload the model for inference:
- CUDA Errors: Ensure PyTorch is installed with GPU support and a compatible CUDA version.
- Long Training Time: Reduce the dataset size or batch size for faster iterations.
- Missing Dependencies: Reinstall missing packages using pip.
Enjoy summarizing!