README for Text Summarization Notebook

Introduction

This notebook demonstrates the implementation of text summarization using a combination of extractive and abstractive methods. It utilizes the CNN/Daily Mail dataset to train and evaluate models for generating summaries of news articles.

Key Features:

Extractive Summarization: Identifies key sentences from articles.
Abstractive Summarization: Generates human-like summaries using transformer models.
Evaluation Metrics: Measures performance using ROUGE and BLEU scores.
Pretrained Models: Utilizes pretrained BERT and BART models for summarization.

Requirements

Libraries and Tools

Ensure the following Python libraries are installed:

transformers
datasets
rouge-score
evaluate
nltk
torch
matplotlib
tqdm

You can install them by running:

!pip install transformers datasets rouge-score evaluate nltk -q

Dataset

The notebook uses the CNN/Daily Mail dataset, which is automatically downloaded using the datasets library.

Notebook Workflow

Step 1: Dataset Preparation

Loading: The CNN/Daily Mail dataset is loaded using the datasets library.
Cleaning: Articles and summaries are cleaned to remove HTML tags, special characters, and extra whitespace.
Splitting: The dataset is divided into training, validation, and testing subsets.

Step 2: Extractive Summarization

Sentence Splitting: Articles are divided into sentences using NLTK.
Sentence Selection: Sentences are classified using a pretrained BERT model for sequence classification.
Key Sentences: Selected sentences form the extractive summary.

Step 3: Abstractive Summarization

Fine-Tuning: The BART model is fine-tuned on the cleaned dataset using Seq2SeqTrainer.
Summary Generation: Summaries are generated using beam search for improved results.

Step 4: Evaluation

ROUGE Score: Measures the overlap of n-grams between generated and reference summaries.
BLEU Score: Assesses the similarity of generated summaries to reference summaries.
Visualization: Displays example articles, reference summaries, and generated summaries.

How to Run

Clone the Notebook Save the notebook on your local system or Jupyter environment.
Install Dependencies Ensure all required Python libraries are installed (see requirements above).
Execute the Notebook Run each cell sequentially. Major sections are labeled clearly with comments.
Model Training
- If fine-tuning BART, the training may take a few hours based on your hardware.
- You can reduce the dataset size during training for faster results.
Evaluate the Models View the generated summaries and evaluate their performance using ROUGE and BLEU metrics.

Results

Example Output

Article:

"John went to the store to buy groceries. He forgot to bring his wallet but managed to find a way to pay."

Extractive Summary:

"John went to the store to buy groceries."

Abstractive Summary:

"John bought groceries without his wallet."

Metrics

ROUGE-1: Measures overlap of unigrams.
ROUGE-2: Measures overlap of bigrams.
ROUGE-L: Measures the longest common subsequence.
BLEU: Measures n-gram overlap with a smoothing function.

Customization

Parameters

Dataset Size: Modify the train_dataset.select and val_dataset.select calls to use fewer samples for quick testing.
Hyperparameters: Adjust training arguments like learning_rate, num_train_epochs, and batch_size in Seq2SeqTrainingArguments.
Model Selection: Replace facebook/bart-base with other transformer models for experimentation.

Saved Model

The trained model and tokenizer are saved in the ./saved_model directory. Use these files to reload the model for inference:

Troubleshooting

CUDA Errors: Ensure PyTorch is installed with GPU support and a compatible CUDA version.
Long Training Time: Reduce the dataset size or batch size for faster iterations.
Missing Dependencies: Reinstall missing packages using pip.

References

Enjoy summarizing!

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
text_summarization.ipynb		text_summarization.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

README for Text Summarization Notebook

Introduction

Key Features:

Requirements

Libraries and Tools

Dataset

Notebook Workflow

Step 1: Dataset Preparation

Step 2: Extractive Summarization

Step 3: Abstractive Summarization

Step 4: Evaluation

How to Run

Results

Example Output

Metrics

Customization

Parameters

Saved Model

Troubleshooting

References

About

Releases

Packages

Languages

AbdSuperDev/Text_Summarization

Folders and files

Latest commit

History

Repository files navigation

README for Text Summarization Notebook

Introduction

Key Features:

Requirements

Libraries and Tools

Dataset

Notebook Workflow

Step 1: Dataset Preparation

Step 2: Extractive Summarization

Step 3: Abstractive Summarization

Step 4: Evaluation

How to Run

Results

Example Output

Metrics

Customization

Parameters

Saved Model

Troubleshooting

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages