LSG-BART: Enhancing Medical Document Summarization with LSG Attention
An enhanced Transformer model for summarizing long medical documents using LSG Attention, addressing the computational challenges of processing extensive medical texts from PubMed and other sources.
LSG-BART: Enhancing Medical Document Summarization with LSG Attention
Project Abstract
This project introduces LSG-BART, an enhanced Transformer model for summarizing long medical documents. It addresses the challenge of processing extensive medical texts (e.g., from PubMed) by combining the robust sequence-to-sequence (Seq2Seq) capabilities of the BART model with the computational efficiency of Local, Sparse, and Global (LSG) Attention. This innovative approach efficiently handles long sequences, reducing computational complexity while maintaining high summarization quality, making it a promising tool for evidence-based medicine and healthcare.
The Problem: The Computational Cost of Long Documents
Summarizing medical literature is crucial, but these documents are often thousands of words long. Standard Transformer models, like the original BART, are built on a full self-attention mechanism.
The primary limitation of this mechanism is its computational complexity, where ‘n’ is the sequence length. As the document length increases, the memory and processing power required explode, making it prohibitively expensive and slow to summarize long medical texts.
The Solution: LSG-BART
Our solution is LSG-BART, a model that replaces the standard, computationally-heavy self-attention in BART’s encoder with LSG Attention.
This modification allows the model to efficiently process sequences up to 16,000 tokens and beyond, drastically reducing the computational and memory footprint while maintaining competitive performance. It’s an adaptable approach that can be applied to existing pre-trained models, saving significant training costs.
Core Concepts & Architecture
BART: The Foundation
BART (Denoising Autoencoder) is a sequence-to-sequence model that combines the best of BERT and GPT:
- Bidirectional Encoder (like BERT): It reads the entire input text at once, allowing it to build a deep understanding of the context
- Autoregressive Decoder (like GPT): It generates the summary token-by-token, making it excellent for high-quality text generation
This architecture makes BART an ideal foundation for abstractive summarization.
LSG Attention: The Innovation
LSG (Local, Sparse, Global) Attention is a hybrid mechanism designed to approximate full self-attention with far less computation. Instead of every token looking at every other token, it combines three efficient attention patterns:
- Local Attention: A sliding-window (like Longformer or Big Bird). Each token only attends to its immediate neighbors (e.g., a fixed-length window). This captures local context
- Sparse Attention: Expands the local context by selecting additional “sparse” tokens based on specific rules (e.g., max pooling within blocks). This allows the model to pick up on important words that are further away
- Global Attention: A few select tokens (e.g., the [CLS] token) are designated as “global.” These tokens can attend to every token in the sequence, and every token can attend back to them. This ensures that critical, high-level information is preserved and communicated across the entire document
LSG-BART: The Combined Model
LSG-BART is created by integrating the LSG attention mechanism directly into the BART architecture, specifically replacing the full self-attention in its encoder.
This allows the encoder to efficiently process long input sequences (the medical article). The standard decoder remains, allowing it to generate a high-quality, abstractive summary.
This approach is “model-agnostic,” meaning we can convert existing BART checkpoints without having to retrain the entire model from scratch.
Methodology & Implementation
Dataset: PubMed Summarize
We used the PubMed Summarize dataset, which is ideal for this task due to its long-form content.
- Training Set: 120,000 documents
- Average Input Length: 3,050 words
- Average Summary Length: 202 words
Model Training & Hyperparameters
The model was fine-tuned on the PubMed dataset using a P100 GPU with 16GB VRAM.
| Hyperparameter | Value |
|---|---|
| Learning Rate | 8e-05 |
| Train Batch Size | 2 |
| Eval Batch Size | 2 |
| Gradient Accumulation Steps | 16 |
| Total Train Batch Size | 32 |
| Optimizer | Adam (betas=(0.9, 0.999), epsilon=1e-08) |
| LR Scheduler Type | Linear |
| LR Scheduler Warmup Steps | 500 |
| Num Epochs | 12 |
| Seed | 42 |
Evaluation Metrics
Model performance was evaluated using the ROUGE (Recall-Oriented Understudy for Gisting Evaluation) score, which measures the overlap between the generated summary and a human-written reference summary.
- ROUGE-1: Overlap of unigrams (single words)
- ROUGE-2: Overlap of bigrams (two-word phrases)
- ROUGE-L: Overlap based on the Longest Common Subsequence (LCS)
Results & Performance
LSG-BART demonstrates highly competitive performance against other state-of-the-art long-sequence models.
Performance on Summarization Tasks (ROUGE Scores)
| Model | R1 | R2 | R-L |
|---|---|---|---|
| Pegasus (1K) | 45.49 | 19.90 | 27.69 |
| Big Bird-Peg. (4K) | 46.32 | 20.65 | 42.33 |
| LongT5-Base (4K) | 47.77 | 22.58 | 44.38 |
| Our LSG-BART (4K) | 45.16 | 21.74 | 42.03 |
| LongT5-L (16K) | 49.98 | 24.69 | 46.46 |
| Our LSG-BART (16K) | 47.18 | 23.42 | 43.60 |
Key Insights
- Our LSG-BART (4K) model, while having a slightly lower R1 score, achieves a higher R2 score than Big Bird-Pegasus (4K). This suggests it is better at capturing meaningful phrases (bigrams)
- Our LSG-BART (16K) model scales effectively, showing strong results that are competitive with the much larger LongT5-L model
- This confirms that the LSG-BART approach successfully balances computational efficiency with high-quality summarization performance
Conclusion & Future Directions
Advantages
- Efficiency: Effectively processes long documents by avoiding complexity
- High Performance: Achieves competitive ROUGE scores against other SOTA long-sequence models
- Resource-Optimized: Can be deployed on systems with more limited computational resources
Future Work
- Expand Vocabulary: Enhance the model with more domain-specific medical terminology
- Multilingual Models: Adapt the model to summarize medical documents in multiple languages
- Enhanced Fact-Checking: Integrate fact-checking mechanisms to further improve the accuracy and trustworthiness of summaries, which is critical for medical applications