NLP with Disaster Tweets
This particular challenge is perfect for data scientists looking to get started with Natural Language Processing. The competition dataset is not too big, and even if you don’t have much personal computing power, you can do all of the work in our free, no-setup, Jupyter Notebooks environment called Kaggle Notebooks.
Type Competition
Role Competitor
Timeline 2025
Kaggle: NLP with Disaster Tweets (Top 10 Finish)
Project Overview
This repository contains the code and methodology for the Kaggle competition “Natural Language Processing with Disaster Tweets”. The goal is to build a machine learning model that can determine whether a given tweet is about a real disaster or not.
This project goes beyond a simple model, implementing a robust, professional-grade pipeline using a pre-trained RoBERTa model and a K-Fold Cross-Validation strategy.
🏆 Key Achievements
- Achieved a Top 10 Rank on the final Kaggle leaderboard out of thousands of participants.
- Final Kaggle Public Leaderboard Score: 0.84768.
- Utilized a robust 5-Fold Cross-Validation strategy to ensure model stability and performance.
- Successfully fine-tuned a RoBERTa-base model using the Hugging Face and PyTorch ecosystem.
🛠️ The Technical Pipeline
The final, high-scoring solution was achieved through a systematic and robust workflow:
1. Data Preprocessing
Initial cleaning of the tweet text was performed to standardize the input for the model. This included:
- Removing all URLs.
- Stripping HTML tags.
- Normalizing whitespace.
2. Model Architecture: RoBERTa
The core of the solution is a pre-trained RoBERTa-base model from the Hugging Face transformers
library. RoBERTa is an optimized version of BERT that is highly effective for language understanding tasks. The model was fine-tuned specifically for this binary classification problem.
3. Training Strategy: K-Fold Cross-Validation
To build a highly robust model and get a reliable performance estimate, a Stratified K-Fold strategy with 5 splits was implemented.
This involves training 5 separate RoBERTa models on different 80% subsets of the training data. Each model is validated on the remaining 20% of the data, ensuring that every sample is used for validation exactly once.
4. Final Prediction: Ensembling with Majority Vote
The final prediction for each tweet in the test set was determined by a majority vote from the 5 models trained during the cross-validation process. This ensembling technique leverages the “wisdom of the crowd” by combining the predictions of all models, leading to a more accurate and stable final submission.
📊 Results Summary
- Average 5-Fold Validation Accuracy: 83.61%
- Final Kaggle Public Leaderboard Score: 0.84768
- Final Kaggle Rank: 10
Future Improvements
While this solution achieved a top rank, further improvements could be explored:
- Experiment with Larger Models: Fine-tuning a
roberta-large
model could yield further improvements, though it requires more computational resources. - Advanced Ensembling: Combining predictions from different transformer architectures (e.g., DeBERTa, BERT, RoBERTa) could create an even more powerful ensemble.
- Meticulous Data Cleaning: A deeper dive into cleaning tweet-specific slang, abbreviations, and misspellings might provide a slight edge.
other works