Logo of Ibrahim Sadik Tamim
NLP with Disaster Tweets

NLP with Disaster Tweets

This particular challenge is perfect for data scientists looking to get started with Natural Language Processing. The competition dataset is not too big, and even if you don’t have much personal computing power, you can do all of the work in our free, no-setup, Jupyter Notebooks environment called Kaggle Notebooks.

Type Competition

Role Competitor

Timeline 2025

Kaggle: NLP with Disaster Tweets (Top 10 Finish)

Project Overview

This repository contains the code and methodology for the Kaggle competition “Natural Language Processing with Disaster Tweets”. The goal is to build a machine learning model that can determine whether a given tweet is about a real disaster or not.

This project goes beyond a simple model, implementing a robust, professional-grade pipeline using a pre-trained RoBERTa model and a K-Fold Cross-Validation strategy.

🏆 Key Achievements

🛠️ The Technical Pipeline

The final, high-scoring solution was achieved through a systematic and robust workflow:

1. Data Preprocessing

Initial cleaning of the tweet text was performed to standardize the input for the model. This included:

2. Model Architecture: RoBERTa

The core of the solution is a pre-trained RoBERTa-base model from the Hugging Face transformers library. RoBERTa is an optimized version of BERT that is highly effective for language understanding tasks. The model was fine-tuned specifically for this binary classification problem.

3. Training Strategy: K-Fold Cross-Validation

To build a highly robust model and get a reliable performance estimate, a Stratified K-Fold strategy with 5 splits was implemented.

This involves training 5 separate RoBERTa models on different 80% subsets of the training data. Each model is validated on the remaining 20% of the data, ensuring that every sample is used for validation exactly once.

4. Final Prediction: Ensembling with Majority Vote

The final prediction for each tweet in the test set was determined by a majority vote from the 5 models trained during the cross-validation process. This ensembling technique leverages the “wisdom of the crowd” by combining the predictions of all models, leading to a more accurate and stable final submission.

📊 Results Summary

Future Improvements

While this solution achieved a top rank, further improvements could be explored:

other works

All KTU PYQs Fetcher Useless Project