- Gabriel de Olim Gaul
- Siya Gowda
- Ryan Patel
This is code used for the NLP coursework submission at Imperial College London, implementing a Natural Language Processing (NLP) classifier to detect patronizing or condescending language in text, trained on the Don't Patronize Me! dataset. All of the experiments and hyperparameter tuning can be found in models/ directory.
The best-performing model in this project is a finetuned DeBERTa model, incorporating:
- Synonym replacement for data augmentation.
- Class-weighted sampling to handle data imbalance.
- Preprocessing (punctuation removal and lemmatization)
The repository is organized as follows:
📂 analysis # Code for analyzing the dataset and the final trained model
📂 dataset # Scripts for reading and splitting the dataset into training and validation sets
📂 models # Implementation of different models and experiments
├── Baseline models: BoW and TF-IDF with logistic regression
├── DeBERTa finetuning with hyperparameter tuning
├── Data augmentation, sampling, and preprocessing techniques
📄 dev.txt # Final predictions for the development dataset
📄 test.txt # Final predictions for the test dataset
To run the code in this repository, install the required dependencies:
python3 -m venv venv
pip install -r requirements.txt
python3 -m spacy download en_core_web_sm
This was trained on the GPU lab machines found at Imperial College London
- The dataset used in this project: Don't Patronize Me! (Dataset Link)
- DeBERTa model from Microsoft for state-of-the-art NLP performance. (DeBERTa Paper)
This repository is part of a coursework submission of the NLP course at Imperial College London. Any unauthorized use, reproduction, or submission of this code as original work may result in academic misconduct or plagiarism consequences.