This repository is the code used during the Kaggle Challenge : IA and Ethics. The aim of this competition is to assign the correct job category to a job description.
The data is therefore representative of what can be found on the English speaking part of the Internet, and thus contains a certain amount of bias. One of the goals of this competition is to design a solution that is both accurate as well as fair.
You can find an exploratory analysis of the data in the notebook folder of this git.
We have two ways to solve this problem :
- Classical NLP methods.
- Transformers : Bert. For the first one, you can either train locally or on the google cloud computing platform. For the second one, it is recommended to use google colaboratory.
Run the following lines in your terminal :
pip install -r requirements.txt
python
import nltk
nltk.download('wordnet')
nltk.download('stopwords')
Download from the link the data.
There are multiple ways to run the code :
First change the path in the three scripts (cleaning.py, embedding.py, classification.py).
From the command line, go into the script folder and run the following lines :
python cleaning.py
python embedding.py
python classification.py
From the code you can modify the parameters to change the embedding method for example.
In the instance.py file, modify the parameters for your virtual machine.
Change the path in the four scripts (project.py, cleaning.py, embedding.py, classification.py).
Move the requirements.txt file in the script folder.
From the command line run the following line :
python main.py
In this architecture, you will find the models in the model folder and the submission files in the result folder.
Upload the bert.ipynb in the notebook folder to colab.
Choose the GPU.
Install the packages and import the data.
Run the cells and download the submission file.