Clone the github repository.
git clone https://github.com/ffaisal93/DialectBench.io.git
cd DialectBench- Download all data available
[except mt and the ones loadable through huggingface]bash download_data.sh --task all
- Download data for Turkish dialectal machine translation
bash download_data.sh --task machine_translation_turkish
Dependency parsing:Install Adapter Package
bash install.sh --task install_adapterExtractive Question Answering [SDQA]:Install Transformers 3.4.0
bash install.sh --task install_transformers_qaOther Structured Prediction, QA and Classification tasks:Transformers 4.21.1
bash install.sh --task install_transformers- Finetune all available language-specific models on both pretrained mBERT and XLMR at once
./all_commands.sh --action train_udp --execute bash - Finetune one single available language-specific model
bash install.sh --task train_udp --lang UD_English-EWT --MODEL_NAME mbert
- Prediction on all finetuned model (for both pretrained mBERT and XLMR) and if no training data available for a specific language variety, do zeroshot from English variety
"UD_English-EWT"./all_commands.sh --action predict_udp --execute bash
- Do zero-shot prediction from a specific language variety (e.g.
UD_English-EWT) and on all available variety defined in--lang_config metadata/udp_metadata.jsonbash install.sh --task predict_udp_zeroshot_all --lang UD_English-EWT --MODEL_NAME mbert
- Do test data prediction on a single finetuned language variety (e.g.
UD_English-EWT)bash install.sh --task predict_udp_single --lang UD_English-EWT --MODEL_NAME mbert
- Finetune all available language-specific models on both pretrained mBERT and XLMR at once
./all_commands.sh --action train_pos --execute bash
- Prediction on all finetuned model (for both pretrained mBERT and XLMR) and if no training data available for a specific language variety, do zeroshot from English variety
"UD_English-EWT"./all_commands.sh --action predict_pos --execute bash
- Performing in-variety Finetuning on all available language varieties on both pretrained mBERT and XLMR at one go.
./all_commands.sh --action train_pos --execute bash
- or, If you want to performing in-variety finetuning for a single language only, try the following:
bash install.sh --task train_ner --lang bokmaal --MODEL_NAME bert --dataset wikiann
- We have two datasets supported in DialectBench at this point.
wikiannandnorwegian_ner.-
wikiann: language varieties("ar" "az" "ku" "tr" "hsb" "nl" "fr" "zh" "en" "mhr" "it" "de" "pa" "es" "hr" "lv" "hi" "ro" "el" "bn"). Use--dataset wikiannto finetune varieties from this dataset. -
norwegian_ner: language varieties ("bokmaal" "nynorsk" "samnorsk"). Use--dataset scripts/ner/norwegian_ner.pyto finetune varieties from this dataset.
-
- Prediction using all in-variety finetuned models (for both pretrained mBERT and XLMR) as well as performing zeroshot prediction using English variety
enon the varieties available in--lang_config metadata/metadata/ner_metadata.jsonat one go../all_commands.sh --action predict_ner --execute bash
- Performing In-cluster finetuning (on both pretrained mbert and xlm-r) on selected varieties from different language cluster.
./all_commands.sh --action train_topic_classification_lm --execute bash-
Add or remove specific variety for finetuning from SIB-200 dataset here in
command-bash.shfile.if [[ "$task" = "train_topic_classification_lm" || "$task" = "predict_topic_classification_lm" ]]; then export ALL_LANGS=("eng_Latn" "ita_Latn" "azj_Latn" "ckb_Arab" "nob_Latn" "nld_Latn" "lvs_Latn" "arb_Arab" "lij_Latn" "zho_Hans" "spa_Latn" "nso_Latn")
- Performing inference on all available varieties across different language clusters (as defined in
--lang_config metadata/topic_metadata.json) and on top of different pretrained models (mbert, xlmr)
./all_commands.sh --action predict_topic_classification_lm --execute bash- Performing zero-shot finetuning from English (on top of both pretrained mbert and xlm-r) on selected varieties from different language cluster.
./all_commands.sh --action train_nli --execute bash-
Add or remove specific variety for finetuning from translate-test
dialect_nlidataset here incommand-bash.shfile.if [[ "$task" = "train_nli" || "$task" = "predict_nli" ]]; then # export ALL_LANGS=("eng_Latn" "ita_Latn" "azj_Latn" "ckb_Arab" "nob_Latn" "nld_Latn" "lvs_Latn" "arb_Arab" "lij_Latn" "zho_Hans" "spa_Latn" "nso_Latn" "ben_Beng") export ALL_LANGS=("eng_Latn") for lang in "${ALL_LANGS[@]}"; do echo ${base_model} echo ${lang} echo ${dataset} bash install.sh --task ${task} --lang ${lang} --MODEL_NAME ${base_model} done fi
-
dialect_nlidataset loading script:--dataset_script scripts/nli/dialect_nli.py
- Performing inference on all available varieties across different language clusters (as defined in
--lang_config metadata/nli_metadata.json) and on top of different pretrained models (mbert, xlmr).
./all_commands.sh --action predict_nli --execute bash- At this point, DialectBench only supports arabic dialectal sentiment analysis. To finetune variety-specific models:
./all_commands.sh --action train_sa --execute bash- To evaluate each variety-specific model at one go:
./all_commands.sh --action predict_sa --execute bash-
Add or remove specific variety for finetuning in
command-bash.shfile.if [[ "$task" = "train_sa" || "$task" = "predict_sa" ]]; then export ALL_LANGS=("aeb_Arab" "aeb_Latn" "arb_arab" "ar-lb" "arq_arab" "ary_arab" "arz_arab" "jor_arab" "sau_arab") for lang in "${ALL_LANGS[@]}"; do echo ${base_model} echo ${lang} echo ${dataset} bash install.sh --task ${task} --lang ${lang} --lang2 arabic --MODEL_NAME ${base_model} done fi
- Finetune Arabic, English, Mandarin, Portuguese, Spanish and Swiss-Dialect identification models (mbert and xlmr based)
./all_commands.sh --action train_did --execute bash- Finetune a dialect identification model of a single language
export lang="arabic" #"arabic" english" "greek" "mandarin_simplified" "mandarin_traditional" "portuguese" "spanish" "swiss-dialects"
export base_model="mbert" #"mbert" "xlmr"
bash install.sh --task train_did --lang ${lang} --dataset ${dataset} --MODEL_NAME ${base_model}./all_commands.sh --action predict_did_lm --execute bash./all_commands.sh --action train_reading_comprehension --execute bash./all_commands.sh --action predict_reading_comprehension --execute bash- Finetune on all language at once as well as on singlae language and it's varieties.
./all_commands.sh --action train_sdqa --execute bash- Add or remove specific language cluster in this
command-bash.shblock.
f [[ "$task" = "train_sdqa" || "$task" = "predict_sdqa" ]]; then
export ALL_MODELS=("all" "arabic" "bengali" "english" "finnish" "indonesian" "korean" "russian" "swahili" "telugu")
for MODEL_NAME in "${ALL_MODELS[@]}"; do
echo ${base_model}
echo ${MODEL_NAME}
bash install.sh --task ${task} --lang ${MODEL_NAME} --MODEL_NAME ${base_model} --dataset dev
bash install.sh --task ${task} --lang ${MODEL_NAME} --MODEL_NAME ${base_model} --dataset test
done
fi./all_commands.sh --action predict_sdqa --execute bash