A novel method to compare the phonetic similarity between words based on phonetic features. This is the official repository for the paper https://arxiv.org/pdf/2109.14796.pdf
- Table of content
Download The CMU Pronouncing Dictionary in the data directory.
wget -P data http://svn.code.sf.net/p/cmusphinx/code/trunk/cmudict/cmudict-0.7b
Download SOTA model vocab from NLP for Hindi git repo.
wget -O data/hindi_lm_large.vocab https://drive.google.com/uc?export=download&id=1P6r8UBcegvVmr1kBDjqcYppmt_WgnbNt
Add missing words to cmu dictionary
cat data/cmudict-0.7b res/cmudict_missing_words >> data/cmudict-0.7b-with-vitz-nonce
Install all the dependencies.
pip install -r src/requirements.txt
Generate hindi dictionary from LM vocab
python src/preprocess/vocab2dict.py res/hindi_phones.csv data/hindi_lm_large.vocab data/dict_hindi
results_method.ipynb contains results for the algorithm. The result includes:
Comparision between unigram, bigram, bigram with penalty and bigram with penalty & vowel weight.

How we obtained the penalty of 2.5.

Comparision between Vitz and Winkler (1973), Parrish's Embeddings (2017), and our methods (with and without vowel weights).
^ The Parrish's Embeddings (PSSVec) results are generated from the author's provided git code using numpy.seed(0) in generate.py. We can not use author provided pretrained vectors because the dictionary used by them misses a word BELATION used in the RELATION dataset by Vitz and Winkler (1973).
The similarity vectors used by us for calculating PSSVec can be downloaded using
wget -O data/cmudict-0.7b-simvecs https://drive.google.com/uc?export=download&id=1gCvwI8ldxGM52vCoN70wUKmJfFMdapNl
Embedding scores can be re-generated using src/embedding.py by providing the learned embedding file and the output file.
python src/embedding.py data/cmudict-0.7b-simvecs res/PSSVec_results.csv
python src/embedding.py embedding_english/simvecs res/embedding_score.csv
^ These files are used to generate scores in the result section using results_method.ipynb.
TSNE Plot for some English words
TSNE Plot for some Hindi words
Pun Dataset (see docs/puns.md for more details)
Docker supported for development and training.
make build
make develop
This will give you a command prompt inside the docker. Current directory will be mounted at /workspace.
The container will be destroyed on exit but all the files and changes done in the directly will persist.
You can also start it with GPU support:
make develop_gpu
make clean
Remember this will not delete the base image. To clean the base image run:
make clean_base
This project is licensed under the MIT License - see the LICENSE file for details
- Hat tip to anyone whose code was used




