Skip to content

aditya2211/CS565-2017-Assignment2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Faceless_Men CS565-Assignment2

Neural Probabilistic Language Model

The model proposed by Bengio et al. is implemented in Tensorflow. It can be run using nplm.py.

Implementation details:

  • Batch size = 5000
  • Window size = 5
  • Number of epochs = 5

The obtained embeddings are saved in nplm_emb.pickle

Singular Value Decomposition

The file svd.py conatains the code for runnning the llibrary function for SVD and also the self implemented version of SVD. The code also generates the word-word co-occurence matrix. The embedding obtained using the library function are stored in a pickle file named svd_library.pickle and the embedding obtained through the self-implemented SVD are pickled in svd_selfimp.pickle.

Word2Vec

The CBOW model with negative sampling proposed by Mikolov et al. is implemented in Tensorflow. It can be run using word2vec.py.

For negative sampling, the count of all the words in the corpus is passed to the model, which then generates a unigram distribution and uses the inbuilt NCE loss function as follows.

neg_sample = tf.nn.fixed_unigram_candidate_sampler(y_in, 1, neg_sample_size, True, dictionary_size, unigrams=count)
self.losses = tf.nn.nce_loss(W, b, tf.to_float(y_in), x_emb_in, neg_sample_size, dictionary_size, sampled_values=neg_sample)

Implementation details:

  • Batch size = 5000
  • Window size = 5
  • Number of epochs = 50

The obtained embeddings are saved in word2vec_emb.pickle

Named Entity Recognition (NER)

Generating the dataset

We use the enwik8 corpus consisting of the first 10^8 words in the English language Wikipedia. This file contains complex XML encoding, so it is first converted to English sentences using a Perl script as perl convert.pl enwik8 > text8.

NER model

The data preprocessing, neural network implementation, model training and prediction generation is done in entityrecognition.py.

entityrecognition.py contains code for a single hidden layer model and a CNN model. In addition, an LSTM model was also implemented, but did not perform well.

About

CS565-2017: NLP

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published