DAY 51-100 DAYS MLCODE: Word Embedding

My Tech World

DAY 51-100 DAYS MLCODE: Word Embedding

December 31, 2018 100-Days-Of-ML-Code blog 0
Word2Vec

In the previous blogs, we have discussed the LSTM and GRU, in this blog we’ll discuss the Word Embedding.

As per Wikipedia, word embedding:

Word embedding is the collective name for a set of language modeling and feature learning techniques in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors of real numbers. Conceptually it involves a mathematical embedding from a space with one dimension per word to a continuous vector space with a much lower dimension.

Wikipedia

When we are working with NLP, we have to represent our sentences or word in such a way that we can train our model and can transfer the learning of one sentence for other tasks.

The first Idea comes in our mind to represent the word with a unique id like Fox with X234 and Lion with Y567. In that case, our model will not be able to utilize the learning fox as the unique ID of Fox and Lion does not have any relation.Representing words as unique, discrete ids furthermore leads to data sparsity, and usually means that we may need more data in order to successfully train statistical models. Using vector representations can overcome some of these obstacles.

One simplest thing comes in our mind to represent the entire vocabulary into a one-hot vector. So if we have 50,000 vocabulary then the 100th word will be represented as a vector of length 50,000 where the word will be presented as 1 at 100th position and rest of the positions will have value zero. This will result into a very big vector.

The most common solution is to represent each word of vocabulary using a small, dense vector. This vector with small length ( say 200) is called Embedding.

Now we know what is Word Embedding , let’s try to see how we can implement using TensorFlow.

Word2vec is a particularly computationally-efficient predictive model for learning word embeddings from raw text. It comes in two flavors, the Continuous Bag-of-Words model (CBOW) and the Skip-Gram model (Section 3.1 and 3.2 in Mikolov et al.).

TensorFlow has a very good example of Word Embedding and you can find the code here.