doc2vec
word2vec
\[f(g(x; \theta))\]Continuous Bug Of Words
Step1. Collect all of the words appeard in the training data
Step2. Labeled the words with unique ID
Step3. Train a neural network with
- $M=3$,
- the number of layers including the input layer and the output layer
- in this case, input layer, hidden layer, output layer
- $K$,
- the number of unique words
- input dimensions
- $W \in \mathbb{N}$,
- the size of context window
- $N_{i} \in \mathbb{N}$,
- the number of unique words
- \(x^{(1)} \in \{0, 1\}^{N}\),
- the input vector
- \(y^{(M)} := x^{(M)} \in [0, 1]^{N}\),
- the output vector
- $n$-th dimension represents the probability $n$-th words
- $(x_{i}, y_{i}) \ (i = 1, \ldots, W)$,
- $x_{i} \in {0, 1}^{N}$,
- input data
- $y_{i} \in [0, 1]^{N}$,
- output data
- $y_{i} = 1$ if $y_{i}$ is apperad between $W$ words before $x_{i}$ and $W$ words after $x_{i}$
- $y_{i} = 0$ otherwise
- $x_{i} \in {0, 1}^{N}$,
Step4. The represented vector of $i$-th is the the output of the trained neural network
■
Skip-Gram
Step1. Collect all of the words appeard in the training data
Step2. Labeled the words with unique ID
Step3. Solve the maximization problem with
- $M=3$,
- the number of layers including the input layer and the output layer
- in this case, input layer, hidden layer, output layer
- $K$,
- the number of unique words
- input dimensions
- $W \in \mathbb{N}$,
- the size of context window
- $T \in \mathbb{N}$,
- the number of context window
- $K^{\prime} \in \mathbb{N}$,
- the number of words in a document
- $T = K^{\prime} - W - 1$,
- $w_{1}, \ldots, w_{K^{\prime}}$,
- the words
- \(A: \{w_{i}\} \rightarrow \{1, \ldots, K\}\),
- map to unique words
- \(x^{(1)} \in \{0, 1\}^{N}\),
- the input vector
- \(y^{(M)} := x^{(M)} \in [0, 1]^{N}\),
- the output vector
- $n$-th dimension represents the probability $n$-th words
- $(x_{i}, y_{i}) \ (i = 1, \ldots, W)$,
- $x_{i} \in {0, 1}^{N}$,
- input data
- $y_{i} \in [0, 1]^{N}$,
- output data
- $y_{i} = 1$ if $y_{i}$ is apperad between $W$ words before $x_{i}$ and $W$ words after $x_{i}$
- $y_{i} = 0$ otherwise
- $x_{i} \in {0, 1}^{N}$,
Step4. The represented vector of $i$-th unique word is the $\theta_{i}$ obtained by maximizing the above equation.
■
Reference
- Efficient Estimation of Word Representations in Vector Space
- paper proposing word2vec algorithm
- only describing the algorithm from the aspect of the computational complexity
- Distributed Representations of Sentences and Documents
- paper proposing doc2vec algorithm
- A gentle introduction to Doc2Vec – ScaleAbout – Medium
- Word2Vec Tutorial - The Skip-Gram Model · Chris McCormick
- Vector Representations of Words | TensorFlow