# Word Embedding

Word Embedding是一种词的向量表示，比如，对于这样的”A B A C B F G”的一个序列，也许我们最后能得到：A对应的向量为[0.1 0.6 -0.5]，B对应的向量为[-0.2 0.9 0.7]。

[En]

The reason why we want to turn each word into a vector is to facilitate calculation. For example, “finding a synonym for word A” can be done by “finding the most similar vector to word An at a cos distance”.

[En]

So how to embed words? At present, there are three main algorithms:

## Embedding Layer

Embedding Layer是与特定自然语言处理上的神经网络模型联合学习的单词嵌入。该嵌入方法将清理好的文本中的单词进行one hot编码（热编码），向量空间的大小或维度被指定为模型的一部分，例如50、100或300维。向量以小的随机数进行初始化。Embedding Layer用于神经网络的前端，并采用反向传播算法进行监督。

[En]

The encoded words are mapped to word vectors, and if the multilayer perceptron model MLP is used, the word vectors are cascaded before being input into the model. If you use the cyclic neural network RNN, you can use each word as an input in the sequence.

[En]

This approach to learning embedded layers requires a lot of training data, which may be slow, but you can learn to train embedded models for both specific text data and NLP.

## Word2Vec/ Doc2Vec（Document to Vector）

Word2Vec是由Tomas Mikolov 等人在《Efficient Estimation of Word Representation in Vector Space》一文中提出，是一种用于有效学习从文本语料库嵌入的独立词语的统计方法。其核心思想就是基于上下文，先用向量代表各个词，然后通过一个预测目标函数学习这些向量的参数。

[En]

The algorithm gives two training models, CBOW (Continuous Bag-of-Words Model) and Skip-gram (Continuous Skip-gram Model). CBOW takes the word in the context of a word as input, and the word itself as output, that is, seeing a context, hoping to guess the word and its meaning. Through training in a large corpus, we get a weight model from the input layer to the implicit layer, while Skip-gram takes the word in the context of a word as the output, and the word itself as the input, that is to say, give a word, hoping to predict the possible context of the word.

[En]

Through training in a large corpus, a weight model from the input layer to the implicit layer is obtained. Given xx, the input of the model for predicting xxx is the vector of words, and then the probability of the next word is predicted through various CNN or RNN models of DL in the middle. Finally, the values of these lexical vectors are obtained by optimizing the objective function.

Word2Vec虽然取得了很好的效果，但模型上仍然存在明显的缺陷，比如没有考虑词序，再比如没有考虑全局的统计信息。

Doc2Vec与Word2Vec的CBOW模型类似，也是基于上下文训练词向量，不同的是，Word2Vec只是简单地将一个单词转换为一个向量，而Doc2Vec不仅可以做到这一点，还可以将一个句子或是一个段落中的所有单词汇成一个向量，为了做到这一点，它只是将一个句子标签视为一个特殊的词。

## 主题模型

LSA、LDA等主题模型，建立词和主题的关系。

# RNN

• 与普通神经网络一样，RNN有输入层、输出层和隐含层。不同之处在于，RNN在不同的时间t将具有不同的状态，其中在t时间的隐层的输出将在t时间作用于隐层。

[En]

* like ordinary neural networks, RNN has input layer, output layer and hidden layer. The difference is that RNN will have different states at different time t, in which the output of the hidden layer at t time will act on the hidden layer at t time.

• 参数意义是:(W_{hv}):输入层到隐含层的权重参数，(W_{hh}):隐含层到隐含层的权重参数，(W_{oh})：隐含层到输出层的权重参数，(b_h):隐含层的偏移量,(b_o)输出层的偏移量，(h_0):起始状态的隐含层的输出，一般初始为0.

• 不同时间的州共享相同的权重w和偏移量b

[En]

* the states at different times share the same weight w and offset b

## RNN的计算方式

RNN因为加入了时间序列，因此训练过程也是和之前的网络不一样，RNN的训练使用的是BPTT(Back Prropagation Through TIme),该方法是由Werbo等人在1990年提出来的。

[En]

But because in line 10 of the above algorithm, when training the t-moment, the parameter of tmur1 appears, so the derivation of a single state becomes the sum of the whole previous state.

[En]

It is precisely because of the long-term dependency that BPTT cannot solve the problem of long-term dependency (that is, the current output is related to the previous long sequence, usually more than ten steps can do nothing), because BPTT brings the so-called gradient disappearance or gradient explosion problem (the vanishing/exploding gradient problem).

## 参数量

model.add(Embedding(output_dim=32, input_dim=2800, input_length=380))
...

model.summary()
#output
simple_rnn_1 (SimpleRNN) param # 784
dense_1 (Dense) param # 4352


# LSTM

[En]

Suppose we try to predict the last word of “I grew up in France… I speak fluent French”. The current information suggests that the next word may be the name of a language, but if we need to figure out what language it is, we need the context of the France mentioned earlier, which is far from the current location. This shows that the gap between the relevant information and the current predicted location must become quite large.

[En]

Unfortunately, as this interval increases, RNN loses its ability to learn information that connects so far away. In theory, RNN can definitely deal with the problem of “long-term dependence”. One can carefully select parameters to solve the most basic form of such problems, but in practice, RNN will certainly not be able to learn this knowledge successfully. Bengio, et al. After an in-depth study of this problem, they found some of the root causes that make it very difficult to train RNN.

[En]

Fortunately, however, LSTM doesn’t have this problem!

LSTM 由Hochreiter & Schmidhuber (1997)提出，并在近期被Alex Graves进行了改良和推广。在很多问题，LSTM 都取得相当巨大的成功，并得到了广泛的使用。

LSTM 通过刻意的设计来避免长期依赖问题。记住长期的信息在实践中是 LSTM 的默认行为，而非需要付出很大代价才能获得的能力！

[En]

All RNN have a chained form of repetitive neural network modules. In standard RNN, this repeating module has a very simple structure.

LSTM 同样是这样的结构，但是重复的模块拥有一个不同的结构。不同于单一神经网络层，以一种非常特殊的方式进行交互。

## 核心思想

LSTM 中的第一步是决定我们会从细胞状态中丢弃什么信息。这个决定通过一个称为忘记门层完成。

[En]

The next step is to determine what new information is stored in the cellular state.

[En]

Finally, we need to determine what value to output. This output will be based on our cell state, but it is also a filtered version.

[En]

We are still introducing normal LSTM so far. But not all LSTM look the same. In fact, almost all papers containing LSTM use minor variants.

[En]

The state of the top line in the picture, s (t), represents long-term memory, while the h (t) below represents working memory or short-term memory.

## 参数量

model.add(Embedding(output_dim=32, input_dim=2800, input_length=380))
...

model.summary()
#output
lstm_1 (LSTM) param # 8320


# GRU

LSTM有很多变体，其中较大改动的是Gated Recurrent Unit (GRU)，这是由 Cho, et al. (2014)提出。它将忘记门和输入门合成了一个单一的 更新门。同样还混合了细胞状态和隐藏状态，和其他一些改动。最终的模型比标准的 LSTM模型要简单。效果和LSTM差不多，但是参数少了1/3，不容易过拟合。

# 参考：

Original: https://www.cnblogs.com/houkai/p/9706716.html
Author: 侯凯
Title: Word Embedding/RNN/LSTM

(0)