【NLP】word2vec 模型

参考:《深度学习从0到1-基于Tensorflow2》

【参考:深入浅出Word2Vec原理解析 – 知乎

总结

word2vec的前生 NNLM(神经网络语言模型)

【参考:词向量技术原理及应用详解(二) – 木屐呀 – 博客园

word2vec有哪几种实现方式?

共两种:
1)用上下文预测中心词cbow(continue bag of word)
2)利用中心词预测上下文 skip-gram

从实现方式上看两者只是输入输出发生了变化。

word2vec的本质是什么?

当然是 无监督学习,因为输出并没有label。但是从输入的和输出的形式上来看,输入的是一对对单词,看起来像是有监督,其实并不是。

因为词向量的本质可以看出是一个只有一层的神经网络,因此必须有输入,输出。而训练过程或者说目的不是得到预测结果单词,或对单词进行分类。最为关键的是获得 hidden layer的权重。也就是说借助了sequence2sequence模型训练过程,得到hidden layer的权重。

CBOW

连续词袋模型 CBOW(Continuous Bag of-Words)

CBOW 模型是给神经网络传入上下文词汇,然后预测 目标词汇

比如我们有一个用于训练的句子是”我爱北京天安门”,可以给模型传入”爱”和”天安门”,然后用”北京”作为要预测的目标词汇。

而最简单的CBOW 模型就是传入前一个词然后再预测后一个词。

Skip-Gram

Skip-Gram 模型是给神经网络传入一个词汇,然后预测 其上下文词汇

PyTorch实现(乞丐版)

【参考:nlp-tutorial/Word2Vec-Skipgram.py at master · graykode/nlp-tutorial

【参考:Word2Vec的PyTorch实现_哔哩哔哩_bilibili

【参考:Word2Vec的PyTorch实现(乞丐版) – mathor

【NLP】word2vec 模型

【NLP】word2vec 模型

【NLP】word2vec 模型

【NLP】word2vec 模型
总结:
构建word2id
构建数据
- 窗口内的单词为【C-2,C-1,C,C+1,C+2】
- 数据 [[C,C-2],[C,C-1],[C,C+1],[C,C+2]]
- np.eye(voc_size) 用onehot表示单词

送入模型训练
import torch
import numpy as np
import torch.nn as nn
import torch.optim as optim
import matplotlib.pyplot as plt
import torch.utils.data as Data

dtype = torch.FloatTensor
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

sentences = ["jack like dog", "jack like cat", "jack like animal",
  "dog cat animal", "banana apple cat dog like", "dog fish milk like",
  "dog cat animal like", "jack like apple", "apple like", "jack like banana",
  "apple banana jack movie book music like", "cat dog hate", "cat dog like"]

word_sequence = " ".join(sentences).split()
vocab = list(set(word_sequence))
word2idx = {w: i for i, w in enumerate(vocab)}

batch_size = 8
embedding_size = 2
C = 2
voc_size = len(vocab)

skip_grams = []

for idx in range(C, len(word_sequence) - C):

  center = word2idx[word_sequence[idx]]

  context_idx = list(range(idx - C, idx)) + list(range(idx + 1, idx + C + 1))

  context = [word2idx[word_sequence[i]] for i in context_idx]
  for w in context:
    skip_grams.append([center, w])

def make_data(skip_grams):
  input_data = []
  output_data = []
  for i in range(len(skip_grams)):

    input_data.append(np.eye(voc_size)[skip_grams[i][0]])

    output_data.append(skip_grams[i][1])
  return input_data, output_data

input_data, output_data = make_data(skip_grams)
input_data, output_data = torch.Tensor(input_data), torch.LongTensor(output_data)
dataset = Data.TensorDataset(input_data, output_data)
loader = Data.DataLoader(dataset, batch_size, True)

class Word2Vec(nn.Module):
  def __init__(self):
    super(Word2Vec, self).__init__()

    self.W = nn.Parameter(torch.randn(voc_size, embedding_size).type(dtype))

    self.V = nn.Parameter(torch.randn(embedding_size, voc_size).type(dtype))

  def forward(self, X):

    hidden_layer = torch.matmul(X, self.W)

    output_layer = torch.matmul(hidden_layer, self.V)
    return output_layer

model = Word2Vec().to(device)
criterion = nn.CrossEntropyLoss().to(device)
optimizer = optim.Adam(model.parameters(), lr=1e-3)

for epoch in range(2000):
  for i, (batch_x, batch_y) in enumerate(loader):
    batch_x = batch_x.to(device)
    batch_y = batch_y.to(device)
    pred = model(batch_x)
    loss = criterion(pred, batch_y)
    if (epoch + 1) % 1000 == 0:
      print(epoch + 1, i, loss.item())

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

for i, label in enumerate(vocab):
  W, WT = model.parameters()
  x,y = float(W[i][0]), float(W[i][1])
  print(label)
  print(x,y)
  plt.scatter(x, y)
  plt.annotate(label, xy=(x, y), xytext=(5, 2), textcoords='offset points', ha='right', va='bottom')
plt.show()
1000 0 2.187922716140747
1000 1 2.1874611377716064
1000 2 2.1020612716674805
1000 3 2.1360023021698
1000 4 1.6479374170303345
1000 5 2.1080777645111084
1000 6 2.117255687713623
1000 7 2.5754618644714355
1000 8 2.375575065612793
1000 9 2.4812772274017334
1000 10 2.2279186248779297
1000 11 1.9958131313323975
1000 12 1.9666472673416138
1000 13 1.792773723602295
1000 14 1.9790289402008057
1000 15 2.150097370147705
1000 16 1.8230916261672974
1000 17 1.9916845560073853
1000 18 2.2354393005371094
1000 19 2.253058910369873
1000 20 1.8957509994506836
2000 0 2.1660408973693848
2000 1 1.9071791172027588
2000 2 1.9131343364715576
2000 3 2.0996546745300293
2000 4 1.9192123413085938
2000 5 1.6349347829818726
2000 6 2.433778762817383
2000 7 2.4247307777404785
2000 8 2.1594560146331787
2000 9 1.9543298482894897
2000 10 1.8078333139419556
2000 11 2.490055561065674
2000 12 2.1941933631896973
2000 13 2.463453531265259
2000 14 2.2849888801574707
2000 15 1.7784088850021362
2000 16 1.8803404569625854
2000 17 1.9645321369171143
2000 18 2.036078453063965
2000 19 1.9239177703857422
2000 20 2.261594772338867

animal
-0.5263756513595581 3.4223508834838867
apple
-0.3384515941143036 1.3274422883987427
milk
-1.2358342409133911 0.3438951075077057
hate
-1.556404709815979 9.134812355041504
music
0.31392836570739746 0.2262829840183258
movie
2.375382661819458 1.1577153205871582
dog
-0.9016568064689636 0.2671743929386139
jack
-0.5878503322601318 0.6020950078964233
cat
-0.9074932932853699 0.2849980890750885
banana
0.47850462794303894 1.1545497179031372
book
0.4761728048324585 0.21939511597156525
like
-0.1496874839067459 0.6957748532295227
fish
-2.37762188911438 0.04009028896689415

【NLP】word2vec 模型

因为数据集 jack like 动物名 比较多,所以这几个词在空间中也挨得比较近

sentences = ["jack like dog", "jack like cat", "jack like animal",
  "dog cat animal", "banana apple cat dog like", "dog fish milk like",
  "dog cat animal like", "jack like apple", "apple like", "jack like banana",
  "apple banana jack movie book music like", "cat dog hate", "cat dog like"]

for epoch in range(10000):

【NLP】word2vec 模型

Original: https://blog.csdn.net/Jruo911/article/details/123585597
Author: myaijarvis
Title: 【NLP】word2vec 模型

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/528524/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

亲爱的 Coder【最近整理,可免费获取】👉 最新必读书单  | 👏 面试题下载  | 🌎 免费的AI知识星球