程序设计思路加代码
1、读取句子
sequence = ['i like reading','i love dog','i miss you']
这里先简单设置一些语句,刚开始入门NLP可以将每个语句格式设置为一样,之后再学习长短不一的语句如何预测。
2、句子切分为一个个单词并去重
wordslist = ' '.join(sequence).split()
words_non = sorted(set(wordslist))
sorted可以给单词排个序
3、制作word-label和label-word的词汇表
words_ix = {w:i for i,w in enumerate(words_non)}
ix_word = {i:w for i,w in enumerate(words_non)}
我们知道无论是单词还是图片在计算机中都是以数字存储的,计算机在学习的过程中就是对这些数字信息进行处理的,如果是字母的话可以直接用ASCLL码进行处理,但当字母组成单词后信息就比较多了,所以我们可以化繁为简为他们打上标签以示每个单词之间的区别就可以了,类似于图像识别的标签。其实单词预测也是个分类问题,后面就可以看到了。至于label-word表是为了方便利用模型预测单词,因为最终你得到的是每个单词的可能性。
4、将每句话中最后一个词与其他词分离作为目标词,其他词作为训练词,并将训练词对应的label转成one-hot
train_list = []
test_list = []
for seq in sequence:
words = seq.split()
a = [words_ix[x] for x in words[:-1]]
a = np.eye(timestep)[a]
train_list.append(a)
test_list.append(words_ix[words[-1]])
搭建RNN模型
前面我们说过,其实这种预测也是一种分类问题,但为什么不用卷积神神经网络呢?简单地说,卷积神经网络虽然是分类了,但每个特征提取与上一时刻没有关系,也就是没有时间信息,时间信息一定程度上反映了单词之间的逻辑关系,而RNN模型也就是循环神经网络比较好的处理了这一问题。
具体可以看一下[蓝翔飞鸟]这位博主,我觉得他解释挺好的。链接:(https://blog.csdn.net/level_code/article/details/108122808)
模型代码如下:
class NET(nn.Module):
def __init__(self,input_size,hidden_size):
super(NET, self).__init__()
self.input_size = input_size
self.hidden_size = hidden_size
self.rnn = nn.RNN(input_size=input_size,hidden_size=hidden_size)
self.linear = nn.Linear(hidden_size,input_size)
def forward(self,input):
input = input.transpose(0,1)
_,hidden = self.rnn(input)
out = self.linear(hidden[0])
return out
_,hidden = self.rnn(input)
这个代码会回传两个变量,第一个变量最后一个值和hidden变量一样的,所以有的人用前面那个变量最后一个也可以,hidden.size [num_lays,batch,num_class],hidden有三个维度,我们使用hidden[0]读取[batch,num_class],为了方便之后计算loss
训练模型
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(net.parameters(),lr=0.01)
hd = []
for i in range(5):
print('第{}次迭代开始!'.format(i))
for iword,oword in loader:
print(iword.shape)
hidden = net(iword)
print(hidden.shape)
print(oword.shape)
loss = criterion(hidden,oword)
optimizer.zero_grad()
loss.backward()
optimizer.step()
print('loss:{}'.format(loss))
可以看到选择的优化器和损失函数基本和分类时用到的一样了,所以预测差不多也是分类问题。当然你也可以制作个验证集验证下准确率。
预测
seq1 = 'i like'
word2idx = [words_ix[i] for i in seq1.split()]
aa = torch.Tensor(np.eye(timestep)[word2idx]).unsqueeze(0)
print(aa.shape)
pre = net(aa)
label = pre.argmax(1).tolist()
print('{} {}'.format(seq1,ix_word[label[0]]))
结果为i like reading。unsqueeze是为了将数据弄成训练时一样的。
整体代码
模型:
import torch.nn as nn
import torch
class NET(nn.Module):
def __init__(self,input_size,hidden_size):
super(NET, self).__init__()
self.input_size = input_size
self.hidden_size = hidden_size
self.rnn = nn.RNN(input_size=input_size,hidden_size=hidden_size)
self.linear = nn.Linear(hidden_size,input_size)
def forward(self,input):
input = input.transpose(0,1)
_,hidden = self.rnn(input)
out = self.linear(hidden[0])
return out
if __name__ == '__main__':
net = NET(10,20)
input = torch.randn(3,2,10)
hidden = net(input)
print(hidden.shape)
print(hidden)
训练:
import numpy as np
import torch.nn as nn
import torch
import torch.utils.data as Data
import torch.optim as optim
import model
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
sequence = ['i like reading','i love dog','i miss you']
wordslist = ' '.join(sequence).split()
words_non = sorted(set(wordslist))
words_ix = {w:i for i,w in enumerate(words_non)}
ix_word = {i:w for i,w in enumerate(words_non)}
batch = 1
timestep = len(wordslist)
print(timestep)
train_list = []
test_list = []
for seq in sequence:
words = seq.split()
a = [words_ix[x] for x in words[:-1]]
a = np.eye(timestep)[a]
train_list.append(a)
test_list.append(words_ix[words[-1]])
train_list = torch.Tensor(np.array(train_list))
test_list = torch.LongTensor(test_list)
dataset = Data.TensorDataset(train_list,test_list)
loader = Data.DataLoader(dataset,1,shuffle=True)
net = model.NET(input_size=timestep,hidden_size=2*timestep)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(net.parameters(),lr=0.01)
hd = []
for i in range(5):
print('第{}次迭代开始!'.format(i))
for iword,oword in loader:
print(iword.shape)
hidden = net(iword)
print(hidden.shape)
print(oword.shape)
loss = criterion(hidden,oword)
optimizer.zero_grad()
loss.backward()
optimizer.step()
print('loss:{}'.format(loss))
seq1 = 'i like'
word2idx = [words_ix[i] for i in seq1.split()]
aa = torch.Tensor(np.eye(timestep)[word2idx]).unsqueeze(0)
print(aa.shape)
pre = net(aa)
label = pre.argmax(1).tolist()
print('{} {}'.format(seq1,ix_word[label[0]]))
Original: https://blog.csdn.net/qq_57082898/article/details/124144995
Author: 无忧阁阁主
Title: 利用pytorch自然语言探索(一):单词预测
原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/528660/
转载文章受原作者版权保护。转载请注明原作者出处!