文章目录
如何录入信息
文字和图像不同
图像可以将对应的像素点的亮度值或者RGB值转换成张量,然后送入神经网络,但是文字又怎么办呢?
这需要对文本进行词汇处理,即编码,将对应的单词和文本转换为词典中的数字,从而使段落可以用数字矩阵表示。
[En]
This requires lexical processing of the text, that is, coding, to convert the corresponding word and text into a number in the dictionary, so that a paragraph can be represented by a number matrix.
载入数据
tensorflow-datasets中由我们需要的数据
需要pip install tensorflow-datasets
import tensorflow_datasets as tfds
imdb,info = tfds.load('imdb_reviews', with_info=True, as_supervised=True)
imdb_reviews中由训练集核测试集
train_data, test_data = imdb['train'], imdb['test']
training_sentences = []
training_labels = []
testing_sentences = []
testing_labels = []
for s,l in train_data:
training_sentences.append(str(s.numpy()))
training_labels.append(l.numpy())
for s,l in test_data:
testing_sentences.append(str(s.numpy()))
testing_labels.append(l.numpy())
神经网络要求输入向量,这里需要将label转换成向量
training_lable_final = np.array(training_labels)
testing_label_final = np.array(testing_labels)
词条化处理
from tensorflow.keras.preprocessing.text import Tokenizer
num_words = 10000
oov_token = "##"
tokenizer = Tokenizer(num_words=num_words, oov_token=oov_token)
tokenizer.fit_on_texts(training_sentences)
word_dict = tokenizer.word_indx
print(word_dict)
词条序列化
将句子中的单词按照word_dict中的数字转换成一个序列化矩阵
在此之前,您需要将句子序列化。
[En]
Before that, you need to serialize the sentence.
词条的最大长度维120
默认情况下,长度不满足后填0。
[En]
By default, fill in 0 after the length is not satisfied.
from tensorflow.keras.preprocessing.sequence import pad_sequences
max_length = 120
train_sequence = tokenizer.texts_to_sequence(training_sentences)
padded_train = pad_sequences(train_sequence, maxlen=max_length,truncate='post')
test_sequence = tokenizer.texts_to_sequence(testing_sentences)
padded_test = pad_sequences(test_sequence, maxlen=max_length, truncate='post')
print(padded_train)
print(padded_test)
搭建神经网络
embedding_dim = 16
model = tf.keras.Sequential([
tf.keras.layers.Embedding(input_dim=num_words,
output_dim=embedding_dim,
input_length=max_length,
name='embed-1'),
tf.keras.layers.GlobalAveragePooling1D(name='globalave-1'),
tf.keras.layers.Dense(6, activation='relu', name='fully-1'),
tf.keras.layers.Dense(1, activation='sigmoid', name='sigmoid-1')
])
model.compile(loss=tf.losses.binary_crossentropy, optimizer=tf.optimizers.Adam(), metrics=['accuracy'])
model.summary()
输入数据
model.fit(padded_train,training_lable_final, epochs=10, validation_data=(padded_test, testing_label_final))
可视化
import io
e = model.layers[0]
weights = e.get_weights()[0]
print(weights.shape)
reverse_word_dict = dict([(value, key) for(key, value) in word_dict.items()])
out_v = io.open("E:/datasets/tmp/language-splite/vecs.tsv", 'w', encoding='utf8')
out_m = io.open("E:/datasets/tmp/language-splite/meta.tsv", 'w', encoding='utf8')
for word_num in range(1, vocab_size):
word = reverse_word_dict[word_num]
embeddings = weights[word_num]
out_m.write(word + '\n')
out_v.write('\t'.join([str(x) for x in embeddings]) + '\n')
out_m.close()
out_v.close()
这样访问projector.tensorflow.org然后再上传tsv文件就可以看到相应的词汇的分布了(外国网站,需要科学上网)
Original: https://blog.csdn.net/m0_56104219/article/details/124543062
Author: 君子以阅川
Title: imdb_reviews电影评论数据集的神经网络
原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/514748/
转载文章受原作者版权保护。转载请注明原作者出处!