文本情感倾向分析——神经网络模型

2023年5月30日下午9:37 • 人工智能 • 阅读 102

1. 方法

词的向量表示的原理：我们可以将一句话中的每一个词都转换成一个向量，下面这句话有16个单词，可以将输入数据看成是一个16*D的矩阵。

(1) 构建词典：把文本中的每个词语和其对应的数字，使用字典保存，同时实现方法把句子通过字典映射为包含数字的列表。

&#x6784;&#x5EFA;&#x8BCD;&#x5178;&#x57FA;&#x672C;&#x601D;&#x8DEF;&#xFF1A;

1&#xFF09;&#x5BF9;&#x6240;&#x6709;&#x53E5;&#x5B50;&#x8FDB;&#x884C;&#x5206;&#x8BCD;&#x3002;

2&#xFF09;&#x8BCD;&#x8BED;&#x5B58;&#x5165;&#x5B57;&#x5178;&#xFF0C;&#x6839;&#x636E;&#x6B21;&#x6570;&#x5BF9;&#x8BCD;&#x8BED;&#x8FDB;&#x884C;&#x8FC7;&#x6EE4;&#xFF0C;&#x5E76;&#x7EDF;&#x8BA1;&#x6B21;&#x6570;&#x3002;

3&#xFF09;&#x5B9E;&#x73B0;&#x6587;&#x672C;&#x8F6C;&#x6570;&#x5B57;&#x5E8F;&#x5217;&#x7684;&#x65B9;&#x6CD5;&#x3002;

4&#xFF09;&#x5B9E;&#x73B0;&#x6570;&#x5B57;&#x5E8F;&#x5217;&#x8F6C;&#x6587;&#x672C;&#x7684;&#x65B9;&#x6CD5;&#x3002;


 sentences = [["今天","天气","很","好"],["今天","去","吃","什么"]]

 ws = Vocab()
 for sentence in sentences:

   ws.fit(sentence)

 ws.build_vocab(min_count = 1)    print(ws.dict)
 >>> {'':1, '':0, '今天':2, '天气':3, '很':4, '好':5, '去':6, '吃':7, '什么':8}

 ret = ws.transform(["好","好"，"好","好","好","好","好","热","呀"], max_len = 13)
 print(ret)
 >>> [5,5,5,5,5,5,5,1,1,0,0,0,0]

 ret = ws.inverse_transform(ret)
 print(ret)
 >>>['好','好','好','好','好','好','好','','','','','','','']

(2) 词向量表示（Word Embedding）

因为文本不能够直接被模型计算，所以需要将其转化为向量，常用的有one-hot编码和word embedding方法，这里使用word embedding。

word embedding是深度学习中表示文本常用的一种方法。和one-hot编码不同，wod embedding使用了浮点型的稠密矩阵来表示token。根据词典的大小，我们的向量通常使用不同的维度，如100，256，300等。其中向量中的每一个值是一个参数，其初始值是随机生成的，之后会在训练中进行学习而获得。两个向量之间是有关系的，可以进行相似的的计算。

token —> num —>vector

2.1 使用 word embedding API：torch.nn.Embedding(num_embeddings, embedding_dim)

"""
   param
   1. num_embedding: 词典的大小
   2. embedding_dim: embedding的维度
"""
   embedding = nn.Embedding(vocab_size, 300)
   input_embeded = embedding(input)

2.2 使用 Word2Vec

Word2Vec可以用高维向量来表示词语，并把意思相近的词语放在相近的位置。我们只需要有大量的某种语言的语料，就可以用它来训练模型。

假设我们输入的句子是”I thought the movie was incredible and inspiring”。为了得到词向量，我们可以用TensorFlow的嵌入函数embedding_lookup( )。该函数包含两个参数，一个是嵌入矩阵（词向量矩阵），另一个是每个单词对应的索引。最终得到 一个句子的向量。
文本情感倾向分析——神经网络模型

(3) 构建神经网络和训练模型
RNNs的使用原理：Word2Vec将词语转化为高维向量后，一个句子就对应着词向量的集合，使用RNNs可以将高维的句向量编码为较低维度的一维向量，而保留大多数有用的信息。

LSTM：在RNN的基础上增加了记忆和遗忘功能，解决长期依赖。我们将机器对一个句子的理解称之为状态，一个输入只有能够通过输入门才能够进入到状态中，大多数的词都通过不了输入门，只有少数的关键的词能够进入到状态中去。随着状态读的词越来越长，在状态中的词也越来越多，状态中的词会通过遗忘门进行自循环，只有能够通过遗忘门的词才能够保留下来。最终状态中词的数量达到一个平衡。在输出的时候有一个输出门，只有能够通过输出门的词才能够被输出。即LSTM中的三重门：

1）输入门：决定了哪些词能够进入记忆。
2）遗忘门：决定了哪些词能够被继续记忆。
3）输出门：决定了哪些词能够被输出。

2. 代码（BiLSTM）


import jieba
import numpy as np
import pandas as pd

import multiprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from gensim.models.word2vec import Word2Vec
from gensim.corpora.dictionary import Dictionary

import keras
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.models import load_model
from keras.layers import Bidirectional, Activation
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import LSTM
from keras.layers.core import Dense, Dropout

cpu_count = multiprocessing.cpu_count()
vocab_dim = 100
n_iterations = 1
n_exposures = 10
window_size = 7
n_epoch = 30
maxlen = 100
batch_size = 32

def loadfile():
    neg = pd.read_csv('data/train_neg.csv', header=None, index_col=None)
    pos = pd.read_csv('data/train_pos.csv', header=None, index_col=None)

    combined = np.concatenate((pos[0],neg[0]))
    y = np.concatenate((np.ones(len(pos), dtype=int), np.zeros(len(neg), dtype=int)))

    return combined, y

def tokenizer(data):
    text = [jieba.lcut(document.replace('\n', '')) for document in data]
    return text

def create_dictionaries(model=None, combined=None):

    if (combined is not None) and (model is not None):
        gensim_dict = Dictionary()
        gensim_dict.doc2bow(model.wv.vocab.keys(),
                            allow_update=True)

        w2indx = {v: k + 1 for k, v in gensim_dict.items()}
        f = open("word2index.txt", 'w', encoding='utf8')
        for key in w2indx:
            f.write(str(key))
            f.write(' ')
            f.write(str(w2indx[key]))
            f.write('\n')
        f.close()
        w2vec = {word: model[word] for word in w2indx.keys()}

        def parse_dataset(combined):
            data = []
            for sentence in combined:
                new_txt = []
                for word in sentence:
                    try:
                        new_txt.append(w2indx[word])
                    except:
                        new_txt.append(0)
                data.append(new_txt)
            return data

        combined = parse_dataset(combined)
        combined = sequence.pad_sequences(combined, maxlen=maxlen)
        return w2indx, w2vec, combined
    else:
        print('No data provided...')

def word2vec_train(combined):
    model = Word2Vec(size=vocab_dim,
                     min_count=n_exposures,
                     window=window_size,
                     workers=cpu_count,
                     iter=n_iterations)
    model.build_vocab(combined)
    model.train(combined, total_examples=model.corpus_count, epochs=model.iter)
    model.save('./model/Word2vec_model.pkl')
    index_dict, word_vectors, combined = create_dictionaries(model=model, combined=combined)
    return index_dict, word_vectors, combined

def get_data(index_dict, word_vectors, combined, y):
    n_symbols = len(index_dict) + 1
    embedding_weights = np.zeros((n_symbols, vocab_dim))
    for word, index in index_dict.items():
        embedding_weights[index, :] = word_vectors[word]
    x_train, x_test, y_train, y_test = train_test_split(combined, y, test_size=0.2,random_state=5)
    y_train = keras.utils.to_categorical(y_train, num_classes=2)
    y_test = keras.utils.to_categorical(y_test, num_classes=2)

    return n_symbols, embedding_weights, x_train, y_train, x_test, y_test

def train_bilstm(n_symbols, embedding_weights, x_train, y_train):
    print('Defining a Simple Keras Model...')
    model = Sequential()
    model.add(Embedding(output_dim=vocab_dim,
                        input_dim=n_symbols,
                        mask_zero=True,
                        weights=[embedding_weights],
                        input_length=maxlen))

    model.add(Bidirectional(LSTM(output_dim=50, activation='tanh')))
    model.add(Dropout(0.5))
    model.add(Dense(2, activation='softmax'))

    model.compile(loss='categorical_crossentropy',
                  optimizer='adam', metrics=['accuracy'])

    model.fit(x_train, y_train, batch_size=batch_size, epochs=n_epoch, verbose=2)

    model.save('./model/bilstm.h5')

if __name__ == '__main__':

    print('加载数据集...')
    combined, y = loadfile()
    print(len(combined), len(y))
    print('数据预处理...')
    combined = tokenizer(combined)
    print('训练word2vec模型...')
    index_dict, word_vectors, combined = word2vec_train(combined)

    print('将数据转换为模型输入所需格式...')
    n_symbols, embedding_weights, x_train, y_train, x_test, y_test = get_data(index_dict, word_vectors, combined,
                                                                              y)
    print("特征与标签大小:")
    print(x_train.shape, y_train.shape)

    print('训练bilstm模型...')
    train_bilstm(n_symbols, embedding_weights, x_train, y_train)

    print('加载bilstm模型...')
    model = load_model('./model/bilstm.h5')

    y_pred = model.predict(x_test)

    for i in range(len(y_pred)):
        max_value = max(y_pred[i])
        for j in range(len(y_pred[i])):
            if max_value == y_pred[i][j]:
                y_pred[i][j] = 1
            else:
                y_pred[i][j] = 0

    print(classification_report(y_test, y_pred))

参考：
https://www.bilibili.com/video/BV1Jf4y1b72e?p=6&t=633
https://www.bilibili.com/video/BV1UE411H7Ck?p=1

Original: https://blog.csdn.net/m0_46144891/article/details/118934203
Author: Yue_kk
Title: 文本情感倾向分析——神经网络模型

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/545235/

转载文章受原作者版权保护。转载请注明原作者出处！

人工智能

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

整理了一份「生产管理数据分析」方案，解决90%的问题

干生产管理却不知道怎么做数据分析？开除警告！钱大把花了，员工也扩招了！别说提高生产效率，赚更多钱了，成本不上升已经是谢天谢地了！那怎么做生产管理呢？其实不难，帆软君这就来和大家…

人工智能 2023年7月16日
0069
图像处理-图像滤波

文章目录 1、图像滤波 2、各滤波特点 * 2.1 均值滤波 2.2 高斯滤波 2.3中值滤波 3、案例分析 * 3.1 均值滤波 – 3.1.1 代码 3.1.2 b…

人工智能 2023年6月20日
00156
opencv-python——2（颜色分割（RGB、HSV）、读取摄像头和视频并保存）

前言关于opencv读取图片等基本操作可以查看opencv-python基础操作汇总——1（读取、画线、平移，旋转缩放、翻转和裁剪等操作）颜色分割（RGB）可以通过cv2.s…

人工智能 2023年7月19日
0094
单人的姿态检测|tensorflow singlepose

单人姿态检测-图片特此声明，这张照片不是我自己的。如果有任何侵权行为，请联系我，我会删除它。 [En] It is hereby declared that the pictur…

人工智能 2023年5月23日
0085
自学Python，学不会怎么办？

Python近段时间一直涨势迅猛，在各大编程排行榜中崭露头角，得益于它多功能性和简单易上手的特性，让它可以在很多不同的工作中发挥重大作用。正因如此，目前几乎所有大中型互联网企业都…

人工智能 2023年6月27日
0097
目标检测算法之YOLOV3

本博客中YOLO系列均为个人理解笔记，欢迎评论指出理解有误或者要讨论的地方 YOLOV3模型相比于v2来说，实质性的改进并不大，更多的是一些技术的堆叠。其并不像yolov2对于v1…

人工智能 2023年7月12日
0069
Python如何安装pandas库，简单3步解决，亲测有效。

问题现象：PyCharm中运行程序，报错，提示”ModuleNotFoundError: No module named ‘pandas’”，如…

人工智能 2023年7月5日
00138
如何在Win11下安装linux子系统(WSL1)，并配置anaconda+pytorch深度学习环境的完整教程(30系列显卡包括RTX3090也适用)

[文件为doc版，可自行转成txt，在手机上看挺好的。本资源来自网络，如有纰漏还请告知，如觉得还不错，请留言告知后来人，谢谢！！！！！入门学习 Linux_常用必会60个命令实…

人工智能 2023年7月22日
0057
计算机视觉中的编码-解码器结构总结（持续更新）

文章目录 NLP领域的编码解码器结构机器学习中的编码器 * 自动编码器视觉领域中的编码解码器结构编码器-解码器结构：编码器原始输入信号转化为中间格式，然后解码器将中间格式转化…

人工智能 2023年6月24日
00204
遥感影像语义分割难点对应解决思路

目录一、像素级精度问题 1. 结合多尺度特征 1.1 空洞卷积 1.2 转置卷积和跳跃连接 1.3 将边缘图集成到分割 2. 基于数据融合的策略 2.1 结合几何和光谱信息来提高…

人工智能 2023年7月27日
0072
Image,cv2读取图片的numpy数组的转换和尺寸resize变化

几种图片尺寸修改和参数总结（from torchvision import transforms as T）显示尺寸格式的不同 Image类型和T进行resize的图片的 si…

人工智能 2023年6月17日
0078
SigmaStar星宸科技智能显示芯片SSD212应用场景简介

啊哦~你想找的内容离你而去了哦内容不存在，可能是由于以下原因造成的： [En] The content does not exist and may be caused by t…

人工智能 2023年5月25日
0077
自然语言处理——文本数据的读写及操作

回答1：批量是指一次性对多个，可以提高效率。在使用 Spark HBase 时，也可以使用批量来提高效率。具体实现方式如下： 1. 批量写入使用 HBase 的 Put…

人工智能 2023年7月9日
0061
目标检测之Two Stage

目标检测之two stage方法对基于two stage目标检测的认识 * R-CNN详细介绍 – step1:生成大量候选区域 step2:提取特征 step3:特…

人工智能 2023年7月10日
0088
《Pytorch深度学习实践》课程合集（刘二大人）笔记

目录 2 线性模型 * 深度学习步骤 ML常用损失函数模型可视化 visdom包 3 梯度下降 4 反向传播 5 用pytorch 实现线性回归 * numpy中的自动广播机制 …

人工智能 2023年6月24日
0088
【科研分享】推荐系统SCI顶会及顶级期刊更新于2022-07-13

引言该文章整理了一些与推荐系统相关的顶会和顶刊，即为大家投稿指明方向，同时也是大家阅读推荐系统方向一流工作的出处。笔者沿途初期总是读一些烂文章，深受其害，因此我们导师常说，看一区…

人工智能 2023年7月28日
0081

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

文本情感倾向分析——神经网络模型

1. 方法

2. 代码（BiLSTM）

大家都在看