【NLP】电影评论情感分析（基础篇）

学习总结

（1）spacy3.0和2.0的区别：https://spacy.io/usage/v3
（2）Spacy可以GPU训练模型，也可以和huggingface结合使用transformer模型。
（3）本篇特别注意spacy的tokenizer的操作，官网教程。

文章目录

学习总结
导言
一、用字符串表示的数据类型
二、电影数据集IMDb
三、文本字符串的数值化
*
3.1将文本数据表示为词袋
3.2 将词袋应用于玩具数据集
3.3 将词袋应用于电影评论
–
四、用 tf-idf 缩放数据
五、研究模型系数
六、多个单词的词袋（n元分词）
七、高级分词、词干提取与词形还原
八、主题建模与文档聚类
*
8.1、隐含狄利克雷分布
–
九、任务总结
*
9.1 小结：
9.2 阅读：
十、工程化思维
Reference

导言

除了典型的两种数据属性：
连续特征：描述数量
分类特征：固定列表中的元素

还有第三类型：文本。ex：在客户服务中，判断一条消息是投诉还是咨询，通过判断主题和内容，推出客户的目的，然后给转发给相关部门。

一、用字符串表示的数据类型

四种类型的字符串数据：

分类数据：如调查人们最喜欢的颜色，数据集有红色、绿色、蓝色等8个取值。
可以在语义上映射为类比的自由字符串：如颜色调查问卷中让人们自己填写喜欢的颜色，有的人写”我弟弟房间的橙色”，这样就不容易和橙色自动对应。
结构化字符串数据：地址、人名或者地名、日期、电话号码，处理方法依赖于上下文和具体领域。
文本数据：单词组成的句子。下面都用英语。

语料库corpus：数据集。
文档document：每个由单个文本表示的数据点。

二、电影数据集IMDb

IMDb（Internet Movie Database，互联网电影数据集）数据集：去http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz下载包
数据集简介：IMDb网站本身是有1到10分的电影打分，为了简化，这些评分整为二分类，即大于等于7位正面，小于等于4位负面的，中性评论不在数据集中。

文件结构如下，neg即negative，pos即positive：

aclImdb_v1/aclImdb
├── test
│   ├── neg
│   └── pos
└── train
    ├── neg
    ├── pos
    └── unsup

可以通过sklearn的 load_files函数加载这种文件夹结构：

from sklearn.datasets import load_files

reviews_train = load_files("aclImdb_v1/aclImdb/train/")

text_train, y_train = reviews_train.data, reviews_train.target
print("type of text_train: {}".format(type(text_train)))
print("length of text_train: {}".format(len(text_train)))
print("text_train[6]:\n{}".format(text_train[6]))

结果如下， text_train是一个列表，并且元素个数为25000，比如下标为6的电影评论句子如下所示，可以看到句子是有 <br><br>换行符等符号的（最好进行数据清洗，删除这类格式）。

type of text_train: <class 'list'>

length of text_train: 25000

text_train[6]:
b"This movie has a special way of telling the story, at first i found it rather odd as it jumped through time and I had no idea whats happening.Anyway the story line was although simple, but still very real and touching. You met someone the first time, you fell in love completely, but broke up at last and promoted a deadly agony. Who hasn't go through this? but we will never forget this kind of pain in our life. I would say i am rather touched as two actor has shown great performance in showing the love between the characters. I just wish that the story could be a happy ending."

text_train = [doc.replace(b"", b" ") for doc in text_train]

text_train 的元素类型与你所使用的 Python 版本有关。在 Python 3 中，它们是 bytes 类
型，是表示字符串数据的二进制编码。在 Python 2 中，text_train 包含的是字符串。
推荐阅读 Python 2（https://docs.python.

org/2/howto/unicode.html）和 Python 3（https://docs.python.org/3/howto/unicode.html）的文
档中关于字符串和 Unicode 的内容。

收集数据集时保持正面字符串和负面字符串的平衡：

print("Samples per class (training): {}".format(np.bincount(y_train)))

用同样操作加载并处理，测试数据集：

reviews_test = load_files("aclImdb_v1/aclImdb/test/")
text_test, y_test = reviews_test.data, reviews_test.target
print("Number of documents in test data: {}".format(len(text_test)))
print("Samples per class (test): {}".format(np.bincount(y_test)))
text_test = [doc.replace(b"", b" ") for doc in text_test]

打印出：

Number of documents in test data: 25000
Samples per class (test): [12500 12500]

三、文本字符串的数值化

3.1将文本数据表示为词袋

即只计算语料库中每个单词在文本中的出现频次。

计算词袋表示包括以下三个步骤。

分词（tokenization）。将每个文档划分为出现在其中的单词 [ 称为词例（token）]，比如
按空格和标点划分。
构建词表（vocabulary building）。收集一个词表，里面包含出现在任意文档中的所有词，
并对它们进行编号（比如按字母顺序排序）。
编码（encoding）。对于每个文档，计算词表中每个单词在该文档中的出现频次。

下图对字符串 This is how you get ants的处理：

输出：包含每个文档中单词计数的一个向量。
得到词表中的每个单词，他们在每个文档中的出现次数。

; 3.2 将词袋应用于玩具数据集

bards_words =["The fool doth think he is wise,",
              "but the wise man knows himself to be a fool"]

from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
vect.fit(bards_words)

print("Vocabulary size: {}".format(len(vect.vocabulary_)))
print("Vocabulary content:\n {}".format(vect.vocabulary_))

通过 CountVectorizer对训练数据的分词和词表的构建，然后通过 vocabulary_属性访问词表的结果：

Vocabulary size: 13
Vocabulary content:
 {'the': 9, 'fool': 3, 'doth': 2, 'think': 10, 'he': 4, 'is': 6, 'wise': 12, 'but': 1, 'man': 8, 'knows': 7, 'himself': 5, 'to': 11, 'be': 0}

词表有13个单词，从be到wise，通过 transform创建训练数据的词袋表示（没转换前是 sparse格式），注意这里的词袋表示会保存在一个 SciPy稀疏矩阵中，所以可以通过 toarray方法将其转为稠密的 numpy数组：

bag_of_words = vect.transform(bards_words)
print("bag_of_words: {}".format(repr(bag_of_words)))

print("Dense representation of bag_of_words:\n{}".format(
    bag_of_words.toarray()))

使用 toarray转为稠密向量：

Dense representation of bag_of_words:
[[0 0 1 1 1 0 1 0 0 1 1 0 1]
 [1 1 0 1 0 1 0 1 1 1 0 1 1]]

3.3 将词袋应用于电影评论

（1）加载为字符串列表


vect = CountVectorizer().fit(text_train)
X_train = vect.transform(text_train)
print("X_train:\n{}".format(repr(X_train)))

"""
X_train:
'
    with 3431196 stored elements in Compressed Sparse Row format>
"""

feature_names = vect.get_feature_names()
print("Number of features: {}".format(len(feature_names)), '\n')
print("First 20 features:\n{}".format(feature_names[:20]), '\n')
print("Features 20010 to 20030:\n{}".format(feature_names[20010:20030]), '\n')
print("Every 2000th feature:\n{}".format(feature_names[::2000]))

"""
Number of features: 74849

First 20 features:
['00', '000', '0000000000001', '00001', '00015', '000s', '001', '003830', '006', '007', '0079', '0080', '0083', '0093638', '00am', '00pm', '00s', '01', '01pm', '02']

Features 20010 to 20030:
['dratted', 'draub', 'draught', 'draughts', 'draughtswoman', 'draw', 'drawback', 'drawbacks', 'drawer', 'drawers', 'drawing', 'drawings', 'drawl', 'drawled', 'drawling', 'drawn', 'draws', 'draza', 'dre', 'drea']

Every 2000th feature:
['00', 'aesir', 'aquarian', 'barking', 'blustering', 'bête', 'chicanery', 'condensing', 'cunning', 'detox', 'draper', 'enshrined', 'favorit', 'freezer', 'goldman', 'hasan', 'huitieme', 'intelligible', 'kantrowitz', 'lawful', 'maars', 'megalunged', 'mostey', 'norrland', 'padilla', 'pincher', 'promisingly', 'receptionist', 'rivals', 'schnaas', 'shunning', 'sparse', 'subset', 'temptations', 'treatises', 'unproven', 'walkman', 'xylophonist']
"""

（1）有些元素都是数字，从无意义的单词中找出意义有时很难；
（2）有些类似词语，如draught、drawback和drawer，其单数复数都在词表中，其实不应该作为不同单词。

（2）交叉验证

对于高维系数数据， LogisticRegression线性模型是最简单的。先用交叉验证对 LogisticRegression线性模型评估。

似乎这样做违背以前的交叉验证与预处理规则。 CountVectorizer的默认设置实际上不会收集任何统计信息，所以我们的结果是有效的。对于应用而言，从一开始就使用 Pipeline 是更好的选择，我们后面再这么做。

from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
scores = cross_val_score(LogisticRegression(), X_train, y_train, cv=5)
print("Mean cross-validation accuracy: {:.2f}".format(np.mean(scores)))
"""
Mean cross-validation accuracy: 0.88
"""

交叉验证平均分数为88%，处理二分类任务合理；又 LogisticRegression线性模型有一个正则化参数C，可以通过交叉验证调节C：

from sklearn.model_selection import GridSearchCV
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10]}

grid = GridSearchCV(LogisticRegression(), param_grid, cv=5)
grid.fit(X_train, y_train)
print("Best cross-validation score: {:.2f}".format(grid.best_score_))
print("Best parameters: ", grid.best_params_)

通过交叉验证发现C=0.1能得到交叉验证分数为89%

Best cross-validation score: 0.89
Best parameters:  {'C': 0.1}

X_test = vect.transform(text_test)
print("Test score: {:.2f}".format(grid.score(X_test, y_test)))

（3）用正则表达式提取词例

默认使用的正则表达式是 ” \b\w\w+\b“。
含义：找到所有包含至少两个字母或数字（\w）且被词边界（\b）分隔的字符序列。它不会匹配只有一个字母的单词，还会将类似”doesn’t”或”bit.ly”之类的缩写分开，但它会将”h8ter”匹配为一个单词。

然后， CountVectorizer 将所有单词转换为小写字母，这样”soon””Soon”和”sOon”都对应于同一个词例（因此也对应于同一个特征）。这一简单机制在实践中的效果很好，但正如前面所见，我们得到了许多不包含信息量的特征（比如数字）。减少这种特征的一种方法是，仅使用至少在 2 个文档（或者至少 5 个，等等）中出现过的词例。仅在一个文档中出现的词例不太可能出现在测试集中，因此没什么用。我们可以用 min_df 参数来设置词例至少需要在多少个文档中出现过。

vect = CountVectorizer(min_df=5).fit(text_train)
X_train = vect.transform(text_train)
print("X_train with min_df: {}".format(repr(X_train)))

"""
X_train with min_df: '
    with 3354014 stored elements in Compressed Sparse Row format>
"""

通过上面的设置 min_df = 5，即要求每个词例至少在5个文档上出现过，可以将特征数减少到 27271个（只有原始特征的三分之一左右）。再看看一些词例：

feature_names = vect.get_feature_names()

print("First 50 features:\n{}".format(feature_names[:50]))
print("Features 20010 to 20030:\n{}".format(feature_names[20010:20030]))
print("Every 700th feature:\n{}".format(feature_names[::700]))

First 50 features:
['00', '000', '007', '00s', '01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '100', '1000', '100th', '101', '102', '103', '104', '105', '107', '108', '10s', '10th', '11', '110', '112', '116', '117', '11th', '12', '120', '12th', '13', '135', '13th', '14', '140', '14th', '15', '150', '15th', '16', '160', '1600', '16mm', '16s', '16th']

Features 20010 to 20030:
['repentance', 'repercussions', 'repertoire', 'repetition', 'repetitions', 'repetitious', 'repetitive', 'rephrase', 'replace', 'replaced', 'replacement', 'replaces', 'replacing', 'replay', 'replayable', 'replayed', 'replaying', 'replays', 'replete', 'replica']

Every 700th feature:
['00', 'affections', 'appropriately', 'barbra', 'blurbs', 'butchered', 'cheese', 'commitment', 'courts', 'deconstructed', 'disgraceful', 'dvds', 'eschews', 'fell', 'freezer', 'goriest', 'hauser', 'hungary', 'insinuate', 'juggle', 'leering', 'maelstrom', 'messiah', 'music', 'occasional', 'parking', 'pleasantville', 'pronunciation', 'recipient', 'reviews', 'sas', 'shea', 'sneers', 'steiger', 'swastika', 'thrusting', 'tvs', 'vampyre', 'westerns']

上面结果发现，数字减少了，生僻词和拼写错误也少了，再次运行网格搜索（如下），虽然结果还是89%，但很多时候，我们减少需要处理的特征数量，可以提高模型的可解释性。

grid = GridSearchCV(LogisticRegression(), param_grid, cv=5)
grid.fit(X_train, y_train)
print("Best cross-validation score: {:.2f}".format(grid.best_score_))

如果一个文档中包含训练数据中没有包含的单词，并对其调用 CountVectorizer的 transform 方法，那么这些单词将被忽略，因为它们没有包含在字典中。这对分类来说不是一个问题，因为从不在训练数据中的单词中学不到任何内容。但对于某些应用而言（比如垃圾邮件检测），添加一个特征来表示特定文档中有多少个所谓”词表外”单词可能会有所帮助。为了实现这一点，你需要设置 min_df，否则这个特征在训练期间永远不会被用到。

（4）删除停用词

查看停用词：

from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
print("Number of stop words: {}".format(len(ENGLISH_STOP_WORDS)))
print("Every 10th stopword:\n{}".format(list(ENGLISH_STOP_WORDS)[::10]))

Number of stop words: 318
Every 10th stopword:
['cannot', 'across', 'describe', 'even', 'how', 'because', 'nothing', 'every', 'up', 'either', 'thru', 'call', 'ourselves', 'why', 'becomes', 'seemed', 'thereby', 'etc', 'whereby', 'than', 'yours', 'everyone', 'eg', 'these', 'by', 'mostly', 'always', 'during', 'among', 'to', 'alone', 'it']

我们对自己的数据删除停用词：


vect = CountVectorizer(min_df=5, stop_words="english").fit(text_train)
X_train = vect.transform(text_train)
print("X_train with stop words:\n{}".format(repr(X_train)))

删除后发现特征数量减少为 26966，即去掉了300多个停用词。

X_train with stop words:
<25000x26966 sparse matrix of type 'numpy.int64'>'
    with 2149958 stored elements in Compressed Sparse Row format>

最后还是一样使用网格搜索：

grid = GridSearchCV(LogisticRegression(), param_grid, cv=5)
grid.fit(X_train, y_train)
print("Best cross-validation score: {:.2f}".format(grid.best_score_))

使用停用词后的网格搜索性能略有下降——不至于担心，但鉴于从 27 000 多个特征中删
除 305 个不太可能对性能或可解释性造成很大影响，所以使用这个列表似乎是不值得的。
固定的列表主要对小型数据集很有帮助，这些数据集可能没有包含足够的信息，模型从数
据本身无法判断出哪些单词是停用词。

练习：通过设置 CountVectorizer 的 max_df 选项来舍弃出现最频繁的单词，并查看它对特征数量和性能有什么影响。

四、用 tf-idf 缩放数据

刚才的方法是丢弃不重要的特征，还有一种方法是按照预计的特征信息量来缩放特征。如tf-idf：在很多文档都有出现的词语，给予的权重不高，但对特定文档高频词，权重高。

特征缩放有几种变体：https://en.wikipedia.org/wiki/Tf-idf
单词w在文档d中的ti-idf分数，计算公式：tfidf ⁡ ( w , d ) = tf ⁡ log ⁡ ( N + 1 N w + 1 ) + 1 \operatorname{tfidf}(w, d)=\operatorname{tf} \log \left(\frac{N+1}{N_{w}+1}\right)+1 t f i d f (w ,d )=t f lo g (N w +1 N +1 )+1

N是训练集中的文档数量
N w N_w N w 是训练集中出现单词w的文档数量
tf（词频）是单词w在查询文档d中出现的次数
两个类在计算tf-idf表示后还应用了L2范数。它们将每个文档的表示缩放到欧几里得范数为 1。利用这种缩放方法，文档长度（单词数量）不会改变向量化表示。

因为tf-idf利用了训练数据的统计学属性，所以可以使用管道，确保网格搜索的结果有效。

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(TfidfVectorizer(min_df=5, norm=None),
                     LogisticRegression())
param_grid = {'logisticregression__C': [0.001, 0.01, 0.1, 1, 10]}

grid = GridSearchCV(pipe, param_grid, cv=5)
grid.fit(text_train, y_train)
print("Best cross-validation score: {:.2f}".format(grid.best_score_))

上面的结果是有提高的，Best cross-validation score: 0.89。
（1）还可以查看 tf-idf 找到的最重要的单词。
（2）tf-idf 缩放的目的是找到能够区分文档的单词，但它完全是一种无监督技术。因此，这里的”重要”不一定与我们感兴趣的”正面评论”和”负面评论”标签相关。

首先，我们从管道中提取 TfidfVectorizer：

vectorizer = grid.best_estimator_.named_steps["tfidfvectorizer"]

X_train = vectorizer.transform(text_train)

max_value = X_train.max(axis=0).toarray().ravel()
sorted_by_tfidf = max_value.argsort()

feature_names = np.array(vectorizer.get_feature_names())

print("Features with lowest tfidf:\n{}".format(
      feature_names[sorted_by_tfidf[:20]]))

print("Features with highest tfidf: \n{}".format(
      feature_names[sorted_by_tfidf[-20:]]))

Features with lowest tfidf:
['poignant' 'disagree' 'instantly' 'importantly' 'lacked' 'occurred'
 'currently' 'altogether' 'nearby' 'undoubtedly' 'directs' 'fond'
 'stinker' 'avoided' 'emphasis' 'commented' 'disappoint' 'realizing'
 'downhill' 'inane']
Features with highest tfidf:
['coop' 'homer' 'dillinger' 'hackenstein' 'gadget' 'taker' 'macarthur'
 'vargas' 'jesse' 'basket' 'dominick' 'the' 'victor' 'bridget' 'victoria'
 'khouri' 'zizek' 'rob' 'timon' 'titanic']

tf-idf 较小的特征要么是在许多文档里都很常用，要么就是很少使用，且仅出现在非常长
的文档中。有趣的是，许多 tf-idf 较大的特征实际上对应的是特定的演出或电影。这些术
语仅出现在这些特定演出或电影的评论中，但往往在这些评论中多次出现。

例如，对于”pokemon”、”smallville” 和 “doodlebops” 是显而易见的，但这里的 “scanners” 实际上指的也是电影标题。这些单词不太可能有助于我们的情感分类任务（除非有些电影的评价可能普遍偏正面或偏负面），但肯定包含了关于评论的大量具体信息。

我们还可以找到逆向文档频率较低的单词，即出现次数很多，因此被认为不那么重要的单
词。训练集的逆向文档频率值被保存在 idf_ 属性中：

sorted_by_idf = np.argsort(vectorizer.idf_)
print("Features with lowest idf:\n{}".format(
       feature_names[sorted_by_idf[:100]]))

Features with lowest idf:
['the' 'and' 'of' 'to' 'this' 'is' 'it' 'in' 'that' 'but' 'for' 'with'
 'was' 'as' 'on' 'movie' 'not' 'have' 'one' 'be' 'film' 'are' 'you' 'all'
 'at' 'an' 'by' 'so' 'from' 'like' 'who' 'they' 'there' 'if' 'his' 'out'
 'just' 'about' 'he' 'or' 'has' 'what' 'some' 'good' 'can' 'more' 'when'
 'time' 'up' 'very' 'even' 'only' 'no' 'would' 'my' 'see' 'really' 'story'
 'which' 'well' 'had' 'me' 'than' 'much' 'their' 'get' 'were' 'other'
 'been' 'do' 'most' 'don' 'her' 'also' 'into' 'first' 'made' 'how' 'great'
 'because' 'will' 'people' 'make' 'way' 'could' 'we' 'bad' 'after' 'any'
 'too' 'then' 'them' 'she' 'watch' 'think' 'acting' 'movies' 'seen' 'its'
 'him']

正如所料，这些词大多是英语中的停用词，比如 “the” 和 “no”。但有些单词显然是电影评
论特有的，比如 “movie”、”film”、”time”、”story” 等。有趣的是，”good”、”great” 和
“bad” 也属于频繁出现的单词，因此根据 tf-idf 度量也属于”不太相关”的单词，尽管我们
可能认为这些单词对情感分析任务非常重要。

五、研究模型系数

Logistic 回归模型从数据中实际学到的内容。由于特征数量非常多（删除出现次数不多的特征之后还有 27 271 个），所以显然我们不能同时查看所有系数。但是，我们可以查看最大的系数，并查看这些系数对应的单词。我们将使用基于 tf-idf 特征训练的最后一个模型。

Logistic 回归模型中最大的 25 个系数与最小的 25 个系数，其高度表示每个系数的大小，这里用到一开始附带的 mglearn包。

对于线性回归和逻辑回归，如目标函数g(x) = w1 * x1 +w2 * x2 +w3x3 + w4 x4 +w0，如果有激活函数sigmoid则是分类，逻辑回归有两个参数： coef_和 intercept，则这里的 coef_为w1到w4， intercept_为w0。

mglearn.tools.visualize_coefficients(
    grid.best_estimator_.named_steps["logisticregression"].coef_,
    feature_names, n_top_features=40)

在 tf-idf 特征上训练的 Logistic 回归的最大系数和最小系数

左侧的负系数属于模型找到的表示负面评论的单词，而右侧的正系数属于模型找到的表示正面评论的单词。大多数单词都是非常直观的，比如 “worst”（最差）、”waste”（浪费）、”disappointment”（失望）和 “laughable”（可笑）都表示不好的电影评论，而”excellent”（优秀）、”wonderful”（精彩）、”enjoyable”（令人愉悦）和 “refreshing” （耳目一新）则表示正面的电影评论。

有些词的含义不那么明确，比如 “bit”（一点）、”job”（工作）和 “today”（今天），但它们可能是类似 “good job”（做得不错）和”best today”（今日最佳）等短语的一部分。

六、多个单词的词袋（n元分词）

词袋模型：舍弃单词之间的顺序。但是使用词袋表示时有一种获取上下文的方法，就是不仅考虑单一词例的计数，而且还考虑相邻的两个或三个词例的计数。即n元分词为n-gram（改变 CountVectorizer 或 TfidfVectorizer的 ngram_range 参数来改变作为特征的词例范围）。
ngram_range 参数是一个元组，包含要考虑的词例序列的最小长度和最大长度。

（1）还是使用玩具数据集，先用一元分词：

bards_words =["The fool doth think he is wise,",
              "but the wise man knows himself to be a fool"]

cv = CountVectorizer(ngram_range=(1, 1)).fit(bards_words)
print("Vocabulary size: {}".format(len(cv.vocabulary_)))
print("Vocabulary:\n{}".format(cv.get_feature_names()))

上面的一元分词结果：

Vocabulary size: 13
Vocabulary:
['be', 'but', 'doth', 'fool', 'he', 'himself', 'is', 'knows', 'man', 'the', 'think', 'to', 'wise']

（2）而如果是二元分词则是 ngram_range=(2, 2)：


cv = CountVectorizer(ngram_range=(2, 2)).fit(bards_words)
print("Vocabulary size: {}".format(len(cv.vocabulary_)))
print("Vocabulary:\n{}".format(cv.get_feature_names()))

二元分词结果：

Vocabulary size: 14
Vocabulary:
['be fool', 'but the', 'doth think', 'fool doth', 'he is', 'himself to', 'is wise', 'knows himself', 'man knows', 'the fool', 'the wise', 'think he', 'to be', 'wise man']

（3）通过 transform创建训练数据的词袋表示，注意这里的词袋表示会保存在一个 SciPy稀疏矩阵中，所以可以通过 toarray方法将其转为稠密的 numpy数组。

下面还是接着n=2的二元分词转换，n越大，则能得到更多的特征，但是如果特征数量过多，可能会导致过拟合。二元分词的数量是一元分词数量的平方，三元分词的数量是一
元分词数量的三次方。


bag_of_words = cv.transform(bards_words)
print("bag_of_words: {}".format(repr(bag_of_words)))

print("Transformed data (dense):\n{}".format(cv.transform(bards_words).toarray()))

Transformed data (dense):
[[0 0 1 1 1 0 1 0 0 1 0 1 0 0]
 [1 1 0 0 0 1 0 1 1 0 1 0 1 1]]

（4）如果在 bards_word数据集上使用一元分词、二元分词、三元分词，结果为：

cv = CountVectorizer(ngram_range=(1, 3)).fit(bards_words)
print("Vocabulary size: {}".format(len(cv.vocabulary_)))
print("Vocabulary:\n{}".format(cv.get_feature_names()))

分词结果为：

Vocabulary size: 39
Vocabulary:
['be', 'be fool', 'but', 'but the', 'but the wise', 'doth', 'doth think', 'doth think he', 'fool', 'fool doth', 'fool doth think', 'he', 'he is', 'he is wise', 'himself', 'himself to', 'himself to be', 'is', 'is wise', 'knows', 'knows himself', 'knows himself to', 'man', 'man knows', 'man knows himself', 'the', 'the fool', 'the fool doth', 'the wise', 'the wise man', 'think', 'think he', 'think he is', 'to', 'to be', 'to be fool', 'wise', 'wise man', 'wise man knows']

（5）为了确定n的最佳值，可用网格搜索，下面搜索时间较长，因为网格较大，且包含三元分词：

pipe = make_pipeline(TfidfVectorizer(min_df=5), LogisticRegression())

param_grid = {'logisticregression__C': [0.001, 0.01, 0.1, 1, 10, 100],
              "tfidfvectorizer__ngram_range": [(1, 1), (1, 2), (1, 3)]}

grid = GridSearchCV(pipe, param_grid, cv=5)
grid.fit(text_train, y_train)
print("Best cross-validation score: {:.2f}".format(grid.best_score_))
print("Best parameters:\n{}".format(grid.best_params_))

网格搜索的结果如下，可见最佳参数设置为 ngram_range=(1, 3)，效果也是最好的（91%）。因为这里我们添加了二元分词和三元分词特征：

Best cross-validation score: 0.91
Best parameters:
{'logisticregression__C': 100, 'tfidfvectorizer__ngram_range': (1, 3)}

（6）热力图可视化：可以进一步将交叉验证精度作为 ngram_range 和 C 参数的函数并用热图可视化：


scores = grid.cv_results_['mean_test_score'].reshape(-1, 3).T

heatmap = mglearn.tools.heatmap(
    scores, xlabel="C", ylabel="ngram_range", cmap="viridis", fmt="%.3f",
    xticklabels=param_grid['logisticregression__C'],
    yticklabels=param_grid['tfidfvectorizer__ngram_range'])
plt.colorbar(heatmap)

(1,2)比(1,1)的效果提升了较大，即用了二元分词后有提高，但是(1,3)和(1,2)差不多，即添加了三元分词后accuracy并没有大提高，进一步可以将最佳模型的重要系数可视化。

（7）将最佳模型的重要系数可视化：


vect = grid.best_estimator_.named_steps['tfidfvectorizer']
feature_names = np.array(vect.get_feature_names())
coef = grid.best_estimator_.named_steps['logisticregression'].coef_
mglearn.tools.visualize_coefficients(coef, feature_names, n_top_features=40)
plt.ylim(-22, 22)

同时使用 tf-idf 缩放与一元分词、二元分词和三元分词时的最重要特征

几个特别有趣的特征：它们包含单词 “worth”（值得），而这个词本身并没有出现在一元
分词模型中：”not worth”（不值得）表示负面评论，而 “definitely worth”（绝对值得）
和 “well worth”（很值得）表示正面评论。这是上下文影响 “worth” 一词含义的主要示例。

（8）下面只将三元分词可视化，以进一步深入了解这些特征有用的原因。许多有用
的二元分词和三元分词都由常见的单词组成，这些单词本身可能没有什么信息量，比如
“none of the”（没有一个）、”the only good”（唯一好的）、”on and on”（不停地）、”this
is one”（这是一部）、”of the most”（最）等短语中的单词。但是，与一元分词特征的重
要性相比，这些特征的影响非常有限。


mask = np.array([len(feature.split(" ")) for feature in feature_names]) == 3

mglearn.tools.visualize_coefficients(coef.ravel()[mask],
                                     feature_names[mask], n_top_features=40)
plt.ylim(-22, 22)

仅将对模型重要的三元分词特征可视化

七、高级分词、词干提取与词形还原

有些相近词（只是单复数区别）的 “drawback”和”drawbacks”、”drawer” 和 “drawers”、”drawing” 和 “drawings，不许过分区分，否则泛化性不强。

两种标准化方法：
（1）词干提取：删除后缀，合并相同词干的单词。
（2）词形还原：使用的是由已知单词形式组成的字典（明确的且经过人工验证的系统），并且考虑了单词在句子中的作用。单词的标准化形式被称为词元（lemma）。
（3）另一个有趣的例子是拼写校正。

import spacy
import nltk

en_nlp = spacy.load('en_core_web_sm')

stemmer = nltk.stem.PorterStemmer()

def compare_normalization(doc):

    doc_spacy = en_nlp(doc)

    print("Lemmatization:")
    print([token.lemma_ for token in doc_spacy])

    print("Stemming:")
    print([stemmer.stem(token.norm_.lower()) for token in doc_spacy])

compare_normalization(u"Our meeting today was worse than yesterday, "
                       "I'm scared of meeting the clients tomorrow.")

通过上面代码比较词性还原和Porter词干提取器的区别，结果：

Lemmatization:
['our', 'meeting', 'today', 'be', 'bad', 'than', 'yesterday',
',', 'I', 'be', 'scared', 'of', 'meet', 'the', 'client', 'tomorrow', '.']
Stemming:
['our', 'meet', 'today', 'wa', 'wors', 'than', 'yesterday',
',', 'i', 'am', 'scare', 'of', 'meet', 'the', 'client', 'tomorrow', '.']

词干提取总是局限于将单词简化成词干，因此 “was” 变成了 “wa”，而词形还原可以得到正确的动词基本词形 “be”。同样，词形还原可以将 “worse” 标准化为 “bad”，而词干提取得到的是 “wors”。

另一个主要区别在于，词干提取将两处 “meeting” 都简化为 “meet”。利用词形还原，第一处 “meeting” 被认为是名词，所以没有变化，而第二处 “meeting” 被认为是动词，所以变为”meet”。一般来说，词形还原是一个比词干提取更复杂的过程，但用于机器学习的词例标准化时通常可以给出比词干提取更好的结果。

八、主题建模与文档聚类

常用于文本数据的一种特殊技术是主题建模（topic modeling），这是描述将每个文档分配给一个或多个主题的任务（通常是无监督的）的概括性术语。这方面一个很好的例子是新闻数据，它们可以被分为 “政治” “体育” “金融” 等主题。如果为每个文档分配一个主题，那么这是一个文档聚类任务。如果每个文档可以有多个主题，那么这个任务与无监督学习中的分解方法有关。我们学到的每个成分对应于一个主题，文档表示中的成分系数告诉我们这个文档与该主题的相关性强弱。通常来说，人们在谈论主题建模时，他们指的是一种叫作隐含狄利克雷分布（Latent Dirichlet Allocation，LDA）的特定分解方法。

8.1、隐含狄利克雷分布

从直观上来看，LDA 模型试图找出频繁共同出现的单词群组（即主题）。LDA 还要求，每个文档可以被理解为主题子集的 “混合”。

注意：机器学习模型所谓的 “主题” 可能不是我们通常在日常对话中所说的主题，而是更类似于 PCA 或 NMF所提取的成分，它可能具有语义，也可能没有。
ex：两篇记者的报道，金融和体育新闻，其主题的不同可能不是金融和体育，其他”主题 “可能是” 记者 A 常用的词语” 和 “记者 B 常用的词语”，虽然这并不是通常意义上的主题。

我们将 LDA 应用于电影评论数据集。

（1）删除停用词

对于无监督的文本文档模型，通常最好删除非常常见的单词，否则它们可能会支配分析过程。我们将删除至少在 15% 的文档中出现过的单词，并在删除前 15% 之后，将词袋模型限定为最常见的 10 000 个单词：

vect = CountVectorizer(max_features=10000, max_df=.15)
X = vect.fit_transform(text_train)

（2）构建模型和变换数据

from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(n_components=10, learning_method="batch",
                                max_iter=25, random_state=0)

document_topics = lda.fit_transform(X)

print("lda.components_.shape: {}".format(lda.components_.shape))

lda.components_.shape: (10, 10000)

（3）排序并打印主题

import numpy as np

sorting = np.argsort(lda.components_, axis=1)[:, ::-1]

feature_names = np.array(vect.get_feature_names())

import mglearn

mglearn.tools.print_topics(topics=range(10), feature_names=feature_names,
                           sorting=sorting, topics_per_chunk=5, n_words=10)

topic 0       topic 1       topic 2       topic 3       topic 4
horror        music         original      thing         action
gore          john          team          worst         police
effects       old           series        didn          murder
blood         young         jack          nothing       killer
pretty        girl          action        minutes       crime
budget        song          new           guy           plays
house         gets          down          actually      lee
zombie        dance         tarzan        want          gets
dead          songs         freddy        going         role
low           rock          indian        re            cop

（4）100个主题

lda100 = LatentDirichletAllocation(n_components=100, learning_method="batch",
                                   max_iter=25, random_state=0)
document_topics100 = lda100.fit_transform(X)

topics = np.array([7, 16, 24, 25, 28, 36, 37, 41, 45, 51, 53, 54, 63, 89, 97])

sorting = np.argsort(lda100.components_, axis=1)[:, ::-1]
feature_names = np.array(vect.get_feature_names())
mglearn.tools.print_topics(topics=topics, feature_names=feature_names,
                           sorting=sorting, topics_per_chunk=5, n_words=20)


music = np.argsort(document_topics100[:, 45])[::-1]

for i in music[:10]:

    print(b".".join(text_train[i].split(b".")[:2]) + b".\n")

fig, ax = plt.subplots(1, 2, figsize=(10, 10))
topic_names = ["{:>2} ".format(i) + " ".join(words)
               for i, words in enumerate(feature_names[sorting[:, :2]])]

for col in [0, 1]:
    start = col * 50
    end = (col + 1) * 50
    ax[col].barh(np.arange(50), np.sum(document_topics100, axis=0)[start:end])
    ax[col].set_yticks(np.arange(50))
    ax[col].set_yticklabels(topic_names[start:end], ha="left", va="top")
    ax[col].invert_yaxis()
    ax[col].set_xlim(0, 2000)
    yax = ax[col].get_yaxis()
    yax.set_tick_params(pad=130)
plt.tight_layout()

九、任务总结

9.1 小结：

（1）本次对电影评论进行分类，如检测垃圾邮件或情感分析也是类似的，可以使用词袋模型，多check所提取的词例和n元分词。

（2） CountVectorizer 类和 T fidfVectorizer 类仅
实现了相对简单的文本处理方法。对于更高级的文本处理方法，推荐使用 Python 包
spacy（一个相对较新的包，但非常高效，且设计良好）、nltk（一个非常完善且完整的库，
但有些过时）和 gensim（着重于主题建模的 NLP 包）。

（3）文本处理还有：连续向量表示（词向量）、分布式词表示，均在word2vec有实现，spacy和gensim也有包。当然现在都是使用深度学习的方法了如transformer等。

9.2 阅读：

（1）可以阅读Steven Bird、Ewan Klein 和 Edward Loper 合著的《Natural Language Processing with Python》，介绍nltk的使用。
（2）Christopher Manning、Prabhakar
Raghavan 和 Hinrich Schuütze 合著的标准参考，Introduction to Information Retrieval（http://nlp.stanford.edu/IR-book/），其中介绍了信息检索、NLP 和机器学习中的基本算法。

十、工程化思维

在这个大佬的博客里，也提到这本书《Python机器学习基础教程》虽然涉及点多，但写得比较皮毛，点到为止，还推荐了一本书《机器学习实战基于scikit-learn和tensorflow》，思维导图：

; Reference

（1）《Python机器学习基础》
（2）http://www.nltk.org/
（3）https://spacy.io/docs/
（4）《Python 机器学习基础教程》总结
（5）spaCy 2.1 + 中文模型简明教程
（6）Spacy NLP – 使用正则表达式分块
（7）spacy官网：https://spacy.io/api
（8）机器学习应用——电影评论情感分析模型构建
（9）解决spacy词形还原的问题帖子
（10）sklearn中的coef_和intercept_
（11）sklearn——CountVectorizer详解

Original: https://blog.csdn.net/qq_35812205/article/details/121587112
Author: 山顶夕景
Title: 【NLP】电影评论情感分析（基础篇）

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/531916/

转载文章受原作者版权保护。转载请注明原作者出处！

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31