【NLP Tool — NLTK】NLTK进行英文情感分析、分词、分句、词性标注(附代码)

NLP Tool 系列文章

NLP–Jieba中文文本–关键词提取、自定义词典、分词、词性标注

NLP–NLTK英文文本–情感分析、分词、分句、词性标注

工具介绍

NLTK,Natural Language Toolkit是一个Python模块,提供了多种语料库(Corpora)和词典(Lexicon)资源,比如WordNet等,以及一系列基本的自然语言处理工具集,包括:分句,标记解析(Tokenization),词干提取(Stemming),词性标注(POS Tagging)和句法分析(Syntactic Parsing)等,是对英文文本数据进行处理的常用工具。

注:此工具主要针对英文文本数据,那么使用中文数据会有什么效果?本文将详细介绍以下内容。

[En]

Note: this tool is mainly aimed at English text data, so what will be the effect of using Chinese data? The following contents of this article will be introduced in detail.

安装库

pip install nltk

安装语料库

方法一:直接在控制台安装

[En]

Method 1: install directly in the console

import nltk
nltk.download()

方法二:由于要下载的语料库太大,方法一有时不会成功,可以考虑手动下载

[En]

Method 2: because the corpus to be downloaded is too large, method 1 will not be successful sometimes, so you can consider downloading it manually

1 进入官网地址下官网地址 GitHub – nltk/nltk_data: NLTK Data

【NLP Tool -- NLTK】NLTK进行英文情感分析、分词、分句、词性标注(附代码)

2 查看解压后的语料库可以放在本地的哪些位置,在控制台中输入以下内容

from nltk.corpus import brown
brown.categories()

【NLP Tool -- NLTK】NLTK进行英文情感分析、分词、分句、词性标注(附代码)

假如是在选择E:\nltk_data,那就在E盘中创建nltk_data文件,随后将下载的语料库中的packages包下的所有文件复制到nltk_data

【NLP Tool -- NLTK】NLTK进行英文情感分析、分词、分句、词性标注(附代码)

进入nltk_data/tokenizers文件夹下解压punkt.zip文件

【NLP Tool -- NLTK】NLTK进行英文情感分析、分词、分句、词性标注(附代码)

完整代码

import nltk
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
from nltk.corpus import brown

brown.categories()
s = '近日,中国短道速滑队队员@武大靖,在直播中歪嘴喝水的画面走红,此后他本人还亲自教学。于是,短道速滑国家队的成员们相继挑战,还出了一人炫三瓶的升级版。网友:终于找到进短道速滑队的方法!'
s1 = 'Along with the development of society , more and more problems are brought to our attention , one of the most serious problems is involution and lying flat . Involution means that when social resources cannot meet the needs of everyone, people compete to obtain more resources. An important feature of involution is internal competition , Internal competition is becoming increasing prevalent at an amazing rate. '
englishTokens = word_tokenize(s1)
chineseTokens = word_tokenize(s)
分句和分词
print("英文分句", sent_tokenize(s1))
print("英文分词", englishTokens)
print("中文分句", sent_tokenize(s))
print("中文分词", chineseTokens)

词性标注
词性标注只能在分词后进行。<details><summary>*<font color='gray'>[En]</font>*</summary>*<font color='gray'>Part of speech tagging can only be carried out after word segmentation.</font>*</details>
englishTags = nltk.pos_tag(englishTokens)
chineseTags = nltk.pos_tag(chineseTokens)
print("英文词性标注", englishTags)
print("中文词性标注", chineseTags)

情感分析
#compound表示复杂程度,neu表示中性,neg表示负面情绪,pos表示正面情绪
from nltk.sentiment.vader import SentimentIntensityAnalyzer
s2 = ['This is a good book', 'This is a bad book']
s3 = ['这是一本好书', '这是一本糟糕的书']
创建分类器
sid = SentimentIntensityAnalyzer()
#英文情感分析
for sentence in s2:
    print(sentence)
    print("情感得分", sid.polarity_scores(sentence))
#中文情感分析
for sentence in s3:
    print(sentence)
    print("情感得分", sid.polarity_scores(sentence))

结果

【NLP Tool -- NLTK】NLTK进行英文情感分析、分词、分句、词性标注(附代码)

结果分析

1分词、分句、词性标注

【NLP Tool -- NLTK】NLTK进行英文情感分析、分词、分句、词性标注(附代码)

2 情感分析

【NLP Tool -- NLTK】NLTK进行英文情感分析、分词、分句、词性标注(附代码)

显然,在分词阶段,英文的实现效果相对较好,而中文的实现效果相对较差。因此,基于分词的小句、词性标注和情感分析的实现效果也较差。

[En]

It is obvious that the implementation effect in English is relatively good, while that in Chinese is relatively bad at the stage of word segmentation. therefore, the implementation effect of clause, part of speech tagging and affective analysis based on word segmentation is also poor.

Original: https://blog.csdn.net/ccaoshangfei/article/details/124625623
Author: 北村南
Title: 【NLP Tool — NLTK】NLTK进行英文情感分析、分词、分句、词性标注(附代码)

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/527561/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

亲爱的 Coder【最近整理,可免费获取】👉 最新必读书单  | 👏 面试题下载  | 🌎 免费的AI知识星球