NLTK安装使用全过程–python

2023年6月3日下午2:39 • 大数据 • 阅读 131

前言

之前做实验用到了情感分析，就下载了一下，这篇博客记录使用过程。

下载安装到实战详细步骤

NLTK下载安装

先使用pip install nltk 安装包
然后运行下面两行代码会弹出如图得GUI界面，注意下载位置，然后点击下载全部下载了大概3.5G。

import nltk
nltk.download()!

注意点：可能由于网络原因访问github卡顿导致，不能正常弹出GUI进行下载，可以自己去github下载
网址：https://github.com/nltk/nltk_data/tree/gh-pages/packages

下载成功后查看是否可以使用，运行下面代码看看是否可以调用brown中的词库

from nltk.corpus import brown

print(brown.categories())  # &#x8F93;&#x51FA;brown&#x8BED;&#x6599;&#x5E93;&#x7684;&#x7C7B;&#x522B;
print(len(brown.sents()))  # &#x8F93;&#x51FA;brown&#x8BED;&#x6599;&#x5E93;&#x7684;&#x53E5;&#x5B50;&#x6570;&#x91CF;
print(len(brown.words()))  # &#x8F93;&#x51FA;brown&#x8BED;&#x6599;&#x5E93;&#x7684;&#x8BCD;&#x6570;&#x91CF;

'''
&#x7ED3;&#x679C;&#x4E3A;&#xFF1A;
['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies',
'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance',
'science_fiction']
57340
1161192
'''

这时候有可能报错，说在下面文件夹中没有找到nltk_data
把下载好的文件解压在复制到其中一个文件夹位置即可，注意文件名，让后就能正常使用！

实战：运用自己的数据进行操作

一、使用自己的训练集训练和分析

链接：https://pan.baidu.com/s/1GrNg3ziWJGhcQIWBCr2PMg
提取码：1fb8

import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
import os
from nltk.corpus import stopwords
import pandas as pd

def extract_features(word_list):
    return dict([(word, True) for word in word_list])

#&#x505C;&#x7528;&#x8BCD;
stop = stopwords.words('english')
stop1 = ['!', ',' ,'.' ,'?' ,'-s' ,'-ly' ,' ', 's','...']
stop = stop1+stop
print(stop)

#&#x8BFB;&#x53D6;txt&#x6587;&#x672C;
def readtxt(f,path):
    data1 = ['microwave']
    # &#x4EE5; utf-8 &#x7684;&#x7F16;&#x7801;&#x683C;&#x5F0F;&#x6253;&#x5F00;&#x6307;&#x5B9A;&#x6587;&#x4EF6;
    f = open(path+f, encoding="utf-8")
    # &#x8F93;&#x51FA;&#x8BFB;&#x53D6;&#x5230;&#x7684;&#x6570;&#x636E;
    #data = f.read().split()
    data = f.read().split()
    for i in range(len(data)):
        if data[i] not in stop:
            data[i] = [data[i]]
            data1 = data1+data[i]
    # &#x5173;&#x95ED;&#x6587;&#x4EF6;
    f.close()
    del data1[0]
    return data1

if __name__ == '__main__':

    # &#x52A0;&#x8F7D;&#x79EF;&#x6781;&#x4E0E;&#x6D88;&#x6781;&#x8BC4;&#x8BBA;  &#x8FD9;&#x4E9B;&#x8BC4;&#x8BBA;&#x53BB;&#x6389;&#x4E86;&#x4E00;&#x4E9B;&#x505C;&#x7528;&#x8BCD;&#xFF0C;&#x662F;&#x5728;readtxt&#x97E9;&#x7855;&#x91CC;&#x5904;&#x7406;&#x7684;&#xFF0C;
    #&#x505C;&#x7528;&#x8BCD;&#x5982; i am you a this &#x7B49;&#x7B49;&#x5728;&#x8BC4;&#x8BBA;&#x4E2D;&#x662F;&#x975E;&#x5E38;&#x5E38;&#x89C1;&#x7684;&#xFF0C;&#x6709;&#x53EF;&#x80FD;&#x5BF9;&#x7ED3;&#x679C;&#x6709;&#x5F71;&#x54CD;&#xFF0C;&#x5E94;&#x8BE5;&#x4E8B;&#x5148;&#x53BB;&#x9664;
    positive_fileids = os.listdir('pos')  # &#x79EF;&#x6781; list&#x7C7B;&#x578B; 42&#x6761;&#x6570;&#x636E; &#x6BCF;&#x4E00;&#x6761;&#x662F;&#x4E00;&#x4E2A;txt&#x6587;&#x4EF6;
    print(type(positive_fileids), len(positive_fileids)) # list&#x7C7B;&#x578B; 42&#x6761;&#x6570;&#x636E; &#x6BCF;&#x4E00;&#x6761;&#x662F;&#x4E00;&#x4E2A;txt&#x6587;&#x4EF6;
    negative_fileids = os.listdir('neg')#&#x6D88;&#x6781; list&#x7C7B;&#x578B; 22&#x6761;&#x6570;&#x636E; &#x6BCF;&#x4E00;&#x6761;&#x662F;&#x4E00;&#x4E2A;txt&#x6587;&#x4EF6;&#x81EA;&#x5DF1;&#x627E;&#x7684;&#x4E00;&#x4E9B;&#x6570;&#x636E;
    print(type(negative_fileids),len(negative_fileids))

    # &#x5C06;&#x8FD9;&#x4E9B;&#x8BC4;&#x8BBA;&#x6570;&#x636E;&#x5206;&#x6210;&#x79EF;&#x6781;&#x8BC4;&#x8BBA;&#x548C;&#x6D88;&#x6781;&#x8BC4;&#x8BBA;
    # movie_reviews.words(fileids=[f])&#x8868;&#x793A;&#x6BCF;&#x4E00;&#x4E2A;txt&#x6587;&#x672C;&#x91CC;&#x9762;&#x7684;&#x5185;&#x5BB9;&#xFF0C;&#x7ED3;&#x679C;&#x662F;&#x5355;&#x8BCD;&#x7684;&#x5217;&#x8868;&#xFF1A;['films', 'adapted', 'from', 'comic', 'books', 'have', ...]
    # features_positive &#x7ED3;&#x679C;&#x4E3A;&#x4E00;&#x4E2A;list
    # &#x7ED3;&#x679C;&#x5F62;&#x5982;&#xFF1A;[({'shakesp: True, 'limit': True, 'mouth': True, ..., 'such': True, 'prophetic': True}, 'Positive'), ..., ({...}, 'Positive'), ...]
    path = 'pos/'
    features_positive = [(extract_features(readtxt(f,path=path)), 'Positive') for f in positive_fileids]
    path = 'neg/'
    features_negative = [(extract_features(readtxt(f,path=path)), 'Negative') for f in negative_fileids]

    # &#x5206;&#x6210;&#x8BAD;&#x7EC3;&#x6570;&#x636E;&#x96C6;&#xFF08;80%&#xFF09;&#x548C;&#x6D4B;&#x8BD5;&#x6570;&#x636E;&#x96C6;&#xFF08;20%&#xFF09;
    threshold_factor = 0.8
    threshold_positive = int(threshold_factor * len(features_positive))  # 800
    threshold_negative = int(threshold_factor * len(features_negative))  # 800
    # &#x63D0;&#x53D6;&#x7279;&#x5F81; 800&#x4E2A;&#x79EF;&#x6781;&#x6587;&#x672C;800&#x4E2A;&#x6D88;&#x6781;&#x6587;&#x672C;&#x6784;&#x6210;&#x8BAD;&#x7EC3;&#x96C6;  200+200&#x6784;&#x6210;&#x6D4B;&#x8BD5;&#x6587;&#x672C;
    features_train = features_positive[:threshold_positive] + features_negative[:threshold_negative]
    features_test = features_positive[threshold_positive:] + features_negative[threshold_negative:]
    print("\n&#x8BAD;&#x7EC3;&#x6570;&#x636E;&#x70B9;&#x7684;&#x6570;&#x91CF;:", len(features_train))
    print("&#x6D4B;&#x8BD5;&#x6570;&#x636E;&#x70B9;&#x7684;&#x6570;&#x91CF;:", len(features_test))

    # &#x8BAD;&#x7EC3;&#x6734;&#x7D20;&#x8D1D;&#x53F6;&#x65AF;&#x5206;&#x7C7B;&#x5668;
    classifier = NaiveBayesClassifier.train(features_train)
    print("\n&#x5206;&#x7C7B;&#x5668;&#x7684;&#x51C6;&#x786E;&#x6027;:", nltk.classify.util.accuracy(classifier, features_test))
    print("\n&#x4E94;&#x5927;&#x4FE1;&#x606F;&#x6700;&#x4E30;&#x5BCC;&#x7684;&#x5355;&#x8BCD;:")
    for item in classifier.most_informative_features()[:5]:
        print(item[0])

    # &#x8F93;&#x5165;&#x4E00;&#x4E9B;&#x7B80;&#x5355;&#x7684;&#x8BC4;&#x8BBA;
    input_reviews = [
        "works well with proper preparation.",
        ]

    #&#x8FD0;&#x884C;&#x5206;&#x7C7B;&#x5668;&#xFF0C;&#x83B7;&#x5F97;&#x9884;&#x6D4B;&#x7ED3;&#x679C;
    print("\n&#x9884;&#x6D4B;:")
    for review in input_reviews:
        print("\n&#x8BC4;&#x8BBA;:", review)
        probdist = classifier.prob_classify(extract_features(review.split()))
        pred_sentiment = probdist.max()
        # &#x6253;&#x5370;&#x8F93;&#x51FA;
        print("&#x9884;&#x6D4B;&#x60C5;&#x7EEA;:", pred_sentiment)
        print("&#x53EF;&#x80FD;&#x6027;:", round(probdist.prob(pred_sentiment), 2))

print("&#x7ED3;&#x675F;")

运行结果：这里的准确性有点高，这是因为我选取的一些数据是非常明显的表达积极和消极的所以处理结果比较难以相信

<class 'list'> 42
<class 'list'> 22

&#x8BAD;&#x7EC3;&#x6570;&#x636E;&#x70B9;&#x7684;&#x6570;&#x91CF;: 50
&#x6D4B;&#x8BD5;&#x6570;&#x636E;&#x70B9;&#x7684;&#x6570;&#x91CF;: 14

&#x5206;&#x7C7B;&#x5668;&#x7684;&#x51C6;&#x786E;&#x6027;: 1.0

&#x4E94;&#x5927;&#x4FE1;&#x606F;&#x6700;&#x4E30;&#x5BCC;&#x7684;&#x5355;&#x8BCD;:
microwave
product
works
ever
service

&#x9884;&#x6D4B;:

&#x8BC4;&#x8BBA;: works well with proper preparation.

&#x9884;&#x6D4B;&#x60C5;&#x7EEA;: Positive
&#x53EF;&#x80FD;&#x6027;: 0.77
&#x7ED3;&#x675F;
</class></class>

二、使用自带库分析

import pandas as pd

from nltk.sentiment.vader import SentimentIntensityAnalyzer
&#x5206;&#x6790;&#x53E5;&#x5B50;&#x7684;&#x60C5;&#x611F;&#xFF1A;&#x60C5;&#x611F;&#x5206;&#x6790;&#x662F;NLP&#x6700;&#x53D7;&#x6B22;&#x8FCE;&#x7684;&#x5E94;&#x7528;&#x4E4B;&#x4E00;&#x3002;&#x60C5;&#x611F;&#x5206;&#x6790;&#x662F;&#x6307;&#x786E;&#x5B9A;&#x4E00;&#x6BB5;&#x7ED9;&#x5B9A;&#x7684;&#x6587;&#x672C;&#x662F;&#x79EF;&#x6781;&#x8FD8;&#x662F;&#x6D88;&#x6781;&#x7684;&#x8FC7;&#x7A0B;&#x3002;
&#x6709;&#x4E00;&#x4E9B;&#x573A;&#x666F;&#x4E2D;&#xFF0C;&#x6211;&#x4EEC;&#x8FD8;&#x4F1A;&#x5C06;&#x201C;&#x4E2D;&#x6027;&#x201C;&#x4F5C;&#x4E3A;&#x7B2C;&#x4E09;&#x4E2A;&#x9009;&#x9879;&#x3002;&#x60C5;&#x611F;&#x5206;&#x6790;&#x5E38;&#x7528;&#x4E8E;&#x53D1;&#x73B0;&#x4EBA;&#x4EEC;&#x5BF9;&#x4E8E;&#x4E00;&#x4E2A;&#x7279;&#x5B9A;&#x4E3B;&#x9898;&#x7684;&#x770B;&#x6CD5;&#x3002;
&#x5B9A;&#x4E49;&#x4E00;&#x4E2A;&#x7528;&#x4E8E;&#x63D0;&#x53D6;&#x7279;&#x5F81;&#x7684;&#x51FD;&#x6570;
&#x8F93;&#x5165;&#x4E00;&#x6BB5;&#x6587;&#x672C;&#x8FD4;&#x56DE;&#x5F62;&#x5982;&#xFF1A;{'It': True, 'movie': True, 'amazing': True, 'is': True, 'an': True}
&#x8FD4;&#x56DE;&#x7C7B;&#x578B;&#x662F;&#x4E00;&#x4E2A;dict

if __name__ == '__main__':

    # &#x8F93;&#x5165;&#x4E00;&#x4E9B;&#x7B80;&#x5355;&#x7684;&#x8BC4;&#x8BBA;
    #data = pd.read_excel('data3/microwave1.xlsx')
    name = 'hair_dryer1'
    data = pd.read_excel('../data3/'+name+'.xlsx')
    input_reviews = data[u'review_body']
    input_reviews = input_reviews.tolist()
    input_reviews = [
        "works well with proper preparation.",
        "i hate that opening the door moves the microwave towards you and out of its place. thats my only complaint.",
        "piece of junk. got two years of use and it died. customer service says too bad. whirlpool dishwasher died a few months ago. whirlpool is dead to me.",
        "am very happy with  this"
        ]

    #&#x8FD0;&#x884C;&#x5206;&#x7C7B;&#x5668;&#xFF0C;&#x83B7;&#x5F97;&#x9884;&#x6D4B;&#x7ED3;&#x679C;
    for sentence in input_reviews:
        sid = SentimentIntensityAnalyzer()
        ss = sid.polarity_scores(sentence)
        print("&#x53E5;&#x5B50;:"+sentence)
        for k in sorted(ss):
            print('{0}: {1}, '.format(k, ss[k]), end='')

        print()
print("&#x7ED3;&#x675F;")

结果：

&#x53E5;&#x5B50;:works well with proper preparation.

compound: 0.2732, neg: 0.0, neu: 0.656, pos: 0.344,
&#x53E5;&#x5B50;:i hate that opening the door moves the microwave towards you and out of its place. thats my only complaint.

compound: -0.7096, neg: 0.258, neu: 0.742, pos: 0.0,
&#x53E5;&#x5B50;:piece of junk. got two years of use and it died. customer service says too bad. whirlpool dishwasher died a few months ago. whirlpool is dead to me.

compound: -0.9432, neg: 0.395, neu: 0.605, pos: 0.0,
&#x53E5;&#x5B50;:am very happy with  this
compound: 0.6115, neg: 0.0, neu: 0.5, pos: 0.5,
&#x7ED3;&#x675F;

结果解释：
compound就相当于一个综合评价，主要和消极和积极的可能性有关
neg：消极可能性
pos：积极可能性
neu：中性可能性

Original: https://www.cnblogs.com/hjk-airl/p/16066851.html
Author: hjk-airl
Title: NLTK安装使用全过程–python

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/565121/

转载文章受原作者版权保护。转载请注明原作者出处！

大数据

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

docker打包容器

从dockerfile进行build sudo docker build -t 镜像名:镜像版本 . docker commit 容器id 镜像名:镜像版本docker save …

大数据 2023年5月29日
0065
Hive 事务表 (Transactional Tables)

文章目录 1. 为什么要使用事务表？ 2. 创建使用事务表 3. 实现原理 * 3.1 事务产生文件夹 3.2 那么文件夹里面有什么？ 3.3 合并器(Compactor) 1. …

大数据 2023年11月12日
0039
AI快车道PaddleNLP系列直播课7|细粒度情感分析实战NLP|文本情感分类

情感分类是一个大领域具体研究的话还需要细分，在对数据分析方面有较强的实用价值。目前有传统方法和深度学习方法，我主要针对深度学习方法进行学习，深度学习方法需要大量数据，在缺乏数据的情…

大数据 2023年5月28日
0090
粗枝大叶记录一下java9模块化改造一个项目的过程(Jigsaw)

假设项目结构如下：其中的依赖关系为我实际用的jdk是17 1. common模块创建描述文件，在common的src/main/java下创建module-info.java, …

大数据 2023年6月3日
0074
在IDEA中对数据库SQlite进行增删改查的基本操作

已知Sqlite数据库中员工表（employee）的数据如图所示。可能要用到的脚本如下：（也可以对此做出相应的更改） create table EMPLOYEE ( EMP_ID…

大数据 2023年11月11日
0035
04-学院管理系统数据库-专业管理数据操作

04学院管理系统数据库-专业管理数据操作学校信息化管理已经成为各个学校信息化建设的一个标志，学院管理系统则是学校信息化的一个重要部分，本项目主要是对学院管理系统数据库中的学院部门…

大数据 2023年11月11日
0030
查找hive表的存储位置并查看表文件大小及分区文件名

有时候我们需要查看Hive表对应文件的文件大小，那么分两步：知道Hive表在HDFS中的存储位置；查看Hive表对应的文件大小。 1. 知道Hive表在HDFS中的存储位置使用…

大数据 2023年11月12日
0051
Python爬虫数据到sqlite实例

最近需要使用到爬虫+数据库，原文中作者有些没补齐，略作修改之后跑通了。主要修改： 1.调整了数据获取的正则表达式；改了一下数据库的table名和定义名字； 3.加了数据清洗的模…

大数据 2023年11月10日
0026
【第2期赠书活动】〖Python 数据库开发实战 – Redis篇⑤〗- Redis 的常用配置参数

大数据 2023年11月14日
0046
K8S原来如此简单（六）Pod调度

我们前面部署的pod调度取决于kube-scheduler，它会根据自己的算法，集群的状态来选择合适的node部署我们的pod。下面我们来看下如何来根据我们自己的要求，来影响po…

大数据 2023年6月3日
0072
flink+kafka的端到端一致性

上一篇中提到flink+kafka如何做到任务级顺序保证，而端到端一致性即为实现用户数据目标端与源端的准确一致，当源端数据发生更改时，保证目标端及时、正确、持久的写入更改数据。为实…

大数据 2023年6月3日
0071
win10下使用VS2022搭建sqlite3环境

下载相关文件从sqlite3.org网站下载 https://www.sqlite.org/index.htm用到的sqlite3库及dll库下载下面这两个文件，然后解压解压后得…

大数据 2023年11月10日
0037
Flume环境搭建

将 lib 文件夹下的 guava-11.0.2.jar 删除以兼容 Hadoop 3.1.3 rm /opt/module/flume/lib/guava-11.0.2.jar …

大数据 2023年6月3日
0090
CSDN第26期周赛赛后总结（第一次AK）

第一次AK，记录与反思一下。一、等差数列一个等差数列是一个能表示成 a, a+b, a+2b,…, a+nb (n=0,1,2,3,…)的数列。在这个问题中 a是一个…

大数据 2023年11月13日
0022
刷题日记：递推问题1-力扣.70.爬楼梯

原题如下： https://leetcode-cn.com/problems/climbing-stairs/submissions/ 分析：首先这是什么题目类型？ [Tence…

大数据 2023年6月2日
0089

2024 年 4 月
一	二	三	四	五	六	日
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

NLTK安装使用全过程–python

前言

NLTK下载安装

实战：运用自己的数据进行操作

一、使用自己的训练集训练和分析

二、使用自带库分析

大家都在看