NLTK安装使用全过程–python

前言

之前做实验用到了情感分析,就下载了一下,这篇博客记录使用过程。

下载安装到实战详细步骤

NLTK下载安装

先使用pip install nltk 安装包
然后运行下面两行代码会弹出如图得GUI界面,注意下载位置,然后点击下载全部下载了大概3.5G。

import nltk
nltk.download()!

NLTK安装使用全过程--python

下载成功后查看是否可以使用,运行下面代码看看是否可以调用brown中的词库

from nltk.corpus import brown

print(brown.categories())  # 输出brown语料库的类别
print(len(brown.sents()))  # 输出brown语料库的句子数量
print(len(brown.words()))  # 输出brown语料库的词数量

'''
结果为:
['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies',
'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance',
'science_fiction']
57340
1161192
'''

这时候有可能报错,说在下面文件夹中没有找到nltk_data
把下载好的文件解压在复制到其中一个文件夹位置即可,注意文件名,让后就能正常使用!

NLTK安装使用全过程--python

实战:运用自己的数据进行操作

一、使用自己的训练集训练和分析

NLTK安装使用全过程--python
链接:https://pan.baidu.com/s/1GrNg3ziWJGhcQIWBCr2PMg
提取码:1fb8
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
import os
from nltk.corpus import stopwords
import pandas as pd

def extract_features(word_list):
    return dict([(word, True) for word in word_list])

#停用词
stop = stopwords.words('english')
stop1 = ['!', ',' ,'.' ,'?' ,'-s' ,'-ly' ,' ', 's','...']
stop = stop1+stop
print(stop)

#读取txt文本
def readtxt(f,path):
    data1 = ['microwave']
    # 以 utf-8 的编码格式打开指定文件
    f = open(path+f, encoding="utf-8")
    # 输出读取到的数据
    #data = f.read().split()
    data = f.read().split()
    for i in range(len(data)):
        if data[i] not in stop:
            data[i] = [data[i]]
            data1 = data1+data[i]
    # 关闭文件
    f.close()
    del data1[0]
    return data1

if __name__ == '__main__':

    # 加载积极与消极评论  这些评论去掉了一些停用词,是在readtxt韩硕里处理的,
    #停用词如 i am you a this 等等在评论中是非常常见的,有可能对结果有影响,应该事先去除
    positive_fileids = os.listdir('pos')  # 积极 list类型 42条数据 每一条是一个txt文件
    print(type(positive_fileids), len(positive_fileids)) # list类型 42条数据 每一条是一个txt文件
    negative_fileids = os.listdir('neg')#消极 list类型 22条数据 每一条是一个txt文件自己找的一些数据
    print(type(negative_fileids),len(negative_fileids))

    # 将这些评论数据分成积极评论和消极评论
    # movie_reviews.words(fileids=[f])表示每一个txt文本里面的内容,结果是单词的列表:['films', 'adapted', 'from', 'comic', 'books', 'have', ...]
    # features_positive 结果为一个list
    # 结果形如:[({'shakesp: True, 'limit': True, 'mouth': True, ..., 'such': True, 'prophetic': True}, 'Positive'), ..., ({...}, 'Positive'), ...]
    path = 'pos/'
    features_positive = [(extract_features(readtxt(f,path=path)), 'Positive') for f in positive_fileids]
    path = 'neg/'
    features_negative = [(extract_features(readtxt(f,path=path)), 'Negative') for f in negative_fileids]

    # 分成训练数据集(80%)和测试数据集(20%)
    threshold_factor = 0.8
    threshold_positive = int(threshold_factor * len(features_positive))  # 800
    threshold_negative = int(threshold_factor * len(features_negative))  # 800
    # 提取特征 800个积极文本800个消极文本构成训练集  200+200构成测试文本
    features_train = features_positive[:threshold_positive] + features_negative[:threshold_negative]
    features_test = features_positive[threshold_positive:] + features_negative[threshold_negative:]
    print("\n训练数据点的数量:", len(features_train))
    print("测试数据点的数量:", len(features_test))

    # 训练朴素贝叶斯分类器
    classifier = NaiveBayesClassifier.train(features_train)
    print("\n分类器的准确性:", nltk.classify.util.accuracy(classifier, features_test))
    print("\n五大信息最丰富的单词:")
    for item in classifier.most_informative_features()[:5]:
        print(item[0])

    # 输入一些简单的评论
    input_reviews = [
        "works well with proper preparation.",
        ]

    #运行分类器,获得预测结果
    print("\n预测:")
    for review in input_reviews:
        print("\n评论:", review)
        probdist = classifier.prob_classify(extract_features(review.split()))
        pred_sentiment = probdist.max()
        # 打印输出
        print("预测情绪:", pred_sentiment)
        print("可能性:", round(probdist.prob(pred_sentiment), 2))

print("结束")

运行结果:这里的准确性有点高,这是因为我选取的一些数据是非常明显的表达积极和消极的所以处理结果比较难以相信

<class 'list'> 42
<class 'list'> 22

&#x8BAD;&#x7EC3;&#x6570;&#x636E;&#x70B9;&#x7684;&#x6570;&#x91CF;: 50
&#x6D4B;&#x8BD5;&#x6570;&#x636E;&#x70B9;&#x7684;&#x6570;&#x91CF;: 14

&#x5206;&#x7C7B;&#x5668;&#x7684;&#x51C6;&#x786E;&#x6027;: 1.0

&#x4E94;&#x5927;&#x4FE1;&#x606F;&#x6700;&#x4E30;&#x5BCC;&#x7684;&#x5355;&#x8BCD;:
microwave
product
works
ever
service

&#x9884;&#x6D4B;:

&#x8BC4;&#x8BBA;: works well with proper preparation.

&#x9884;&#x6D4B;&#x60C5;&#x7EEA;: Positive
&#x53EF;&#x80FD;&#x6027;: 0.77
&#x7ED3;&#x675F;
</class></class>

二、使用自带库分析

import pandas as pd

from nltk.sentiment.vader import SentimentIntensityAnalyzer
&#x5206;&#x6790;&#x53E5;&#x5B50;&#x7684;&#x60C5;&#x611F;&#xFF1A;&#x60C5;&#x611F;&#x5206;&#x6790;&#x662F;NLP&#x6700;&#x53D7;&#x6B22;&#x8FCE;&#x7684;&#x5E94;&#x7528;&#x4E4B;&#x4E00;&#x3002;&#x60C5;&#x611F;&#x5206;&#x6790;&#x662F;&#x6307;&#x786E;&#x5B9A;&#x4E00;&#x6BB5;&#x7ED9;&#x5B9A;&#x7684;&#x6587;&#x672C;&#x662F;&#x79EF;&#x6781;&#x8FD8;&#x662F;&#x6D88;&#x6781;&#x7684;&#x8FC7;&#x7A0B;&#x3002;
&#x6709;&#x4E00;&#x4E9B;&#x573A;&#x666F;&#x4E2D;&#xFF0C;&#x6211;&#x4EEC;&#x8FD8;&#x4F1A;&#x5C06;&#x201C;&#x4E2D;&#x6027;&#x201C;&#x4F5C;&#x4E3A;&#x7B2C;&#x4E09;&#x4E2A;&#x9009;&#x9879;&#x3002;&#x60C5;&#x611F;&#x5206;&#x6790;&#x5E38;&#x7528;&#x4E8E;&#x53D1;&#x73B0;&#x4EBA;&#x4EEC;&#x5BF9;&#x4E8E;&#x4E00;&#x4E2A;&#x7279;&#x5B9A;&#x4E3B;&#x9898;&#x7684;&#x770B;&#x6CD5;&#x3002;
&#x5B9A;&#x4E49;&#x4E00;&#x4E2A;&#x7528;&#x4E8E;&#x63D0;&#x53D6;&#x7279;&#x5F81;&#x7684;&#x51FD;&#x6570;
&#x8F93;&#x5165;&#x4E00;&#x6BB5;&#x6587;&#x672C;&#x8FD4;&#x56DE;&#x5F62;&#x5982;&#xFF1A;{'It': True, 'movie': True, 'amazing': True, 'is': True, 'an': True}
&#x8FD4;&#x56DE;&#x7C7B;&#x578B;&#x662F;&#x4E00;&#x4E2A;dict

if __name__ == '__main__':

    # &#x8F93;&#x5165;&#x4E00;&#x4E9B;&#x7B80;&#x5355;&#x7684;&#x8BC4;&#x8BBA;
    #data = pd.read_excel('data3/microwave1.xlsx')
    name = 'hair_dryer1'
    data = pd.read_excel('../data3/'+name+'.xlsx')
    input_reviews = data[u'review_body']
    input_reviews = input_reviews.tolist()
    input_reviews = [
        "works well with proper preparation.",
        "i hate that opening the door moves the microwave towards you and out of its place. thats my only complaint.",
        "piece of junk. got two years of use and it died. customer service says too bad. whirlpool dishwasher died a few months ago. whirlpool is dead to me.",
        "am very happy with  this"
        ]

    #&#x8FD0;&#x884C;&#x5206;&#x7C7B;&#x5668;&#xFF0C;&#x83B7;&#x5F97;&#x9884;&#x6D4B;&#x7ED3;&#x679C;
    for sentence in input_reviews:
        sid = SentimentIntensityAnalyzer()
        ss = sid.polarity_scores(sentence)
        print("&#x53E5;&#x5B50;:"+sentence)
        for k in sorted(ss):
            print('{0}: {1}, '.format(k, ss[k]), end='')

        print()
print("&#x7ED3;&#x675F;")

结果:

&#x53E5;&#x5B50;:works well with proper preparation.

compound: 0.2732, neg: 0.0, neu: 0.656, pos: 0.344,
&#x53E5;&#x5B50;:i hate that opening the door moves the microwave towards you and out of its place. thats my only complaint.

compound: -0.7096, neg: 0.258, neu: 0.742, pos: 0.0,
&#x53E5;&#x5B50;:piece of junk. got two years of use and it died. customer service says too bad. whirlpool dishwasher died a few months ago. whirlpool is dead to me.

compound: -0.9432, neg: 0.395, neu: 0.605, pos: 0.0,
&#x53E5;&#x5B50;:am very happy with  this
compound: 0.6115, neg: 0.0, neu: 0.5, pos: 0.5,
&#x7ED3;&#x675F;

结果解释:
compound就相当于一个综合评价,主要和消极和积极的可能性有关
neg:消极可能性
pos:积极可能性
neu:中性可能性

Original: https://www.cnblogs.com/hjk-airl/p/16066851.html
Author: hjk-airl
Title: NLTK安装使用全过程–python

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/565121/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

亲爱的 Coder【最近整理,可免费获取】👉 最新必读书单  | 👏 面试题下载  | 🌎 免费的AI知识星球