python—利用朴素贝叶斯分类器对文本进行分类

2023年7月2日上午1:24 • 人工智能 • 阅读 78

题目：

1.已知一个文本集合为：
[[‘my’, ‘dog’,’has’,’false’,’problems’,’help’,’please’],
[‘maybe’,’not’,’take’,’him’,’to’,’dog’,’park’,’stupid’],
[‘my’,’dalmation’,’is’,’so’,’cute’,’I’,’love’,’him’,’my’],
[‘stop’,’posting’,’stupid’,’worthless’,’garbage’], [‘mr’,’licks’,’ate’,’my’,’steak’,’how’,’to’,’stop’,’him’],
[‘quit’,’buying’,’worthless’,’dog’,’food’,’stupid’]]
此数据集有6个文本，其对应标签为classVec={0,1,0,1,0,1}，标签为1表示此文本带有侮辱性词语，否则标签为0。要求：
(1)得到数据集的所有出现的单词构成的无重复词的词典。即为：
[‘cute’, ‘love’, ‘help’, ‘garbage’, ‘quit’, ‘I’, ‘problems’, ‘is’, ‘park’, ‘stop’, ‘flea’, ‘dalmation’, ‘licks’, ‘food’, ‘not’, ‘him’, ‘buying’, ‘posting’, ‘has’, ‘worthless’, ‘ate’, ‘to’, ‘maybe’, ‘please’, ‘dog’, ‘how’, ‘stupid’, ‘so’, ‘take’, ‘mr’, ‘steak’, ‘my’]
(2)根据词典将原始的6个文本表示成0或1数字构成的词向量形式。例如，第一个文本对应的词向量为：
[0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1]
(3)利用朴素贝叶斯分类模型。
(4)现有两个文本：[‘love’, ‘my’, ‘dalmation’]， [‘stupid’, ‘garbage’]，用(3)得到的朴素贝叶斯分类器对其进行分类，即给出其标签为0(非侮辱性文本)或1(侮辱性文本)。

2.在(1)-(4)的基础上，对给出的50个邮件，其中25个垃圾邮件，25个非垃圾邮件进行分类，随机选取20个垃圾邮件和20个非垃圾邮件作为训练集，剩余10个文本为测试集，用训练集文本得到朴素贝叶斯文本分类器的多项式模型或贝努力模型，然后对测试集文本进行测试，得到其accuracy值。

步骤：

代码如下：


from numpy import*
def loadDataSet():
    postingList=[
        ['my','dog','has','flea','problems','help','please'],
        ['maybe','not','take','him','dog','part','park','stupid'],
        ['my','dalmation','is','so','cute','I','love','him'],
        ['stop','posting','stupid','worthless','garbage'],
        ['mr','licks','ate','my','steak','how','to','stop','him'],
        ['quit','buying','worthless','dog','food','stupid']
        ]
    classVec=[0,1,0,1,0,1]
    return postingList,classVec

def createVocabList(dataSet):
    vocabSet=set([])
    for document in dataSet:
        vocabSet=vocabSet | set(document)
    return list(vocabSet)

def setOfWords2Vec(vocabList,inputSet):
    returnVec=[0]*len(vocabList)
    for word in inputSet:

        if word in vocabList:
            returnVec[vocabList.index(word)]=1
        else:
            print("the word: %s is not in my Vocabulary！" %word)

    return returnVec

def trainNB0(trainMatrix,trainCategory):
    numTrainDocs=len(trainMatrix)
    numWords=len(trainMatrix[0])
    pAbusive=sum(trainCategory)/float(numTrainDocs)
    p0Num=ones(numWords);p1Num=ones(numWords)
    p0Denom=2.0;p1Denom=2.0
    for i in range(numTrainDocs):
        if trainCategory[i]==1:
            p1Num+=trainMatrix[i]
            p1Denom+=sum(trainMatrix[i])
        else:
            p0Num+=trainMatrix[i]
            p0Denom+=sum(trainMatrix[i])
    p1Vect=log(p1Num/p1Denom)
    p0Vect=log(p0Num/p0Denom)
    return p0Vect,p1Vect,pAbusive

def classifyNB(vec2Classify,p0Vec,p1Vec,pClass1):
    p1=sum(vec2Classify*p1Vec)+log(pClass1)
    p0=sum(vec2Classify*p0Vec)+log(1.0-pClass1)
    if p1>p0:
        return 1
    else:
        return 0

def testingNB():
    listOposts,listClasses=loadDataSet()
    myVocablist=createVocabList(listOposts)
    print(myVocablist)
    trainMat=[]
    for postinDoc in listOposts:
        trainMat.append(setOfWords2Vec(myVocablist, postinDoc))
    p0V,p1V,pAb=trainNB0(trainMat,listClasses)
    testEntry=['love','my','dalmation']
    thisDoc=array(setOfWords2Vec(myVocablist, testEntry))
    print(testEntry,'classified as: ',classifyNB(thisDoc, p0V, p1V,pAb))
    testEntry=['stupid','garbge']
    thisDoc=array(setOfWords2Vec(myVocablist, testEntry))
    print(testEntry,'classified as: ',classifyNB(thisDoc, p0V, p1V,pAb))

def textParse(bigString):
    import re
    listOftokens=re.split(r'\w*',bigString)
    return [tok.lower() for tok in listOftokens if len(tok)>2]

def spamTest():
    docList=[]
    classList=[]
    for i in range(1,26):
        wordList=textParse(open('E:/pywork/test/sy-7/email/spam/%d.txt' %i).read())
        docList.append(wordList)
        classList.append(1)
        wordList=textParse(open('E:/pywork/test/sy-7/email/ham/%d.txt' %i).read())
        docList.append(wordList)
        classList.append(0)
    vocabList=createVocabList(docList)
    trainingSet=list(range(50))
    testSet=[]
    times=0
    while True:
        randIndex=int(random.uniform(0,len(trainingSet)))
        if classList[trainingSet[randIndex]]==1:
            testSet.append(trainingSet[randIndex])
            del (trainingSet[randIndex])
            times +=1
        if times==5:
            break
    while True:
        randIndex=int(random.uniform(0,len(trainingSet)))
        if classList[trainingSet[randIndex]]==0:
            testSet.append(trainingSet[randIndex])
            del (trainingSet[randIndex])
            times+=1
        if times==10:
            break
    trainMat=[];trainClasses=[]
    for docIndex in trainingSet:
        trainMat.append(setOfWords2Vec(vocabList, docList[docIndex]))
        trainClasses.append(classList[docIndex])
    p0V,p1V,pSpam=trainNB0(array(trainMat),array(trainClasses))
    rightCount=0
    for docIndex in testSet:
        wordVector=setOfWords2Vec(vocabList,docList[docIndex])
        if classifyNB(array(wordVector),p0V,p1V,pSpam)==classList[docIndex]:
            rightCount+=1

    return float(rightCount)/len(testSet)

def multiTest():
    numTests=10;rightSum=0.0
    for k in range(numTests):
        rightSum += spamTest()

testingNB()
multiTest()

运行结果：

Original: https://blog.csdn.net/weixin_45652976/article/details/122530659
Author: Y_ni
Title: python—利用朴素贝叶斯分类器对文本进行分类

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/664441/

转载文章受原作者版权保护。转载请注明原作者出处！

人工智能

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

vscode配置opencv

前言本篇文章主要用来记录使用vscode配置opencv的全过程，在整个过程中需要用到的工具包括vscode安装包、MinGW-w64和opencv的源码。vs studio配置…

人工智能 2023年6月18日
00101
Resnet152对102种花朵图像分类（PyTorch，迁移学习）

目录 * – 1.介绍 – + 1.1.项目数据及源码 + 1.2.数据集介绍 + 1.3.任务介绍 + 1.4.ResNet网络介绍 – 2….

人工智能 2023年6月30日
00101
Python中的时序分析工具包推荐（2）

导读在前期推文中介绍了时序分析的三个工具包，分别侧重于时序特征工程、基于sklearn的时序建模和更为高级的时序建模工具。今天，本篇再来介绍4个时序分析好用的工具包：Prophe…

人工智能 2023年7月16日
00111
stamp mismatch with notes file

import osimport sys def changeStamp(gcda: str, gcno: str):with open(gcda, ‘rb+&#8217…

人工智能 2023年6月26日
0081
Exploiting Cloze Questions for Few Shot Text Classification and Natural Language Inference

Exploiting Cloze Questions for Few Shot Text Classification and Natural Language Inference…

人工智能 2023年5月28日
0098
人脸表情识别解干扰论文解读2：D3Net：Dual-Branch Disturbance Disentangling Networkfor Facial Exp

D3Net：厦门大学发表于ACM MM 2021 原论文链接：有道云笔记本文的第一作者是信息学院计算机科学与技术系2019级硕士生莫榕云，通讯作者是信息学院计算机科学与技术系严…

人工智能 2023年6月20日
00130
Python学习DAY4|Pandas库的使用方法

本笔记摘录清华大学工业工程系朱成礼老师的python教案与授课内容，并在此基础上进行实操性的拓展，希望能对大家有所帮助。零、简介 DataFrame是一个二维数据结构，既有行索…

人工智能 2023年7月7日
0098
数学建模常用算法—灰色预测

今天数模君给大家讲解一下数学建模比赛中常用的一种预测方法：灰色预测法。目录模型的含义灰色预测的原理实例模型的含义灰色预测模型（ Gray Forecast Model…

人工智能 2023年6月15日
00147
1.5、计算机网络的性能指标(1)

性能指标可以从不同的方面来度量计算机网络的性能。常用的计算机网络的性能指标有以下8个速率带宽吞吐量时延时延带宽积往返时间利用率丢包率 1、比特计算机中数据量…

人工智能 2023年6月28日
0095
【深度学习】(1)CNN中的注意力机制（SE、ECA、CBAM），附Pytorch完整代码

大家好，今天和各位分享一下如何使用 Pytorch构建卷积神经网络中的各种注意力机制，如： SENet，ECANet，CBAM。注意力机制的原理和 TensorFlow2的实现方…

人工智能 2023年7月24日
00130
C# 中的委托

委托是一个类，它定义了方法的类型，使得可以将方法当做另一个方法的参数来进行传递。这种将方法动态的赋给参数的做法，可以避免在一个程序中大量使用if…else……

人工智能 2023年6月30日
00123
tensorflow变量相关问题记录（模型中的变量如何在 train/ valid/ test 过程中共享？）

抵扣说明： 1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。2.余额无法直接购买下载，可以购买VIP、C币套餐、付费专栏及课程。 Original: https:…

人工智能 2023年5月25日
0091
计算机网络概述

既然我们无法逃避接触互联网，那么为何不去了解它并且使用它。 ——因特网”之父” 温顿.瑟夫因特网概述 1、网络、互联网与因特网的区别与关系多节点之间通过…

人工智能 2023年6月26日
0082
将Anaconda设置为国内镜像源的方法

因为Anaconda默认使用国外镜像源，下载速度相对较慢，为了节省时间提高效率，需要将源设置修改为国内镜像源，配置国内镜像源方法如下。 1.打开anaconda的prompt 如果…

人工智能 2023年6月23日
00443
拓端tecdat：Python主题建模LDA模型、t-SNE 降维聚类、词云可视化文本挖掘新闻组数据集

最近我们被客户要求撰写关于主题建模的研究报告，包括一些图形和统计输出。在这篇文章中，我们讨论了基于 gensim 包来可视化主题模型 (LDA) 的输出和结果的技术。相关视频…

人工智能 2023年5月31日
00105
安装Anaconda和tensorflow的一些问题

最近，我决定学习机器学习，所以我决定先配置环境。 [En] Recently, I decided to learn about machine learning, so I de…

人工智能 2023年5月23日
00130

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

python—利用朴素贝叶斯分类器对文本进行分类

大家都在看