【NLP】⚠️学不会打我! 半小时学会基本操作 8⚠️ 新闻分类

2023年7月2日下午4:52 • 人工智能 • 阅读 81

【NLP】⚠️学不会打我! 半小时学会基本操作 8⚠️ 新闻分类

概述
TF-IDF 关键词提取
*
TF
IDF
TF-IDF
TfidfVectorizer
数据介绍
代码实现

概述

从今天开始我们将开启一段自然语言处理 (NLP) 的旅程. 自然语言处理可以让来处理, 理解, 以及运用人类的语言, 实现机器语言和人类语言之间的沟通桥梁.

; TF-IDF 关键词提取

TF-IDF (Term Frequency-Inverse Document Frequency), 即词频-逆文件频率是一种用于信息检索与数据挖掘的常用加权技术. TF-IDF 可以帮助我们挖掘文章中的关键词. 通过数值统计, 反映一个词对于语料库中某篇文章的重要性.

TF

TF (Term Frequency), 即词频. 表示词在文本中出现的频率.

公式:

; IDF

IDF (Inverse Document Frequency), 即逆文档频率. 表示语料库中包含词的文档的数目的倒数.

公式:

TF-IDF

公式:

TF-IDF = (词的频率 / 句子总字数) × (总文档数 / 包含该词的文档数)

如果一个词非常常见, 那么 IDF 就会很低, 反之就会很高. TF-IDF 可以帮助我们过滤常见词语, 提取关键词.

; TfidfVectorizer

TfidfVectorizer 可以帮助我们把原始文本转化为 tf-idf 的特征矩阵, 从而进行相似度计算. sklearn 的TfidfVectorizer 默认输入文本矩阵每行表示一篇文本. 不同文本中相同词项的 tf 值不同, 因此 tf 值与词项所在文本有关.

格式:

tfidfVectorizer(input='content', encoding='utf-8',
                 decode_error='strict', strip_accents=None, lowercase=True,
                 preprocessor=None, tokenizer=None, analyzer='word',
                 stop_words=None, token_pattern=r"(?u)\b\w\w+\b",
                 ngram_range=(1, 1), max_df=1.0, min_df=1,
                 max_features=None, vocabulary=None, binary=False,
                 dtype=np.float64, norm='l2', use_idf=True, smooth_idf=True,
                 sublinear_tf=False)

参数:

input: 输入
encoding: 编码, 默认为 utf-8
analyzer: “word” 或 “char”, 默认按词 (word) 分析
stopwords: 停用词
ngram_range: ngrame 上下限
lowercase: 转换为小写
max_features: 关键词个数

数据介绍

数据由 12 个不同网站的新闻数据组成. 如下:

  Class  ...                                            Content
0  news  ...  &#x4E2D;&#x5E7F;&#x7F51;&#x5510;&#x5C71;&#xFF16;&#x6708;&#xFF11;&#xFF12;&#x65E5;&#x6D88;&#x606F;&#xFF08;&#x8BB0;&#x8005;&#x6C64;&#x4E00;&#x4EAE;&#x3000;&#x5E84;&#x80DC;&#x6625;&#xFF09;&#x636E;&#x4E2D;&#x56FD;&#x4E4B;&#x58F0;&#x300A;&#x65B0;&#x95FB;&#x665A;&#x9AD8;&#x5CF0;&#x300B;&#x62A5;&#x9053;&#xFF0C;&#x4ECA;&#x5929;&#xFF08;&#xFF11;&#xFF12;&#x65E5;&#xFF09;&#x4E0A;...

1  news  ...  &#x5929;&#x6D25;&#x536B;&#x89C6;&#x6C42;&#x804C;&#x8282;&#x76EE;&#x300A;&#x975E;&#x4F60;&#x83AB;&#x5C5E;&#x300B;&#x201C;&#x6655;&#x5012;&#x95E8;&#x201D;&#x4E8B;&#x4EF6;&#x4F59;&#x6CE2;&#x672A;&#x4E86;&#xFF0C;&#x4E3B;&#x6301;&#x4EBA;&#x5F20;&#x7ECD;&#x521A;&#x524D;&#x65E5;&#x901A;&#x8FC7;&#x300A;&#x975E;&#x4F60;&#x83AB;&#x5C5E;&#x300B;&#x8282;&#x76EE;&#x7EC4;&#x53D1;...

2  news  ...  &#x4E34;&#x6C82;&#xFF08;&#x5C71;&#x4E1C;&#xFF09;&#xFF0C;&#xFF12;&#xFF10;&#xFF11;&#xFF12;&#x5E74;&#xFF16;&#x6708;&#xFF14;&#x65E5;&#x3000;&#x592B;&#x59BB;&#x201C;&#x9EA6;&#x5BA2;&#x201D;&#x5FD9;&#x9EA6;&#x6536;&#x3000;&#xFF16;&#x6708;&#xFF14;&#x65E5;&#xFF0C;&#x5728;&#x5C71;&#x4E1C;&#x7701;&#x4E34;&#x6C82;&#x5E02;&#x90EF;&#x57CE;&#x53BF;&#x90EF;&#x57CE;&#x8857;&#x9053;...

3  news  ...  &#x4E2D;&#x5E7F;&#x7F51;&#x5317;&#x4EAC;&#xFF16;&#x6708;&#xFF11;&#xFF13;&#x65E5;&#x6D88;&#x606F;&#xFF08;&#x8BB0;&#x8005;&#x738B;&#x5B87;&#xFF09;&#x636E;&#x4E2D;&#x56FD;&#x4E4B;&#x58F0;&#x300A;&#x65B0;&#x95FB;&#x665A;&#x9AD8;&#x5CF0;&#x300B;&#x62A5;&#x9053;&#xFF0C;&#x660E;&#x5929;&#x51CC;&#x6668;&#x4E24;&#x573A;&#x6B27;&#x6D32;&#x676F;&#x7684;&#x7CBE;&#x5F69;&#x6BD4;...

4  news  ...  &#x73AF;&#x7403;&#x7F51;&#x8BB0;&#x8005;&#x674E;&#x4EAE;&#x62A5;&#x9053;&#xFF0C;&#x6B63;&#x5728;&#x610F;&#x5927;&#x5229;&#x5EA6;&#x871C;&#x6708;&#x7684;&#x201C;&#x8138;&#x8C31;&#x201D;&#x521B;&#x59CB;&#x4EBA;&#x624E;&#x514B;&#x4F2F;&#x683C;&#x4E0E;&#x4ED6;&#x534E;&#x88D4;&#x59BB;&#x5B50;&#x7684;&#x4E00;&#x4E3E;&#x4E00;&#x52A8;&#x90FD;&#x5904;&#x4E8E;&#x5A92;&#x4F53;...

流程:

读取数据
计算数据 tf-idf 值
贝叶斯分类

代码实现

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report
from sklearn.naive_bayes import MultinomialNB
import jieba

def load_data():
    """&#x8BFB;&#x53D6;&#x6570;&#x636E;/&#x505C;&#x7528;&#x8BCD;"""

    # &#x8BFB;&#x53D6;&#x6570;&#x636E;
    data = pd.read_csv("test.txt", sep="\t", names=["Class", "Title", "Content"])
    print(data.head())

    # &#x8BFB;&#x53D6;&#x505C;&#x7528;&#x8BCD;
    stop_words = pd.read_csv("stopwords.txt", names=["stop_words"], encoding="utf-8")
    stop_words = stop_words["stop_words"].values.tolist()
    print(stop_words)

    return data, stop_words

def main():
    """&#x4E3B;&#x51FD;&#x6570;"""

    # &#x8BFB;&#x53D6;&#x6570;&#x636E;
    data, stop_words = load_data()

    # &#x5206;&#x8BCD;
    segs = data["Content"].apply(lambda x: ' '.join(jieba.cut(x)))

    # Tf-Idf
    tf_idf = TfidfVectorizer(stop_words=stop_words, max_features=1000, lowercase=False)

    # &#x62DF;&#x5408;
    tf_idf.fit(segs)

    # &#x8F6C;&#x6362;
    X = tf_idf.transform(segs)

    # &#x5206;&#x5272;&#x6570;&#x636E;
    X_train, X_test, y_train, y_test = train_test_split(X, data["Class"], random_state=0)

    # &#x8C03;&#x8BD5;&#x8F93;&#x51FA;
    print(X_train[:2])
    print(y_train[:2])

    # &#x5B9E;&#x4F8B;&#x5316;&#x6734;&#x7D20;&#x8D1D;&#x53F6;&#x65AF;
    classifier = MultinomialNB()

    # &#x62DF;&#x5408;
    classifier.fit(X_train, y_train)

    # &#x8BA1;&#x7B97;&#x5206;&#x6570;
    acc = classifier.score(X_test, y_test)
    print("&#x51C6;&#x786E;&#x7387;:", acc)

    # &#x62A5;&#x544A;
    report = classification_report(y_test, classifier.predict(X_test))
    print(report)

if __name__ == '__main__':
    main()

输出结果:

  Class  ...                                            Content
0  news  ...  &#x4E2D;&#x5E7F;&#x7F51;&#x5510;&#x5C71;&#xFF16;&#x6708;&#xFF11;&#xFF12;&#x65E5;&#x6D88;&#x606F;&#xFF08;&#x8BB0;&#x8005;&#x6C64;&#x4E00;&#x4EAE;&#x3000;&#x5E84;&#x80DC;&#x6625;&#xFF09;&#x636E;&#x4E2D;&#x56FD;&#x4E4B;&#x58F0;&#x300A;&#x65B0;&#x95FB;&#x665A;&#x9AD8;&#x5CF0;&#x300B;&#x62A5;&#x9053;&#xFF0C;&#x4ECA;&#x5929;&#xFF08;&#xFF11;&#xFF12;&#x65E5;&#xFF09;&#x4E0A;...

1  news  ...  &#x5929;&#x6D25;&#x536B;&#x89C6;&#x6C42;&#x804C;&#x8282;&#x76EE;&#x300A;&#x975E;&#x4F60;&#x83AB;&#x5C5E;&#x300B;&#x201C;&#x6655;&#x5012;&#x95E8;&#x201D;&#x4E8B;&#x4EF6;&#x4F59;&#x6CE2;&#x672A;&#x4E86;&#xFF0C;&#x4E3B;&#x6301;&#x4EBA;&#x5F20;&#x7ECD;&#x521A;&#x524D;&#x65E5;&#x901A;&#x8FC7;&#x300A;&#x975E;&#x4F60;&#x83AB;&#x5C5E;&#x300B;&#x8282;&#x76EE;&#x7EC4;&#x53D1;...

2  news  ...  &#x4E34;&#x6C82;&#xFF08;&#x5C71;&#x4E1C;&#xFF09;&#xFF0C;&#xFF12;&#xFF10;&#xFF11;&#xFF12;&#x5E74;&#xFF16;&#x6708;&#xFF14;&#x65E5;&#x3000;&#x592B;&#x59BB;&#x201C;&#x9EA6;&#x5BA2;&#x201D;&#x5FD9;&#x9EA6;&#x6536;&#x3000;&#xFF16;&#x6708;&#xFF14;&#x65E5;&#xFF0C;&#x5728;&#x5C71;&#x4E1C;&#x7701;&#x4E34;&#x6C82;&#x5E02;&#x90EF;&#x57CE;&#x53BF;&#x90EF;&#x57CE;&#x8857;&#x9053;...

3  news  ...  &#x4E2D;&#x5E7F;&#x7F51;&#x5317;&#x4EAC;&#xFF16;&#x6708;&#xFF11;&#xFF13;&#x65E5;&#x6D88;&#x606F;&#xFF08;&#x8BB0;&#x8005;&#x738B;&#x5B87;&#xFF09;&#x636E;&#x4E2D;&#x56FD;&#x4E4B;&#x58F0;&#x300A;&#x65B0;&#x95FB;&#x665A;&#x9AD8;&#x5CF0;&#x300B;&#x62A5;&#x9053;&#xFF0C;&#x660E;&#x5929;&#x51CC;&#x6668;&#x4E24;&#x573A;&#x6B27;&#x6D32;&#x676F;&#x7684;&#x7CBE;&#x5F69;&#x6BD4;...

4  news  ...  &#x73AF;&#x7403;&#x7F51;&#x8BB0;&#x8005;&#x674E;&#x4EAE;&#x62A5;&#x9053;&#xFF0C;&#x6B63;&#x5728;&#x610F;&#x5927;&#x5229;&#x5EA6;&#x871C;&#x6708;&#x7684;&#x201C;&#x8138;&#x8C31;&#x201D;&#x521B;&#x59CB;&#x4EBA;&#x624E;&#x514B;&#x4F2F;&#x683C;&#x4E0E;&#x4ED6;&#x534E;&#x88D4;&#x59BB;&#x5B50;&#x7684;&#x4E00;&#x4E3E;&#x4E00;&#x52A8;&#x90FD;&#x5904;&#x4E8E;&#x5A92;&#x4F53;...

[5 rows x 3 columns]
['?', '&#x3001;', '&#x3002;', '&#x201C;', '&#x201D;', '&#x300A;', '&#x300B;', '&#xFF01;', '&#xFF0C;', '&#xFF1A;', '&#xFF1B;', '&#xFF1F;', '&#x554A;', '&#x963F;', '&#x54CE;', '&#x54CE;&#x5440;', '&#x54CE;&#x54DF;', '&#x5509;', '&#x4FFA;', '&#x4FFA;&#x4EEC;', '&#x6309;', '&#x6309;&#x7167;', '&#x5427;', '&#x5427;&#x54D2;', '&#x628A;', '&#x7F62;&#x4E86;', '&#x88AB;', '&#x672C;', '&#x672C;&#x7740;', '&#x6BD4;', '&#x6BD4;&#x65B9;', '&#x6BD4;&#x5982;', '&#x9119;&#x4EBA;', '&#x5F7C;', '&#x5F7C;&#x6B64;', '&#x8FB9;', '&#x522B;', '&#x522B;&#x7684;', '&#x522B;&#x8BF4;', '&#x5E76;', '&#x5E76;&#x4E14;', '&#x4E0D;&#x6BD4;', '&#x4E0D;&#x6210;', '&#x4E0D;&#x5355;', '&#x4E0D;&#x4F46;', '&#x4E0D;&#x72EC;', '&#x4E0D;&#x7BA1;', '&#x4E0D;&#x5149;', '&#x4E0D;&#x8FC7;', '&#x4E0D;&#x4EC5;', '&#x4E0D;&#x62D8;', '&#x4E0D;&#x8BBA;', '&#x4E0D;&#x6015;', '&#x4E0D;&#x7136;', '&#x4E0D;&#x5982;', '&#x4E0D;&#x7279;', '&#x4E0D;&#x60DF;', '&#x4E0D;&#x95EE;', '&#x4E0D;&#x53EA;', '&#x671D;', '&#x671D;&#x7740;', '&#x8D81;', '&#x8D81;&#x7740;', '&#x4E58;', '&#x51B2;', '&#x9664;', '&#x9664;&#x6B64;&#x4E4B;&#x5916;', '&#x9664;&#x975E;', '&#x9664;&#x4E86;', '&#x6B64;', '&#x6B64;&#x95F4;', '&#x6B64;&#x5916;', '&#x4ECE;', '&#x4ECE;&#x800C;', '&#x6253;', '&#x5F85;', '&#x4F46;', '&#x4F46;&#x662F;', '&#x5F53;', '&#x5F53;&#x7740;', '&#x5230;', '&#x5F97;', '&#x7684;', '&#x7684;&#x8BDD;', '&#x7B49;', '&#x7B49;&#x7B49;', '&#x5730;', '&#x7B2C;', '&#x53EE;&#x549A;', '&#x5BF9;', '&#x5BF9;&#x4E8E;', '&#x591A;', '&#x591A;&#x5C11;', '&#x800C;', '&#x800C;&#x51B5;', '&#x800C;&#x4E14;', '&#x800C;&#x662F;', '&#x800C;&#x5916;', '&#x800C;&#x8A00;', '&#x800C;&#x5DF2;', '&#x5C14;&#x540E;', '&#x53CD;&#x8FC7;&#x6765;', '&#x53CD;&#x8FC7;&#x6765;&#x8BF4;', '&#x53CD;&#x4E4B;', '&#x975E;&#x4F46;', '&#x975E;&#x5F92;', '&#x5426;&#x5219;', '&#x560E;', '&#x560E;&#x767B;', '&#x8BE5;', '&#x8D76;', '&#x4E2A;', '&#x5404;', '&#x5404;&#x4E2A;', '&#x5404;&#x4F4D;', '&#x5404;&#x79CD;', '&#x5404;&#x81EA;', '&#x7ED9;', '&#x6839;&#x636E;', '&#x8DDF;', '&#x6545;', '&#x6545;&#x6B64;', '&#x56FA;&#x7136;', '&#x5173;&#x4E8E;', '&#x7BA1;', '&#x5F52;', '&#x679C;&#x7136;', '&#x679C;&#x771F;', '&#x8FC7;', '&#x54C8;', '&#x54C8;&#x54C8;', '&#x5475;', '&#x548C;', '&#x4F55;', '&#x4F55;&#x5904;', '&#x4F55;&#x51B5;', '&#x4F55;&#x65F6;', '&#x563F;', '&#x54FC;', '&#x54FC;&#x5537;', '&#x547C;&#x54E7;', '&#x4E4E;', '&#x54D7;', '&#x8FD8;&#x662F;', '&#x8FD8;&#x6709;', '&#x6362;&#x53E5;&#x8BDD;&#x8BF4;', '&#x6362;&#x8A00;&#x4E4B;', '&#x6216;', '&#x6216;&#x662F;', '&#x6216;&#x8005;', '&#x6781;&#x4E86;', '&#x53CA;', '&#x53CA;&#x5176;', '&#x53CA;&#x81F3;', '&#x5373;', '&#x5373;&#x4FBF;', '&#x5373;&#x6216;', '&#x5373;&#x4EE4;', '&#x5373;&#x82E5;', '&#x5373;&#x4F7F;', '&#x51E0;', '&#x51E0;&#x65F6;', '&#x5DF1;', '&#x65E2;', '&#x65E2;&#x7136;', '&#x65E2;&#x662F;', '&#x7EE7;&#x800C;', '&#x52A0;&#x4E4B;', '&#x5047;&#x5982;', '&#x5047;&#x82E5;', '&#x5047;&#x4F7F;', '&#x9274;&#x4E8E;', '&#x5C06;', '&#x8F83;', '&#x8F83;&#x4E4B;', '&#x53EB;', '&#x63A5;&#x7740;', '&#x7ED3;&#x679C;', '&#x501F;', '&#x7D27;&#x63A5;&#x7740;', '&#x8FDB;&#x800C;', '&#x5C3D;', '&#x5C3D;&#x7BA1;', '&#x7ECF;', '&#x7ECF;&#x8FC7;', '&#x5C31;', '&#x5C31;&#x662F;', '&#x5C31;&#x662F;&#x8BF4;', '&#x636E;', '&#x5177;&#x4F53;&#x5730;&#x8BF4;', '&#x5177;&#x4F53;&#x8BF4;&#x6765;', '&#x5F00;&#x59CB;', '&#x5F00;&#x5916;', '&#x9760;', '&#x54B3;', '&#x53EF;', '&#x53EF;&#x89C1;', '&#x53EF;&#x662F;', '&#x53EF;&#x4EE5;', '&#x51B5;&#x4E14;', '&#x5566;', '&#x6765;', '&#x6765;&#x7740;', '&#x79BB;', '&#x4F8B;&#x5982;', '&#x54E9;', '&#x8FDE;', '&#x8FDE;&#x540C;', '&#x4E24;&#x8005;', '&#x4E86;', '&#x4E34;', '&#x53E6;', '&#x53E6;&#x5916;', '&#x53E6;&#x4E00;&#x65B9;&#x9762;', '&#x8BBA;', '&#x561B;', '&#x5417;', '&#x6162;&#x8BF4;', '&#x6F2B;&#x8BF4;', '&#x5192;', '&#x4E48;', '&#x6BCF;', '&#x6BCF;&#x5F53;', '&#x4EEC;', '&#x83AB;&#x82E5;', '&#x67D0;', '&#x67D0;&#x4E2A;', '&#x67D0;&#x4E9B;', '&#x62FF;', '&#x54EA;', '&#x54EA;&#x8FB9;', '&#x54EA;&#x513F;', '&#x54EA;&#x4E2A;', '&#x54EA;&#x91CC;', '&#x54EA;&#x5E74;', '&#x54EA;&#x6015;', '&#x54EA;&#x5929;', '&#x54EA;&#x4E9B;', '&#x54EA;&#x6837;', '&#x90A3;', '&#x90A3;&#x8FB9;', '&#x90A3;&#x513F;', '&#x90A3;&#x4E2A;', '&#x90A3;&#x4F1A;&#x513F;', '&#x90A3;&#x91CC;', '&#x90A3;&#x4E48;', '&#x90A3;&#x4E48;&#x4E9B;', '&#x90A3;&#x4E48;&#x6837;', '&#x90A3;&#x65F6;', '&#x90A3;&#x4E9B;', '&#x90A3;&#x6837;', '&#x4E43;', '&#x4E43;&#x81F3;', '&#x5462;', '&#x80FD;', '&#x4F60;', '&#x4F60;&#x4EEC;', '&#x60A8;', '&#x5B81;', '&#x5B81;&#x53EF;', '&#x5B81;&#x80AF;', '&#x5B81;&#x613F;', '&#x54E6;', '&#x5455;', '&#x556A;&#x8FBE;', '&#x65C1;&#x4EBA;', '&#x5478;', '&#x51ED;', '&#x51ED;&#x501F;', '&#x5176;', '&#x5176;&#x6B21;', '&#x5176;&#x4E8C;', '&#x5176;&#x4ED6;', '&#x5176;&#x5B83;', '&#x5176;&#x4E00;', '&#x5176;&#x4F59;', '&#x5176;&#x4E2D;', '&#x8D77;', '&#x8D77;&#x89C1;', '&#x8D77;&#x89C1;', '&#x5C82;&#x4F46;', '&#x6070;&#x6070;&#x76F8;&#x53CD;', '&#x524D;&#x540E;', '&#x524D;&#x8005;', '&#x4E14;', '&#x7136;&#x800C;', '&#x7136;&#x540E;', '&#x7136;&#x5219;', '&#x8BA9;', '&#x4EBA;&#x5BB6;', '&#x4EFB;', '&#x4EFB;&#x4F55;', '&#x4EFB;&#x51ED;', '&#x5982;', '&#x5982;&#x6B64;', '&#x5982;&#x679C;', '&#x5982;&#x4F55;', '&#x5982;&#x5176;', '&#x5982;&#x82E5;', '&#x5982;&#x4E0A;&#x6240;&#x8FF0;', '&#x82E5;', '&#x82E5;&#x975E;', '&#x82E5;&#x662F;', '&#x5565;', '&#x4E0A;&#x4E0B;', '&#x5C1A;&#x4E14;', '&#x8BBE;&#x82E5;', '&#x8BBE;&#x4F7F;', '&#x751A;&#x800C;', '&#x751A;&#x4E48;', '&#x751A;&#x81F3;', '&#x7701;&#x5F97;', '&#x65F6;&#x5019;', '&#x4EC0;&#x4E48;', '&#x4EC0;&#x4E48;&#x6837;', '&#x4F7F;&#x5F97;', '&#x662F;', '&#x662F;&#x7684;', '&#x9996;&#x5148;', '&#x8C01;', '&#x8C01;&#x77E5;', '&#x987A;', '&#x987A;&#x7740;', '&#x4F3C;&#x7684;', '&#x867D;', '&#x867D;&#x7136;', '&#x867D;&#x8BF4;', '&#x867D;&#x5219;', '&#x968F;', '&#x968F;&#x7740;', '&#x6240;', '&#x6240;&#x4EE5;', '&#x4ED6;', '&#x4ED6;&#x4EEC;', '&#x4ED6;&#x4EBA;', '&#x5B83;', '&#x5B83;&#x4EEC;', '&#x5979;', '&#x5979;&#x4EEC;', '&#x5018;', '&#x5018;&#x6216;', '&#x5018;&#x7136;', '&#x5018;&#x82E5;', '&#x5018;&#x4F7F;', '&#x817E;', '&#x66FF;', '&#x901A;&#x8FC7;', '&#x540C;', '&#x540C;&#x65F6;', '&#x54C7;', '&#x4E07;&#x4E00;', '&#x5F80;', '&#x671B;', '&#x4E3A;', '&#x4E3A;&#x4F55;', '&#x4E3A;&#x4E86;', '&#x4E3A;&#x4EC0;&#x4E48;', '&#x4E3A;&#x7740;', '&#x5582;', '&#x55E1;&#x55E1;', '&#x6211;', '&#x6211;&#x4EEC;', '&#x545C;', '&#x545C;&#x547C;', '&#x4E4C;&#x4E4E;', '&#x65E0;&#x8BBA;', '&#x65E0;&#x5B81;', '&#x6BCB;&#x5B81;', '&#x563B;', '&#x5413;', '&#x76F8;&#x5BF9;&#x800C;&#x8A00;', '&#x50CF;', '&#x5411;', '&#x5411;&#x7740;', '&#x5618;', '&#x5440;', '&#x7109;', '&#x6CBF;', '&#x6CBF;&#x7740;', '&#x8981;', '&#x8981;&#x4E0D;', '&#x8981;&#x4E0D;&#x7136;', '&#x8981;&#x4E0D;&#x662F;', '&#x8981;&#x4E48;', '&#x8981;&#x662F;', '&#x4E5F;', '&#x4E5F;&#x7F62;', '&#x4E5F;&#x597D;', '&#x4E00;', '&#x4E00;&#x822C;', '&#x4E00;&#x65E6;', '&#x4E00;&#x65B9;&#x9762;', '&#x4E00;&#x6765;', '&#x4E00;&#x5207;', '&#x4E00;&#x6837;', '&#x4E00;&#x5219;', '&#x4F9D;', '&#x4F9D;&#x7167;', '&#x77E3;', '&#x4EE5;', '&#x4EE5;&#x4FBF;', '&#x4EE5;&#x53CA;', '&#x4EE5;&#x514D;', '&#x4EE5;&#x81F3;', '&#x4EE5;&#x81F3;&#x4E8E;', '&#x4EE5;&#x81F4;', '&#x6291;&#x6216;', '&#x56E0;', '&#x56E0;&#x6B64;', '&#x56E0;&#x800C;', '&#x56E0;&#x4E3A;', '&#x54DF;', '&#x7528;', '&#x7531;', '&#x7531;&#x6B64;&#x53EF;&#x89C1;', '&#x7531;&#x4E8E;', '&#x6709;', '&#x6709;&#x7684;', '&#x6709;&#x5173;', '&#x6709;&#x4E9B;', '&#x53C8;', '&#x4E8E;', '&#x4E8E;&#x662F;', '&#x4E8E;&#x662F;&#x4E4E;', '&#x4E0E;', '&#x4E0E;&#x6B64;&#x540C;&#x65F6;', '&#x4E0E;&#x5426;', '&#x4E0E;&#x5176;', '&#x8D8A;&#x662F;', '&#x4E91;&#x4E91;', '&#x54C9;', '&#x518D;&#x8BF4;', '&#x518D;&#x8005;', '&#x5728;', '&#x5728;&#x4E0B;', '&#x54B1;', '&#x54B1;&#x4EEC;', '&#x5219;', '&#x600E;', '&#x600E;&#x4E48;', '&#x600E;&#x4E48;&#x529E;', '&#x600E;&#x4E48;&#x6837;', '&#x600E;&#x6837;', '&#x548B;', '&#x7167;', '&#x7167;&#x7740;', '&#x8005;', '&#x8FD9;', '&#x8FD9;&#x8FB9;', '&#x8FD9;&#x513F;', '&#x8FD9;&#x4E2A;', '&#x8FD9;&#x4F1A;&#x513F;', '&#x8FD9;&#x5C31;&#x662F;&#x8BF4;', '&#x8FD9;&#x91CC;', '&#x8FD9;&#x4E48;', '&#x8FD9;&#x4E48;&#x70B9;&#x513F;', '&#x8FD9;&#x4E48;&#x4E9B;', '&#x8FD9;&#x4E48;&#x6837;', '&#x8FD9;&#x65F6;', '&#x8FD9;&#x4E9B;', '&#x8FD9;&#x6837;', '&#x6B63;&#x5982;', '&#x5431;', '&#x4E4B;', '&#x4E4B;&#x7C7B;', '&#x4E4B;&#x6240;&#x4EE5;', '&#x4E4B;&#x4E00;', '&#x53EA;&#x662F;', '&#x53EA;&#x9650;', '&#x53EA;&#x8981;', '&#x53EA;&#x6709;', '&#x81F3;', '&#x81F3;&#x4E8E;', '&#x8BF8;&#x4F4D;', '&#x7740;', '&#x7740;&#x5462;', '&#x81EA;', '&#x81EA;&#x4ECE;', '&#x81EA;&#x4E2A;&#x513F;', '&#x81EA;&#x5404;&#x513F;', '&#x81EA;&#x5DF1;', '&#x81EA;&#x5BB6;', '&#x81EA;&#x8EAB;', '&#x7EFC;&#x4E0A;&#x6240;&#x8FF0;', '&#x603B;&#x7684;&#x6765;&#x770B;', '&#x603B;&#x7684;&#x6765;&#x8BF4;', '&#x603B;&#x7684;&#x8BF4;&#x6765;', '&#x603B;&#x800C;&#x8A00;&#x4E4B;', '&#x603B;&#x4E4B;', '&#x7EB5;', '&#x7EB5;&#x4EE4;', '&#x7EB5;&#x7136;', '&#x7EB5;&#x4F7F;', '&#x9075;&#x7167;', '&#x4F5C;&#x4E3A;', '&#x516E;', '&#x5443;', '&#x5457;', '&#x549A;', '&#x54A6;', '&#x558F;', '&#x5550;', '&#x5594;&#x5537;', '&#x55EC;', '&#x55EF;', '&#x55F3;', 'a', 'able', 'about', 'above', 'abroad', 'according', 'accordingly', 'across', 'actually', 'adj', 'after', 'afterwards', 'again', 'against', 'ago', 'ahead', "ain't", 'all', 'allow', 'allows', 'almost', 'alone', 'along', 'alongside', 'already', 'also', 'although', 'always', 'am', 'amid', 'amidst', 'among', 'amongst', 'an', 'and', 'another', 'any', 'anybody', 'anyhow', 'anyone', 'anything', 'anyway', 'anyways', 'anywhere', 'apart', 'appear', 'appreciate', 'appropriate', 'are', "aren't", 'around', 'as', "a's", 'aside', 'ask', 'asking', 'associated', 'at', 'available', 'away', 'awfully', 'b', 'back', 'backward', 'backwards', 'be', 'became', 'because', 'become', 'becomes', 'becoming', 'been', 'before', 'beforehand', 'begin', 'behind', 'being', 'believe', 'below', 'beside', 'besides', 'best', 'better', 'between', 'beyond', 'both', 'brief', 'but', 'by', 'c', 'came', 'can', 'cannot', 'cant', "can't", 'caption', 'cause', 'causes', 'certain', 'certainly', 'changes', 'clearly', "c'mon", 'co', 'co.', 'com', 'come', 'comes', 'concerning', 'consequently', 'consider', 'considering', 'contain', 'containing', 'contains', 'corresponding', 'could', "couldn't", 'course', "c's", 'currently', 'd', 'dare', "daren't", 'definitely', 'described', 'despite', 'did', "didn't", 'different', 'directly', 'do', 'does', "doesn't", 'doing', 'done', "don't", 'down', 'downwards', 'during', 'e', 'each', 'edu', 'eg', 'eight', 'eighty', 'either', 'else', 'elsewhere', 'end', 'ending', 'enough', 'entirely', 'especially', 'et', 'etc', 'even', 'ever', 'evermore', 'every', 'everybody', 'everyone', 'everything', 'everywhere', 'ex', 'exactly', 'example', 'except', 'f', 'fairly', 'far', 'farther', 'few', 'fewer', 'fifth', 'first', 'five', 'followed', 'following', 'follows', 'for', 'forever', 'former', 'formerly', 'forth', 'forward', 'found', 'four', 'from', 'further', 'furthermore', 'g', 'get', 'gets', 'getting', 'given', 'gives', 'go', 'goes', 'going', 'gone', 'got', 'gotten', 'greetings', 'h', 'had', "hadn't", 'half', 'happens', 'hardly', 'has', "hasn't", 'have', "haven't", 'having', 'he', "he'd", "he'll", 'hello', 'help', 'hence', 'her', 'here', 'hereafter', 'hereby', 'herein', "here's", 'hereupon', 'hers', 'herself', "he's", 'hi', 'him', 'himself', 'his', 'hither', 'hopefully', 'how', 'howbeit', 'however', 'hundred', 'i', "i'd", 'ie', 'if', 'ignored', "i'll", "i'm", 'immediate', 'in', 'inasmuch', 'inc', 'inc.', 'indeed', 'indicate', 'indicated', 'indicates', 'inner', 'inside', 'insofar', 'instead', 'into', 'inward', 'is', "isn't", 'it', "it'd", "it'll", 'its', "it's", 'itself', "i've", 'j', 'just', 'k', 'keep', 'keeps', 'kept', 'know', 'known', 'knows', 'l', 'last', 'lately', 'later', 'latter', 'latterly', 'least', 'less', 'lest', 'let', "let's", 'like', 'liked', 'likely', 'likewise', 'little', 'look', 'looking', 'looks', 'low', 'lower', 'ltd', 'm', 'made', 'mainly', 'make', 'makes', 'many', 'may', 'maybe', "mayn't", 'me', 'mean', 'meantime', 'meanwhile', 'merely', 'might', "mightn't", 'mine', 'minus', 'miss', 'more', 'moreover', 'most', 'mostly', 'mr', 'mrs', 'much', 'must', "mustn't", 'my', 'myself', 'n', 'name', 'namely', 'nd', 'near', 'nearly', 'necessary', 'need', "needn't", 'needs', 'neither', 'never', 'neverf', 'neverless', 'nevertheless', 'new', 'next', 'nine', 'ninety', 'no', 'nobody', 'non', 'none', 'nonetheless', 'noone', 'no-one', 'nor', 'normally', 'not', 'nothing', 'notwithstanding', 'novel', 'now', 'nowhere', 'o', 'obviously', 'of', 'off', 'often', 'oh', 'ok', 'okay', 'old', 'on', 'once', 'one', 'ones', "one's", 'only', 'onto', 'opposite', 'or', 'other', 'others', 'otherwise', 'ought', "oughtn't", 'our', 'ours', 'ourselves', 'out', 'outside', 'over', 'overall', 'own', 'p', 'particular', 'particularly', 'past', 'per', 'perhaps', 'placed', 'please', 'plus', 'possible', 'presumably', 'probably', 'provided', 'provides', 'q', 'que', 'quite', 'qv', 'r', 'rather', 'rd', 're', 'really', 'reasonably', 'recent', 'recently', 'regarding', 'regardless', 'regards', 'relatively', 'respectively', 'right', 'round', 's', 'said', 'same', 'saw', 'say', 'saying', 'says', 'second', 'secondly', 'see', 'seeing', 'seem', 'seemed', 'seeming', 'seems', 'seen', 'self', 'selves', 'sensible', 'sent', 'serious', 'seriously', 'seven', 'several', 'shall', "shan't", 'she', "she'd", "she'll", "she's", 'should', "shouldn't", 'since', 'six', 'so', 'some', 'somebody', 'someday', 'somehow', 'someone', 'something', 'sometime', 'sometimes', 'somewhat', 'somewhere', 'soon', 'sorry', 'specified', 'specify', 'specifying', 'still', 'sub', 'such', 'sup', 'sure', 't', 'take', 'taken', 'taking', 'tell', 'tends', 'th', 'than', 'thank', 'thanks', 'thanx', 'that', "that'll", 'thats', "that's", "that've", 'the', 'their', 'theirs', 'them', 'themselves', 'then', 'thence', 'there', 'thereafter', 'thereby', "there'd", 'therefore', 'therein', "there'll", "there're", 'theres', "there's", 'thereupon', "there've", 'these', 'they', "they'd", "they'll", "they're", "they've", 'thing', 'things', 'think', 'third', 'thirty', 'this', 'thorough', 'thoroughly', 'those', 'though', 'three', 'through', 'throughout', 'thru', 'thus', 'till', 'to', 'together', 'too', 'took', 'toward', 'towards', 'tried', 'tries', 'truly', 'try', 'trying', "t's", 'twice', 'two', 'u', 'un', 'under', 'underneath', 'undoing', 'unfortunately', 'unless', 'unlike', 'unlikely', 'until', 'unto', 'up', 'upon', 'upwards', 'us', 'use', 'used', 'useful', 'uses', 'using', 'usually', 'v', 'value', 'various', 'versus', 'very', 'via', 'viz', 'vs', 'w', 'want', 'wants', 'was', "wasn't", 'way', 'we', "we'd", 'welcome', 'well', "we'll", 'went', 'were', "we're", "weren't", "we've", 'what', 'whatever', "what'll", "what's", "what've", 'when', 'whence', 'whenever', 'where', 'whereafter', 'whereas', 'whereby', 'wherein', "where's", 'whereupon', 'wherever', 'whether', 'which', 'whichever', 'while', 'whilst', 'whither', 'who', "who'd", 'whoever', 'whole', "who'll", 'whom', 'whomever', "who's", 'whose', 'why', 'will', 'willing', 'wish', 'with', 'within', 'without', 'wonder', "won't", 'would', "wouldn't", 'x', 'y', 'yes', 'yet', 'you', "you'd", "you'll", 'your', "you're", 'yours', 'yourself', 'yourselves', "you've", 'z', 'zero']
Building prefix dict from the default dictionary ...

Loading model from cache C:\Users\Windows\AppData\Local\Temp\jieba.cache
Loading model cost 0.797 seconds.

Prefix dict has been built successfully.

C:\Users\Windows\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py:300: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['ain', 'aren', 'couldn', 'daren', 'didn', 'doesn', 'don', 'hadn', 'hasn', 'haven', 'isn', 'll', 'mayn', 'mightn', 'mon', 'mustn', 'needn', 'oughtn', 'shan', 'shouldn', 've', 'wasn', 'weren', 'won', 'wouldn'] not in stop_words.

  'stop_words.' % sorted(inconsistent))
C:\Users\Windows\Anaconda3\lib\site-packages\sklearn\metrics\classification.py:1437: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.

  'precision', 'predicted', average, warn_for)
  (0, 1)    0.172494787172401
  (0, 13)   0.03617578927683419
  (0, 19)   0.044685283861169885
  (0, 24)   0.04378669110667244
  (0, 32)   0.04770060616202845
  (0, 37)   0.08714906173699981
  (0, 42)   0.03262617791847282
  (0, 79)   0.03598272479613044
  (0, 80)   0.03384787551537572
  (0, 83)   0.105263111952599
  (0, 88)   0.1963277784717525
  (0, 89)   0.04008970433219022
  (0, 90)   0.1412328663779052
  (0, 98)   0.03994454517190565
  (0, 134)  0.04594067399577792
  (0, 142)  0.04422553495980068
  (0, 147)  0.05830540319790606
  (0, 149)  0.02851184938909845
  (0, 163)  0.03588762154160368
  (0, 166)  0.11695593718680143
  (0, 168)  0.06847746933720501
  (0, 170)  0.08047415949662211
  (0, 176)  0.03776179057073174
  (0, 185)  0.03924979201525634
  (0, 200)  0.04184963649844074
  : :
  (0, 855)  0.04946215984804544
  (0, 865)  0.04649241857748127
  (0, 869)  0.138252930106818
  (0, 870)  0.21384828173878706
  (0, 873)  0.1667028225043511
  (0, 876)  0.09298483715496254
  (0, 885)  0.19784215331311594
  (0, 888)  0.04008970433219022
  (0, 896)  0.04985458905113395
  (0, 897)  0.49416003160549193
  (0, 902)  0.03938479984191792
  (0, 909)  0.03885574129240138
  (0, 910)  0.03911664642497605
  (0, 917)  0.05050266101265748
  (0, 922)  0.03966061581314387
  (0, 933)  0.02711876416903459
  (0, 947)  0.0329697338326223
  (0, 955)  0.15044205095521537
  (0, 972)  0.03255891003473477
  (0, 998)  0.1578946679288985
  (1, 196)  0.37511948551579166
  (1, 224)  0.21966463546937165
  (1, 420)  0.24197656045436386
  (1, 460)  0.49133294291978213
  (1, 787)  0.7148930709574247
1394    news.cn
1365    news.cn
Name: Class, dtype: object
&#x51C6;&#x786E;&#x7387;: 0.7176165803108808
               precision    recall  f1-score   support

1688.autos.cn       0.00      0.00      0.00         3
     autos.cn       1.00      0.22      0.36         9
       biz.cn       0.63      0.73      0.68        45
          cpc       0.00      0.00      0.00         8
      dangshi       0.48      0.77      0.59        13
       ent.cn       0.86      0.57      0.68        44
        henan       0.98      0.77      0.86        69
         news       0.00      0.00      0.00        16
      news.cn       0.64      0.93      0.76       131
      society       0.00      0.00      0.00         5
    sports.cn       0.86      0.86      0.86        37
       theory       0.00      0.00      0.00         6

     accuracy                           0.72       386
    macro avg       0.45      0.40      0.40       386
 weighted avg       0.69      0.72      0.68       386

Original: https://blog.csdn.net/weixin_46274168/article/details/120365553
Author: 我是小白呀
Title: 【NLP】⚠️学不会打我! 半小时学会基本操作 8⚠️ 新闻分类

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/665834/

转载文章受原作者版权保护。转载请注明原作者出处！

人工智能

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

PreScan快速入门到精通第十一讲之PreScan道路标记，建筑物、抽象物体及交通标识

11.1 道路标记目的：路标是为了完善智能驾驶实验场景。用途：道路标记是附加在道路段上的。标记的大小{小、大}、颜色，以及相对于路段的位置和方向可以在属性编辑器中设置。也可以指…

人工智能 2023年6月10日
0099
YOLOv7基于自己的数据集从零构建模型完整训练、推理计算超详细教程

YOLOv7出来也有一段时间了，在刚出来的时候我就抢先体验了一把，当时主要是就是尝尝鲜，没有实际项目落地，所以也就没有去用很多数据集，也没有对模型进行评估计算，在前面的系列博文里面…

人工智能 2023年6月16日
0086
解决AttributeError: module ‘tensorflow‘ has no attribute ‘placeholder‘问题

PyCharm运行optimize.py出错Traceback (most recent call last):File “D:/PyCharm/RelationPre…

人工智能 2023年5月26日
0086
移动机器人传感器—IMU

文章目录 1. IMU概念 2. IMU模块概述 * 2.1 MEMS传感器 2.2 三轴陀螺仪 3. IMU输出数据在移动机器人算法中的应用 IMU概念惯性测量模块（IMU，I…

人工智能 2023年6月10日
0092
深度学习-conv卷积

卷积卷积是一种定义在两个函数((f) 和 (g))上的数学操作，旨在产生一个新的函数。(f) 和 (g) 的卷积可以写成 (f\ast g)，数学定义如下： [\begin{al…

人工智能 2023年6月4日
00164
Pytorch基础 softmax回归的实现

和之前的逻辑回归类似，针对上篇的Fashion Mnist数据集进行处理；详细的数学推导这里不给出，直接给出代码；如果采用自定义方式搭建网络方式： import torch i…

人工智能 2023年6月17日
0078
Pytorch中torchvision包transforms模块应用小案例

Pytorch中torchvision包transforms模块应用小案例 Pytorch提供了torchvision这样一个视觉工具包，提供了很多视觉图像处理的工具，其中tran…

人工智能 2023年7月2日
0082
几个图像处理库整理：OpenCV、PIL(pillow)、skimage和GDAL库

主要是图像处理的几个库对数据的读取方式存在差异，有的时候经常搞混，没有概念，所以大致整理一下，一是增强印象，二是整理便于查阅。关于图像读取函数： 1、opencv库，python…

人工智能 2023年7月18日
0078
基于KMeans算法的图像分割例子

文章目录一、理论基础 * 1、KMeans算法 2、图像分割二、实验过程 * 1、图片 2、实验步骤 3、Python代码 – （1）导包（2）读取图像数据（3…

人工智能 2023年5月31日
00110
热词挖掘、热度值计算方法及python实现

🤗 之前做过一个项目，是在特定社交平台上发现每天的热帖，做热帖推送，所以笔者自然而然想到利用热词来代表热帖进行热帖发掘，所以在参考了许多资料后，采用了本文所用方法，简单有效，所以在…

人工智能 2023年6月19日
0073
python 最小外接矩形笔记

目录最小外接矩形角度计算： opencv生成最小外接矩形：最小外接矩形修正版：最小外接矩形角度计算： rect = cv2.minAreaRect(merged_contou…

人工智能 2023年7月19日
0074
数据清洗：重复值识别和处理方法

重复值识别数据集中的重复值包括以下两种情况：数据值完全相同的多条数据记录；数据主体相同但匹配到的唯一属性值不同。示例如下：导入pandas库 import pandas …

人工智能 2023年7月16日
0063
【机器学习】决策树代码练习

本课程是中国大学慕课《机器学习》的”决策树”章节的课后代码。课程地址：https://www.icourse163.org/course/WZU-14640…

人工智能 2023年5月25日
00114
CNN入门mnist数据集运行环境搭建（安装Python，Pycharm,Anaconda,Tensorflow,CNN代码）

安装环境运行大致步骤： Python安装：选择3.8，安装教程具体可查看：https://blog.csdn.net/liming89/article/details/109632…

人工智能 2023年5月26日
00117
Sklearn机器学习——样本不平衡问题解决、精确率、召回率、ROC曲线

目录 1 二分类SVC的样本不均衡问题 1.1 样本不平衡定义 1.2 解决方法 1.2.1 SVC的参数class_weight 1.2.2 SVC的接口fit的参数:sampl…

人工智能 2023年6月16日
0099
逻辑回归算法实战之信用卡欺诈检测

信用卡欺诈检测 1. 数据分析与预处理 * 1.1 数据的读取与分析 1.2 解决样本不均衡 1.3 特征标准化 2. 下采样方案 * 2.1 交叉验证 2.2 模型评估方法 2….

人工智能 2023年7月27日
0078

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

【NLP】⚠️学不会打我! 半小时学会基本操作 8⚠️ 新闻分类

【NLP】⚠️学不会打我! 半小时学会基本操作 8⚠️ 新闻分类

TF

; IDF

TF-IDF

; TfidfVectorizer

大家都在看