sklearn做文本聚类分析

对文本Kmeans聚类分析

前言

背景

为了研究用户对数字音乐付费的影响因素,我们采用了配额抽样的调查方法,共发出并收回有效问卷765份,其中问卷最后一题为开放性提问”Q42_H1.您认为当前数字音乐付费模式存在哪些问题以及相应的建议?”。
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:c454138e-01ee-4275-a2c8-0a1f6aecd94e

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:bf3babb3-5455-46c0-9c2b-cff67210146c

目的与思路

目的:对建议文本进行聚类分析,最终得到 几个主题词团
实验方法:将数据进行预处理之后,先进行 结巴分词、去除停用词,然后把文档生成tfidf矩阵,再通过K-means聚类,最后得到几个类的主题词。

数据预处理

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:87db9734-2e27-409b-9fc6-935b45054f56

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:77840166-3bca-4ba2-80cc-b6213a0de2da

data = pd.read_excel('questionnaire_data.xlsx')
data.columns.values.tolist()
adv = data[ 'Q42_H1.您认为当前数字音乐付费模式存在哪些问题以及相应的建议?']

adv = adv.dropna()
l1 = len(adv)
adv1 = pd.DataFrame(adv.unique())
l2 = len(adv1)
adv1.to_csv('jianyi.csv',index = False,encoding='utf-8')
print(f'删除了{l1 - l2}条建议')

采用机械压缩去词的方法对文本数据进行处理,并将结果存入jianyi.csv


f = codecs.open('jianyi2.csv' ,'w','utf-8')
def cutword(strs,reverse = False):
    for A_string in strs:
        temp1= A_string[0].strip('\n')
        temp2 = temp1.lstrip('\ufeff')
        temp3= temp2.strip('\r')
        char_list=list(temp3)
        list1=['']
        list2=['']
        del1=[]
        flag=['']
        i=0
        while(i<len(char_list)):
            if (char_list[i]==list1[0]):
                if (list2==['']):
                    list2[0]=char_list[i]
                else:
                    if (list1==list2):
                        t=len(list1)
                        m=0
                        while(m<t):
                            del1.append( i-m-1)
                            m=m+1
                        list2=['']
                        list2[0]=char_list[i]
                    else:
                        list1=['']
                        list2=['']
                        flag=['']
                        list1[0]=char_list[i]
                        flag[0]=i
            else:
                if (list1==list2)and(list1!=[''])and(list2!=['']):
                    if len(list1)>=2:
                        t=len(list1)
                        m=0
                        while(m<t):
                            del1.append( i-m-1)
                            m=m+1
                        list1=['']
                        list2=['']
                        list1[0]=char_list[i]
                        flag[0]=i
                else:
                    if(list2==['']):
                        if(list1==['']):
                            list1[0]=char_list[i]
                            flag[0]=i
                        else:
                            list1.append(char_list[i])
                            flag.append(i)
                    else:
                        list2.append(char_list[i])
            i=i+1
            if(i==len(char_list)):
                if(list1==list2):
                        t=len(list1)
                        m=0
                        while(m<t):
                            del1.append( i-m-1)
                            m=m+1
                        m=0
                        while(m<t):
                            del1.append(flag[m])
                            m=m+1
        a=sorted(del1)
        t=len(a)-1
        while (t>=0):

            del char_list[a[t]]
            t=t-1
        str1 = "".join(char_list)
        str2=str1.strip()
        if len(str2)>4:
            f.writelines(str2+'\r\n')
    f.close()
    return
data1 = pd.read_csv('jianyi.csv',encoding = 'utf-8')
data2 = cutword(data1.values)
data2 = pd.read_csv('jianyi2.csv',encoding = 'utf-8',delimiter="\t",header=None)

分词处理

采用jieba分词

doc=open('jianyi2.csv',encoding='utf-8').read()
f = open("wenben.txt", "w", encoding = 'utf-8')
f.write(doc)
f.close()

with open('wenben.txt', "r", encoding='utf-8') as fr:
    lines = fr.readlines()
jiebaword = []
for line in lines:
    line = line.strip('\n')

    line = "".join(line.split())

    seg_list = jieba.cut(line, cut_all=False)
    word = "/".join(seg_list)
    jiebaword.append(word)
jiebaword

得到jiebaword如下:

sklearn做文本聚类分析

停用词处理

获取停用词表

在网上搜索下载停用词文本 stopwords.txt

stopword = []

with open('stopwords.txt', "r", encoding='utf-8') as fr:
    lines = fr.readlines()

for line in lines:
    line = line.strip('\n')
    stopword.append(line)
stopword

去除停用词


fw = open('CleanWords.txt', 'a+',encoding='utf-8')
for words in jiebaword:
    words = words.split('/')
    for word in words:
        if word not in stopword:
            fw.write(word + '\t')
    fw.write('\n')
fw.close()

生成tf-idf矩阵

with open('CleanWords.txt', "r", encoding='utf-8') as fr:
    lines = fr.readlines()

transformer=TfidfVectorizer()
tfidf = transformer.fit_transform(lines)

tfidf_arr = tfidf.toarray()
tfidf_arr.shape

Kmeans聚类

获取分类

这里按照经验,将类别设为num_means=3

kmeans = KMeansClusterer(num_means=3, distance=cosine_distance)
kmeans.cluster(tfidf_arr)

kinds = pd.Series([kmeans.classify(i) for i in tfidf_arr])
fw = open('ClusterText.txt', 'a+', encoding='utf-8')
for i, v in kinds.items():
    fw.write(str(i) + '\t' + str(v) + '\n')
fw.close()

在txt文档中,就有每一条建议与之对应的分类了
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:9022c00e-b864-4a6c-861a-39480e5de21e

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:714d3e3a-b3a1-46ea-8357-983cee8d0020

获取分类文档

index_cluser = []

with open('ClusterText.txt', "r", encoding='utf-8') as fr:
    lines = fr.readlines()

for line in lines:
    line = line.strip('\n')
    line = line.split('\t')
    index_cluser.append(line)

with open('CleanWords.txt', "r", encoding='utf-8') as fr:
    lines = fr.readlines()

for index,line in enumerate(lines):
    for i in range(410):
        if str(index) == index_cluser[i][0]:
            fw = open('cluster' + index_cluser[i][1] + '.txt', 'a+', encoding='utf-8')
            fw.write(line)
fw.close()

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:69bcc42c-f212-4137-a4bd-bfcf56505d72

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:7f259854-4383-49c8-a0d0-94a6c0d1d696

获取主题词

for i in range(3):

    with open('cluster' + str(i) + '.txt', "r", encoding='utf-8') as fr:
        lines = fr.readlines()

    all_words = []
    for line in lines:
        line = line.strip('\n')
        line = line.split('\t')
        for word in line:
            all_words.append(word)
        c = Counter()
        for x in all_words:
            if len(x) > 1 and x != '\r\n':
                c[x] += 1

        print('主题' + str(i+1) + '\n词频统计结果:')

        for (k, v) in c.most_common(1):
            print(k,':',v,'\n')

sklearn做文本聚类分析
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:d2a37b73-fa7f-4726-a3c6-3228cfb2b8f5
[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:63650466-40ba-463c-86a3-0ee494180894

结论

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:88fa236e-6632-45c5-b006-c66bd6ac2101

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:776c896c-25ab-4283-821c-296ec2dd783c

Original: https://blog.csdn.net/weixin_43194506/article/details/115276211
Author: 今天我吃好吃的了吗
Title: sklearn做文本聚类分析

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/561086/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

亲爱的 Coder【最近整理,可免费获取】👉 最新必读书单  | 👏 面试题下载  | 🌎 免费的AI知识星球