对文本Kmeans聚类分析
前言
背景
为了研究用户对数字音乐付费的影响因素,我们采用了配额抽样的调查方法,共发出并收回有效问卷765份,其中问卷最后一题为开放性提问”Q42_H1.您认为当前数字音乐付费模式存在哪些问题以及相应的建议?”。
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:c454138e-01ee-4275-a2c8-0a1f6aecd94e
[En]
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:bf3babb3-5455-46c0-9c2b-cff67210146c
目的与思路
目的:对建议文本进行聚类分析,最终得到 几个主题词团。
实验方法:将数据进行预处理之后,先进行 结巴分词、去除停用词,然后把文档生成tfidf矩阵,再通过K-means聚类,最后得到几个类的主题词。
数据预处理
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:87db9734-2e27-409b-9fc6-935b45054f56
[En]
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:77840166-3bca-4ba2-80cc-b6213a0de2da
data = pd.read_excel('questionnaire_data.xlsx')
data.columns.values.tolist()
adv = data[ 'Q42_H1.您认为当前数字音乐付费模式存在哪些问题以及相应的建议?']
adv = adv.dropna()
l1 = len(adv)
adv1 = pd.DataFrame(adv.unique())
l2 = len(adv1)
adv1.to_csv('jianyi.csv',index = False,encoding='utf-8')
print(f'删除了{l1 - l2}条建议')
采用机械压缩去词的方法对文本数据进行处理,并将结果存入jianyi.csv
f = codecs.open('jianyi2.csv' ,'w','utf-8')
def cutword(strs,reverse = False):
for A_string in strs:
temp1= A_string[0].strip('\n')
temp2 = temp1.lstrip('\ufeff')
temp3= temp2.strip('\r')
char_list=list(temp3)
list1=['']
list2=['']
del1=[]
flag=['']
i=0
while(i<len(char_list)):
if (char_list[i]==list1[0]):
if (list2==['']):
list2[0]=char_list[i]
else:
if (list1==list2):
t=len(list1)
m=0
while(m<t):
del1.append( i-m-1)
m=m+1
list2=['']
list2[0]=char_list[i]
else:
list1=['']
list2=['']
flag=['']
list1[0]=char_list[i]
flag[0]=i
else:
if (list1==list2)and(list1!=[''])and(list2!=['']):
if len(list1)>=2:
t=len(list1)
m=0
while(m<t):
del1.append( i-m-1)
m=m+1
list1=['']
list2=['']
list1[0]=char_list[i]
flag[0]=i
else:
if(list2==['']):
if(list1==['']):
list1[0]=char_list[i]
flag[0]=i
else:
list1.append(char_list[i])
flag.append(i)
else:
list2.append(char_list[i])
i=i+1
if(i==len(char_list)):
if(list1==list2):
t=len(list1)
m=0
while(m<t):
del1.append( i-m-1)
m=m+1
m=0
while(m<t):
del1.append(flag[m])
m=m+1
a=sorted(del1)
t=len(a)-1
while (t>=0):
del char_list[a[t]]
t=t-1
str1 = "".join(char_list)
str2=str1.strip()
if len(str2)>4:
f.writelines(str2+'\r\n')
f.close()
return
data1 = pd.read_csv('jianyi.csv',encoding = 'utf-8')
data2 = cutword(data1.values)
data2 = pd.read_csv('jianyi2.csv',encoding = 'utf-8',delimiter="\t",header=None)
分词处理
采用jieba分词
doc=open('jianyi2.csv',encoding='utf-8').read()
f = open("wenben.txt", "w", encoding = 'utf-8')
f.write(doc)
f.close()
with open('wenben.txt', "r", encoding='utf-8') as fr:
lines = fr.readlines()
jiebaword = []
for line in lines:
line = line.strip('\n')
line = "".join(line.split())
seg_list = jieba.cut(line, cut_all=False)
word = "/".join(seg_list)
jiebaword.append(word)
jiebaword
得到jiebaword如下:
停用词处理
获取停用词表
在网上搜索下载停用词文本 stopwords.txt
stopword = []
with open('stopwords.txt', "r", encoding='utf-8') as fr:
lines = fr.readlines()
for line in lines:
line = line.strip('\n')
stopword.append(line)
stopword
去除停用词
fw = open('CleanWords.txt', 'a+',encoding='utf-8')
for words in jiebaword:
words = words.split('/')
for word in words:
if word not in stopword:
fw.write(word + '\t')
fw.write('\n')
fw.close()
生成tf-idf矩阵
with open('CleanWords.txt', "r", encoding='utf-8') as fr:
lines = fr.readlines()
transformer=TfidfVectorizer()
tfidf = transformer.fit_transform(lines)
tfidf_arr = tfidf.toarray()
tfidf_arr.shape
Kmeans聚类
获取分类
这里按照经验,将类别设为num_means=3
kmeans = KMeansClusterer(num_means=3, distance=cosine_distance)
kmeans.cluster(tfidf_arr)
kinds = pd.Series([kmeans.classify(i) for i in tfidf_arr])
fw = open('ClusterText.txt', 'a+', encoding='utf-8')
for i, v in kinds.items():
fw.write(str(i) + '\t' + str(v) + '\n')
fw.close()
在txt文档中,就有每一条建议与之对应的分类了
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:9022c00e-b864-4a6c-861a-39480e5de21e
[En]
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:714d3e3a-b3a1-46ea-8357-983cee8d0020
获取分类文档
index_cluser = []
with open('ClusterText.txt', "r", encoding='utf-8') as fr:
lines = fr.readlines()
for line in lines:
line = line.strip('\n')
line = line.split('\t')
index_cluser.append(line)
with open('CleanWords.txt', "r", encoding='utf-8') as fr:
lines = fr.readlines()
for index,line in enumerate(lines):
for i in range(410):
if str(index) == index_cluser[i][0]:
fw = open('cluster' + index_cluser[i][1] + '.txt', 'a+', encoding='utf-8')
fw.write(line)
fw.close()
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:69bcc42c-f212-4137-a4bd-bfcf56505d72
[En]
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:7f259854-4383-49c8-a0d0-94a6c0d1d696
获取主题词
for i in range(3):
with open('cluster' + str(i) + '.txt', "r", encoding='utf-8') as fr:
lines = fr.readlines()
all_words = []
for line in lines:
line = line.strip('\n')
line = line.split('\t')
for word in line:
all_words.append(word)
c = Counter()
for x in all_words:
if len(x) > 1 and x != '\r\n':
c[x] += 1
print('主题' + str(i+1) + '\n词频统计结果:')
for (k, v) in c.most_common(1):
print(k,':',v,'\n')
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:d2a37b73-fa7f-4726-a3c6-3228cfb2b8f5
[En]
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:63650466-40ba-463c-86a3-0ee494180894
结论
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:88fa236e-6632-45c5-b006-c66bd6ac2101
[En]
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:776c896c-25ab-4283-821c-296ec2dd783c
Original: https://blog.csdn.net/weixin_43194506/article/details/115276211
Author: 今天我吃好吃的了吗
Title: sklearn做文本聚类分析
原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/561086/
转载文章受原作者版权保护。转载请注明原作者出处!