文章目录
1、Hamlet英文词频统计
txt = open('hamlet.txt','r').read()
txt = txt.lower()
for ch in ',./?;:'"<>=+-[]{}!~%@()
txt.replace(ch, ' ')
words = txt.split()
counts = {}
for word in words:
counts[word] = counts.get(word, 0) + 1
counts = sorted(counts.items(), key = lambda x: x[1], reverse = True)
for i in range(10):
word, count = counts[i]
print('{0:5}'.format(word,count)
运行之后发现高频单词大多数是冠词、代词、连接词等语法型词汇,并不能代表文章含义
建立一个排除词库encludes
excludes = {'the','and','of','you','a','i','my','in'}
txt = open('hamlet.txt', 'r').read()
txt = txt.lower()
for ch in ',./?;:'"<>=+-[]{}!~%@()
txt = txt.replace(ch, ' ')
words = txt.split()
counts = {}
for word in words:
counts[word] = counts.get(word, 0) + 1
for word in excludes:
del counts[word]
counts = sorted(counts.items(), key = lambda x:x[1],reverse = True)
for i in range(10):
print('{:5}'.format(counts[i][0],counts[i][1])
2、python之jieba库
1)重要的第三方中文分词函数库
2)安装 pip3 install jieba
3)常用函数
; 3、《三国演义》中文人物出场统计
import jieba
txt = open('三国演义.txt','r', encoding='utf-8').read()
words = jieba.lcut(txt)
counts = {}
for word in words:
if len(word) == 1:
continue
else:
counts[word] = counts.get(word, 0) + 1
counts = sorted(counts.items(), key = lambda x: x[1], reverse = True)
for i in range(5):
word, count = counts[i]
print('{:5}'.format(word, count))
【代码改进】
1、排除与人名无关的词汇
2、同一个人有不同称谓
encludes = {'将军','却说','荆州','二人','不可','不能','如此'}
import jieba
txt = open('三国演义.txt','r', encoding='utf-8').read()
words = jiaba.lcut(s)
counts = {}
for word in words:
if len(word) == 1:
continue
elif word == '诸葛亮' or '孔明曰':
rword = '孔明'
elif word == '关公' or '云长':
rword = '关羽'
elif word == '玄德' or '玄德曰':
rword = '刘备'
elif word == '孟德' or '丞相':
rword = '曹操'
else:
rword = word
counts[rword] = counts.get(rword, 0) + 1
for word in excludes:
del counts[word]
counts = sorted(counts.items(), key = lambda x:x[1], reverse=True)
for i in range(10):
print('{:5}'.format(counts[i][0], counts[i][1]))
Original: https://blog.csdn.net/weixin_54958866/article/details/123466990
Author: grittii
Title: python之词频统计
原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/528218/
转载文章受原作者版权保护。转载请注明原作者出处!