python之词频统计

文章目录

1、Hamlet英文词频统计

txt = open('hamlet.txt','r').read()

txt = txt.lower()

for ch in ',./?;:'"<>=+-[]{}!~%@()
    txt.replace(ch, ' ')

words = txt.split()
counts = {}
for word in words:
    counts[word] = counts.get(word, 0) + 1

counts = sorted(counts.items(), key = lambda x: x[1], reverse = True)

for i in range(10):
    word, count = counts[i]
    print('{0:5}'.format(word,count)

运行之后发现高频单词大多数是冠词、代词、连接词等语法型词汇,并不能代表文章含义
建立一个排除词库encludes

excludes = {'the','and','of','you','a','i','my','in'}
txt = open('hamlet.txt', 'r').read()

txt = txt.lower()

for ch in ',./?;:'"<>=+-[]{}!~%@()
    txt = txt.replace(ch, ' ')

words = txt.split()

counts = {}
for word in words:
    counts[word] = counts.get(word, 0) + 1
for word in excludes:
    del counts[word]

counts = sorted(counts.items(), key = lambda x:x[1],reverse = True)

for i in range(10):
    print('{:5}'.format(counts[i][0],counts[i][1])

2、python之jieba库

1)重要的第三方中文分词函数库
2)安装 pip3 install jieba
3)常用函数

python之词频统计

; 3、《三国演义》中文人物出场统计

import jieba
txt = open('三国演义.txt','r', encoding='utf-8').read()
words = jieba.lcut(txt)

counts = {}
for word in words:
    if len(word) == 1:
        continue
    else:
        counts[word] = counts.get(word, 0) + 1

counts = sorted(counts.items(), key = lambda x: x[1], reverse = True)

for i in range(5):
    word, count = counts[i]
    print('{:5}'.format(word, count))

【代码改进】
1、排除与人名无关的词汇
2、同一个人有不同称谓

encludes = {'将军','却说','荆州','二人','不可','不能','如此'}
import jieba
txt = open('三国演义.txt','r', encoding='utf-8').read()

words = jiaba.lcut(s)

counts = {}
for word in words:
    if len(word) == 1:
        continue

    elif word == '诸葛亮' or '孔明曰':
        rword = '孔明'
    elif word == '关公' or '云长':
        rword = '关羽'
    elif word == '玄德' or '玄德曰':
        rword = '刘备'
    elif word == '孟德' or '丞相':
        rword = '曹操'
    else:
        rword = word
    counts[rword] = counts.get(rword, 0) + 1

for word in excludes:
    del counts[word]

counts = sorted(counts.items(), key = lambda x:x[1], reverse=True)

for i in range(10):
    print('{:5}'.format(counts[i][0], counts[i][1]))

Original: https://blog.csdn.net/weixin_54958866/article/details/123466990
Author: grittii
Title: python之词频统计

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/528218/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

亲爱的 Coder【最近整理,可免费获取】👉 最新必读书单  | 👏 面试题下载  | 🌎 免费的AI知识星球