5.2 数据可视化分析——词云图绘制

2023年7月6日上午6:08 • 人工智能 • 阅读 127

5.2.1 用jieba库实现中文分词

要从中文文本中提取高频词汇，需要使用中文分词（Chinese Word Segmentation）技术。分词是指将一个文本序列切分成一个个单独的词。我们知道，在英文的行文中，单词之间以空格作为分隔符，而中文的词语之间没有一个形式上的分解符，因此，中文分词比英文分词要复杂一些。在Python中，可以利用jieba库来快速完成中文分词。

1.jieba 库的安装与基本用法

import jieba
word = jieba.cut('我爱北京天安门')
for i in word:
    print(i)

注意：用cut()函数分词得到的word不是一个列表，而是一个迭代器。所谓迭代器其实和列表很相似，可以把它理解成一个”隐身的列表”。但是迭代器里的元素要通过for循环语句来访问，所以第3行代码和第4行代码不能改写成print(word)。

2.读取文本内容并进行分词

下面先来讲解如何从文本文件中读取内容并进行分词，代码如下：

import jieba
report = open('信托行业报告.txt','r').read()
words = jieba.cut(report)
for word in words:
    print(word)

3.按指定长度提取分词后的词

这里以提取长度大于等于4个字的词为例进行讲解，代码如下：

import jieba
report = open('信托行业报告.txt','r').read()
words = jieba.cut(report)
report_words = []
for word in words:
    if len(word) >= 4:
        report_words.append(word)
print(report_words)

4.统计高频词汇

统计高频词汇并不复杂，用collections库中的Counter()函数就可以完成，代码如下：

import jieba
report = open('信托行业报告.txt','r').read()
words = jieba.cut(report)
report_words = []
for word in words:
    if len(word) >= 4:
        report_words.append(word)

from collections import Counter
result = Counter(report_words)
print(result)

这样便可以打印输出每个词的出现次数。如果只想看出现次数排名前50位的词，可以用most_common()函数来完成，将上述第10行代码改写成如下代码：

import jieba
report = open('信托行业报告.txt','r').read()
words = jieba.cut(report)
report_words = []

for word in words:
    if len(word) >= 4:
        report_words.append(word)

from collections import Counter
result = Counter(report_words).most_common(50)
print(result)

将result打印输出，结果如下图所示。可以看到，里面的一些高频词还是能体现整个行业的某些情况的。例如，最近几年行业年度报告里频繁提到”信息技术”与”人工智能”这类词，那么它们也的确体现了行业的未来发展趋势。

5.2.2 用wordcloud库绘制词云图

1.wordcloud库的基本用法

以前面做了长度筛选的分词结果report_words为例讲解wordcloud库的基本用法，代码如下：
如果wordcloud安装失败，可参考python中安装wordcloud的方法


import jieba
report = open('信托行业报告.txt','r').read()
words = jieba.cut(report)

report_words = []
for word in words:
    if len(word) >= 4:
        report_words.append(word)

from collections import Counter
result = Counter(report_words).most_common(50)

from wordcloud import WordCloud
content = ' '.join(report_words)
wc = WordCloud(font_path='simhei.ttf',
               background_color='white',
                width=1000,
                height=600,
                 ).generate(content)
wc.to_file('词云图.png')

2.绘制特定形状的词云图

我们还可以将词云图绘制成特定的形状。首先导入相关库，代码如下：

from PIL import Image
from numpy as np
from wordcloud import WordCloud

第一行代码导入用于处理图片的PIL库。第二行导入用于处理数据的NumPy库，在本书作者编写的《Python金融大数据挖掘与分析全流程详解》的第6章就有该库的详细了解。
导入相关库后，就可以绘制指定形状的词云图了，代码如下：


import jieba
report = open('信托行业报告.txt','r').read()
words = jieba.cut(report)

report_words = []
for word in words:
    if len(word) >= 4:
        report_words.append(word)

from collections import Counter
result = Counter(report_words).most_common(50)

from PIL import Image
import numpy as np
from wordcloud import WordCloud
background_pic = '微博.jpg'
images = Image.open(background_pic)
maskImages = np.array(images)
content = ' '.join(report_words)
wc = WordCloud(font_path='simhei.ttf',
               background_color='white',
                width=1000,
                height=600,
               mask = maskImages
                 ).generate(content)
wc.to_file('词云图+自定义形状.png')

如何省略了第30行代码，那么得到的图形就和第一个词云图是一样的。
最终生成的词云图的颜色还是默认颜色（具体的颜色效果请读者自行运行代码后查看）。

3.绘制特定颜色的词云图

接下来讲解如何在特定形状的基础上，按特定颜色绘制词云图。首先导入相关库，代码如下：

from wordcloud import WordCloud,ImageColorGenerator
from imageio import imread

第一行代码除了从wordcloud库中导入WordCloud模块，还导入了ImageColorGenerator模块，用于获取颜色；第2行代码导入imageio库中的imread模块，用于读取图像。
然后在绘制特定词云图的代码后面加上如下代码：


import jieba
report = open('信托行业报告.txt','r').read()
words = jieba.cut(report)

report_words = []
for word in words:
    if len(word) >= 4:
        report_words.append(word)

from collections import Counter
result = Counter(report_words).most_common(50)

from PIL import Image
import numpy as np
from wordcloud import WordCloud
background_pic = '微博.jpg'
images = Image.open(background_pic)
maskImages = np.array(images)
content = ' '.join(report_words)
wc = WordCloud(font_path='simhei.ttf',
               background_color='white',
                width=1000,
                height=600,
               mask = maskImages
                 ).generate(content)

from wordcloud import WordCloud,ImageColorGenerator
from imageio import imread
back_color = imread(background_pic)
image_colors = ImageColorGenerator(back_color)
wc.recolor(color_func=image_colors)
wc.to_file('词云图+自定义形状+颜色.png')

最终生成的词云图，可以看到除了形状是新浪微博徽标的轮廓，词的颜色也是新浪微博徽标的特定颜色（具体的颜色效果读者自行运行代码后查看）。

5.2.3案例实战：新浪微博词云图绘制

下面基于3.6节从新浪微博爬取的内容，结合本节学习的知识绘制词云图。

requests获取网页源代码失败

import requests
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36'}

url = 'https://s.weibo.com/weibo?q=阿里巴巴'
res = requests.get(url, headers=headers).text
print(res)

爬取并汇总每条微博的内容


import time
from selenium import webdriver

def get_browser():
    options = webdriver.ChromeOptions()
    options.add_experimental_option('excludeSwitches', ['enable-automation'])
    options.add_argument("--disable-blink-features=AutomationControlled")
    driver = webdriver.Chrome(options=options)

    driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {
        "source": """
                        Object.defineProperty(navigator, 'webdriver', {
                          get: () => undefined
                        })
"""
    })
    return driver

url = 'https://s.weibo.com/weibo?q=阿里巴巴'
browser = get_browser()
browser.get(url)
time.sleep(6)
data = browser.page_source

import re
p_source = '
source = re.findall(p_source,data)
p_title = '(.*?)'
title = re.findall(p_title,data,re.S)

title_all = ''
for i in range(len(title)):
    title[i] = title[i].strip()
    title[i] = re.sub('','',title[i])
    title[i] = re.sub('[\u200b]','',title[i])
    title[i] = re.sub('（.*?）','',title[i])
    title[i] = re.sub(' ','',title[i])
    title_all = title_all + title[i]
    print(str(i + 1) + '.' + title[i] + '-' + source[i])

接着进行过分词


words = jieba.cut(title_all)

report_words = []
for word in words:
    if len(word) == 2:
        report_words.append(word)

result = Counter(report_words).most_common(50)

绘制词云图


import time
from selenium import webdriver

def get_browser():
    options = webdriver.ChromeOptions()
    options.add_experimental_option('excludeSwitches', ['enable-automation'])
    options.add_argument("--disable-blink-features=AutomationControlled")
    driver = webdriver.Chrome(options=options)

    driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {
        "source": """
                        Object.defineProperty(navigator, 'webdriver', {
                          get: () => undefined
                        })
"""
    })
    return driver

url = 'https://s.weibo.com/weibo?q=阿里巴巴'
browser = get_browser()
browser.get(url)
time.sleep(6)
data = browser.page_source

import re
p_source = '
source = re.findall(p_source,data)
p_title = '(.*?)'
title = re.findall(p_title,data,re.S)

title_all = ''
for i in range(len(title)):
    title[i] = title[i].strip()
    title[i] = re.sub('','',title[i])
    title[i] = re.sub('[\u200b]','',title[i])
    title[i] = re.sub('（.*?）','',title[i])
    title[i] = re.sub(' ','',title[i])
    title_all = title_all + title[i]
    print(str(i + 1) + '.' + title[i] + '-' + source[i])

import jieba
words = jieba.cut(title_all)

report_words = []
for word in words:
    if len(word) == 2:
        report_words.append(word)

from collections import Counter
result = Counter(report_words).most_common(50)

from PIL import Image
import numpy as np
from wordcloud import WordCloud
background_pic = '微博.jpg'
images = Image.open(background_pic)
maskImages = np.array(images)

content = ' '.join(report_words)
wc = WordCloud(font_path='simhei.ttf',
               background_color='white',
                width=1000,
                height=600,
               mask = maskImages
                 ).generate(content)

from wordcloud import WordCloud,ImageColorGenerator
from imageio import imread
back_color = imread(background_pic)
image_colors = ImageColorGenerator(back_color)
wc.recolor(color_func=image_colors)
wc.to_file('微博内容词云图.png')

Original: https://blog.csdn.net/Triumph19/article/details/123937138
Author: Triumph19
Title: 5.2 数据可视化分析——词云图绘制

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/673360/

转载文章受原作者版权保护。转载请注明原作者出处！

人工智能

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

多层感知机(MLP)的构建与实现

前面介绍的线性回归与Softmax回归，都属于单层神经网络，而在深度学习领域，主要关注多层模型，这节主要熟悉多层感知机（MultiLayer Perceptron，MLP），因为神…

人工智能 2023年7月14日
0064
【医学图像分析与理解课内实验】医学图像分类实验

实验目的任意选择分类算法，实现乳腺癌分类。要求所有分类算法均自己实现。下图是一个良性样本：下图是一个恶性样本： ; 实验过程由于能力和精力有限，我并没有选用CNN模型作为分类器…

人工智能 2023年7月14日
0057
【机器学习】实战系列四——聚类实验

系列文章目录学习笔记：【机器学习】第一章——机器学习分类和性能度量【机器学习】第二章——EM（期望最大化）算法【机器学习】第六章——概率无向图模型实战系列：【机器学习】实战系列…

人工智能 2023年6月2日
0089
python openCV HOGDescriptor.compute报错：Process finished with exit code -1073740791 (0xC0000409)解决方法

在写老人摔倒检测代码，训练SVM需要提取HOG特征，运行的时候控制台返回： Process finished with exit code -1073740791 (0xC0000…

人工智能 2023年7月20日
0039
【笔记本智能计划】使用OpenCv搭建一个简便实用的智能防盗监控系统

前言本计划是基于笔记本电脑，通过编程打造一个智能笔记本系统，使得我们使用笔记本起来更加智能。整个计划包括但不限于：身份认证、手势控制、手指控制鼠标、防盗监控系统、语音识…

人工智能 2023年7月19日
0045
yolov5训练自己的数据集

一直兜兜转转，看yolov5看了好长一段时间。感觉迷迷糊糊的一直摸不着边际，直到今天终于可以又进一步。现在是使用yolov5里面训练自己的数据（只说我当前的操作，不说原理，因为我现…

人工智能 2023年7月9日
0043
人工智能、机器学习、深度学习之间的关系

人工智能、机器学习、深度学习之间的关系一、人工智能二、机器学习三、深度学习三者之间的关系 * 参考资料一、人工智能人工智能（Artificial Intelligenc…

人工智能 2023年6月25日
0068
Ubuntu下百度在线语音合成使用

前言系统自动的TTS合成声音太过机械，经过笔者探查，百度和科大讯飞的离线语音确实可以。前期准备 1、上百度AI开放平台-全球领先的人工智能服务平台注册个账号 2、去控制台 …

人工智能 2023年5月23日
0069
PyTorch与其他深度学习框架（如TensorFlow）有什么区别

PyTorch与TensorFlow的区别 PyTorch和TensorFlow是目前最为流行的深度学习框架之一，它们在使用、功能和性能等方面有一些区别。本文将详细介绍这两个框架的…

人工智能 2024年1月2日
0040
回归预测 | MATLAB实现CNN-LSTM(卷积长短期记忆神经网络)多输入单输出

回归预测 | MATLAB实现CNN-LSTM(卷积长短期记忆神经网络)多输入单输出目录 * – 回归预测 | MATLAB实现CNN-LSTM(卷积长短期记忆神经网…

人工智能 2023年6月16日
00200
ECCV2022论文汇总：检测/分割/跟踪/3D/深度估计/姿态解算等多个方向！

作者 | 汽车人编辑 | Autobox 目前，公众号正向大家广泛征稿中，欢迎童鞋们投稿，我们将有一定的稿费支持哦，详细信息请点击： COO: Comic Onomatopoei…

人工智能 2023年6月23日
0054
[ 注意力机制 ] 经典网络模型1——SENet 详解与复现

🤵 Author ：Horizon Max ✨ 编程技巧篇：各种操作小结 🎇 机器视觉篇：会变魔术 OpenCV 💥 深度学习篇：简单入门 PyTorch 🏆 神经网络篇：经典网络…

人工智能 2023年7月25日
0063
无监督学习算法如何进行降维操作

问题描述如何通过无监督学习算法进行降维操作？详细介绍降维是机器学习中常用的技术之一，用于减少数据集中特征的数量。通过降维，可以剔除掉冗余的特征，减少计算复杂度、降低过拟合风险…

人工智能 2024年1月5日
0034
DataFrame数据框模糊查询与去重

1.数据框模糊查询数据框查询使用contains函数+正则表达式来实现。语法格式如下： data[data.列名.str.contains()] 1.1查询以某某开头的数据 dat…

人工智能 2023年7月8日
0076
非标准化疾病诉求的简单分诊方案总结

抵扣说明： 1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。2.余额无法直接购买下载，可以购买VIP、C币套餐、付费专栏及课程。 Original: https:…

人工智能 2023年5月31日
0079
1. 分类与回归树原理（CART）

简介分类与回归树（Classification And Regression Tree），采用二分递归分割技术，将当前样本集划分成两个子集，即其结构为二叉树，每个内部节点均只…

人工智能 2023年7月2日
0085

2024 年 4 月
一	二	三	四	五	六	日
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

5.2 数据可视化分析——词云图绘制

5.2.1 用jieba库实现中文分词

1.jieba 库的安装与基本用法

2.读取文本内容并进行分词

3.按指定长度提取分词后的词

4.统计高频词汇

5.2.2 用wordcloud库绘制词云图

1.wordcloud库的基本用法

2.绘制特定形状的词云图

3.绘制特定颜色的词云图

5.2.3案例实战：新浪微博词云图绘制

requests获取网页源代码失败

爬取并汇总每条微博的内容

接着进行过分词

绘制词云图

大家都在看