BeautifulSoup的基本使用

2023年7月4日上午2:35 • 人工智能 • 阅读 89

✅作者简介：大家好我是hacker707,大家可以叫我hacker
📃个人主页：hacker707的csdn博客
🔥系列专栏：python爬虫
💬推荐一款模拟面试、刷题神器👉点击跳转进入网站

bs4

bs4的安装
*
bs4的快速入门
解析器的比较(了解即可)
对象种类
–
- bs4的简单使用
  +
- 遍历文档树
  *

; bs4的安装

要使用BeautifulSoup4需要先安装lxml,再安装bs4

pip install lxml

pip install bs4

使用方法：

from bs4 import BeautifulSoup

lxml和bs4对比学习

from lxml import etree
tree = etree.HTML(html)
tree.xpath()

from bs4 import BeautifulSoup
soup =  BeautifulSoup(html_doc, 'lxml')

注意事项：
创建soup对象时如果不传’lxml’或者features=”lxml”会出现以下警告

bs4的快速入门

解析器的比较(了解即可)

解析器用法优点缺点python标准库BeautifulSoup(markup,’html.parser’)python标准库，执行速度适中(在python2.7.3或3.2.2之前的版本中)文档容错能力差lxml的HTML解析器BeautifulSoup(markup,’lxml’)速度快，文档容错能力强需要安装c语言库lxml的XML解析器BeautifulSoup(markup,’lxml-xml’)或者BeautifulSoup(markup,’xml’)速度快，唯一支持XML的解析器需要安装c语言库html5libBeautifulSoup(markup,’html5lib’)最好的容错性，以浏览器的方式解析文档，生成HTML5格式的文档速度慢，不依赖外部扩展

对象种类

Tag：标签
BeautifulSoup：bs对象
NavigableString：可导航的字符串
Comment：注释

from bs4 import BeautifulSoup

html_doc = """
The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.

...

"""

soup = BeautifulSoup(html_doc, 'lxml')
print(type(soup.title))
print(type(soup))
print(type(soup.title.string))
print(type(soup.span.string))

bs4的简单使用

获取标签内容

from bs4 import BeautifulSoup

html_doc = """
The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.

...

"""

soup = BeautifulSoup(html_doc, 'lxml')
print('head标签内容:\n', soup.head)
print('body标签内容:\n', soup.body)
print('html标签内容:\n', soup.html)
print('p标签内容:\n', soup.p)

✅注意：在打印p标签对应的代码时，可以发现只打印了第一个p标签内容，这时我们可以通过find_all来获取p标签全部内容

print('p标签内容:\n', soup.find_all('p'))

✅这里需要注意使用find_all里面必须传入的是字符串
获取标签名字
通过name属性获取标签名字

from bs4 import BeautifulSoup

html_doc = """
The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.

...

"""

soup = BeautifulSoup(html_doc, 'lxml')
print('head标签名字:\n', soup.head.name)
print('body标签名字:\n', soup.body.name)
print('html标签名字:\n', soup.html.name)
print('p标签名字:\n', soup.find_all('p').name)

✅如果要找到两个标签的内容，需要传入列表过滤器，而不是字符串过滤器
使用字符串过滤器获取多个标签内容会返回空列表

print(soup.find_all('title', 'p'))

[]

需要使用列表过滤器获取多个标签内容

print(soup.find_all(['title', 'p']))

[<title>The Dormouse's story, The Dormouse's story</b></p>, <p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>, <p class="story">...</p>]

获取a标签的href属性值

from bs4 import BeautifulSoup

html_doc = """
The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.

...

"""

soup = BeautifulSoup(html_doc, 'lxml')
a_list = soup.find_all('a')

for a in a_list:

    print(a.get('href'))

    print(a.attrs['href'])

    print(a['href'])

✅扩展：使用prettify()美化让节点层级关系更加明显方便分析

print(soup.prettify())

不使用prettify时的代码

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>

使用prettify时的代码

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.

  </p>
  <p class="story">
   ...

  </p>
 </body>
</html>

遍历文档树

from bs4 import BeautifulSoup

html_doc = """
The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.

...

"""
soup = BeautifulSoup(html_doc, 'lxml')
head = soup.head

print(head.contents)

print(head.children)

for h in head.children:
    print(h)
html = soup.html

print(html.descendants)

for h in html.descendants:
    print(h)

'''
需要重点掌握的
string获取标签里面的内容
strings 返回是一个生成器对象用过来获取多个标签内容
stripped_strings 和strings基本一致 但是它可以把多余的空格去掉
'''
print(soup.title.string)
print(soup.html.string)

print(soup.html.strings)
for h in soup.html.strings:
    print(h)

print(soup.html.stripped_strings)
for h in soup.html.stripped_strings:
    print(h)
'''
parent直接获得父节点
parents获取所有的父节点
'''
title = soup.title

print(title.parent)

print(title.parents)
for p in title.parents:
    print(p)

print(soup.html.parent)

print(type(soup.html.parent))

案例练习

获取所有职位名称

html = """

            职位名称
            职位类别
            人数
            地点
            发布时间

            22989-金融云区块链高级研发工程师（深圳）
            技术类
            1
            深圳
            2017-11-25

            22989-金融云高级后台开发
            技术类
            2
            深圳
            2017-11-25

            SNG16-腾讯音乐运营开发工程师（深圳）
            技术类
            2
            深圳
            2017-11-25

            SNG16-腾讯音乐业务运维工程师（深圳）
            技术类
            1
            深圳
            2017-11-25

            TEG03-高级研发工程师（深圳）
            技术类
            1
            深圳
            2017-11-24

            TEG03-高级图像算法研发工程师（深圳）
            技术类
            1
            深圳
            2017-11-24

            TEG11-高级AI开发工程师（深圳）
            技术类
            4
            深圳
            2017-11-24

            15851-后台开发工程师
            技术类
            1
            深圳
            2017-11-24

            15851-后台开发工程师
            技术类
            1
            深圳
            2017-11-24

            SNG11-高级业务运维工程师（深圳）
            技术类
            1
            深圳
            2017-11-24

"""

思路

不难看出想要的数据在tr节点的a标签里，只需要遍历所有的tr节点，从遍历出来的tr节点取a标签里面的文本数据

代码实现

from bs4 import BeautifulSoup

html = """

            职位名称
            职位类别
            人数
            地点
            发布时间

            22989-金融云区块链高级研发工程师（深圳）
            技术类
            1
            深圳
            2017-11-25

            22989-金融云高级后台开发
            技术类
            2
            深圳
            2017-11-25

            SNG16-腾讯音乐运营开发工程师（深圳）
            技术类
            2
            深圳
            2017-11-25

            SNG16-腾讯音乐业务运维工程师（深圳）
            技术类
            1
            深圳
            2017-11-25

            TEG03-高级研发工程师（深圳）
            技术类
            1
            深圳
            2017-11-24

            TEG03-高级图像算法研发工程师（深圳）
            技术类
            1
            深圳
            2017-11-24

            TEG11-高级AI开发工程师（深圳）
            技术类
            4
            深圳
            2017-11-24

            15851-后台开发工程师
            技术类
            1
            深圳
            2017-11-24

            15851-后台开发工程师
            技术类
            1
            深圳
            2017-11-24

            SNG11-高级业务运维工程师（深圳）
            技术类
            1
            深圳
            2017-11-24

"""

soup = BeautifulSoup(html, 'lxml')

tr_list = soup.find_all('tr')[1:]

for tr in tr_list:
    a_list = tr.find_all('a')
    print(a_list[0].string)

运行结果如下：

22989-金融云区块链高级研发工程师（深圳）
22989-金融云高级后台开发
SNG16-腾讯音乐运营开发工程师（深圳）
SNG16-腾讯音乐业务运维工程师（深圳）
TEG03-高级研发工程师（深圳）
TEG03-高级图像算法研发工程师（深圳）
TEG11-高级AI开发工程师（深圳）
15851-后台开发工程师
15851-后台开发工程师
SNG11-高级业务运维工程师（深圳）

🔥以上就是bs4的基本使用，如果有改进的建议，欢迎在评论区留言奥~

Original: https://blog.csdn.net/xqe777/article/details/123588660
Author: honker707
Title: BeautifulSoup的基本使用

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/668814/

转载文章受原作者版权保护。转载请注明原作者出处！

人工智能

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

小物体的目标检测的研究综述

关于小目标检测算法的研究综述小目标研究的难点小目标研究算法的现状自己对于小目标算法的理解小目标检测的难点在我们平常的数据集中，大多数都是中等和偏大的物体，小目标属性相对偏…

人工智能 2023年7月9日
00105
Neo4j入门-简单的把红楼梦数据集变成知识图谱

前两天看了一些关于知识图谱的知识，关于结构化数据建立知识图谱的想法，非结构化和半结构化数据的提取后续也会继续学习，根据数据集构建了一个红楼梦人物关系的知识图谱，主要参考了大佬的代码…

人工智能 2023年6月10日
0092
聚类方法汇总

聚类(Clustering)：按照某个特定标准(如：距离)把一个数据集分割成不同的类或簇，使得同一个簇内的数据对…

人工智能 2023年6月13日
0082
【注意力机制集锦】Channel Attention通道注意力网络结构、源码解读系列一

Channel Attention网络结构、源码解读系列一 SE-Net、SK-Net与CBAM 1 SENet 原文链接：SENet原文源码链接：SENet源码 Squeeze-…

人工智能 2023年6月17日
00172
【数据库课程设计】SQLServer数据库课程设计（学生宿舍管理），课设报告+源码+数据库关系图

数据库课程设计——学生宿舍管理，需要全部源码可以关注私信我，把邮箱发在评论区前言一、课题背景和开发环境 * 1、课题背景 2、开发环境二、系统功能及示意图 * 1、系统实现功…

人工智能 2023年7月31日
0082
[机器学习与scikit-learn-51]：模型评估-图解分类模型的评估指标（准确率、精确率、召回率）与代码示例

作者主页(文火冰糖的硅基工坊)：文火冰糖（王文兵）的博客_文火冰糖的硅基工坊_CSDN博客本文网址：https://blog.csdn.net/HiWangWenBing/art…

人工智能 2023年7月1日
0087
有关swin transformer相对位置编码的理解：

有关swin transformer相对位置编码的理解：假设window_size是7*7 那么窗口中共有49个patch，共有49*49个相对位置，每个相对位置有两个索引对应x…

人工智能 2023年7月23日
0075
人工智能十大算法

人工智能是什么？很多人都知道，但大多又都说不清楚。事实上，人工智能已经存在于我们生活中很久了。比如我们常常用到的邮箱，其中垃圾邮件过滤就是依靠人工智能；比如每个智能手机都配备…

人工智能 2023年7月28日
0046
Matlab 图像处理

目录 1.1 图片的读写和显示 1.2 彩色图、灰度图和二值化 RGB分离与合并彩色图转灰度图 rgb2gray 对灰度图进行二值化 (0或1) imbinarize 1.3 M…

人工智能 2023年6月20日
0077
语音助手——简介与交互过程

语音助手简介这款产品的目的是实现聊天陪伴、知识获取、设备控制等需求。有三种不同类型的助手：聊天型、问答型和指导型。 [En] The purpose of this produc…

人工智能 2023年5月27日
0099
知识图谱问答

公众号系统之神与我同在基于知识图谱的问答形式基于知识图谱的问答基于模板的方法跨垂域粗粒度的语义解析方法基于路径匹配的方法基于模板的方法基于模板的方法—模板挖掘方法问题1….

人工智能 2023年6月1日
0094
Deep&Wide、DeepFm的原理

关键词：记忆、泛化、交叉特征、因子分解机FM、推荐系统、联合模型、精排一般这2个模型用于推荐系统中，推荐系统分为两种： CF-Based（协同过滤）、Content-Based（…

人工智能 2023年7月14日
0070
Anaconda创建环境及环境配置

Anaconda创建环境及环境配置 1-创建环境 2-激活环境 3-删除环境 4-退出环境 5-注 1-创建环境前情提要：默认你已经安装 _anaconda_的基础上。以下操…

人工智能 2023年6月16日
0083
Pandas索引操作

Pandas中的索引操作非常灵活，功能非常强大。学会他的索引操作能帮助我们更好的处理数据。下面来对索引进行讲解。一、索引类型：不管是 Series还是 DataFrame，索引…

人工智能 2023年7月7日
0056
windows下安装pycocotools（目标检测必装）

对于目标检测任务的学习，coco数据集是绕不开的，如果你是在ubuntu或其他linux系统学习还好，但如果是在windows系统下学习，早晚会遇到令人头大的pycocotools…

人工智能 2023年7月9日
0094
机器学习回归算法（SVM、MLP、RF、Stacking集成学习）

1.支持向量回归SVM （1）基本原理支持向量机（SVM）算法因其性能优越，在图像情感分类研究中得以广泛使用，支持向量回归(SVR)算法常用于回归预测模型的构建。SVM要求数据尽…

人工智能 2023年6月16日
0075

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31