BeautifulSoup的基本使用

2023年8月1日下午5:37 • Python • 阅读 50

✅作者简介：大家好我是hacker707,大家可以叫我hacker
📃个人主页：hacker707的csdn博客
🔥系列专栏：python爬虫
💬推荐一款模拟面试、刷题神器👉点击跳转进入网站

bs4

bs4的安装
*
bs4的快速入门
解析器的比较(了解即可)
对象种类
–
- bs4的简单使用
  +
- 遍历文档树
  *

; bs4的安装

要使用BeautifulSoup4需要先安装lxml,再安装bs4

pip install lxml

pip install bs4

使用方法：

from bs4 import BeautifulSoup

lxml和bs4对比学习

from lxml import etree
tree = etree.HTML(html)
tree.xpath()

from bs4 import BeautifulSoup
soup =  BeautifulSoup(html_doc, 'lxml')

注意事项：
创建soup对象时如果不传’lxml’或者features=”lxml”会出现以下警告

bs4的快速入门

解析器的比较(了解即可)

解析器用法优点缺点python标准库BeautifulSoup(markup,’html.parser’)python标准库，执行速度适中(在python2.7.3或3.2.2之前的版本中)文档容错能力差lxml的HTML解析器BeautifulSoup(markup,’lxml’)速度快，文档容错能力强需要安装c语言库lxml的XML解析器BeautifulSoup(markup,’lxml-xml’)或者BeautifulSoup(markup,’xml’)速度快，唯一支持XML的解析器需要安装c语言库html5libBeautifulSoup(markup,’html5lib’)最好的容错性，以浏览器的方式解析文档，生成HTML5格式的文档速度慢，不依赖外部扩展

对象种类

Tag：标签
BeautifulSoup：bs对象
NavigableString：可导航的字符串
Comment：注释

from bs4 import BeautifulSoup

html_doc = """
The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.

...

"""

soup = BeautifulSoup(html_doc, 'lxml')
print(type(soup.title))
print(type(soup))
print(type(soup.title.string))
print(type(soup.span.string))

bs4的简单使用

获取标签内容

from bs4 import BeautifulSoup

html_doc = """
The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.

...

"""

soup = BeautifulSoup(html_doc, 'lxml')
print('head标签内容:\n', soup.head)
print('body标签内容:\n', soup.body)
print('html标签内容:\n', soup.html)
print('p标签内容:\n', soup.p)

✅注意：在打印p标签对应的代码时，可以发现只打印了第一个p标签内容，这时我们可以通过find_all来获取p标签全部内容

print('p标签内容:\n', soup.find_all('p'))

✅这里需要注意使用find_all里面必须传入的是字符串
获取标签名字
通过name属性获取标签名字

from bs4 import BeautifulSoup

html_doc = """
The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.

...

"""

soup = BeautifulSoup(html_doc, 'lxml')
print('head标签名字:\n', soup.head.name)
print('body标签名字:\n', soup.body.name)
print('html标签名字:\n', soup.html.name)
print('p标签名字:\n', soup.find_all('p').name)

✅如果要找到两个标签的内容，需要传入列表过滤器，而不是字符串过滤器
使用字符串过滤器获取多个标签内容会返回空列表

print(soup.find_all('title', 'p'))

[]

需要使用列表过滤器获取多个标签内容

print(soup.find_all(['title', 'p']))

[<title>The Dormouse's story, The Dormouse's story</b></p>, <p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>, <p class="story">...</p>]

获取a标签的href属性值

from bs4 import BeautifulSoup

html_doc = """
The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.

...

"""

soup = BeautifulSoup(html_doc, 'lxml')
a_list = soup.find_all('a')

for a in a_list:

    print(a.get('href'))

    print(a.attrs['href'])

    print(a['href'])

✅扩展：使用prettify()美化让节点层级关系更加明显方便分析

print(soup.prettify())

不使用prettify时的代码

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>

使用prettify时的代码

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.

  </p>
  <p class="story">
   ...

  </p>
 </body>
</html>

遍历文档树

from bs4 import BeautifulSoup

html_doc = """
The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.

...

"""
soup = BeautifulSoup(html_doc, 'lxml')
head = soup.head

print(head.contents)

print(head.children)

for h in head.children:
    print(h)
html = soup.html

print(html.descendants)

for h in html.descendants:
    print(h)

'''
需要重点掌握的
string获取标签里面的内容
strings 返回是一个生成器对象用过来获取多个标签内容
stripped_strings 和strings基本一致 但是它可以把多余的空格去掉
'''
print(soup.title.string)
print(soup.html.string)

print(soup.html.strings)
for h in soup.html.strings:
    print(h)

print(soup.html.stripped_strings)
for h in soup.html.stripped_strings:
    print(h)
'''
parent直接获得父节点
parents获取所有的父节点
'''
title = soup.title

print(title.parent)

print(title.parents)
for p in title.parents:
    print(p)

print(soup.html.parent)

print(type(soup.html.parent))

案例练习

获取所有职位名称

html = """

            职位名称
            职位类别
            人数
            地点
            发布时间

            22989-金融云区块链高级研发工程师（深圳）
            技术类
            1
            深圳
            2017-11-25

            22989-金融云高级后台开发
            技术类
            2
            深圳
            2017-11-25

            SNG16-腾讯音乐运营开发工程师（深圳）
            技术类
            2
            深圳
            2017-11-25

            SNG16-腾讯音乐业务运维工程师（深圳）
            技术类
            1
            深圳
            2017-11-25

            TEG03-高级研发工程师（深圳）
            技术类
            1
            深圳
            2017-11-24

            TEG03-高级图像算法研发工程师（深圳）
            技术类
            1
            深圳
            2017-11-24

            TEG11-高级AI开发工程师（深圳）
            技术类
            4
            深圳
            2017-11-24

            15851-后台开发工程师
            技术类
            1
            深圳
            2017-11-24

            15851-后台开发工程师
            技术类
            1
            深圳
            2017-11-24

            SNG11-高级业务运维工程师（深圳）
            技术类
            1
            深圳
            2017-11-24

"""

思路

不难看出想要的数据在tr节点的a标签里，只需要遍历所有的tr节点，从遍历出来的tr节点取a标签里面的文本数据

代码实现

from bs4 import BeautifulSoup

html = """

            职位名称
            职位类别
            人数
            地点
            发布时间

            22989-金融云区块链高级研发工程师（深圳）
            技术类
            1
            深圳
            2017-11-25

            22989-金融云高级后台开发
            技术类
            2
            深圳
            2017-11-25

            SNG16-腾讯音乐运营开发工程师（深圳）
            技术类
            2
            深圳
            2017-11-25

            SNG16-腾讯音乐业务运维工程师（深圳）
            技术类
            1
            深圳
            2017-11-25

            TEG03-高级研发工程师（深圳）
            技术类
            1
            深圳
            2017-11-24

            TEG03-高级图像算法研发工程师（深圳）
            技术类
            1
            深圳
            2017-11-24

            TEG11-高级AI开发工程师（深圳）
            技术类
            4
            深圳
            2017-11-24

            15851-后台开发工程师
            技术类
            1
            深圳
            2017-11-24

            15851-后台开发工程师
            技术类
            1
            深圳
            2017-11-24

            SNG11-高级业务运维工程师（深圳）
            技术类
            1
            深圳
            2017-11-24

"""

soup = BeautifulSoup(html, 'lxml')

tr_list = soup.find_all('tr')[1:]

for tr in tr_list:
    a_list = tr.find_all('a')
    print(a_list[0].string)

运行结果如下：

22989-金融云区块链高级研发工程师（深圳）
22989-金融云高级后台开发
SNG16-腾讯音乐运营开发工程师（深圳）
SNG16-腾讯音乐业务运维工程师（深圳）
TEG03-高级研发工程师（深圳）
TEG03-高级图像算法研发工程师（深圳）
TEG11-高级AI开发工程师（深圳）
15851-后台开发工程师
15851-后台开发工程师
SNG11-高级业务运维工程师（深圳）

🔥以上就是bs4的基本使用，如果有改进的建议，欢迎在评论区留言奥~

Original: https://blog.csdn.net/xqe777/article/details/123588660
Author: honker707
Title: BeautifulSoup的基本使用

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/728690/

转载文章受原作者版权保护。转载请注明原作者出处！

python

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

Python中使用Matplotlib库绘制图形

目录前言一、简单的正弦函数与余弦函数二、进阶版正弦函数与余弦函数1.改变颜色与粗细2.设置图片边界3.设置记号4.设置记号的标签5.设置X,Y轴6.完整代码三、绘制简单的折线图前…

Python 2023年8月31日
0061
【python】词云图制作

词云图制作 python 练了一段时间的词云图，就来和大家讲讲词云图制作的详细过程。效果图 ; 工具准备 1、python3 2、安装第三方库wordcloud 3、安装nump…

Python 2023年8月1日
0047
pytest–编写以及管理测试用例

测试文件目录测试案例要和源码分开，单独放在一个tests文件夹里tests文件夹位于项目根目录一般测试用例的目录这样conftest包含hook函数和fixturepytest….

Python 2023年9月10日
0043
35岁高龄程序员的 4 条出路，提早布局，避免出局！

目录一、40岁回首往事：自己竟没有任何核心优势二、公司遇到危机时40岁大龄程序员会怎么样三、适合大龄程序员的几条职业发展路线四、最后的寄语这篇文章，给大家聊聊Java工程…

Python 2023年11月9日
0030
《Python编程从入门到实践》第2版 PDF高清版电子书

404. 抱歉，您访问的资源不存在。可能是URL不正确，或者对应的内容已经被删除，或者处于隐私状态。 [En] It may be that the URL is incorre…

Python 2023年5月23日
0069
pycharm使用matplotlib绘图学习笔记

#encoding=utf-8 import numpy as np def main(): import matplotlib.pyplot as plt ##lesson1:画…

Python 2023年9月5日
0029
Abp vNext 切换MySql数据库

Abp vNext是Abp的下一代版本，目前还在经一步完善，代码已经全部重写了，好的东西保留了下来，去除了很多笨重的东西，从官宣来看，Abp vNext主要是为了以后微服务架构而诞…

Python 2023年6月12日
0048
python pandas常用函数_python(21)-pandas-常用函数

pandas常用函数 1.导入 2.数据清洗、预处理 3.数据分类、筛选 4.数据统计 5.导出 def main(): 1.导入与查看 df=pd.DataFrame(pd.re…

Python 2023年8月9日
0047
Python yield与实现

生成器生成器是通过一个或多个 yield表达式构成的函数，每一个生成器都是一个迭代器（但是迭代器不一定是生成器）。如果一个函数包含 yield关键字，这个函数就会变为一个生成器…

Python 2023年6月12日
0099
pytest用例管理框架（单元测试框架）基础知识

## pytest用例管理框架（单元测试框架）基础知识 python ：unittest、pytest java ：testng 、junit 2pytest 主要作用， 1，发现…

Python 2023年9月14日
0034
学习python 迷茫_学习Python的迷茫，如何高效有趣地学习Python？

你好，作为一名程序员，10多年前接触C, C++，后来接触了C#, Python, Java语言，对于Python学习，也有着跟你类似的经历，下面谈下我的看法。如何有趣？通常来…

Python 2023年9月25日
0041
字典的底层实现原理

这个问题可以从三个方面来回答：字典是Python的一种可变、无序容器数据结构，它的元素以键值对的形式存在，键值唯一，它的特点搜索速度很快：数据量增加10000倍，搜索时间增加不到…

Python 2023年6月3日
0080
实习第一天工作总结（Summary of the first day of internship)：

实习第一天工作总结（Summary of the first day of internship)： 1.工作背景：为快速熟悉服务器的使用，首先使用django搭建案例： 2…

Python 2023年8月5日
0058
Python 模块和包、文件

一、模块 1、模块的概念模块是 Python 程序架构的一个核心概念每一个以扩展名 py 结尾的 Python 源代码文件都是一个模块模块名同样也是一个标识符，需要符…

Python 2023年8月12日
0050
编译器优化：何为别名分析

摘要：别名分析是编译器理论中的一种技术，用于确定存储位置是否可以以多种方式访问。 1.简介别名分析是编译器理论中的一种技术，用于确定存储位置是否可以以多种方式访问。如果两个指针指…

Python 2023年10月21日
0030
python能处理csv文件吗_python处理csv文件非常慢

因此,我尝试打开一个csv文件,读取它的字段,并基于此修复其他一些字段,然后将数据保存回csv。我的问题是csv文件有200万行。最好的方法是什么来加快速度。 csv文件包括 ID…

Python 2023年8月22日
0050

2024 年 4 月
一	二	三	四	五	六	日
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30