Chapter3.4/3.5 scrapy-css选择器及本章小结

2023年10月6日上午6:00 • Python • 阅读 52

3.4 CSS选择器

CSS即层叠样式表，其选择器是一种用来确定HTML文档中某部分
位置的语言。
CSS选择器的语法比XPath更简单一些，但功能不如XPath强大。
实际上，当我们调用Selector对象的CSS方法时，在其内部会使用Python库cssselect将CSS选择器表达式翻译成XPath表达式，然后调用Selector对象的XPATH方法。
表3-2列出了CSS选择器的一些基本语法。

表3-2 CSS选择器表达式描述例子选中所元素E选中E元素pE1,E2选中E1和E2 元素div,preE1 E2选中E1后代元素中的E2元素div pE1>E2选中E1子元素中的E2元素div>pE1+E2选中E1兄弟元素中的E2元素p +strong.CLASS选中CLASS属性包含CLASS的元素.info#ID选中id属性为ID的元素#main[ATTR]选中包含ATTR属性的元素[href][ATTR=VALUE]选中包含ATTR属性且值为VALUE的元素[method=post]ATTR~=VALUE选中包含ATTR属性且值包含VALUE的元素[class~=clearfix]E:nth-child(n)

E:nth-last-child(n)选中E元素，且该元素必须是其父元素的（倒数）第n个子元素a:nth-child(1)

a:nth-last-child(2)E:first-child(n)

E:last-child(n)选中E元素，且该元素必须是其父元素的（倒数）第一个子元素a:first-child

a: last-childE:empty选中没有子元素的E 元素div:emptyE::text选中E元素的文本节点（Text Node）p:: text

和学习XPath一样，通过一些例子展示CSS选择器的使用。
先创建一个HTML文档并构造一个HtmlResponse对象：

>>> from scrapy.selector import Selector
>>> from scrapy.http import HtmlResponse
>>> body ='''

        Example website

         Name: Image 0

             Name: Image 1
             Name: Image 2
             Name: Image 3

             Name: Image 4
             Name: Image 5

 '''
>>> response = HtmlResponse(url='http://www.example.com', body=body, encoding='utf-8')
>>> response
<200 http://www.example.com>

● E：选中E元素。


>>> response.css('img')
>[<Selector xpath='descendant-or-self::img' data=''>,
 <Selector xpath='descendant-or-self::img' data=''>,
 <Selector xpath='descendant-or-self::img' data=''>,
 <Selector xpath='descendant-or-self::img' data=''>,
 <Selector xpath='descendant-or-self::img' data=''>,
 <Selector xpath='descendant-or-self::img' data=''>]

● E1,E2：选中E1和E2元素。


>>> response.css('base,title')
>[<Selector xpath='descendant-or-self::base | descendant-or-self::title' data=''>,
 <Selector xpath='descendant-or-self::base | descendant-or-self::title' data='Example website'>]

● E1 E2：选中E1后代元素中的E2元素。


>>> response.css('div img')
>[<Selector xpath='descendant-or-self::div/descendant-or-self::*/img' data=''>,
 <Selector xpath='descendant-or-self::div/descendant-or-self::*/img' data=''>,
 <Selector xpath='descendant-or-self::div/descendant-or-self::*/img' data=''>,
 <Selector xpath='descendant-or-self::div/descendant-or-self::*/img' data=''>,
 <Selector xpath='descendant-or-self::div/descendant-or-self::*/img' data=''>]

● E1>E2：选中E1子元素中的E2元素。


>>> response.css('body>div')
[<Selector xpath='descendant-or-self::body/div' data='>,
 <Selector xpath='descendant-or-self::body/div' data='\n\t  ...'>]

● [ATTR]：选中包含ATTR属性的元素。


>>> response.css('[style]')
>[<Selector xpath='descendant-or-self::*[@style]' data='>]

● [ATTR=VALUE]：选中包含ATTR属性且值为VALUE的元素。


>>> response.css('[id=images-1]')
>[<Selector xpath="descendant-or-self::*[@id = 'images-1']" data='>]

● E:nth-child(n)：选中E元素，且该元素必须是其父元素的第n个子元素。


>>> response.css('div>a:nth-child(1)')
[<Selector xpath='descendant-or-self::div/a[count(preceding-sibling::*) = 0]' data='Name: Image 1 >,
 <Selector xpath='descendant-or-self::div/a[count(preceding-sibling::*) = 0]' data='Name: Image 4 >]

>>> response.css('div:nth-child(2)>a:nth-child(1)')
>[<Selector xpath='descendant-or-self::div[count(preceding-sibling::*) = 2]/a[count(preceding-sibling::*) = 0]' data='Name: Image 4 >]

● E:first-child：选中E元素，该元素必须是其父元素的第一个子元素。
● E:last-child：选中E元素，该元素必须是其父元素的倒数第一个子元素。


>>> response.css('div:first-child>a:last-child')
>[<Selector xpath='descendant-or-self::div[count(following-sibling::*) = 0]/a[count(preceding-sibling::*) = 0]' data='Name: Image 4 >]

● E::text：选中E元素的文本节点。

选中所有a的文本

>>> sel = response.css('a::text')
>>> sel
>[<Selector xpath='descendant-or-self::a/text()' data='Name: Image 0 '>,
 <Selector xpath='descendant-or-self::a/text()' data=' '>,
 <Selector xpath='descendant-or-self::a/text()' data='Name: Image 1 '>,
 <Selector xpath='descendant-or-self::a/text()' data=' '>,
 <Selector xpath='descendant-or-self::a/text()' data='Name: Image 2 '>,
 <Selector xpath='descendant-or-self::a/text()' data=' '>,
 <Selector xpath='descendant-or-self::a/text()' data='Name: Image 3 '>,
 <Selector xpath='descendant-or-self::a/text()' data=' '>,
 <Selector xpath='descendant-or-self::a/text()' data='Name: Image 4 '>,
 <Selector xpath='descendant-or-self::a/text()' data=' '>,
 <Selector xpath='descendant-or-self::a/text()' data='Name: Image 5 '>,
 <Selector xpath='descendant-or-self::a/text()' data=' '>]
 >>> sel.extract()
 > ['Name: Image 0 ',
 ' ',
 'Name: Image 1 ',
 ' ',
 'Name: Image 2 ',
 ' ',
 'Name: Image 3 ',
 ' ',
 'Name: Image 4 ',
 ' ',
 'Name: Image 5 ',

3.5 本章小结

本章学习了从页面中提取数据的相关内容，首先带大家了解了Scrapy中的Selector对象，然后学习如何使用Selector对象在页面中选中并提取数据，最后通过一系列例子讲解了XPath和CSS选择器的用法。

本文参照《精通Scrapy网络爬虫+（刘硕著）》PDF，并自己跑相关代码，代码内容稍作修改，来对css的使用方法进行笔记及方法解读，仅做参考和笔记复习使用

Original: https://blog.csdn.net/qq_27608761/article/details/121028927
Author: lee’s work
Title: Chapter3.4/3.5 scrapy-css选择器及本章小结

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/792368/

转载文章受原作者版权保护。转载请注明原作者出处！

python

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

Python基础巩固：如何同时遍历多个序列

导入系统包 import platform 我还给大家准备了海量资料：Python视频教程、100本Python电子书、基础、爬虫、数据分析、web开发、机器学习、人工智能、面试题…

Python 2023年11月2日
0069
3.1 定义图表类型—-柱状图、线形图和堆积柱状图

3.1 定义图表类型—-柱状图、线形图和堆积柱状图文章目录 3.1 定义图表类型—-柱状图、线形图和堆积柱状图前言一、线形图 * 1.操作步骤 2.进…

Python 2023年9月4日
0068
Pandas 横向数据汇总实例

在某些情况下需要对Excel中的数据做横向汇总，此时使用Pandas的将体现出很强的优势，请看下面的数据：表格中有5个子类别：类别1—-类别5，每一行中至少有1个类别…

Python 2023年8月21日
0032
合工大—-python实验作业3—-matplotlib踩坑记录—-set_thetagrids()

题目在做python实验作业3的第四题的时候需要使用matplotlib库，在前面的操作都没有太大的问题，在b站找了个教程跟着做，但是在set_thetagrids（）函数，也就…

Python 2023年8月31日
0080
anaconda离线安装pytorch,解决下载过慢的问题

问题描述我们在安装pytorch的时候，通常是去pytorch官网的旧版本pytorch页面https://pytorch.org/get-started/previous-ve…

Python 2023年9月7日
0045
【Unity】Audio Source组件——用代码动态控制音效的播放、暂停

1.代码控制播放、暂停、停止给游戏物体添加Audio Source组件把音频文件拖入Audio Source组件的AudioCilp中创建一个脚本并挂载 using System…

Python 2023年9月29日
0056
从零开始写博客系统——开篇

我从2015年底开始慢慢的学习编码相关的知识。在这个博客我把自己学习的记录全部都记了下来，这么多年过去了，版本不停的在更替，当年的那个系列确实已经没啥参考意义了，并且受限于当时的水…

Python 2023年8月10日
0037
遥感影像语义分割难点对应解决思路

目录一、像素级精度问题 1. 结合多尺度特征 1.1 空洞卷积 1.2 转置卷积和跳跃连接 1.3 将边缘图集成到分割 2. 基于数据融合的策略 2.1 结合几何和光谱信息来提高…

Python 2023年10月26日
0032
python虚拟环境的使用

一、python虚拟环境介绍虚拟环境（virtual environment），它是一个虚拟化，从电脑独立开辟出来的环境。通俗的来讲，虚拟环境就是借助虚拟机来把一部分内容独立出…

Python 2023年8月5日
0053
python dataframe筛选日期_使用Python的Dataframe取两列时间值相差一年的所有行方法…

在使用Python处理数据时，经常需要对数据筛选。这是在对时间筛选时，判断两列时间是否相差一年，如果是，则返回符合条件的所有列。 data原始数据： data[map(lambd…

Python 2023年8月21日
0051
第五章使用 matplotlib 绘制饼图

系列文章目录第一章使用 matplotlib 绘制折线图第二章使用 matplotlib 绘制条形图第三章使用 matplotlib 绘制直方图第四章使用 matplot…

Python 2023年8月30日
0048
python中shift_Python Pandas dataframe.shift()用法及代码示例

Python是进行数据分析的一种出色语言，主要是因为以数据为中心的python软件包具有奇妙的生态系统。 Pandas是其中的一种，使导入和分析数据更加容易。 Pandas dat…

Python 2023年8月7日
0089
使用scrapy框架抓取手机商品信息(1)

目录 1.准备工作 1.1 启动pycharm 1.2 setting.py 配置 1.3爬取页面分析 2.代码编写 2.1 爬虫代码 2.2 piplines.py代码编写 3 …

Python 2023年10月1日
0044
Alexnet论文介绍（超详细）——ImageNet Classification with Deep Convolutional Neural Networks

近期开始阅读cv领域的一些经典论文，本文整理计算机视觉的奠基之作—— Alexnet 论文原文：ImageNet Classification with Deep Convolut…

Python 2023年10月27日
0036
python3使用scrapy爬取图片示例

python3使用scrapy爬取彼岸图网安装scrapy 创建项目项目结构具体实现 * image.py item.py pipelines.py settings.py …

Python 2023年10月2日
0054
蚂蚁学Python-Pandas从入门到实战的系列课

Pandas数据读取数据查看查看数据的形状，返回(行数、列数) data.shape 查看列名列表 data.columns 查看索引列 data.index 查香每列的数据…

Python 2023年8月8日
0072

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Chapter3.4/3.5 scrapy-css选择器及本章小结

3.4 CSS选择器

选中所有a的文本

3.5 本章小结

大家都在看