scrapy框架之crawl spider

2023年10月3日上午3:27 • Python • 阅读 59

crawl spider继承Spider类，Spider类的设计原则是只爬取start_url列表中的网页，而CrawlSpider类定义了一些规则(Rule)来提供跟进link的方便的机制，从爬取的网页中获取link并继续爬取的工作更适合，也可以重写一些方法来实现特定的功能。简单来说就是简单高效的爬取一些url比较固定的网址

This is the most commonly used spider for crawling regular websites, as it provides a convenient mechanism for following links by defining a set of rules. It may not be the best suited for your particular web sites or project, but it’s generic enough for several cases, so you can start from it and override it as needed for more custom functionality, or just implement your own spider.

Rule使用参数：

Rule(link_extractor=None, callback=None, cb_kwargs=None, follow=None, process_links=None, process_request=None, errback=None)

官方文档如下：

Spiders — Scrapy 2.5.1 documentation

案列：

1.创建项目：在scrapy安装目录下打开cmd窗口 执行 scrapy startproject pigyitong
2.创建一个crawlspider爬虫   scrapy genspider -t crawl pig "bj.zhue.com.cn"

https://bj.zhue.com.cn/search_list.php?sort=&pid=22&s_id=19&cid=2209&county_id=0&mid=&lx=&page=1  目标网址

由于发布作品的规范要求，这里只列出几个主要项目文件代码：

pig.py

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from pigyitong.items import PigyitongItem

class PigSpider(CrawlSpider):
    name = 'pig'
    allowed_domains = ['bj.zhue.com.cn']
    start_urls = ['https://bj.zhue.com.cn/search_list.php?sort=&pid=22&s_id=19&cid=2209&county_id=0&mid=&lx=&page=1']

    rules = (
        Rule(LinkExtractor(allow=r'.*?sort=&pid=22&s_id=19&cid=2209&county_id=0&mid=&lx=&page=\d'),
             follow=False, callback='parse_item'),)

    def parse_item(self, response):
        # item['domain_id'] = response.xpath('//input[@id="sid"]/@value').get()
        # item['name'] = response.xpath('//div[@id="name"]').get()
        # item['description'] = response.xpath('//div[@id="description"]').get()
        tr = response.xpath('//tr[@bgcolor="#efefef"]/../tr')
        for i in tr[2:]:
            date = i.xpath('./td[1]/a/text()').get()
            province = i.xpath('./td[2]/a/text()').get()
            region = i.xpath('./td[3]/a/text()').get()
            p_name = i.xpath('./td[4]/a/text()').get()
            species = i.xpath('./td[5]/a/text()').get()
            price = i.xpath('./td[6]//li/text()').get()
            item = PigyitongItem(data=date, province=province, region=region, p_name=p_name, species=species, price=price)
            yield item

pipelines.py

from itemadapter import ItemAdapter
from scrapy.exporters import JsonLinesItemExporter
class PigyitongPipeline:
    def __init__(self):
        self.f = open('猪易通.json', mode='wb')
        self.export = JsonLinesItemExporter(self.f, ensure_ascii=False, encoding='utf-8')

    def open_spider(self, spider):

        pass

    def process_item(self, item, spider):
        self.export.export_item(item)
        return item

    def close_spider(self, spider):
        self.f.close()

item.py

import scrapy

class PigyitongItem(scrapy.Item):

    data = scrapy.Field()
    province = scrapy.Field()
    region = scrapy.Field()
    p_name = scrapy.Field()
    species = scrapy.Field()
    price = scrapy.Field()

setting文件正常设置就好

没毛病！

Original: https://blog.csdn.net/zm024212/article/details/120628757
Author: Hill.
Title: scrapy框架之crawl spider

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/789953/

转载文章受原作者版权保护。转载请注明原作者出处！

python

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

Python 的排序方法 sort 和 sorted 的区别

使用 sort() 或内建函数 sorted() 对列表进行排序。它们之间的区别有两点： sort() 方法是对原列表进行操作，而 sorted() 方法会返回一个新列表，不是在原…

Python 2023年5月24日
00114
【宝藏级】全网最全的Pandas详细教程（2万字总结）

抵扣说明： 1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。 Original: https://blo…

Python 2023年8月2日
0039
谷歌、微软、Meta？谁才是 Python 最大的金主？

你知道维护 Python 这个大规模的开源项目，每年需要多少资金吗？答案是：约 200 万美元！ PSF（Python 软件基金会）在 2022 年 6 月发布了 2021 的…

Python 2023年10月30日
0042
python ant_python环境搭建

1、下载 1.1、直接在官网下载安装包，官网地址https://www.anaconda.com/download/。 1.2、华镜像上下载安装包，下载地址https://mirr…

Python 2023年9月25日
0038
flask源码解析之app.run()的执行流程

app是Flask类的实例，最后执行了app的run方法。 from flask import Flask app = Flask(__name__) @app.route(‘/’…

Python 2023年8月12日
0058
Java操作MongoDB详解

Java操作MongoDB详解 1. MongoDB概述 * 1.1 MongoDB简介 1.2 MongoDB概念 2. MongoDB安装 * 2.1 MongoDB下载 2….

Python 2023年9月26日
0051
MessagePack 和System.Text.Json 序列化和反序列化对比

本博客将测试MessagePack 和System.Text.Json 序列化和反序列化性能项目文件： Program.cs代码： using BenchmarkDotNet.Ru…

Python 2023年10月15日
0030
Python数据可视化学习（初学中…）

啊哦~你想找的内容离你而去了哦内容不存在，可能为如下原因导致： ① 内容还在审核中 ② 内容以前存在，但是由于不符合新的规定而被删除 ③ 内容地址错误 ④ 作者删除了内容。可…

Python 2023年8月30日
0047
Python3，选择Python自动安装第三方库，从此跟pip说拜拜！！

python安装第三方库方法 1、引言 2、pip手动安装 * 2.1 在线安装 – 2.1.1 默认安装 2.1.2 指定版本安装 2.2 离线安装 2.3 设置国内…

Python 2023年7月31日
0080
tf.nn.dropout和tf.keras.layers.Dropout的区别（TensorFlow2.3）与实验

全网搜索tf.nn.dropout和tf.keras.layers.Dropout区别，发现好多都是错误的讲解，因此有必要进行一次实验和纠错。tf.nn.dropout和tf.ke…

Python 2023年8月28日
0073
dataframe 绘图——按照每列出一个图(df.plot)

主要利用dataframe.plot绘图：对每一列绘制折线图，并在一张图显示。使用DataFrame的plot方法绘制图像会按照数据的每一列绘制一条曲线，默认按照列columns…

Python 2023年8月8日
0068
粒子群优化算法及其应用

产生背景粒子群优化（Particle Swarm Optimization, PSO）算法是由美国普渡大学的Kennedy和Eberhart于1995年提出，它的基本概念源于对鸟…

Python 2023年9月16日
0038
python实现简单五子棋

规则：用鼠标下子，黑子白子交替下子核心：1、使用turtle库画棋盘 2、turtle库中的onscreenclick()函数，实现鼠标点击详细步骤： 1、画棋盘，初始化数组 …

Python 2023年7月31日
0062
Python pandas DataFrame排序与去重操作

本篇文章主要介绍了Python数据分析Pandas Dataframe排序与去重操作：1、DataFrame 的排序分为两种，一种是对索引进行排序，另一种是对值进行排序；2、Dat…

Python 2023年8月6日
0053
Pytest用例运行方式以及参数

1.命令行模式运行命令：pytes 参数： -vs -v 输出更加详细的测试用例的息. -s 输出用例中的调试信息示例：pytest -vs -n 多线程运行示例：pytes…

Python 2023年9月11日
0047
【建议收藏】机器学习数据预处理（五）——特征选择（内附代码）

📌引言本节我们开始介绍特征选择的相关内容，好的特征对后续的机器学习模型构建有很大的帮助，很有可能会大大提高模型的准确率。 📌特征选择在进行了数据预处理以及特征构造后，我们需要对…

Python 2023年9月28日
0046

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

scrapy框架之crawl spider

pig.py

pipelines.py

大家都在看