Scrapy提供了可自定义2种中间件,1个数据处理器
名称
作用
用户设置
数据收集器(Item-Pipeline)
处理item
覆盖
下载中间件(Downloader-Middleware)
处理request/response
合并
爬虫中间件(Spider-Middleware)
处理item/response/request
合并
解释:
用户设置:是指custom_settings
可是它们继承的父类竟然是Object…,每次都要查文档。
正常来说应该提供一个抽象函数作为接口,给使用者实现自己的具体功能,不知道为啥这么设计
通过几段代码及注释,简要说明三个中间件的功能
1、Spider
baidu_spider.py
from scrapy import Spider, cmdline
class BaiduSpider(Spider):
name = “baidu_spider”
start_urls = [
“https://www.baidu.com/”
]
custom_settings = {
“SPIDER_DATA”: “this is spider data”,
“DOWNLOADER_MIDDLEWARES”: {
“scrapys.mymiddleware.MyMiddleware”: 100,
},
“ITEM_PIPELINES”: {
“scrapys.mypipeline.MyPipeline”: 100,
},
“SPIDER_MIDDLEWARES”:{
“scrapys.myspidermiddleware.MySpiderMiddleware”: 100,
}
}
def parse(self, response):
pass
if name == ‘main‘:
cmdline.execute(“scrapy crawl baidu_spider”.split())
2、Pipeline
mypipeline.py
class MyPipeline(object):
def init(self, spider_data):
self.spider_data = spider_data
@classmethod
def from_crawler(cls, crawler):
“””
获取spider的settings参数,返回Pipeline实例对象
“””
spider_data = crawler.settings.get(“SPIDER_DATA”)
print(“### pipeline get spider_data: {}”.format(spider_data))
return cls(spider_data)
def process_item(self, item, spider):
“””
return Item 继续处理
raise DropItem 丢弃
“””
print(“### call process_item”)
return item
def open_spider(self, spider):
“””
spider开启时调用
“””
print(“### spdier open {}”.format(spider.name))
def close_spider(self, spider):
“””
spider关闭时调用
“””
print(“### spdier close {}”.format(spider.name))
3、Downloader-Middleware
mymiddleware.py
class MyMiddleware(object):
def init(self, spider_data):
self.spider_data = spider_data
@classmethod
def from_crawler(cls, crawler):
“””
获取spider的settings参数,返回中间件实例对象
“””
spider_data = crawler.settings.get(“SPIDER_DATA”)
print(“### middleware get spider_data: {}”.format(spider_data))
return cls(spider_data)
def process_request(self, request, spider):
“””
return
None: 继续处理Request
Response: 返回Response
Request: 重新调度
raise IgnoreRequest: process_exception -> Request.errback
“””
print(“### call process_request”)
def process_response(self, request, response, spider):
“””
return
Response: 继续处理Response
Request: 重新调度
raise IgnoreRequest: Request.errback
“””
print(“### call process_response”)
return response
def process_exception(self, request, exception, spider):
“””
return
None: 继续处理异常
Response: 返回Response
Request: 重新调用
“””
pass
4、Spider-Middleware
myspidermiddleware.py
class MySpiderMiddleware(object):
def init(self, spider_data):
self.spider_data = spider_data
@classmethod
def from_crawler(cls, crawler):
“””
获取spider的settings参数,返回中间件实例对象
“””
spider_data = crawler.settings.get(“SPIDER_DATA”)
print(“### spider middleware get spider_data: {}”.format(spider_data))
return cls(spider_data)
def process_spider_input(self, response, spider):
“””
url下载完成后执行,交给parse处理response (parse之前执行)
return None 继续处理response
raise Exception
“””
print(“### call process_spider_input”)
def process_spider_output(self, response, result, spider):
“””
response返回result时调用(result必须返回包含item 或者是 Request的可迭代对象)—–》yield item、yield Request(url)
return
iterable of Request、dict or Item
“””
print(“### call process_spider_output”)
for i in result:
yield i
def process_spider_exception(self, response, exception, spider):
“””
return
None
iterable of Response, dict, or Item
“””
passdef process_start_requests(self,start_requests,spider):”””#爬虫刚开始调用的时候执行 (start_request 之前执行)return: 包含Request对象的可迭代对象”””return start_requests
运行爬虫后,查看日志
middleware get spider_data: this is spider data
spider middleware get spider_data: this is spider data
pipeline get spider_data: this is spider data
spdier open baidu_spider
call process_request
call process_response
call process_spider_input
call process_spider_output
spdier close baidu_spider中间件启动顺序
download middleware
spider middleware
pipeline
处理函数调用顺序
spdier open
process_request
process_response
process_spider_input
process_spider_output
spdier close
Original: https://blog.csdn.net/weixin_36383052/article/details/114410255
Author: Kalimnos
Title: python中scrapy的middleware是干嘛的_Python爬虫:Scrapy中间件Middleware和Pipeline
原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/792035/
转载文章受原作者版权保护。转载请注明原作者出处!