Scrapy中的settings配置文件多个版本的参数详解

Scrapy中的settings配置文件多个版本的参数详解


BOT_NAME = 'demo1'

SPIDER_MODULES = ['demo1.spiders']
NEWSPIDER_MODULE = 'demo1.spiders'

ROBOTSTXT_OBEY = True

解释几个参数:

ROBOTSTXT_OBEY = True-----------是否遵守robots.txt

CONCURRENT_REQUESTS = 16-----------开启线程数量,默认16

AUTOTHROTTLE_START_DELAY = 3-----------开始下载时限速并延迟时间

AUTOTHROTTLE_MAX_DELAY = 60-----------高并发请求时最大延迟时间

最底下的几个:是否启用在本地缓存,如果开启会优先读取本地缓存,从而加快爬取速度,视情况而定

HTTPCACHE_ENABLED = True

HTTPCACHE_EXPIRATION_SECS = 0

HTTPCACHE_DIR = 'httpcache'

HTTPCACHE_IGNORE_HTTP_CODES = []

HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

以上几个可以视项目需要开启,但是有两个参数最好每次都开启,而每次都是项目文件手动开启不免有些麻烦,最好是项目创建后就自动开启


DEPTH_PRIORITY = 1
SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleFifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.FifoMemoryQueue
SCHEDULER_ORDER = 'BFO'

CONCURRENT_REQUESTS = 100

LOG_FILE = BOT_NAME + '_' + time.strftime("%Y%m%d", time.localtime()) + '.log'

LOG_LEVEL = 'INFO'

LOG_ENABLED = True

LOG_ENCODING = 'utf-8'

LOG_STDOUT = False

-- coding: utf-8 --
Scrapy settings for step8_king project
For simplicity, this file contains only settings considered important or
commonly used. You can find more settings consulting the documentation:
http://doc.scrapy.org/en/latest/topics/settings.html
http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
1. 爬虫名称
BOT_NAME = 'step8_king'

2. 爬虫应用路径
SPIDER_MODULES = ['step8_king.spiders']
NEWSPIDER_MODULE = 'step8_king.spiders'

Crawl responsibly by identifying yourself (and your website) on the user-agent
3. 客户端 user-agent请求头
USER_AGENT = 'step8_king (+http://www.yourdomain.com)'
Obey robots.txt rules
4. 禁止爬虫配置
ROBOTSTXT_OBEY = False
Configure maximum concurrent requests performed by Scrapy (default: 16)
5. 并发请求数
CONCURRENT_REQUESTS = 4
Configure a delay for requests for the same website (default: 0)
See http://scrapy.readthedocs.org/en/latest/topics/settings.html
See also autothrottle settings and docs
6. 延迟下载秒数
DOWNLOAD_DELAY = 2
The download delay setting will honor only one of:
7. 单域名访问并发数,并且延迟下次秒数也应用在每个域名
CONCURRENT_REQUESTS_PER_DOMAIN = 2
单IP访问并发数,如果有值则忽略:CONCURRENT_REQUESTS_PER_DOMAIN,并且延迟下次秒数也应用在每个IP
CONCURRENT_REQUESTS_PER_IP = 3
Disable cookies (enabled by default)
8. 是否支持cookie,cookiejar进行操作cookie
COOKIES_ENABLED = True
COOKIES_DEBUG = True
Disable Telnet Console (enabled by default)

Original: https://blog.csdn.net/qq_27109535/article/details/125692094
Author: 默默爬行的虫虫
Title: Scrapy中的settings配置文件多个版本的参数详解

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/791290/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

亲爱的 Coder【最近整理,可免费获取】👉 最新必读书单  | 👏 面试题下载  | 🌎 免费的AI知识星球