Scrapy中的settings配置文件多个版本的参数详解
BOT_NAME = 'demo1'
SPIDER_MODULES = ['demo1.spiders']
NEWSPIDER_MODULE = 'demo1.spiders'
ROBOTSTXT_OBEY = True
解释几个参数:
ROBOTSTXT_OBEY = True-----------是否遵守robots.txt
CONCURRENT_REQUESTS = 16-----------开启线程数量,默认16
AUTOTHROTTLE_START_DELAY = 3-----------开始下载时限速并延迟时间
AUTOTHROTTLE_MAX_DELAY = 60-----------高并发请求时最大延迟时间
最底下的几个:是否启用在本地缓存,如果开启会优先读取本地缓存,从而加快爬取速度,视情况而定
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 0
HTTPCACHE_DIR = 'httpcache'
HTTPCACHE_IGNORE_HTTP_CODES = []
HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
以上几个可以视项目需要开启,但是有两个参数最好每次都开启,而每次都是项目文件手动开启不免有些麻烦,最好是项目创建后就自动开启
DEPTH_PRIORITY = 1
SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleFifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.FifoMemoryQueue
SCHEDULER_ORDER = 'BFO'
CONCURRENT_REQUESTS = 100
LOG_FILE = BOT_NAME + '_' + time.strftime("%Y%m%d", time.localtime()) + '.log'
LOG_LEVEL = 'INFO'
LOG_ENABLED = True
LOG_ENCODING = 'utf-8'
LOG_STDOUT = False
-- coding: utf-8 --
Scrapy settings for step8_king project
For simplicity, this file contains only settings considered important or
commonly used. You can find more settings consulting the documentation:
http://doc.scrapy.org/en/latest/topics/settings.html
http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
1. 爬虫名称
BOT_NAME = 'step8_king'
2. 爬虫应用路径
SPIDER_MODULES = ['step8_king.spiders']
NEWSPIDER_MODULE = 'step8_king.spiders'
Crawl responsibly by identifying yourself (and your website) on the user-agent
3. 客户端 user-agent请求头
USER_AGENT = 'step8_king (+http://www.yourdomain.com)'
Obey robots.txt rules
4. 禁止爬虫配置
ROBOTSTXT_OBEY = False
Configure maximum concurrent requests performed by Scrapy (default: 16)
5. 并发请求数
CONCURRENT_REQUESTS = 4
Configure a delay for requests for the same website (default: 0)
See http://scrapy.readthedocs.org/en/latest/topics/settings.html
See also autothrottle settings and docs
6. 延迟下载秒数
DOWNLOAD_DELAY = 2
The download delay setting will honor only one of:
7. 单域名访问并发数,并且延迟下次秒数也应用在每个域名
CONCURRENT_REQUESTS_PER_DOMAIN = 2
单IP访问并发数,如果有值则忽略:CONCURRENT_REQUESTS_PER_DOMAIN,并且延迟下次秒数也应用在每个IP
CONCURRENT_REQUESTS_PER_IP = 3
Disable cookies (enabled by default)
8. 是否支持cookie,cookiejar进行操作cookie
COOKIES_ENABLED = True
COOKIES_DEBUG = True
Disable Telnet Console (enabled by default)
Original: https://blog.csdn.net/qq_27109535/article/details/125692094
Author: 默默爬行的虫虫
Title: Scrapy中的settings配置文件多个版本的参数详解
原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/791290/
转载文章受原作者版权保护。转载请注明原作者出处!