爬虫基本命令
新建项目
在开始爬取之前,必须创建一个新的Scrapy项目。进入自定义的项目目录中,在终端下运行下列命令:
scrapy startproject mySpider
生成爬虫文件
生成爬虫名是itcast, 爬虫允许的域名是 itcast.cn
这里爬虫允许爬取的域名范围
如果后面修改代码时增加了其他域名,需要在这里进行修改
scrapy genspider itcast "itcast.cn"
>> 代码变化示例如下:
name = "itcast"
allow_domains = ['itcast.cn']
启动爬虫文件
- 在终端使用命令行:
Scrapy crawl itcast
- 制作爬虫启动文件:
from scrapy import cmdline
爬虫启动文件
cmdline.execute("scrapy crawl baidu_news".split())
scrapy常用参数:
encoding: 使用默认的 ‘utf-8’ 就行。
dont_filter: 表明该请求不由调度器过滤。这是当你想使用多次执行相同的请求,忽略重复的过滤器。默认为False。
errback: 指定错误处理函数
method:请求一般不需要指定,默认GET方法,可设置为”GET”, “POST”, “PUT”等,且保证字符串大写
xpath基本语法:
表达式结果内容/bookstore/book[price>35.00]选取bookstore下book元素price 元素的值须大于 35.00。//title[@la=’eng’]选取所有title元素,且这些元素拥有值为 eng 的 la 属性。/bookstore/book[1]选取属于 bookstore 的第一个 book 元素。//a/text()选取a标签下的内容//a/@href选取 a 标签下的href属性的值
模块编写
middleware.py
在middleware.py下编写随机user-agent库,请求随机提取一个user-agent
需要在setting.py打开DOWNLOADER_MIDDLEWARES,并且更改内容
import random
#每次请求设置随机user-agent
class BaiduDownloaderMiddleware(object):
user_agents = [
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60",
"Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50",
# Firefox
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0",
"Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10",
# Safari
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2",
# chrome
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11",
# 猎豹浏览器
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER)",
# QQ浏览器
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; QQBrowser/7.0.3698.400)",
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)",
# sogou浏览器
"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; SE 2.X MetaSr 1.0)",
# UC浏览器
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 UBrowser/4.0.3214.0 Safari/537.36",
]
def process_request(self,request,spider):
user_agent = random.choice(self.user_agents)
request.headers['User-Agent'] = user_agent
items.py
规定好要爬取的字段
import scrapy
class DmozItem(scrapy.Item):
title = scrapy.Field()
link = scrapy.Field()
desc = scrapy.Field()
需要在spider.py文件中添加
from ..items import DmozItem
pipeline.py
以下操作都需要在setting中打开配置ITEM_PIPELINES,并且更改内容
保存json
import json
class BibiPipeline(object):
def process_item(self, item, spider):
with open('bibi.json','a') as fp:
json.dump(item,fp,ensure_ascii=False)
print("{}保存成功".format(item['题目']))
保存txt
class BookSpiderPipeline(object):
def process_item(self, item, spider):
with open('%s.txt'%item['分类'],'a',encoding='utf-8') as fp:
fp.writelines('题目 :{}\n内容:{}url:{}'.format(item['title'],item['content'],item['url']))
fp.write('\n')
保存csv
import csv
headers = ['展示面料', '羽毛球拍等级', '级别', '服装厚度指数', '服装弹力指数', '鞋软硬指数', '鞋透气指数', '乒乓配件分类', '故事包', '运动项目', '羽拍最高磅数', '支付方式', '产品名称', '羽拍重量等级', '商品类型', '羽拍性能特点', '羽毛球球速', '展示故事包', '商品名称', '羽拍中杆硬度', '展示科技', '产品规格/尺寸/厚度', '鞋底', '运动类型', '颜色', '商品价格', '款型', '展示产品系列', '店铺活动', '品牌名称', '产品系列', '乒乓成拍类型', '性别']
class JingdongPipeline(object):
def process_item(self, item, spider):
with open('lining.csv', 'a', encoding='utf-8',newline='') as fp:
writer = csv.DictWriter(fp, headers)
writer.writerow(item)
print(item)
保存到mysql
from pymysql import *
class BaiduPipeline(object):
def process_item(self, item, spider):
# 配置链接数据库
conn = connect(
host='127.0.0.1',
port=3306,
user='root',
password='123456',
database='news_baidu',
charset="utf8",
)
# 游标
cur = conn.cursor()
sql = "insert into b_news(title,url,content) values ('{}','{}','{}')".format(item['title'],item['url'],item['content'])
try:
cur.execute(sql)
conn.commit()
except Exception as err:
print(err)
conn.rollback()
cur.close()
conn.close()
return item
setting.py:
LOG_LEVEL = “WARNING” 显示警告以上的提示信息
DOWNLOAD_DELAY = 1 设置每次请求的时间间隔
CONCURRENT_REQUESTS = 32 设置同一时间最大的请求并发数注意: DOWNLOAD_DELAY 会影响 CONCURRENT_REQUESTS,不能使并发显现出来
ROBOTSTXT_OBEY = False True为遵守robots.txt,反之则相反
DEFAULT_REQUEST_HEADERS 会覆盖默认的请求头
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'
}
其他爬取经验
setting中的COOKIES_ENABLED默认是True,但是最好还是打开
cookie = "gr_user_id=0be58042-245c-4efb-9647-942ad4a313f7; grwng_uid=c8d15705-e672-469f-b10f-541d77c6a2fe; AGL_USER_ID=2b2f99a4-ffe0-475c-90d9-0e3278de8277; 8de2f524d49e13e1_gr_last_sent_cs1=1546032; preference_select=preference_select; report_prompt_disc=1; 8de2f524d49e13e1_gr_session_id=cb2df81a-f14d-41cb-846b-e8ede5abebc0; 8de2f524d49e13e1_gr_last_sent_sid_with_cs1=cb2df81a-f14d-41cb-846b-e8ede5abebc0; 8de2f524d49e13e1_gr_session_id_cb2df81a-f14d-41cb-846b-e8ede5abebc0=true; Hm_lvt_ec1d5de03c39d652adb3b5432ece711d=1582278877,1582358901,1582359990,1582373249; Hm_lpvt_ec1d5de03c39d652adb3b5432ece711d=1582373249; 8de2f524d49e13e1_gr_cs1=1546032; Hm_lvt_40163307b5932c7d36838ff99a147621=1582278877,1582358901,1582359990,1582373249; Hm_lpvt_40163307b5932c7d36838ff99a147621=1582373249; userinfo_id=1546032; NO_REFRESH_JWT=1; POP_USER=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJ3ZWJzaXRlIjoiMSIsImp3dFR5cGUiOiJvZmZpY2lhbCIsImV4cCI6MTU4NDk1MjAwMywidXNlciI6eyJjbGllbnRJZCI6IjVlNTExOTgwYmQxNzMyLjI3NDg0OTcwIiwidXNlcklkIjoxNTQ2MDMyLCJjaGlsZElkIjoiIiwibW9iaWxlIjoiMTc2NDAyNDQzMDYiLCJhY2NvdW50IjoiMTc2NDAyNDQzMDYiLCJpcCI6IjExMS4zNS4yMDkuMzEiLCJ1c2VyX2Zyb20iOiIxIiwiY2hlY2tJcE51bSI6ZmFsc2V9fQ.F1GEGa0W_tNNSeo-bt2ri3bRZGMCP6vBjXN-A6UkBWc; POP_UID=6d2d52a39457829a93155b1fb31125b0"
cookies = {i.split('=')[0]: i.split('=')[1] for i in cookie.split(';')}
yield scrapy.Request(
url=detial_url,
callback=self.parse_detail,
cookies=cookies,
meta={"info":title}
)
拼接url
自带response.urljoin(url) 方法
传递爬取的参数
在函数之间传递爬取的参数 必要时要使用deepcopy()
#传输:
meta={"info":(title,detail_url,type_,content)}
yield scrapy.Request(
url=detail_url,
meta={"info":(title,detail_url,type_,content)},
callback=self.parse_detail
)
#在新函数中接收:
title,url,type_,content = response.meta.get('info')
#deepcopy示例:
meta={"item": deepcopy(item)}
输出使用gb18030解码:
import sys
sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030')
使用正则的非贪婪模式匹配域名
base = re.findall("(https|http)://(.*?)/", new_url)[0][1]
json和字符串的转换
将json转化为str
json_str = json.dumps(books,ensure_ascii=False)
ensure_ascii=False 不进行编码
str转化为json
json = json.loads(json_str)
start_requests
若设置这一行,那么start_urls会无用,通常用于构造post请求和构造初始请求列表
def start_requests(self):
urls = ['http://lab.scrapyd.cn/page/1/','http://lab.scrapyd.cn/page/2/']
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
scrapy.FormRequest
用于构造POST请求,scrapy.Request也可以发送post请求
但是Request的请求不能携带formdata,即post的参数
FormRequest
yield scrapy.FormRequest(
url="http://192.168.29.54:8080/resale/list{}.do".format(key.capitalize()),
formdata={"page":"1"},
meta={'key':key},
callback=self.parse_is_detail
)
Request
yield scrapy.Request(
url="http://192.168.29.54:8080/resale/list{}.do".format(key.capitalize()),
methon = 'POST',
meta={'key':key},
callback=self.parse_is_detail
)
pyexecjs,python执行JS
当网页存在反爬,需要解析JS时,从前端抓包找到对应的JS,然后用python执行JS,返回值作为request的参数进行查询
Original: https://blog.csdn.net/lijiamingccc/article/details/124219822
Author: 加油strive
Title: Scrapy爬虫基本命令 | 各类配置文件的使用 | 其他的爬虫小技巧
原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/788525/
转载文章受原作者版权保护。转载请注明原作者出处!