Scrapy爬虫基本命令 | 各类配置文件的使用 | 其他的爬虫小技巧

爬虫基本命令

新建项目

在开始爬取之前,必须创建一个新的Scrapy项目。进入自定义的项目目录中,在终端下运行下列命令:

scrapy startproject mySpider

生成爬虫文件

生成爬虫名是itcast, 爬虫允许的域名是 itcast.cn
这里爬虫允许爬取的域名范围
如果后面修改代码时增加了其他域名,需要在这里进行修改

scrapy genspider itcast "itcast.cn"
>> 代码变化示例如下:
name = "itcast"
allow_domains = ['itcast.cn']

启动爬虫文件

  1. 在终端使用命令行:
Scrapy crawl itcast
  1. 制作爬虫启动文件:
from scrapy import cmdline

爬虫启动文件
cmdline.execute("scrapy crawl baidu_news".split())

scrapy常用参数:

encoding: 使用默认的 ‘utf-8’ 就行。
dont_filter: 表明该请求不由调度器过滤。这是当你想使用多次执行相同的请求,忽略重复的过滤器。默认为False。
errback: 指定错误处理函数
method:请求一般不需要指定,默认GET方法,可设置为”GET”, “POST”, “PUT”等,且保证字符串大写

xpath基本语法:

表达式结果内容/bookstore/book[price>35.00]选取bookstore下book元素price 元素的值须大于 35.00。//title[@la=’eng’]选取所有title元素,且这些元素拥有值为 eng 的 la 属性。/bookstore/book[1]选取属于 bookstore 的第一个 book 元素。//a/text()选取a标签下的内容//a/@href选取 a 标签下的href属性的值

模块编写

middleware.py

在middleware.py下编写随机user-agent库,请求随机提取一个user-agent
需要在setting.py打开DOWNLOADER_MIDDLEWARES,并且更改内容

import random

#每次请求设置随机user-agent
class BaiduDownloaderMiddleware(object):
    user_agents = [
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60",
    "Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50",
    # Firefox
    "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0",
    "Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10",
    # Safari
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2",
    # chrome
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11",
    # 猎豹浏览器
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER",
    "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER)",
    # QQ浏览器
    "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; QQBrowser/7.0.3698.400)",
    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)",
    # sogou浏览器
    "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; SE 2.X MetaSr 1.0)",
    # UC浏览器
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 UBrowser/4.0.3214.0 Safari/537.36",
    ]
    def process_request(self,request,spider):
        user_agent = random.choice(self.user_agents)
        request.headers['User-Agent'] = user_agent

items.py

规定好要爬取的字段

import scrapy

class DmozItem(scrapy.Item):
    title = scrapy.Field()
    link = scrapy.Field()
    desc = scrapy.Field()

需要在spider.py文件中添加

from ..items import DmozItem

pipeline.py

以下操作都需要在setting中打开配置ITEM_PIPELINES,并且更改内容
保存json

import json

class BibiPipeline(object):
    def process_item(self, item, spider):
        with open('bibi.json','a') as fp:
            json.dump(item,fp,ensure_ascii=False)
            print("{}保存成功".format(item['题目']))

保存txt

class BookSpiderPipeline(object):
    def process_item(self, item, spider):
        with open('%s.txt'%item['分类'],'a',encoding='utf-8') as fp:
            fp.writelines('题目 :{}\n内容:{}url:{}'.format(item['title'],item['content'],item['url']))
            fp.write('\n')

保存csv

import csv

headers = ['展示面料', '羽毛球拍等级', '级别', '服装厚度指数', '服装弹力指数', '鞋软硬指数', '鞋透气指数', '乒乓配件分类', '故事包', '运动项目', '羽拍最高磅数', '支付方式', '产品名称', '羽拍重量等级', '商品类型', '羽拍性能特点', '羽毛球球速', '展示故事包', '商品名称', '羽拍中杆硬度', '展示科技', '产品规格/尺寸/厚度', '鞋底', '运动类型', '颜色', '商品价格', '款型', '展示产品系列', '店铺活动', '品牌名称', '产品系列', '乒乓成拍类型', '性别']
class JingdongPipeline(object):
    def process_item(self, item, spider):
        with open('lining.csv', 'a', encoding='utf-8',newline='') as fp:
            writer = csv.DictWriter(fp, headers)
            writer.writerow(item)
        print(item)

保存到mysql

from pymysql import *

class BaiduPipeline(object):
    def process_item(self, item, spider):
        # 配置链接数据库
        conn = connect(
            host='127.0.0.1',
            port=3306,
            user='root',
            password='123456',
            database='news_baidu',
            charset="utf8",
        )
        # 游标
        cur = conn.cursor()

        sql = "insert into b_news(title,url,content) values ('{}','{}','{}')".format(item['title'],item['url'],item['content'])
        try:
            cur.execute(sql)
            conn.commit()
        except Exception as err:
            print(err)
            conn.rollback()

        cur.close()
        conn.close()
        return item

setting.py:

LOG_LEVEL = “WARNING” 显示警告以上的提示信息
DOWNLOAD_DELAY = 1 设置每次请求的时间间隔
CONCURRENT_REQUESTS = 32 设置同一时间最大的请求并发数

注意: DOWNLOAD_DELAY 会影响 CONCURRENT_REQUESTS,不能使并发显现出来

ROBOTSTXT_OBEY = False True为遵守robots.txt,反之则相反
DEFAULT_REQUEST_HEADERS 会覆盖默认的请求头

DEFAULT_REQUEST_HEADERS = {
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'en',
  'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'
}

其他爬取经验

带cookie爬取

setting中的COOKIES_ENABLED默认是True,但是最好还是打开

cookie = "gr_user_id=0be58042-245c-4efb-9647-942ad4a313f7; grwng_uid=c8d15705-e672-469f-b10f-541d77c6a2fe; AGL_USER_ID=2b2f99a4-ffe0-475c-90d9-0e3278de8277; 8de2f524d49e13e1_gr_last_sent_cs1=1546032; preference_select=preference_select; report_prompt_disc=1; 8de2f524d49e13e1_gr_session_id=cb2df81a-f14d-41cb-846b-e8ede5abebc0; 8de2f524d49e13e1_gr_last_sent_sid_with_cs1=cb2df81a-f14d-41cb-846b-e8ede5abebc0; 8de2f524d49e13e1_gr_session_id_cb2df81a-f14d-41cb-846b-e8ede5abebc0=true; Hm_lvt_ec1d5de03c39d652adb3b5432ece711d=1582278877,1582358901,1582359990,1582373249; Hm_lpvt_ec1d5de03c39d652adb3b5432ece711d=1582373249; 8de2f524d49e13e1_gr_cs1=1546032; Hm_lvt_40163307b5932c7d36838ff99a147621=1582278877,1582358901,1582359990,1582373249; Hm_lpvt_40163307b5932c7d36838ff99a147621=1582373249; userinfo_id=1546032; NO_REFRESH_JWT=1; POP_USER=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJ3ZWJzaXRlIjoiMSIsImp3dFR5cGUiOiJvZmZpY2lhbCIsImV4cCI6MTU4NDk1MjAwMywidXNlciI6eyJjbGllbnRJZCI6IjVlNTExOTgwYmQxNzMyLjI3NDg0OTcwIiwidXNlcklkIjoxNTQ2MDMyLCJjaGlsZElkIjoiIiwibW9iaWxlIjoiMTc2NDAyNDQzMDYiLCJhY2NvdW50IjoiMTc2NDAyNDQzMDYiLCJpcCI6IjExMS4zNS4yMDkuMzEiLCJ1c2VyX2Zyb20iOiIxIiwiY2hlY2tJcE51bSI6ZmFsc2V9fQ.F1GEGa0W_tNNSeo-bt2ri3bRZGMCP6vBjXN-A6UkBWc; POP_UID=6d2d52a39457829a93155b1fb31125b0"
        cookies = {i.split('=')[0]: i.split('=')[1] for i in cookie.split(';')}
        yield scrapy.Request(
            url=detial_url,
            callback=self.parse_detail,
            cookies=cookies,
            meta={"info":title}
        )

拼接url

自带response.urljoin(url) 方法

传递爬取的参数

在函数之间传递爬取的参数 必要时要使用deepcopy()

#传输:
meta={"info":(title,detail_url,type_,content)}
yield scrapy.Request(
                url=detail_url,
                meta={"info":(title,detail_url,type_,content)},
                callback=self.parse_detail
            )

#在新函数中接收:
title,url,type_,content = response.meta.get('info')

#deepcopy示例:
meta={"item": deepcopy(item)}

输出使用gb18030解码:

import sys
sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030')

使用正则的非贪婪模式匹配域名

base = re.findall("(https|http)://(.*?)/", new_url)[0][1]

json和字符串的转换

将json转化为str
json_str = json.dumps(books,ensure_ascii=False)
ensure_ascii=False  不进行编码

str转化为json
json = json.loads(json_str)

start_requests

若设置这一行,那么start_urls会无用,通常用于构造post请求和构造初始请求列表

def start_requests(self):
    urls = ['http://lab.scrapyd.cn/page/1/','http://lab.scrapyd.cn/page/2/']
    for url in urls:
        yield scrapy.Request(url=url, callback=self.parse)

scrapy.FormRequest

用于构造POST请求,scrapy.Request也可以发送post请求
但是Request的请求不能携带formdata,即post的参数

FormRequest
yield scrapy.FormRequest(
    url="http://192.168.29.54:8080/resale/list{}.do".format(key.capitalize()),
    formdata={"page":"1"},
    meta={'key':key},
    callback=self.parse_is_detail
)

Request
yield scrapy.Request(
    url="http://192.168.29.54:8080/resale/list{}.do".format(key.capitalize()),
    methon = 'POST',
    meta={'key':key},
    callback=self.parse_is_detail
)

pyexecjs,python执行JS

当网页存在反爬,需要解析JS时,从前端抓包找到对应的JS,然后用python执行JS,返回值作为request的参数进行查询

Original: https://blog.csdn.net/lijiamingccc/article/details/124219822
Author: 加油strive
Title: Scrapy爬虫基本命令 | 各类配置文件的使用 | 其他的爬虫小技巧

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/788525/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

亲爱的 Coder【最近整理,可免费获取】👉 最新必读书单  | 👏 面试题下载  | 🌎 免费的AI知识星球