爬虫学习之scrapy

2023年10月5日下午10:17 • Python • 阅读 42

爬虫学习之scrapy

*
– 前言
– scrapy框架
– scrapy+selenium自动化

前言

此文用于学习SCRAPY

scrapy框架

scrapy是一个用于爬取数据，并对数据进行处理的爬虫框架，使用编写几个模块就可以实现爬虫，并对数据可以做一些想定的处理

scrapy架构图

关于scrapy的组件介绍和流程步骤可以参考博客
https://www.cnblogs.com/wcwnina/p/10399810.html

使用scraoy
创建项目：scrapy startproject proname
进入项目：cd proname
创建爬虫：scrapy genspider spiname（爬虫名） xxx.com （爬取域）

我们简单的爬取dangdang网站的图书信息
在生成的Spiders(testdang)中编写爬虫代码
生成中的start_urls就是爬取的网址填写好后，去分析网站可以很轻松点分析出数据，我们将它解析出来
当数据出来后，问题来了，我们该用什么样的格式方式去将数据接收下来
这里就可以用到Item，将它看成一个可以自己定义的容器去获取数据
代码部分.

spider部分

import scrapy

from studydang.items import StudydangItem

class TestdangSpider(scrapy.Spider):
    name = 'testdang'
    allowed_domains = ['dangdang.com']
    start_urls = ['http://search.dangdang.com/?key=PYTHON&act=input&page_index=1']

    def parse(self, response):
        print('*' * 70)
        dlist = response.selector.css("ul.bigimg li")

        for i in dlist:
            print('=' * 70)
            item = StudydangItem()
            item['pic'] = (i.css("a.pic img::attr(src)").extract_first())
            if item['pic'] == None:
                item['pic'] = (i.css("a.pic img::attr(data-original)").extract_first())
            item['name'] = (i.css("p.name a::attr(title)").extract_first())
            item['author'] = (i.css("p.search_book_author a::attr(title)").extract_first())
            item['price'] = (i.css("p.price span.search_now_price::text").extract_first())
            item['time'] = (i.re_first(" /(.*?)"))
            yield item

items部分

Define here the models for your scraped items
#
See documentation in:
https://docs.scrapy.org/en/latest/topics/items.html

import scrapy

class StudydangItem(scrapy.Item):
    # define the fields for your item here like:
    pic = scrapy.Field()
    name = scrapy.Field()
    author = scrapy.Field()
    price = scrapy.Field()
    time = scrapy.Field()

接下来就是框架的另一个好处，对数据做持久化
我们这里就将数据放进MYSQL数据库中
简单的存储数据库就是连接数据库，执行SQL，关闭数据库连接
首先将连接数据库所用的user,password写入gettings中（这样可以统一，不至于反复写导致写错的问题）然后将gettings中的ITEM_PIPELINES打开，注意数字越小，越先执行。
然后对pipelines进行编辑
代码部分.


import pymysql
使用twsited异步IO框架，实现数据的异步写入。
from twisted.enterprise import adbapi

from studydang import settings

class StudydangPipeline():
        def __init__(self):
            #连接数据库，使用的参数是在gettings里准备好的
            MYSQL_HOST = settings.MYSQL_HOST
            MYSQL_DB = settings.MYSQL_DB
            MYSQL_USER = settings.MYSQL_USER
            MYSQL_PASSWD = settings.MYSQL_PASSWD
            MYSQL_PORT = settings.MYSQL_PORT
            MYSQL_CHARSET = settings.MYSQL_CHARSET

            self.db = pymysql.connect(host=MYSQL_HOST, db=MYSQL_DB, user=MYSQL_USER, password=MYSQL_PASSWD,
                                 port=MYSQL_PORT,
                                 charset=MYSQL_CHARSET)

            #建立游标
            self.cursor = self.db.cursor()

        def process_item(self, item, spider):

            #执行SQL语句，将爬取下来的数据储存进数据库
            self.cursor.execute("insert into test(pic,name,author,price,time) value (%s,%s,%s,%s,%s)",
                                (item['pic'], item['name'], item['author'], item['price'], item['time']))

            #提交
            self.db.commit()

        def close_sql(self):
            #关闭游标和数据库
            self.cursor.close()
            self.db.close()

接下来
运行爬虫：scrapy crawl spiname -o file.json
即可

而中间会发现我的框架格式有一点不同会有一个init.py的文件
这里是为了DEBUG模式，在SCRAPY框架下运行DEBUG模式可以再scrapy.cfg同级目录下创建py文件DEBUG以下代码即可

from scrapy.cmdline import execute
import os
import sys
if __name__ == '__main__':

    sys.path.append(os.path.dirname(os.path.abspath(__file__)))
    execute(['scrapy','crawl','testdang'])

scrapy+selenium自动化

当一些网站的数据是使用加载的方式呈现，就需要使用selenium自动化打开网站来获取数据。
scrapy使用selenium主要是用download中间件实现，在download中将selenium加入打开网页获取数据返回网站加载好的响应
直接上手,自动点击图片网的下一页持续获取数据。
spider将url给调度器发送，启动download前时，我们需要将中间件打开重写，使用selenium打开网站，并获取网站数据。
selenium的使用这里不多做介绍，将driver加入并打开爬取网站.

middlewares中downloadMiddleware部分

Define here the models for your spider middleware
#
See documentation in:
https://docs.scrapy.org/en/latest/topics/spider-middleware.html
import chardet
from scrapy import signals
from selenium import webdriver
from scrapy.http import HtmlResponse

useful for handling different item types with a single interface
from itemadapter import is_item, ItemAdapter

class SeleniumMiddleware:

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.

        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        crawler.signals.connect(s.spider_close, signal=signals.spider_closed)
        return s

    def spider_opened(self, spider):

        #selenium.webdriver
        self.chrome = webdriver.Chrome(
            executable_path='xxxxxxxxxxxxxxxxxxxxxx')

    def spider_close(self,spider):
        #self.chrome.close()
        pass

    def process_response(self,request,response,spider):

        #如果是第一次打开网站第一页，就执行当前网页的获取
        if not request.meta.get("nextPage",False):
            self.chrome.get(request.url)
            webhtml = self.chrome.page_source
        #不然就点击下一页,再获取当前网页的数据
        else:
            self.chrome.find_element_by_link_text("下一页").click()
            webhtml = self.chrome.page_source

        #将获取的网页数据封装进HtmlResponse.body中
        response = HtmlResponse(url=request.url, body=webhtml,encoding="utf-8")

        return response

spider部分

import scrapy
from lxml import etree
from studydang.items import StudydangItem

class TestdangSpider(scrapy.Spider):
    name = 'pic'
    allowed_domains = ['netbian.com']
    start_urls = ['https://pic.netbian.com/4kmeinv/']
    #爬取页数的判定
    p = 2

    def parse(self, response):
        html = response.body.decode(encoding="utf-8")

        xpathBody = etree.HTML(html)
        listLie = xpathBody.xpath('//div[@class="slist"]/ul/li')
        number = 0

        for i in listLie:
            item = StudydangItem()
            item['name'] = str(i.xpath('./a/@href')[0])
            if number < 20:
                number = number+1
                yield item
            else:
                if self.p < 3:
                    self.p = self.p + 1
                    yield scrapy.Request(url=response.url,meta={"nextPage":True},callback=self.parse, dont_filter=True)

解析部分也不多说了，之前的博客中有对图片网的解析
判断也只是为了限制爬虫的页数，非常简单的逻辑
这边注意的是scrapy.Request中的meta是为了传参进中间件中给与是否执行点击下一页的判断参数

其中item是写好的，如之前一样是为了存进数据库，就不写出来了。

Original: https://blog.csdn.net/weixin_42750816/article/details/117419680
Author: 胡萝卜粥
Title: 爬虫学习之scrapy

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/792112/

转载文章受原作者版权保护。转载请注明原作者出处！

python

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

【Python爬虫】如何把抖音漂亮的小姐姐（高清、无水印）保存到硬盘里，附源码

前言现在经常有一些视频素材需要保存在自媒体平台上，但大多数平台下载时都打上了平台水印，影响了视频的美观。这一次，我们使用爬虫爬到高清晰度的无水印视频。 [En] Now ther…

Python 2023年5月25日
00114
MongoDB副本集群搭建和基础配置

MongoDB副本集群文章目录 MongoDB副本集群 * 1.MongoDB副本集介绍 – 1.1.副本集角色： 2.副本集介绍 – 2.1.副本集目录…

Python 2023年9月26日
0029
pytest笔记

一、pytest对比unittest二、pytest用例前后置方式二： @pytest.fixture(scope=’function’) 用&#x4…

Python 2023年9月13日
0033
CentOS7.9 安装 Python3.11

在 windows 机器上开发了几个Python 爬虫，想扔到云服务器上去跑。windows 使用的Python 是3.11，而 CentOS7.9 默认只有 2.7. 所以需要在…

Python 2023年11月6日
0040
Python数据分析与展示1

数据的维度一维数据：一维数据由对等关系的有序或无序数据构成，采用线性方式组织。对应列表、数组和集合等概念。列表和数组：一组数据的有序结构。区别：列表：数据类型可以不同数组…

Python 2023年8月27日
0041
C语言知识学习归纳总结（逐梦篇专栏合集）

📣📣📣📣📣📣📣📣✏️作者主页：枫霜剑客📋 上一专栏: C++实战宝典📣📣📣📣📣📣📣📣 🎍逐梦编程，让中华屹立世界之巅。 🎍简单的事情重复做,重复的事情用心做,用心的事情坚持做；文…

Python 2023年9月17日
0047
厉害了，用Python绘制动态可视化图表，并保存成gif格式

安装相关的模块首先第一步的话我们需要安装相关的模块，通过 pip 命令来安装 pip install gif gif matplotlib plotly altair pip i…

Python 2023年9月6日
0042
python异常值处理–箱线图，特征重复处理–corr函数，记录重复处理

一、原理：正常值范围： QL-1.5IQR QU+1.5IQR说明：QL:下四分位数QU:上四分位数IQR: QU-QL 异常值处理–箱线图import matplo…

Python 2023年8月21日
0043
Python代码阅读（第45篇）：柯里化

本篇阅读的代码实现了将函数柯里化的功能。在计算机科学中，柯里化（英语：Currying），又译为卡瑞化或加里化，是把接受多个参数的函数变换成接受一个单一参数（最初函数的第一个参数…

Python 2023年6月15日
0063
python ——numpy库学习

numpy重在数值计算，也是大部分Python科学计算库的基础库创建数组 1. import numpy as np t1=np.array([1,2,3]) print(t1)…

Python 2023年8月29日
0047
Python Scrapy爬虫框架实战应用

通过上一节《Python Scrapy爬虫框架详解》的学习，您已经对 Scrapy 框架有了一个初步的认识，比如它的组件构成，配置文件，以及工作流程。本节将通过一个的简单爬虫项目对…

Python 2023年10月3日
0030
元宇宙Web3.0科普—MoneyKing链游平台新格局介绍

众所周知，当前元宇宙”行业化”概念爆发后，当前社会甚至全世界再次步入一个全新的世界格局分水岭；随之而来的，包括了元宇宙具像化落地的Web3.0概念。如果有…

Python 2023年11月8日
0029
[prometheus]基于influxdb2实现远端存储

Prometheus基于InfluxDB2和Telegraf实现数据远端存储前言 Prometheus自带的时序数据库胜在使用方便，缺点在于难以维护，如果数据有问题，可能需要删除…

Python 2023年6月12日
00121
反转链表（剑指offer）

反转链表头插法思路：建立一个头结点newList 临时节点next保存原头节点head的下一个节点（保存其位置，为了使原头结点指向newList的第一个节点，并且不丢失原头结…

Python 2023年6月12日
0072
淘宝自动抢购脚本

淘宝自动抢购脚本抢购脚本是通过Selenium来完成自动登录，和自动点击的操作的。 Selenium是一个用于Web应用程序测试的工具，Selenium可以直接运行在浏览器中，通…

Python 2023年7月31日
0051
机器学习中的数学——距离定义（二十六）：Wasserstein距离（Wasserstei Distance）/EM距离（Earth-Mover Distance）

分类目录：《机器学习中的数学》总目录相关文章：· 距离定义：基础知识· 距离定义（一）：欧几里得距离（Euclidean Distance）· 距离定义（二）：曼哈顿距离（Manh…

Python 2023年9月17日
0058

2024 年 4 月
一	二	三	四	五	六	日
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

爬虫学习之scrapy

爬虫学习之scrapy

前言

scrapy框架

scrapy+selenium自动化

大家都在看