爬虫学习之scrapy

2023年10月5日下午10:17 • Python • 阅读 52

爬虫学习之scrapy

*
– 前言
– scrapy框架
– scrapy+selenium自动化

前言

此文用于学习SCRAPY

scrapy框架

scrapy是一个用于爬取数据，并对数据进行处理的爬虫框架，使用编写几个模块就可以实现爬虫，并对数据可以做一些想定的处理

scrapy架构图

关于scrapy的组件介绍和流程步骤可以参考博客
https://www.cnblogs.com/wcwnina/p/10399810.html

使用scraoy
创建项目：scrapy startproject proname
进入项目：cd proname
创建爬虫：scrapy genspider spiname（爬虫名） xxx.com （爬取域）

我们简单的爬取dangdang网站的图书信息
在生成的Spiders(testdang)中编写爬虫代码
生成中的start_urls就是爬取的网址填写好后，去分析网站可以很轻松点分析出数据，我们将它解析出来
当数据出来后，问题来了，我们该用什么样的格式方式去将数据接收下来
这里就可以用到Item，将它看成一个可以自己定义的容器去获取数据
代码部分.

spider部分

import scrapy

from studydang.items import StudydangItem

class TestdangSpider(scrapy.Spider):
    name = 'testdang'
    allowed_domains = ['dangdang.com']
    start_urls = ['http://search.dangdang.com/?key=PYTHON&act=input&page_index=1']

    def parse(self, response):
        print('*' * 70)
        dlist = response.selector.css("ul.bigimg li")

        for i in dlist:
            print('=' * 70)
            item = StudydangItem()
            item['pic'] = (i.css("a.pic img::attr(src)").extract_first())
            if item['pic'] == None:
                item['pic'] = (i.css("a.pic img::attr(data-original)").extract_first())
            item['name'] = (i.css("p.name a::attr(title)").extract_first())
            item['author'] = (i.css("p.search_book_author a::attr(title)").extract_first())
            item['price'] = (i.css("p.price span.search_now_price::text").extract_first())
            item['time'] = (i.re_first(" /(.*?)"))
            yield item

items部分

Define here the models for your scraped items
#
See documentation in:
https://docs.scrapy.org/en/latest/topics/items.html

import scrapy

class StudydangItem(scrapy.Item):
    # define the fields for your item here like:
    pic = scrapy.Field()
    name = scrapy.Field()
    author = scrapy.Field()
    price = scrapy.Field()
    time = scrapy.Field()

接下来就是框架的另一个好处，对数据做持久化
我们这里就将数据放进MYSQL数据库中
简单的存储数据库就是连接数据库，执行SQL，关闭数据库连接
首先将连接数据库所用的user,password写入gettings中（这样可以统一，不至于反复写导致写错的问题）然后将gettings中的ITEM_PIPELINES打开，注意数字越小，越先执行。
然后对pipelines进行编辑
代码部分.


import pymysql
使用twsited异步IO框架，实现数据的异步写入。
from twisted.enterprise import adbapi

from studydang import settings

class StudydangPipeline():
        def __init__(self):
            #连接数据库，使用的参数是在gettings里准备好的
            MYSQL_HOST = settings.MYSQL_HOST
            MYSQL_DB = settings.MYSQL_DB
            MYSQL_USER = settings.MYSQL_USER
            MYSQL_PASSWD = settings.MYSQL_PASSWD
            MYSQL_PORT = settings.MYSQL_PORT
            MYSQL_CHARSET = settings.MYSQL_CHARSET

            self.db = pymysql.connect(host=MYSQL_HOST, db=MYSQL_DB, user=MYSQL_USER, password=MYSQL_PASSWD,
                                 port=MYSQL_PORT,
                                 charset=MYSQL_CHARSET)

            #建立游标
            self.cursor = self.db.cursor()

        def process_item(self, item, spider):

            #执行SQL语句，将爬取下来的数据储存进数据库
            self.cursor.execute("insert into test(pic,name,author,price,time) value (%s,%s,%s,%s,%s)",
                                (item['pic'], item['name'], item['author'], item['price'], item['time']))

            #提交
            self.db.commit()

        def close_sql(self):
            #关闭游标和数据库
            self.cursor.close()
            self.db.close()

接下来
运行爬虫：scrapy crawl spiname -o file.json
即可

而中间会发现我的框架格式有一点不同会有一个init.py的文件
这里是为了DEBUG模式，在SCRAPY框架下运行DEBUG模式可以再scrapy.cfg同级目录下创建py文件DEBUG以下代码即可

from scrapy.cmdline import execute
import os
import sys
if __name__ == '__main__':

    sys.path.append(os.path.dirname(os.path.abspath(__file__)))
    execute(['scrapy','crawl','testdang'])

scrapy+selenium自动化

当一些网站的数据是使用加载的方式呈现，就需要使用selenium自动化打开网站来获取数据。
scrapy使用selenium主要是用download中间件实现，在download中将selenium加入打开网页获取数据返回网站加载好的响应
直接上手,自动点击图片网的下一页持续获取数据。
spider将url给调度器发送，启动download前时，我们需要将中间件打开重写，使用selenium打开网站，并获取网站数据。
selenium的使用这里不多做介绍，将driver加入并打开爬取网站.

middlewares中downloadMiddleware部分

Define here the models for your spider middleware
#
See documentation in:
https://docs.scrapy.org/en/latest/topics/spider-middleware.html
import chardet
from scrapy import signals
from selenium import webdriver
from scrapy.http import HtmlResponse

useful for handling different item types with a single interface
from itemadapter import is_item, ItemAdapter

class SeleniumMiddleware:

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.

        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        crawler.signals.connect(s.spider_close, signal=signals.spider_closed)
        return s

    def spider_opened(self, spider):

        #selenium.webdriver
        self.chrome = webdriver.Chrome(
            executable_path='xxxxxxxxxxxxxxxxxxxxxx')

    def spider_close(self,spider):
        #self.chrome.close()
        pass

    def process_response(self,request,response,spider):

        #如果是第一次打开网站第一页，就执行当前网页的获取
        if not request.meta.get("nextPage",False):
            self.chrome.get(request.url)
            webhtml = self.chrome.page_source
        #不然就点击下一页,再获取当前网页的数据
        else:
            self.chrome.find_element_by_link_text("下一页").click()
            webhtml = self.chrome.page_source

        #将获取的网页数据封装进HtmlResponse.body中
        response = HtmlResponse(url=request.url, body=webhtml,encoding="utf-8")

        return response

spider部分

import scrapy
from lxml import etree
from studydang.items import StudydangItem

class TestdangSpider(scrapy.Spider):
    name = 'pic'
    allowed_domains = ['netbian.com']
    start_urls = ['https://pic.netbian.com/4kmeinv/']
    #爬取页数的判定
    p = 2

    def parse(self, response):
        html = response.body.decode(encoding="utf-8")

        xpathBody = etree.HTML(html)
        listLie = xpathBody.xpath('//div[@class="slist"]/ul/li')
        number = 0

        for i in listLie:
            item = StudydangItem()
            item['name'] = str(i.xpath('./a/@href')[0])
            if number < 20:
                number = number+1
                yield item
            else:
                if self.p < 3:
                    self.p = self.p + 1
                    yield scrapy.Request(url=response.url,meta={"nextPage":True},callback=self.parse, dont_filter=True)

解析部分也不多说了，之前的博客中有对图片网的解析
判断也只是为了限制爬虫的页数，非常简单的逻辑
这边注意的是scrapy.Request中的meta是为了传参进中间件中给与是否执行点击下一页的判断参数

其中item是写好的，如之前一样是为了存进数据库，就不写出来了。

Original: https://blog.csdn.net/weixin_42750816/article/details/117419680
Author: 胡萝卜粥
Title: 爬虫学习之scrapy

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/792112/

转载文章受原作者版权保护。转载请注明原作者出处！

python

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

python绘制横向柱状图_python matplotlib绘制折线、柱状图(简单版)

python matplotlib绘制折线、柱状图(简单版) 入门级简单的折线、柱状图绘制，代码粘贴即可使用、方便修改。其中，柱状图可以填充内容，比如.和/等。简单折线绘制一个…

Python 2023年9月6日
0078
Scrapy08：scrapy-deltafetch，让爬虫有了记忆

deltafetch，让爬虫有记忆前言 “我化作人鱼，只有七秒钟的记忆”。很多时候，爬虫程序跑着跑着，因为网络故障或者程序异常就宕掉了。无奈之下只能重启…

Python 2023年10月3日
0043
人工智能与智能系统1->机器人学1 | 位置与姿态描述

寒假有几项学习计划，其中有一些是为了一些任务而学，最主要的任务是我要在2021_v4的基础上编写2022_v1的大援代码，为此顺便学习一下机器人学的知识（下学期也有这方面的老黄的课…

Python 2023年10月26日
0045
Python如何把脚本编译成可执行exe文件

Original: https://www.cnblogs.com/123456feng/p/16055609.htmlAuthor: 蚂蚁ailingTitle: Python如…

Python 2023年11月9日
0035
超详细的pytest教程(二)之前后置方法和fixture机制

前言上一篇文章入门篇咱们介绍了pytest的基本使用，这一篇文章专门给大家讲解pytest中关于用例执行的前后置步骤处理,pytest中用例执行的前后置处理既可以通过测试夹具(f…

Python 2023年9月12日
0059
将图片保存到mysql数据库并展示在前端页面

小编使用python中的django框架来完成！ 1，首先用pycharm创建django项目并配置相关环境这里小编默认项目都会创建 settings.py中要修改的两处配置 D…

Python 2023年8月6日
0075
科学计算和可视化

import matplotlib.pyplot as plt引入matplotlib库plt.plot([4,1,3,5,2],[1,3,6,7,9])[4,1,3,5,2]为x…

Python 2023年9月5日
0062
MySQL对指定字段按指定顺序排序FIELD函数

简介：比如我们有一列数据，字段site_code代表数据区域。如下需求：现在我们查询列表，希望以字段 site_code排序，排序的方式为 PH->MY->TH…

Python 2023年6月12日
0084
【Python】深究模块导入：from .. import .. import ..

模块导入：from .. import ..\ import .. * – from .. import .. 用法 – + * 从py模块中导入变量，im…

Python 2023年8月2日
0081
监督学习，无监督学习常用算法集合总结，引用scikit-learn库（监督篇）

最近在接触这方面的知识，但是找了许多的笔记，都感觉没有很好的总结出来，也正好当做是边学习，边复习着走。大佬轻喷。参考书目《python机器学习基础教程》 1.算法的作用2.引用的…

Python 2023年10月29日
0061
Django（二）Django的基本使用

文章目录 ORM * ORM基本原理模型类和表的生成通过模型类操作数据表模型关系和关系查询 Django后台管理页面 * 设置流程视图的使用模板的使用 MVT交互的综合案…

Python 2023年8月6日
0048
航空公司客户价值分析

1、数据的读取读取数据 import pandas as pd import numpy as np air = pd.read_csv("D:\\DM\\air_da…

Python 2023年8月21日
0069
你给文字描述，AI艺术作画，精美无比！附源码，快来试试！

💡 作者：韩信子@ShowMeAI📘 深度学习实战系列：https://www.showmeai.tech/tutorials/42📘 TensorFlow 实战系列：https:…

Python 2023年10月25日
0046
Numpy中数据的常用的保存与读取方法

1.numpy.save 保存一个数组到一个二进制的文件中,保存格式是 .npy 参数介绍numpy.save(file, arr, allow_pickle=True, fix_…

Python 2023年8月23日
0046
Python系列-Django-Ninja

适用对象：有一定python和django基础，对此技术感兴趣，或者想快速尝试、实现效果的。原则：实用为主，效果为主不重复造轮子，但应该知道其工作原理官网是最好的教程，其它…

Python 2023年8月5日
0062
kali中间人攻击

部分数据来源：ChatGPT 一、中间人攻击原理 1. 利用的ARP协议的漏洞 2. ARP 协议原理： 1）发送ARP广播请求目标MAC地址 2）目标主机发送ARP单播应答，响应…

Python 2023年10月24日
0054

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

爬虫学习之scrapy

爬虫学习之scrapy

前言

scrapy框架

scrapy+selenium自动化

大家都在看