scrapy解析与数据库

2023年10月3日上午12:00 • Python • 阅读 63

`Scrapy` 功能学习

1 `scrapy` 数据提取

Scrapy还提供了自己的数据提取方法，即 Selector(选择器)。 Selector是基于 lxml 来构建的，支持 XPath选择器、 CSS选择器以及正则表达式，功能全面，解析速度和准确度非常高

1.1. 直接使用

Selector 是一个可以独立使用的模块。我们可以直接利用 Selector这个类来构建一个选择器对象，然后调用它的相关方法如 xpath、 css等来提取数据。

例如，针对一段 HTML 代码，我们可以用如下方式构建 Selector对象来提取数据：

1.2 `xpath` 选择器

1.2.1 测试代码

html = '''

  Example website

   Name: My image 1
   Name: My image 2
   Name: My image 3
   Name: My image 4
   Name: My image 5

'''

构建对象

response = Selector(text=html)

节点提取

result = response.xpath('//a')

注：这里面的话就使用常规 xpath语法就好拉

1.3 正则匹配

Scrapy 的选择器还支持正则匹配。比如，在示例的 a节点中的文本类似于 Name: My image 1，现在我们只想把 Name:后面的内容提取出来，这时就可以借助 re方法

response.xpath('//a/text()').re('Name:\s(.*)')

给 re()方法传了一个正则表达式，其中 (.*) 就是要匹配的内容

print(response.xpath('//a/text()').re('(.*?):\s(.*)'))

提取返回的第一个值

extract_first()    extract()

from scrapy import Selector

Selector 框架的一个解析模块

html = '''

  Example website

   Name: My image 1
   Name: My image 2
   Name: My image 3
   Name: My image 4
   Name: My image 5

'''
response = Selector(text=html)
response.css()
response.xpath()
response.re()

from lxml import etree
etree.HTML(response.text)

title = response.xpath('//title/text()').extract()
title1 = response.xpath('//title/text()').extract_first()
print(title)
print(title1)

extract()  返回多组数据
extract_first() 返回单条数据 第一次被匹配的数据

正则语法结构
print(response.xpath('//a/text()').re('Name:\s(.*)'))
先定位数据 在使用正则分割
print(response.xpath('//a/text()').re('(.*?):\s(.*)'))

执行结果

['Example website']
Example website
['My image 1 ', 'My image 2 ', 'My image 3 ', 'My image 4 ', 'My image 5 ']
['Name', 'My image 1 ', 'Name', 'My image 2 ', 'Name', 'My image 3 ', 'Name', 'My image 4 ', 'Name', 'My image 5 ']

2 `scrapy` 中间件

Scheduler从队列中拿出一个 Request 发送给 Downloader 执行下载，这个过程会经过 Downloader Middleware 的处理。另外，当 Downloader 将 Request下载完成得到 Response 返回给 Spider 时会再次经过 Downloader Middleware 处理。

也就是说， Downloader Middleware在整个架构中起作用的位置是以下两个。

在 Scheduler 调度出队列的 Request 发送给 Downloader下载之前，也就是我们可以在 Request执行下载之前对其进行修改。
在下载后生成的 Response 发送给 Spider 之前，也就是我们可以在生成 Resposne被 Spider 解析之前对其进行修改。

2.1 目的

Downloader Middleware 的功能十分强大，修改 User-Agent、处理重定向、设置代理、失败重试、设置 Cookies 等功能都需要借助它来实现。下面我们来了解一下 Downloader Middleware 的详细用法

注：如果没有中间件的话，就是一个光光的请求了

2.2 中间件介绍

可以看到里面主要有五个方法：

from_crawler：类方法，用于初始化中间件
process_request：每个request通过下载中间件时，都会调用该方法
process_response：处理下载器返回的响应内容
process_exception：当下载器或者处理请求异常时，调用此方法
spider_opened：内置的信号量回调方法

2.2.1 中间件激活

DOWNLOADER_MIDDLEWARES参数用来设置下载器中间件。其中， Key为中间件路径， Value为中间件执行优先级， 「数字越小，越先执行」，当 Value为 「None」时，表示禁用。

2.3 自定义中间件

import random

class RandomUserAgentMiddleware():
    def __init__(self):
        self.user_agents = [
            'Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)',
            'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.2 (KHTML, like Gecko) Chrome/22.0.1216.0 Safari/537.2',
            'Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:15.0) Gecko/20100101 Firefox/15.0.1'
        ]

    def process_request(self, request, spider):
        request.headers['User-Agent'] = random.choice(self.user_agents)

2.3.1 激活配置

DOWNLOADER_MIDDLEWARES = {
   'scrapydownloadertest.middlewares.RandomUserAgentMiddleware': 543,
}

2.3.3 拓展专题-配置自动化

地址：https://phantomjs.org/download.html

from selenium import webdriver
from logging import getLogger
from scrapy.http import HtmlResponse

class SeleniumMiddleware():
    def __init__(self):
        self.logger = getLogger(__name__)
        self.timeout = random.randint(1,3)
        self.browser =webdriver.Chrome()
        self.browser.set_window_size(1400, 700)
        self.browser.set_page_load_timeout(self.timeout)

    def process_request(self, request, spider):
        self.logger.debug('PhantomJS is Starting')
        self.browser.get(request.url)
        body = self.browser.page_source
        return HtmlResponse(url=request.url, body=body, request=request, encoding='utf-8',status=200)

    def __del__(self):
        self.browser.close()

测试python代码

import scrapy
from pydispatch import dispatcher
from scrapy import cmdline, signals

class TestSpider(scrapy.Spider):

    name = 'test'
    start_url = 'https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1630663331818&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=python&pageIndex={}&pageSize=10&language=zh-cn&area=cn'

    def start_requests(self):
        for i in range(1,3):
            yield scrapy.Request(url=self.start_url.format(i),callback=self.parse)

    def parse(self, response):
        self.logger.info(response.text)

if __name__ == '__main__':
    cmdline.execute('scrapy crawl test'.split())

3 `scrapy` 数据存储

3.1 基于 `mongo` 存储

首页：https://hot.online.sh.cn/node/node_65634.htm
需求：使用框架采集

3.1.1 爬虫文件编写

from urllib.parse import urljoin
import scrapy
from scrapy import cmdline
from news.items import NewsItem

class HotSpider(scrapy.Spider):
    name = 'hot'
    start_urls = ['https://hot.online.sh.cn/node/node_65634.htm']

    def parse(self, response):
        news_list = response.css('div.list_thread')
        for news in news_list:
            items = NewsItem()
            items['title'] = news.xpath('.//h2/a/text()').extract_first()
            items['times'] = news.xpath('.//h3/text()').extract_first()
            items['info']  = news.xpath('.//p/text()').extract_first()
            yield  items

        # 处理翻页
        next = response.xpath('//center/a[text()="下一页"]/@href').extract_first()
        if next:
            # https://movie.douban.com/top250?start=25&filter=
            url = 'https://hot.online.sh.cn/node/'
            print(url + next)
            yield scrapy.Request(urljoin(url, next), callback=self.parse)

if __name__ == '__main__':
    cmdline.execute('scrapy crawl hot'.split())

3.1.2 管道文件编写

Define your item pipelines here
#
Don't forget to add your pipeline to the ITEM_PIPELINES setting
See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
useful for handling different item types with a single interface
from itemadapter import ItemAdapter
import pymongo

class NewsPipeline:

    def open_spider(self,spider):
        self.client = pymongo.MongoClient()
        self.db = self.client.news  # 指令库

    def process_item(self, item, spider):
        items = dict(item)
        if isinstance(items,dict):
            self.db['xl'].insert(items)
            return item
        else:
            return '数据格式有误'

3.2 基于 `MySQL` 存储

3.2.1 配置编写

DATA_CONFIG = {
   'config' : {
      'host':'127.0.0.1',
      'port':3306,
      'user':'root',
      'password':'',
      'db':'yy',
      'charset':'utf8'
   }
}

3.2.2 存储文件编写

class NewsPipeline_mysql:

    def open_spider(self,spider):
        data_config = spider.settings['DATA_CONFIG']
        self.conn = pymysql.connect(**data_config['config'])
        self.cursor = self.conn.cursor()

    def close_spider(self,spider):
        # 关闭游标和连接
        self.cursor.close()
        self.conn.close()

    def process_item(self, item, spider):
        # 插入数到数据库
        if isinstance(item, items.NewsItem_hot):
            try:
                sql = 'insert into info (title,crate_time,info) values (%s,%s,%s)'
                self.cursor.execute(sql, (
                    item['title'],
                    item['times'],
                    item['info'],
                ))
                # 提交
                self.conn.commit()
            except Exception as e:
                self.conn.rollback()
                print('信息写入错误%s-%s' % (item['url'], e))

Original: https://blog.csdn.net/shifengboy/article/details/127237010
Author: 尘世风
Title: scrapy解析与数据库

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/789841/

转载文章受原作者版权保护。转载请注明原作者出处！

python

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

【技术流吃瓜】python可视化大屏舆情分析“张天爱“事件微博评论

一、事件背景二、微热点分析二、自开发Python舆情分析 2.1 Python爬虫 2.2 可视化大屏 2.2.1 大标题 2.2.2 词云图 2.2.3 条形图 2.2.4 …

Python 2023年11月1日
0050
卡尔曼滤波及其变种有哪些？究竟滤了个啥？

点击下方卡片，关注” 自动驾驶之心“公众号 ADAS巨卷干货，即可获取点击进入→ 后台回复【卡尔曼滤波】获取卡尔曼滤波学习相关资料！ 1从基础卡尔曼滤…

Python 2023年10月7日
0067
介绍一个文本语音神器，几行代码就能搞定！

阅读全文这里使用的python模块是pyttsx3，是文本语音转换中比较简单的一个第三方库。通过传统的pip方式安装好pyttsx3库。 pip install pyttsx3…

Python 2023年5月24日
0074
Scrapy入门示例程序

1、安装Scrapy 参考文档官方指导文档，Scrapy 2.5 documentation — Scrapy 2.5.1 documentation scrapy的中文指导文档…

Python 2023年10月5日
0037
专题18：Django之Form，ModelForm

原始思路实现添加用户功能的缺点： 1）用户提交的数据没有校验 2）如果用户输入的数据有错误，没有错误提示 3）前端页面上的每一个字段都需要我们重新写一次 4）关联的数据需要手动取获…

Python 2023年8月6日
0063
熬夜爆肝万字C#基础入门大总结【建议收藏】

往期文章分享点击跳转=>熬夜再战Android从青铜到王者-UI组件快速搭建App界面点击跳转=>熬夜再战Android从青铜到王者-几个适配方案点击跳转=>熬…

Python 2023年11月9日
0034
python兔子和獾_python实例-兔子和獾（塔防游戏）附源码

原文：兔子和獾大战python小游戏英文：https://www.raywenderlich.com/24252/beginning-game-programming-for-t…

Python 2023年9月23日
0036
熬夜整理了2021年Python最新学习资料，分享给学弟学妹们【大学生必备】

Python最新学习资料和视频一、Python软件安装教程视频教程二、学习规划阶段一：Python基础阶段二：Python核心编程阶段三：web前端开发阶段四：数据…

Python 2023年9月15日
0066
Alexnet论文介绍（超详细）——ImageNet Classification with Deep Convolutional Neural Networks

近期开始阅读cv领域的一些经典论文，本文整理计算机视觉的奠基之作—— Alexnet 论文原文：ImageNet Classification with Deep Convolut…

Python 2023年10月27日
0032
MySQL必知必会笔记-Mysql基本操作

mysql的基本操作包括增、删、改、查，本书中前三章简单的介绍MySQL为何物，查是mysql中非常重要的功能，4-6章展示了mysql的查（查询——select）的简单实现，my…

Python 2023年6月9日
0061
python运算符

算术运算符： / % ** // 比较运算符： == != > < >= Original: https://www.cnblogs.com/daxiangcai…

Python 2023年11月3日
0038
Python制作当年第一款真正意义上的手机游戏——贪吃蛇游戏

前言最近有很多的同学问，能不能用Python做出一个小游戏来，而且最好要讲清楚每一段干嘛是用来干嘛的那行，今天将来讲解一下用Python pygame做一个贪吃蛇的小游戏据说…

Python 2023年9月19日
0038
python 一行命令开启网络间的文件共享

这个文件共享的功能是基于python实现，所以必须具备python环境。没有python环境的直接到官网去下载就可以了，这里分享一下官网的下载地址。【阅读全文】 https://…

Python 2023年5月24日
0073
cs231n作业-assignment1

assignment 1 (cs231n) 文章目录 assignment 1 (cs231n) * KNN基础 – 计算distances 方法一：双层循环计算di…

Python 2023年8月28日
0066
如何用NumPy读取和保存点云数据

如何用NumPy读取和保存点云数据本文首发于微信公众号【DeepDriving】，欢迎关注。前言最近在学习点云处理的时候用到了 Modelnet40数据集，该数据集总共有 4…

Python 2023年8月25日
0066
Python（3）高级特性

一、切片在python的使用中，对于列表、元组的元素取值是非常常见的，例如： *注意：切&…

Python 2023年8月14日
0054

2024 年 4 月
一	二	三	四	五	六	日
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30