itemexporters-scrapy框架8-python

2023年10月6日上午4:49 • Python • 阅读 45

文章目录

1 前言
2 item exporters
*
2.1 Item Exporters
2.2 BaseItemExporter
2.3 实例化
–
- 2.3.1 必须条件
- 2.3.2 字段序列化
2.4 项目实例
2.5 自定义ItemExporter
3 后记

1 前言

我们爬取数据的目的，就是为了在其他应用或者系统中使用。为了方便使用，我们一般把爬取的数据持久化存储或者导出。关于持久化存储可以去参考之前pipeline章节以及python与数据库部分，这里主要讲解数据导出。

为此，Scrapy 提供了一组用于不同输出格式的项目导出器，例如 XML、CSV 或 JSON，以类XxxItemExporter的形式呈现。

2 item exporters

同样的在使用之前需要先实例化XxxItemExporter，那么我们先来看看都有哪些类。

2.1 Item Exporters

类名参数描述BaseItemExporter(fields_to_export=None, export_empty_fields=False, encoding=’utf-8′, indent=0, dont_fail=False)基础类PythonItemExporter(, dont_fail=False, kwargs)python格式XmlItemExporter(file, item_element=’item’, root_element=’items’, kwargs)xml格式CsvItemExporter(file, include_headers_line=True, join_multivalued=’,’, errors=None, kwargs)csv格式PickleItemExporter(file, protocol=0, kwargs)pickle格式PprintItemExporter(file, kwargs)打印格式JsonItemExporter(file, kwargs)json格式JsonLinesItemExporter(file, kwargs)json 行格式MarshalItemExporter(file, *args)marshal格式

关于JsonItemExporter与JsonLinesItemExporter的分析
JsonItemExporter典型输出

[{"name": "Color TV", "price": "1200"},
{"name": "DVD player", "price": "200"}]

JsonLinesItemExporter典型输出

{"name": "Color TV", "price": "1200"}
{"name": "DVD player", "price": "200"}

JsonItemExporter输出规范，适用于小数据量的输出;JsonLinesItemExporter适用于大量的输出。

2.2 BaseItemExporter

BaseItemExporter为基础类，其他的都继承该类，下面我们以BaseItemExporter为例，介绍它的属性和方法。

方法

方法名参数描述export_item()item导出itemserialize_field()file,name,value序列化字段start_exporting()开始导出，准备工作finish_exporting()介绍导出，收尾工作

字段

字段名默认值描述export_empty_fieldsNone要导出的字段，默认导出全部字段encoding编码indent0缩进

2.3 实例化

2.3.1 必须条件

实例化需要实现一下3个方法：

start_exporting()
export_item()
finish_exporting()

2.3.2 字段序列化

默认情况下，字段值由默认的序列化库执行序列化，当然我们也可以自定义实现，有以下2中方式：

在字段中声明一个序列化器，示例：

import scrapy

def serialize_price(value):
    return f'$ {str(value)}'

class Product(scrapy.Item):
    name = scrapy.Field()
    price = scrapy.Field(serializer=serialize_price)

重写serialize_field()方法，示例：

from scrapy.exporter import XmlItemExporter

class ProductXmlExporter(XmlItemExporter):

    def serialize_field(self, field, name, value):
        if name == 'price':
            return f'$ {str(value)}'
        return super().serialize_field(field, name, value)

2.4 项目实例

以我们之前的爬取csdn个人博客文章为例，现在我们要吧爬取的数据以json格式输出到文件中，pipelines.py代码实例：

class JSONPipeline:
    def __init__(self):
        self.fp = open("../../output/csdn.json", "wb")
        self.exporter = JsonItemExporter(self.fp, encoding='utf-8')

    def open_spider(self, spider):
        self.exporter.start_exporting()

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

    def close_spider(self, spider):
        self.exporter.finish_exporting()
        self.fp.close()

输出：

[{"title": "process-&#x8FDB;&#x7A0B;&#x8BE6;&#x89E3;-python", "publish": "2022-01-12 18:10:09", "approval": 0, "comment": 0, "collection": 0},...

2.5 自定义ItemExporter

现在很多应用特别是办公类，都需要和excel打交道，但是scrapy没有提供响应的导出器，你们我们参考BaseItemExporter自定义实现ExcelItemExporter。

详细过程参考链接：https://www.jianshu.com/p/a50b19b6258d

实例，以爬取csdn个人博客文章为例，pipeline代码

class ExcelPipeline:
    def __init__(self):
        self.fp = open("../../output/csdn.xls", "wb")
        self.exporter = ExcelItemExporter(self.fp, encoding='utf-8')

    def open_spider(self, spider):
        self.exporter.start_exporting()

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

    def close_spider(self, spider):
        self.exporter.finish_exporting()
        self.fp.close()

3 后记

参考文章：

Scrapy导出Excel By Exporter

代码仓库：https://gitee.com/gaogzhen/python-study.git
QQ群:433529853

Original: https://blog.csdn.net/gaogzhen/article/details/125171386
Author: gaog2zh
Title: itemexporters-scrapy框架8-python

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/792328/

转载文章受原作者版权保护。转载请注明原作者出处！

python

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

奇思妙想之超级魔改版外星人入侵小游戏-01

目录一、原始版本再现原始1.0版本的源码免费获取地址：二、魔改思路 1.醒目亮眼的UI界面 2.实现地图，怪物，血量多样化 3.实现游戏商城和丰富金融体系三、未完待续很多…

Python 2023年9月23日
0049
数据分析工具pandas总结

Numpy简介 Numpy：Numerical Python ，即数值Python包，是Python进行科学计算一个基础包。包括：一个具有矢量运算和复杂广播能力的快速且节省空间的…

Python 2023年8月27日
0053
【Audio音频开发】音频基础知识及PCM技术详解

个人主页：董哥聊技术我是董哥，嵌入式领域新星创作者创作理念：专注分享高质量嵌入式文章，让大家读有所得！文章目录 * – 1、前言 – 2、概念 &#82…

Python 2023年11月6日
0044
安装twisted_Scrapy 入门日记1：安装报错详解 Microsoft Visual C++required

根据引用和引用的内容，安装 Microsoft Visual C 14.0时，您可以按照以下步骤进行操作： 1. 首先，您需要下载并安装 Microsoft Visual C+…

Python 2023年10月6日
0069
NLP 自然语言处理实战

前言自然语言处理 ( Natural Language Processing, NLP) 是计算机科学领域与人工智能领域中的一个重要方向。它研究能实现人与计算机之间用自然语言进行…

Python 2023年10月28日
0021
第十七届全国大学生智能车竞赛智能视觉组总结

文章目录前言一、本次比赛任务分工二、OpenArt部分任务 * 1.地图识别 2.图像识别 – 1）模型训练 2）图像处理 3）通讯总结前言我参加了第十七届…

Python 2023年8月1日
0059
用Python实现股价的简单移动平均值

前言最近有没有想要买股票和基金的小伙伴，今天我要教大家一个神奇的东西，如何去计算平均值。没有人不喜欢钱吧… 用Python绘制出股价的5日均线和20日均线。众所周知，…

Python 2023年11月2日
0028
[排序算法] 快速排序 (C++) (含三种写法)

快速排序解释快速排序 Quick Sort 与归并排序一样，也是典型的分治法的应用。 (如果有对归并排序还不了解的童鞋，可以看看这里哟~ 归并排序)❤❤❤ (本文作者： Amα…

Python 2023年10月14日
0041
pytest + yaml 框架 -14.钉钉机器人通知测试结果

前言当用例执行完成后，希望能给报告反馈，常见的报告反馈有：邮箱/钉钉群/飞书/企业微信等。pip 安装插件 pip install pytest-yaml-yoyo 钉钉机器人…

Python 2023年9月11日
0039
利用Python编写密码检测器，输出详细信息~

兄弟们，今天来实现一下用Python编写密码检测器，并输出详细信息！本次涉及知识点文件读写基础语法字符串处理循环遍历代码展示导入系统包 import platform…

Python 2023年11月1日
0046
Python pygame库

Python pygame库 * – 最有用的模块 – 功能最新的模块最有用的模块 1、draw模块import pygame.surfaceimport…

Python 2023年9月25日
0034
Scrapy框架理论

抵扣说明： 1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。 Original: https://blo…

Python 2023年10月5日
0037
pip安装第三方库报错

1、pip安装第三方库报错：check_hostname requires server_hostname 解决方法：关掉电脑浏览器代理 2、pip安装第三方库报错’C…

Python 2023年8月5日
0046
ChatGPT到底是个啥？它会让我们失业么？

谈起AI届，几家大公司是无论如何都不能不提及的，谷歌的GoogleX与DeepMind (下围棋那个)、Meta的FAIR (Resnet提出者恺明所就职的)、OpenAI (搞大…

Python 2023年11月3日
0034
lora和lorawan物联网无线传输技术

Lora联盟表示：”Lora设备和开放的LoRaWAN协议使智能物联网应用能够解决我们智慧城市建设面临的一些最大挑战：能源管理、自然资源减少、污染控制、基础设施效率、防…

Python 2023年10月25日
0025
flask 基础语法

coding=utf-8 from flask import Flask, render_template, request, redirect, url_for, abort, …

Python 2023年8月10日
0047

2024 年 4 月
一	二	三	四	五	六	日
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30