itemexporters-scrapy框架8-python

2023年10月6日上午4:49 • Python • 阅读 53

文章目录

1 前言
2 item exporters
*
2.1 Item Exporters
2.2 BaseItemExporter
2.3 实例化
–
- 2.3.1 必须条件
- 2.3.2 字段序列化
2.4 项目实例
2.5 自定义ItemExporter
3 后记

1 前言

我们爬取数据的目的，就是为了在其他应用或者系统中使用。为了方便使用，我们一般把爬取的数据持久化存储或者导出。关于持久化存储可以去参考之前pipeline章节以及python与数据库部分，这里主要讲解数据导出。

为此，Scrapy 提供了一组用于不同输出格式的项目导出器，例如 XML、CSV 或 JSON，以类XxxItemExporter的形式呈现。

2 item exporters

同样的在使用之前需要先实例化XxxItemExporter，那么我们先来看看都有哪些类。

2.1 Item Exporters

类名参数描述BaseItemExporter(fields_to_export=None, export_empty_fields=False, encoding=’utf-8′, indent=0, dont_fail=False)基础类PythonItemExporter(, dont_fail=False, kwargs)python格式XmlItemExporter(file, item_element=’item’, root_element=’items’, kwargs)xml格式CsvItemExporter(file, include_headers_line=True, join_multivalued=’,’, errors=None, kwargs)csv格式PickleItemExporter(file, protocol=0, kwargs)pickle格式PprintItemExporter(file, kwargs)打印格式JsonItemExporter(file, kwargs)json格式JsonLinesItemExporter(file, kwargs)json 行格式MarshalItemExporter(file, *args)marshal格式

关于JsonItemExporter与JsonLinesItemExporter的分析
JsonItemExporter典型输出

[{"name": "Color TV", "price": "1200"},
{"name": "DVD player", "price": "200"}]

JsonLinesItemExporter典型输出

{"name": "Color TV", "price": "1200"}
{"name": "DVD player", "price": "200"}

JsonItemExporter输出规范，适用于小数据量的输出;JsonLinesItemExporter适用于大量的输出。

2.2 BaseItemExporter

BaseItemExporter为基础类，其他的都继承该类，下面我们以BaseItemExporter为例，介绍它的属性和方法。

方法

方法名参数描述export_item()item导出itemserialize_field()file,name,value序列化字段start_exporting()开始导出，准备工作finish_exporting()介绍导出，收尾工作

字段

字段名默认值描述export_empty_fieldsNone要导出的字段，默认导出全部字段encoding编码indent0缩进

2.3 实例化

2.3.1 必须条件

实例化需要实现一下3个方法：

start_exporting()
export_item()
finish_exporting()

2.3.2 字段序列化

默认情况下，字段值由默认的序列化库执行序列化，当然我们也可以自定义实现，有以下2中方式：

在字段中声明一个序列化器，示例：

import scrapy

def serialize_price(value):
    return f'$ {str(value)}'

class Product(scrapy.Item):
    name = scrapy.Field()
    price = scrapy.Field(serializer=serialize_price)

重写serialize_field()方法，示例：

from scrapy.exporter import XmlItemExporter

class ProductXmlExporter(XmlItemExporter):

    def serialize_field(self, field, name, value):
        if name == 'price':
            return f'$ {str(value)}'
        return super().serialize_field(field, name, value)

2.4 项目实例

以我们之前的爬取csdn个人博客文章为例，现在我们要吧爬取的数据以json格式输出到文件中，pipelines.py代码实例：

class JSONPipeline:
    def __init__(self):
        self.fp = open("../../output/csdn.json", "wb")
        self.exporter = JsonItemExporter(self.fp, encoding='utf-8')

    def open_spider(self, spider):
        self.exporter.start_exporting()

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

    def close_spider(self, spider):
        self.exporter.finish_exporting()
        self.fp.close()

输出：

[{"title": "process-&#x8FDB;&#x7A0B;&#x8BE6;&#x89E3;-python", "publish": "2022-01-12 18:10:09", "approval": 0, "comment": 0, "collection": 0},...

2.5 自定义ItemExporter

现在很多应用特别是办公类，都需要和excel打交道，但是scrapy没有提供响应的导出器，你们我们参考BaseItemExporter自定义实现ExcelItemExporter。

详细过程参考链接：https://www.jianshu.com/p/a50b19b6258d

实例，以爬取csdn个人博客文章为例，pipeline代码

class ExcelPipeline:
    def __init__(self):
        self.fp = open("../../output/csdn.xls", "wb")
        self.exporter = ExcelItemExporter(self.fp, encoding='utf-8')

    def open_spider(self, spider):
        self.exporter.start_exporting()

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

    def close_spider(self, spider):
        self.exporter.finish_exporting()
        self.fp.close()

3 后记

参考文章：

Scrapy导出Excel By Exporter

代码仓库：https://gitee.com/gaogzhen/python-study.git
QQ群:433529853

Original: https://blog.csdn.net/gaogzhen/article/details/125171386
Author: gaog2zh
Title: itemexporters-scrapy框架8-python

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/792328/

转载文章受原作者版权保护。转载请注明原作者出处！

python

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

什么是哈希（hash）

文章目录什么是哈希 * 概念哈希的特性哈希的用途 python中基于hash的数据有哪些？ – dict 为何查询速度超快，且不受dict大小影响？ set为何…

Python 2023年8月2日
0056
数据分析5–多层索引与分组聚合

一、多层索引 MultiIndex，即具有多个层次的索引，有些类似于根据索引进行分组的形式。通过多层次索引，我们就可以使用高层次的索引，来操作整个索引组的数据。多层索引的创建方式第…

Python 2023年8月21日
0052
一文聊透Apache Hudi的索引设计与应用

Hudi索引在数据读和写的过程中都有应用。读的过程主要是查询引擎利用MetaDataTable使用索引进行Data Skipping以提高查找速度;写的过程主要应用在upsert写…

Python 2023年10月12日
0052
使用 React 和 Django Channels 构建聊天应用程序

使用 Django 开发用于 HTTP 连接和应用程序请求的服务器很常见。但是，在开发需要始终打开连接以进行双向连接的应用程序（如会议和聊天程序）时，使用 HTTP 连接效率低下…

Python 2023年8月4日
0057
Python实现简繁体转换，现在的人玩得老花了

1、opencc-python 首先介绍opencc中的Python实现库，它具有安装简单，翻译准确，使用方便等优点。对于我们日常的需求完全能够胜任。 ; 1.1安装opencc-…

Python 2023年11月2日
0032
Django官方文档

一、文档地址 https://docs.djangoproject.com/zh-hans/3.2/ 1.1 编写你的第一个 Django 应用，第 1 部分 1.1.1 创建项目…

Python 2023年8月4日
0059
计算机视觉项目-文档扫描OCR识别

😊😊😊 欢迎来到本博客😊😊😊本次博客内容将继续讲解关于OpenCV的相关知识🎉 作者简介：⭐️⭐️⭐️ 目前计算机研究生在读。主要研究方向是人工智能和群智能算法方向。目前熟悉pyt…

Python 2023年8月1日
0087
一道Python练习题引发的，一个知识点的探讨：删除列表中特定元素的几种方法

题目如下：给定一个仅包含大小写字母和空格 ‘ ‘ 的字符串 s，返回其最后一个单词的长度。如果字符串从左向右滚动显示，那么最后一个单词就是最后出现的单词。…

Python 2023年6月9日
0079
看到表弟为了看电影还在充银子，我很心痛，于是用python给他写了个免费看电影的软件！

正则表达式数据匹配 import re import tkinter as tk url地址解析 from urllib import parse 消息盒子 import tki…

Python 2023年5月24日
0082
知识图谱嵌入：TransE算法原理及代码详解

目录 KGE TransE TransE代码详解 KGE 知识图谱中，离散符号化的知识不能够进行语义计算，为帮助计算机对知识进行计算，解决数据稀疏性，可以将知识图谱中的实体、关系映…

Python 2023年9月28日
0063
python对列保留有效位数_用python进行数据分析的套路

经过一段时间的学习，总结一下目前所学知识，在用python进行数据分析的过程中所用到的函数及分析过程。第一步导入包常用的包有以下这些： 1.用于处理数据的包 import p…

Python 2023年8月19日
0039
《Python小游戏汇总》- 1. 表白神器

声明本文仅在CSDN发布，其他均为盗版。请支持正版！正版链接：https://blog.csdn.net/meenr/article/details/119185683 《Py…

Python 2023年9月22日
0044
chatGPT教你算法(1)——常用的排序算法

0. 引言最近喜闻ChatGPT能够帮助我们认识世界了，我必须得作为先行者下场一波了。虽然ChatGPT对主观的一些看法是拿不准的，但是常规的基础性教学真的是信手拈来，别的地方不…

Python 2023年11月4日
0045
ElementUI + Vue + Django 上传文件

但其它方法也有可借鉴的地方，特此记录首先安装相应的包 Django == 3.1.5 djangorestframework == 3.11.1 django-cors-head…

Python 2023年8月5日
0050
机器学习经验笔记

1.初步了解数据首先读取数据 df = pd.read_csv( ‘https://labfile.oss.aliyuncs.com/courses/1283/telecom_c…

Python 2023年8月6日
0041
Python爬虫（学习笔记）

Python爬虫（学习笔记）常见的反爬机制及应对策略解决方案/反反爬措施 1.Headers 从用户的headers进行反爬是最常见的反爬策略,Headers是一种最常见的反爬…

Python 2023年10月31日
0038

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31