scrapy mysql入库_scrapy保存抓取数据为json文件和异步入库mysql

一, 保存抓取到的数据为json文件.

首先新建一个专门保存json文件的pipieline类.

import codecs, json

codecs类似于open,但会帮我们处理一些编码问题.

class JsonWithEncodingPipeline:

def init(self):

self.file = codecs.open(‘save_file.json’, ‘w’, encoding=’utf-8′)

def process_item(self, item, spider):

lines = json.dumps(dict(item), ensure_ascii=False) + ‘\n’

self.file.write(lines)

return item

def spider_close(self, spider):

self.file.close()

上边是我们自己写的保存为json的功能,scrapy也提供了专门保存json的功能

from scrapy.exporters import JsonItemExporter

exporters.py模块中还有csv, xml,pickle,marshal等文件的保存类

class JsonExporterPipeline:

def init(self):

self.file = open(‘save_file.json’, ‘wb’)

self.exporter.start_exporting()

def process_item(self, item, spider):

self.exporter.export_item(item)

return item

def close_spider(self, spider):

self.exporter.finish_exporting()

self.file.close()

  1. 把自己写的上边的两个pipeline之一加入settings.py中,设定好优先级.

二, 把数据异步保存到mysql

scrapy支持异步的方式高并发抓取数据,抓取速度会非常快,而普通的pymysql以及sqlalchemy都是同步阻塞的,如果spider和download模块抓取速度非常快,mysql入库才去同步阻塞的方式,速度就会很慢,会拖累项目的整体速度,所以有必要情况下采取异步入库mysql.

from twisted.enterprise import adbapi

import MySQLdb

import MySQLdb.sursors

class MysqlTwistedPipeline:

def init(self, db_pool):

self.db_pool = db_pool

@classmethed

def from_settings(cls, settings):

db_parmas = dict(

host = settings[‘MYSQL_HOST’],

db = settings[‘MYSQL_DBNAME’],

user = settings[‘MYSQL_USER’],

passed = settings[‘MYSQL_PASSWORD’],

charset = ‘utf-8’,

cursorclass = MySQLdb.cursors.DictCursor,

use_unicode=True,

) # db_params里的key要和ConnectionPool里的参数对应.

db_pool = adbapi.ConnectionPool(“MySQLdb”, **db_params)

return cls(db_pool)

def process_item(self, item, spider):

query = self.db_pool.runInteraction(self.do_insert, item)

query.addErrback(self.handle_error [,item, spider]) # 处理异常,

def do_insert(self, cursor, item): # db_pool.runInteraction会传一个cursor

insert_sql = ‘insert into table_name(name, age) values(%s, %s)’

cursor.execute(insert_sql, (item[‘name’], item[‘age’]))

def handle_error(self, failure [, item, spider]): # 处理错误的回调函数

print(failure)

Original: https://blog.csdn.net/weixin_30697437/article/details/113456907
Author: 吕宸昊
Title: scrapy mysql入库_scrapy保存抓取数据为json文件和异步入库mysql

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/792266/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

亲爱的 Coder【最近整理,可免费获取】👉 最新必读书单  | 👏 面试题下载  | 🌎 免费的AI知识星球