scrapy mysql 实例_GitHub – yanceyblog/scrapy-mysql: 实现数据存储到数据库的爬虫实例

2023年10月6日下午2:54 • Python • 阅读 35

爬虫数据存储实例

[TOC]

数据存储

scrapy支持将数据存储到文件,例如csv、jl、jsonlines、pickle、marshal、json、xml，少量的数据存储到数据库还行，如果超大量的数据存储到文件(当然图片还是要存文件的)，就显得不太友好，毕竟这些数据要为我所用。

因此我们通常将数据存储到数据库，本处将介绍的是最常用的数据库mysql。我们也看到scrapy中的pipeline文件还没有用到，其实这个文件就是处理spider分发下来的item，我们可以在pipeline中处理文件的存储。

mysql库(PyMysql)的添加

打开pycharm File–>Default Settings–>Project interpreter点击左下角的”+”，搜索PyMysql，如图：

scrapy mysql 实例_GitHub - yanceyblog/scrapy-mysql: 实现数据存储到数据库的爬虫实例

点击安装install package，如果无法安装可以选择将上面的install to user’s site…勾选安装到Users目录下。

配置mysql服务

1、安装mysql

root@ubuntu:~# sudo apt-get install mysql-server

root@ubuntu:~# apt isntall mysql-client

root@ubuntu:~# apt install libmysqlclient-dev

期间会弹出设置root账户的密码框，输入两次相同密码。

2、查询是否安装成功

root@ubuntu:~# sudo netstat -tap | grep mysql

root@ubuntu:~# netstat -tap | grep mysql

tcp600[::]:mysql [::]:* LISTEN 7510/mysqld

3、开启远程访问mysql

编辑mysql配置文件，注释掉”bind-address = 127.0.0.1″

root@ubuntu:~# vi /etc/mysql/mysql.conf.d/mysqld.cnf

bind-address = 127.0.0.1

进入mysql root账户

root@ubuntu:~# mysql -u root -p123456

在mysql环境中输入grant all on . to username@’%’ identified by ‘password’;

或者grant all on . to username@’%’ identified by ‘password’ with grand option;

root@ubuntu:~# grant all on . to china@’%’ identified by ‘123456’;

刷新flush privileges;然后重启mysql，通过/etc/init.d/mysql restart命令

root@ubuntu:~# flush privileges;

root@ubuntu:~# /etc/init.d/mysql restart

远程连接时客户端设置：

4、常见问题

1045 access denied for user ‘root’@’localhost(ip)’ using password yes

1、mysql -u root -p;

2、GRANT ALL PRIVILEGES ON . TO ‘myuser’@’%’ IDENTIFIED BY ‘mypassword’ WITH GRANT OPTION;

3、FLUSH PRIVILEGES;

在mysql中创建好四个item表

创建项目

安装好PyMysql后就可以在pipeline中处理存储的逻辑了。首先创建项目：scrapy startproject mysql 本例还是使用上一章多个爬虫组合实例的例子，处理将其中四个item存储到mysql数据库。

然后打开创建好的mysql项目，在settings.py中添加数据库连接相关的常量。

–– coding: utf-8 ––

BOT_NAME = ‘mysql’

SPIDER_MODULES = [‘mysql.spiders’]

NEWSPIDER_MODULE = ‘mysql.spiders’

MYSQL_HOST = ‘localhost’

MYSQL_DBNAME = ‘spider’

MYSQL_USER = ‘root’

MYSQL_PASSWD = ‘123456’

DOWNLOAD_DELAY = 1

ITEM_PIPELINES = {

‘mysql.pipelines.DoubanPipeline’: 301,

}

pipeline.py配置

–– coding: utf-8 ––

Define your item pipelines here

Don’t forget to add your pipeline to the ITEM_PIPELINES setting

See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

import pymysql

from scrapy import log

from mysql import settings

from mysql.items import MusicItem, MusicReviewItem, VideoItem, VideoReviewItem

class DoubanPipeline(object):

def init(self):

self.connect = pymysql.connect(

host=settings.MYSQL_HOST,

db=settings.MYSQL_DBNAME,

user=settings.MYSQL_USER,

passwd=settings.MYSQL_PASSWD,

charset=’utf8′,

use_unicode=True)

self.cursor = self.connect.cursor()

def process_item(self, item, spider):

if item.class == MusicItem:

try:

self.cursor.execute(“””select * from music_douban where music_url = %s”””, item[“music_url”])

ret = self.cursor.fetchone()

if ret:

self.cursor.execute(

“””update music_douban set music_name = %s,music_alias = %s,music_singer = %s,

music_time = %s,music_rating = %s,music_votes = %s,music_tags = %s,music_url = %s

where music_url = %s”””,

(item[‘music_name’],

item[‘music_alias’],

item[‘music_singer’],

item[‘music_time’],

item[‘music_rating’],

item[‘music_votes’],

item[‘music_tags’],

item[‘music_url’],

item[‘music_url’]))

else:

self.cursor.execute(

“””insert into music_douban(music_name,music_alias,music_singer,music_time,music_rating,

music_votes,music_tags,music_url)

value (%s,%s,%s,%s,%s,%s,%s,%s)”””,

(item[‘music_name’],

item[‘music_alias’],

item[‘music_singer’],

item[‘music_time’],

item[‘music_rating’],

item[‘music_votes’],

item[‘music_tags’],

item[‘music_url’]))

self.connect.commit()

except Exception as error:

log(error)

return item

elif item.class == MusicReviewItem:

try:

self.cursor.execute(“””select * from music_review_douban where review_url = %s”””, item[“review_url”])

ret = self.cursor.fetchone()

if ret:

self.cursor.execute(

“””update music_review_douban set review_title = %s,review_content = %s,review_author = %s,

review_music = %s,review_time = %s,review_url = %s

where review_url = %s”””,

(item[‘review_title’],

item[‘review_content’],

item[‘review_author’],

item[‘review_music’],

item[‘review_time’],

item[‘review_url’],

item[‘review_url’]))

else:

self.cursor.execute(

“””insert into music_review_douban(review_title,review_content,review_author,review_music,review_time,

review_url)

value (%s,%s,%s,%s,%s,%s)”””,

(item[‘review_title’],

item[‘review_content’],

item[‘review_author’],

item[‘review_music’],

item[‘review_time’],

item[‘review_url’]))

self.connect.commit()

except Exception as error:

log(error)

return item

elif item.class == VideoItem:

try:

self.cursor.execute(“””select * from video_douban where video_url = %s”””, item[“video_url”])

ret = self.cursor.fetchone()

if ret:

self.cursor.execute(

“””update video_douban set video_name= %s,video_alias= %s,video_actor= %s,video_year= %s,

video_time= %s,video_rating= %s,video_votes= %s,video_tags= %s,video_url= %s,

video_director= %s,video_type= %s,video_bigtype= %s,video_area= %s,video_language= %s,

video_length= %s,video_writer= %s,video_desc= %s,video_episodes= %s where video_url = %s”””,

(item[‘video_name’],

item[‘video_alias’],

item[‘video_actor’],

item[‘video_year’],

item[‘video_time’],

item[‘video_rating’],

item[‘video_votes’],

item[‘video_tags’],

item[‘video_url’],

item[‘video_director’],

item[‘video_type’],

item[‘video_bigtype’],

item[‘video_area’],

item[‘video_language’],

item[‘video_length’],

item[‘video_writer’],

item[‘video_desc’],

item[‘video_episodes’],

item[‘video_url’]))

else:

self.cursor.execute(

“””insert into video_douban(video_name,video_alias,video_actor,video_year,video_time,

video_rating,video_votes,video_tags,video_url,video_director,video_type,video_bigtype,

video_area,video_language,video_length,video_writer,video_desc,video_episodes)

value (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)”””,

(item[‘video_name’],

item[‘video_alias’],

item[‘video_actor’],

item[‘video_year’],

item[‘video_time’],

item[‘video_rating’],

item[‘video_votes’],

item[‘video_tags’],

item[‘video_url’],

item[‘video_director’],

item[‘video_type’],

item[‘video_bigtype’],

item[‘video_area’],

item[‘video_language’],

item[‘video_length’],

item[‘video_writer’],

item[‘video_desc’],

item[‘video_episodes’]))

self.connect.commit()

except Exception as error:

log(error)

return item

elif item.class == VideoReviewItem:

try:

self.cursor.execute(“””select * from video_review_douban where review_url = %s”””, item[“review_url”])

ret = self.cursor.fetchone()

if ret:

self.cursor.execute(

“””update video_review_douban set review_title = %s,review_content = %s,review_author = %s,

review_video = %s,review_time = %s,review_url = %s

where review_url = %s”””,

(item[‘review_title’],

item[‘review_content’],

item[‘review_author’],

item[‘review_video’],

item[‘review_time’],

item[‘review_url’],

item[‘review_url’]))

else:

self.cursor.execute(

“””insert into video_review_douban(review_title,review_content,review_author,review_video,review_time,

review_url)

value (%s,%s,%s,%s,%s,%s)”””,

(item[‘review_title’],

item[‘review_content’],

item[‘review_author’],

item[‘review_video’],

item[‘review_time’],

item[‘review_url’]))

self.connect.commit()

except Exception as error:

log(error)

return item

else:

pass

在上面的pipeline中我已经做了数据库去重的操作。

运行爬虫

pycharm运行run.py，mysql数据库表中已经存好了我们要的数据。

Original: https://blog.csdn.net/weixin_29602351/article/details/113267491
Author: Martin awodey
Title: scrapy mysql 实例_GitHub – yanceyblog/scrapy-mysql: 实现数据存储到数据库的爬虫实例

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/792643/

转载文章受原作者版权保护。转载请注明原作者出处！

python

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

通过AndroidJUnit4框架发现用例不会按顺序执行，变成随机了

接上一篇所有错误修改完成后并成功编译成功。开始整体运行时，发现执行的用例并未按照顺序去执行，变成随机了。此时需要在类前面加上对应的注解，指定用例的执行顺序 @FixMethodO…

Python 2023年6月11日
0052
基于Python3+Flask+SQLite的大学实验室设备管理系统

资源下载地址：https://download.csdn.net/download/sheziqiong/85709920资源下载地址：https://download.csdn….

Python 2023年8月10日
0056
Django如何使用中间件，以及解决使用中间件的跳转问题

项目场景：最近在学Django的博客项目，将项目的后端和前端写完后，发现有安全问题（随便输入个网站都能进，按道理来说只有登陆了才能这么放肆阿），都是自己摸索的解决方法，可能不那么…

Python 2023年8月6日
0062
【python中级】二进制的字节码图像数据转化为图像数组

Python 内置函数open,以’rb’模式读取图片文件，得到的是字节码数据，形如：b’\xff\xd8\xff\xe0\x00\x10JFI…

Python 2023年8月26日
0044
【题解笔记】PTA基础6-7：统计某类完全平方

题目地址：https://pintia.cn/problem-sets/14/problems/739 前言咱目前还只能说是个小白，写题解是为了后面自己能够回顾。如果有哪些写错的…

Python 2023年6月3日
0063
pygame实现的GUI画板

写在前面这是将近一年前学校开设的Python高级语言程序课上的一次作业，最近太久没打代码所以又翻出来重新温习了一遍，希望对自己以后做项目开发软件设计方面有所帮助创建画笔类橡皮…

Python 2023年9月19日
0058
go-cqhttp权限管理

一、概述在写好我们的智能聊天功能之后，大家有没有感觉很烦呢？感觉这个机器人在群里面一直被艾特，一直被戳一戳。那么，我们有没有一种方法，使得其在群里面的权限可控呢？或许大家看到…

Python 2023年6月9日
0064
新的系列(一步步学习Python)

一些想说的话: 我在之前发表过一些游戏之类的文章,我觉得写的并不怎么好,我决定在后面的文章里多写一点基础的,通俗易懂的文章. 说起Python的数据分析让我不由得觉得Python的…

Python 2023年8月30日
0043
【Python爬虫系列教程 28-100】小姐姐带你入门爬虫框架Scrapy、使用Scrapy框架爬取糗事百科段子

### 回答1： Python 爬虫_是指 _使用 Python_编程语言编写的，用于自动化地从互联网上获 _取_数据的一种程序。而CentOS是一种基于Linux操作系统的开源操…

Python 2023年10月6日
0033
钉钉企业机器人outgoing功能实现（超详细）

第一次做，遇到了很多问题，都不知道从何开始，主要是没一个清晰的思路，闲话不多说，直入主题目录 1、创建企业机器人 1.1钉钉创建团队 1.2创建企业机器人 1.3登陆钉钉开放平…

Python 2023年8月9日
0065
内网穿透软件对比——cpolar : 网云穿（下）

啊哦~你想找的内容离你而去了哦内容不存在，可能为如下原因导致： ① 内容还在审核中 ② 内容以前存在，但是由于不符合新的规定而被删除 ③ 内容地址错误 ④ 作者删除了内容。可…

Python 2023年11月6日
0049
YOLO系列算法

目录 YOLO系列算法 * yolo算法 – Yolo算法思想 Yolo的网络结构 + 网络输入网络输出 * 7X7网格 30维向量 Yolo模型的训练 + 训练样本…

Python 2023年9月27日
0048
线性回归实现

深度学习第一章：最简单的线性回归实现 1. 引言 AI领域的线性回归和其他领域不太一样，包括了名词和实现方式，所以必须先认识重要名词，再把所有步骤熟悉一边，并建立在之前学习线性回归…

Python 2023年10月25日
0038
Django笔记二之连接数据库、执行migrate数据结构更改操作

本篇笔记目录索引如下： Django 连接mysql，执行数据库表结构迁移步骤介绍操作数据库，对数据进行简单操作接下来几篇笔记都会介绍和数据库相关，包括数据库的连接、操作（包括…

Python 2023年8月6日
0064
[漏洞复现] [Vulhub靶机] Struts2-045 Remote Code Execution Vulnerablity（CVE-2017-5638）

免责声明：本文仅供学习研究，严禁从事非法活动，任何后果由使用者本人负责。 0x00 背景知识 0x01 漏洞介绍 CVE-ID：CVE-2017-5638 C…

Python 2023年6月12日
0078
初识matplotlib

学习代码： import matplotlib.pyplot as plt import numpy as np fig, ax = plt.subplots() # 创建一个包含…

Python 2023年9月1日
0057

2024 年 4 月
一	二	三	四	五	六	日
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

scrapy mysql 实例_GitHub – yanceyblog/scrapy-mysql: 实现数据存储到数据库的爬虫实例

bind-address = 127.0.0.1

大家都在看