scrapy mysql 豆瓣_使用scrapy简易爬取豆瓣9分榜单图书并存放在mysql数据库中

2023年10月4日上午12:57 • Python • 阅读 28

(.*?)

author = author.replace(‘ ‘,”).replace(‘\n’,”)

print author

rate = each.xpath(‘div[@class=”rating”]/span[@class=”rating_nums”]/text()’).extract()[0]

print rate

保存。

为方便执行，我们将建立一个main.py文件kuku@ubuntu:~/pachong/douban9fen/douban9fen/spiders$ cd ../..

kuku@ubuntu:~/pachong/douban9fen$ vim main.py

添加以下内容，# –– coding:utf8 ––

import scrapy.cmdline as cmd

cmd.execute(‘scrapy crawl db9fen’.split()) #db9fen 对应着db_9fen_spider.py文件中的name变量值

保存。

此时，我们可以执行下kuku@ubuntu:~/pachong/douban9fen$ python main.py

scrapy mysql 豆瓣_使用scrapy简易爬取豆瓣9分榜单图书并存放在mysql数据库中

但此时只能抓取到当前页面中的信息，查看页面中的后页信息

可以看到是存在标签span中的class=”next”下，我们只需要将这个链接提取出来，进而对其进行爬取’//span[@class=”next”]/link/@href’

然后提取后我们scrapy的爬虫怎么处理呢？

可以使用yield，这样爬虫就会自动执行url的命令了，处理方式还是使用我们的parse函数yield scrapy.http.Request(url,callback=self.parse)

然后将更改db_9fen_spider.py文件,添加以下内容到for函数中。nextpage = response.xpath(‘//span[@class=”next”]/link/@href’).extract()

if nextpage:

print nextpage

next = nextpage[0]

print next

yield scrapy.http.Request(next,callback=self.parse)

如图所示

可能有些人想问，next = nextpage[0]什么意思，这里可以解释以下，变量nextpage是一个列表，列表里面存的是一个链接字符串，next = nextpage[0]就是将这个链接取出并赋值给变量next。

现在可以在items文件中定义我们要抓取的字段kuku@ubuntu:~/pachong/douban9fen/douban9fen$ vim items.py

编辑item.py文件中的内容是：# –– coding: utf-8 ––

Define here the models for your scraped items

See documentation in:

http://doc.scrapy.org/en/latest/topics/items.html

import scrapy

from scrapy import Field

class Douban9FenItem(scrapy.Item):

define the fields for your item here like:

name = scrapy.Field()

title = Field()

author = Field()

rate = Field()

定义好字段之后，将重新对db_9fen_spider.py进行编辑，将刚才抓取到的三个字段存放在items.py中类的实例中，作为属性值。kuku@ubuntu:~/pachong/douban9fen/douban9fen$ cd spiders/

kuku@ubuntu:~/pachong/douban9fen/douban9fen/spiders$ vim db_9fen_spider.py# –– coding:utf8 ––

import scrapy

import re

from douban9fen.items import Douban9FenItem

class Db9fenSpider(scrapy.Spider):

name = “db9fen”

allowed_domains = [“douban.com”]

start_urls = [“https://www.douban.com/doulist/1264675/”]

解析数据

def parse(self,response):

print response.body

ninefenbook = response.xpath(‘//div[@class=”bd doulist-subject”]’)

for each in ninefenbook:

item = Douban9FenItem()

title = each.xpath(‘div[@class=”title”]/a/text()’).extract()[0]

title = title.replace(‘ ‘,”).replace(‘\n’,”)

print title

item[‘title’] = title

author = re.search(‘

(.*?)

author = author.replace(‘ ‘,”).replace(‘\n’,”)

print author

item[‘author’] = author

rate = each.xpath(‘div[@class=”rating”]/span[@class=”rating_nums”]/text()’).extract()[0]

print rate

item[‘rate’] = rate

yield item

nextpage = response.xpath(‘//span[@class=”next”]/link/@href’).extract()

if nextpage:

print nextpage

next = nextpage[0]

print next

yield scrapy.http.Request(next,callback=self.parse)

编辑setting.py,添加数据库配置信息USER_AGENT = ‘Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.8.1.14) Gecko/20080404 Firefox/44.0.2’

start MySQL database configure setting

MYSQL_HOST = ‘localhost’

MYSQL_DBNAME = ‘douban9fen’

MYSQL_USER = ‘root’

MYSQL_PASSWD = ‘openstack’

end of MySQL database configure setting

ITEM_PIPELINES = {

‘douban9fen.pipelines.Douban9FenPipeline’: 300,

}

注意mysql数据库是预先安装进去的，可以看到数据库的名称为douban9fen，因此我们首先需要在数据库中创建douban9fen 数据库

kuku@ubuntu:~/pachong/douban9fen/douban9fen/spiders$ mysql -uroot -p

Enter password:

Welcome to the MySQL monitor. Commands end with ; or \g.

Your MySQL connection id is 46

Server version: 5.5.52-0ubuntu0.14.04.1 (Ubuntu)

Oracle is a registered trademark of Oracle Corporation and/or its

affiliates. Other names may be trademarks of their respective

owners.

Type ‘help;’ or ‘\h’ for help. Type ‘\c’ to clear the current input statement.

mysql> create database douban9fen;

Query OK, 1 row affected (0.00 sec)mysql> show databases;+——————–+

| Database |

+——————–+

| information_schema |

| csvt04 |

| douban9fen |

| doubandianying |

| mysql |

| performance_schema |

| web08 |

+——————–+

7 rows in set (0.00 sec)

可以看到已经创建数据库成功；mysql> use douban9fen;

接下来创建数据表mysql> create table douban9fen (

id int(4) not null primary key auto_increment,

title varchar(100) not null,

author varchar(40) not null,

rate varchar(20) not null )CHARACTER SET utf8 COLLATE utf8_general_ci;

Query OK, 0 rows affected (0.04 sec)

编辑pipelines.py，将数据储存到数据库中，kuku@ubuntu:~/pachong/douban9fen/douban9fen$ vim pipelines.py# –– coding: utf-8 ––

Define your item pipelines here

Don’t forget to add your pipeline to the ITEM_PIPELINES setting

See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

将数据存储到mysql数据库

from twisted.enterprise import adbapi

import MySQLdb

import MySQLdb.cursors

class Douban9FenPipeline(object):

数据库参数

def init(self):

dbargs = dict(

host = ‘127.0.0.1’,

db = ‘douban9fen’,

user = ‘root’,

passwd = ‘openstack’,

cursorclass = MySQLdb.cursors.DictCursor,

charset = ‘utf8’,

use_unicode = True

)

self.dbpool = adbapi.ConnectionPool(‘MySQLdb’,**dbargs)

def process_item(self, item, spider):

res = self.dbpool.runInteraction(self.insert_into_table,item)

return item

插入的表，此表需要事先建好

def insert_into_table(self,conn,item):

conn.execute(‘insert into douban9fen( title,author,rate) values(%s,%s,%s)’, (

item[‘title’],

item[‘author’],

item[‘rate’]

)

编辑好上面的红色标注的文件后，kuku@ubuntu:~/pachong/douban9fen/douban9fen$ cd ..

kuku@ubuntu:~/pachong/douban9fen$

再执行 main.py文件kuku@ubuntu:~/pachong/douban9fen$ python main.py

执行过程如下：

打开mysql ，查看是否已经写入到数据库中；kuku@ubuntu:~/pachong/douban9fen$ mysql -uroot -p

输入密码openstack 登录mysql> show databases;+——————–+

| Database |

+——————–+

| information_schema |

| csvt04 |

| douban9fen |

| doubandianying |

| mysql |

| performance_schema |

| web08 |

+——————–+

7 rows in set (0.00 sec)mysql> use douban9fen;Reading table information for completion of table and column names

You can turn off this feature to get a quicker startup with -A

Database changedmysql> show tables;+———————-+

| Tables_in_douban9fen |

+———————-+

| douban9fen |

+———————-+

1 row in set (0.00 sec)mysql> select * from douban9fen;

显示能够成功写入到数据库中。

Original: https://blog.csdn.net/weixin_42514736/article/details/113998898
Author: 格林的雪国
Title: scrapy mysql 豆瓣_使用scrapy简易爬取豆瓣9分榜单图书并存放在mysql数据库中

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/790654/

转载文章受原作者版权保护。转载请注明原作者出处！

python

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

pip install 与 conda install的区别

以前一直不知道pip install 与 conda install有什么区别，每次都是先用pip装，要是不行就换conda install。但是前一段时间在配置环境的时候发现了一…

Python 2023年9月8日
0050
自主导航与路径规划无人机研究现状

目录 1.SLAM算法的研究现状 2. 无人机定位研究现状 3 路径规划的研究现状参考文献 1.SLAM算法的研究现状移动机器人根据传感器获取的自身状态信息和环境信息构建环境地…

Python 2023年9月30日
0047
umi-request设置请求头_scrapy_splash 设置随机请求头

本文为霾大：scrapy_splash 爬取 js 加载网页初体验zhuanlan.zhihu.com 的补充在上面的文章中我们仅仅是初步完成了 scrapy_splash …

Python 2023年10月2日
0094
2022世界杯感悟

世界杯转眼之间已经到了尾声，总共48场比赛，也只剩下了最后的两场。都说这一届是诸神黄昏，再过三天，也就到了真正封神的时候。回想第一次看世界杯，恰逢2002年中国队首次挤进世界杯…

Python 2023年9月16日
0037
python+pytest接口自动化(1)-接口测试基础

Python微信订餐小程序课程视频 https://edu.csdn.net/course/detail/36074 Python实战量化交易理财系统 https://edu.cs…

Python 2023年9月9日
0046
python –Tkinter详解

简介 tkinter tkinter(Tk interface)是Python的标准GUl库，支持跨平台的GUl程序开发。tkinter适合小型的GUl程序编写，也特别适合初学者学…

Python 2023年8月1日
0073
python-docx操作word文档详解

案例官网地址： https://python-docx.readthedocs.io/en/latest/ pip install python -docx from docx …

Python 2023年10月12日
0054
django中聚合函数查询和分组聚合查询

聚合函数：Max，Min，Count首字母都要大写，且后面的参数加 ‘ ‘ 号，不然会报错，还有就是，如果是Count(‘ )的话，需要加个别名…

Python 2023年6月11日
0067
2020CCF 路况预测–数据读取部分：将数据格式转化成dataFrame

从最基础的部分开始学习，完整的解决一个任务！本文的代码借鉴自2020CCF冠军的开源代码。简单的数据格式转换是我数据挖掘道路上的拦路虎，今天看到人家的处理方式，惊呼！学到了~ 将t…

Python 2023年8月18日
0069
DataFrame 求存在空值的行或列

非转置：df.isnull().any()，得到的每一列求any()计算的结果，输出为列的Series。转置：df.isnull().T.any()，得到的每一行求any()计算…

Python 2023年8月7日
0046
pandas如何进行优雅的列转行、行转列？

一、列转行 1、背景描述在日常处理数据过程中，你们可能会经常遇到这种类型的数据：而我们用pandas进行统计分析时，往往需要将结果转换成以下类型的数据： ; 2.方法描述准备数…

Python 2023年8月18日
0047
python3 pygame load图片不显示_Python——mac下pygame踩坑，绘制图像不加载

问题描述：mac下python3版本安装pygame后开发小游戏时，背景和角色在画布上不加载，创建游戏循环执行时，一直加载背景不刷新； pygame相对比较简单，直接上代码： im…

Python 2023年9月23日
0047
Python爬虫实战，requests+openpyxl模块，爬取手机商品信息数据（附源码）

前言今天给大家介绍的是Python爬取手机商品信息数据，在这里给需要的小伙伴们代码，并且给出一点小心得。首先是爬取之前应该尽可能伪装成浏览器而不被识别出来是爬虫，基本的是加请求…

Python 2023年10月30日
0066
Pthread 并发编程（一）——深入剖析线程基本元素和状态

Pthread 并发编程（一）——深入剖析线程基本元素和状态前言在本篇文章当中讲主要给大家介绍 pthread 并发编程当中关于线程的基础概念，并且深入剖析进程的相关属性和设置…

Python 2023年10月16日
0052
python科学计算实验3

创建一个8行5列的DataFrame对象，列名为[‘Num’, ‘Name’,’C++’, ‘J…

Python 2023年8月21日
0057
Python 用 matplotlib 绘制 3D 散点图

Python 用 matplotlib 绘制 3D 散点图 import numpy as np import matplotlib.pyplot as plt import mp…

Python 2023年9月1日
0097

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

scrapy mysql 豆瓣_使用scrapy简易爬取豆瓣9分榜单图书并存放在mysql数据库中

解析数据

将数据存储到mysql数据库

数据库参数

插入的表，此表需要事先建好

大家都在看