爬取多页资讯到mysql_利用Scrapy框架爬取博客信息并存到mysql数据库

2023年10月5日下午5:08 • Python • 阅读 42

一、所需要的库

(1)Scrapy

(2)pymysql

二、创建数据库和表

Create database hexun;

Use hexun;

Create table myhexun(id int(10) auto_increment primary key not null,name varchar(30),url varchar(100),hits int(15),comment int(15));

三、创建Scrapy项目

(1)创建Scrapy项目: scrapy startproject hexunpjt

(2)创建spider爬虫: scrapy genspider -t basic Myhexunspd hexun.com

(3)开始爬取: scrapy crawl myhexunspd

或者 scrapy crawl myhexunspd –nolog

四、 items编写

import scrapy

class HexunpjtItem(scrapy.Item):

define the fields for your item here like:

name = scrapy.Field()

建立name存储文章名

name= scrapy.Field()

建立url存储文章url网址

url= scrapy.Field()

建立hits存储文章阅读数

hits= scrapy.Field()

建立comment存储文章评论数

comment= scrapy.Field()

五、pipeline编写

–– coding: utf-8 ––

import pymysql

Define your item pipelines here

Don’t forget to add your pipeline to the ITEM_PIPELINES setting

See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

class HexunpjtPipeline(object):

def init(self):

刚开始时连接对应数据库

self.conn=pymysql.connect(host=”127.0.0.1″, user=”root”, passwd=”root”, db=”hexun”)

def process_item(self, item, spider):

每一个博文列表页中包含多篇博文的信息，我们可以通过for循环一次处理各博文的信息

for j in range(0, len(item[“name”])):

将获取到的name、url、hits、comment分别赋给各变量

name=item[“name”][j]

url=item[“url”][j]

hits=item[“hits”][j]

comment=item[“comment”][j]

构造对应的sql语句，实现将获取到的对应数据插入数据库中

sql=”insert into myhexun(name,url,hits,comment) VALUES(‘”+name+”‘,'”+url+”‘,'”+hits+”‘,'”+comment+”‘)”

通过query实现执行对应的sql语句

self.conn.query(sql)

return item

def close_spider(self,spider):

最后关闭数据库连接

self.conn.close()

六、setting配置

ITEM_PIPELINES = {

‘hexunpjt.pipelines.HexunpjtPipeline’: 300,

}

Disable cookies (enabled by default)

COOKIES_ENABLED = False

Disable cookies (enabled by default)

COOKIES_ENABLED = False

Obey robots.txt rules

ROBOTSTXT_OBEY = False

七、spider编写

–– coding: utf-8 ––

import scrapy

import re

import urllib.request

from hexunpjt.items import HexunpjtItem

from scrapy.http import Request

class MyhexunspdSpider(scrapy.Spider):

name = “myhexunspd”

allowed_domains = [“hexun.com”]

设置要爬取的用户的uid，为后续构造爬取网址做准备

uid = “19940007”

通过start_requests方法编写首次的爬取行为

def start_requests(self):

首次爬取模拟成浏览器进行

yield Request(“http://”+str(self.uid)+”.blog.hexun.com/p1/default.html”,headers = {‘User-Agent’: “Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 Safari/537.36 SE 2.X MetaSr 1.0”})

def parse(self, response):

item = HexunpjtItem()

item[‘name’]=response.xpath(“//span[@class=’ArticleTitleText’]/a/text()”).extract()

item[“url”]=response.xpath(“//span[@class=’ArticleTitleText’]/a/@href”).extract()

接下来需要使用urllib和re模块获取博文的评论数和阅读数

首先提取存储评论数和点击数网址的正则表达式

pat1=’

hcurl为存储评论数和点击数的网址

hcurl=re.compile(pat1).findall(str(response.body))[0]

模拟成浏览器

headers2 = (“User-Agent”,

“Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 Safari/537.36 SE 2.X MetaSr 1.0”)

opener = urllib.request.build_opener()

opener.addheaders = [headers2]

将opener安装为全局

urllib.request.install_opener(opener)

data为对应博客列表页的所有博文的点击数与评论数数据

data=urllib.request.urlopen(hcurl).read()

pat2为提取文章阅读数的正则表达式

pat2=”click\d?’,'(\d?)'”

pat3为提取文章评论数的正则表达式

pat3=”comment\d?’,'(\d?)'”

提取阅读数和评论数数据并分别赋值给item下的hits和comment

item[“hits”]=re.compile(pat2).findall(str(data))

item[“comment”]=re.compile(pat3).findall(str(data))

yield item

提取博文列表页的总页数

pat4=”blog.hexun.com/p(.*?)/”

通过正则表达式获取到的数据为一个列表，倒数第二个元素为总页数

data2=re.compile(pat4).findall(str(response.body))

if(len(data2)>=2):

totalurl=data2[-2]

else:

totalurl=1

在实际运行中，下一行print的代码可以注释掉，在调试过程中，可以开启下一行print的代码

print(“一共”+str(totalurl)+”页”)

进入for循环，依次爬取各博文列表页的博文数据

for i in range(2,int(totalurl)+1):

构造下一次要爬取的url，爬取一下页博文列表页中的数据

nexturl=”http://”+str(self.uid)+”.blog.hexun.com/p”+str(i)+”/default.html”

进行下一次爬取，下一次爬取仍然模拟成浏览器进行

yield Request(nexturl,callback=self.parse,headers = {‘User-Agent’: “Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 Safari/537.36 SE 2.X MetaSr 1.0”})

八、运行结果

爬取多页资讯到mysql_利用Scrapy框架爬取博客信息并存到mysql数据库

Original: https://blog.csdn.net/weixin_33324007/article/details/113720730
Author: 米诺大魔王
Title: 爬取多页资讯到mysql_利用Scrapy框架爬取博客信息并存到mysql数据库

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/791948/

转载文章受原作者版权保护。转载请注明原作者出处！

python

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

python拆包和封包

"""python的拆包和封包之 *号在函数形参和实参的区别1. 在函数形参定义时添加*就是封包过程，封包默认是以元组形式进行封包2. 在函数实参调用…

Python 2023年11月2日
0051
在vscode中调试python scrapy爬虫

文章目录前言一、vscode调试前的准备二、使用步骤 * 1.使用vscode 打开scrapy项目 2.创建启动爬虫文件 3.配置运行与调试 4.启动调试 5.运行情况总…

Python 2023年10月1日
00154
SPL工业智能：发现时序数据的异常

基本问题工业生产过程中会产生大量的数据，比如电压、温度、流量等等，它们随时间推移而不断产生，这些数据在多数情况下是正常的，否则生产无法正常进行；少数情况下，数据是异常的，生产效率…

Python 2023年10月24日
0037
一、Django基础介绍

一、Django介绍 Python下有许多款不同的 Web 框架。Django是重量级选手中最有代表性的一位。许多成功的网站和APP都基于Django。Django 是一个开放源代…

Python 2023年8月3日
0052
python3.5安装scrapy_Python3.5在Windows7环境下Scrapy库的安装

Python3.5在Windows7环境下Scrapy库的安装忙活了一下午，总算是把Scrapy库给装完了，记下来给需要帮助的人首先安装的环境：Windows7 64位 Pyt…

Python 2023年10月6日
0042
Django配置大全

一、基础配置（配置文件为settings.py）以下内容中有使用os，必须导入os import os 1、调试模式 DEBUG = True 2、禁用csrf中间件（如果不禁用p…

Python 2023年8月4日
0073
学习版pytest内核测试平台开发万字长文入门篇

前言 2021年，测试平台如雨后春笋般冒了出来，我就是其中一员，写了一款pytest内核测试平台，在公司落地。分享出来后，有同学觉得挺不错，希望能开源，本着”公司代码不…

Python 2023年9月12日
0037
pandas数据分组与聚合

[ Pandas_是Python中一个非常常用的 _数据_分析库，其中的groupby()函数可以对 _数据_进行 _分组聚合_操作，该函数支持多种 _聚合_函数，包括sum()…

Python 2023年9月4日
0041
EXCEL函数

把公式产生的错误值显示为空公式：C2 = IF ERROR(A2/B2,””) 说明：如果是错误值则显示为空，否则正常显示。隔列求和公式：H3 =SU…

Python 2023年6月3日
0067
Aip接口自动化测试框架pytest+allure+request+jsonpath+excle

Aip接口自动化测试框架pytest+allure+request+jsonpath+excle/yaml 介绍 Aip接口自动化测试python+pytest+allure+re…

Python 2023年9月11日
0071
pytest自动化测试学习部分遇到的问题总结（持续更新）

1、很久不用pycharm，新建一个测试项目后，拷贝了上一个项目的部分代码，发现 allure安装不成功，检查了所有该有的配置也不成功（这个问题的过程被我不小心关掉了，没有放截图…

Python 2023年9月13日
0071
python实现——处理Excel表格（超详细）

目录 xls和xlsx 基本操作 * 1：用openpyxl模块打开Excel文档，查看所有sheet表 2.1：通过sheet名称获取表格 – 2.2：获取活动表 3…

Python 2023年7月31日
0081
python中dataframe编码问题_在Python中打印Dataframe时出现问题

下面是我在Python脚本中使用的数据集示例。 df = pd.read_excel(“C:\Users\YannickLECROART\Desktop\comedie…

Python 2023年8月8日
0047
带你读AI论文丨ACGAN-动漫头像生成

摘要：ACGAN-动漫头像生成是一个十分优秀的开源项目。 1.论文及算法介绍 • 论文题目：《Conditional Image Synthesis With Auxiliary …

Python 2023年10月28日
0042
YOLOv5 Head解耦

Decoupled_Detect 一、common.py文件中加入DecoupledHead class DecoupledHead(nn.Module): def __init_…

Python 2023年8月2日
0043
关系代数(关系代数的五个基本操作)

五种基本关系代数运算是？五种基本关系代数运算是并、差、投影、交、选择、投影。 1、并：设有两个关系R和S，它们具有相同的结构。R和S的并是由属于R或属于S的元组组成的集合，运算符…

Python 2023年9月29日
0045

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31