scrapy和mysql_贪心学院作业：如何使用Scrapy框架和MySQL数据库

2023年10月6日下午11:45 • Python • 阅读 42

我认为这此课讲的更好！

一、Python3建议使用 PyMySQL

PyMySQL 是在 Python3.x 版本中用于连接 MySQL 服务器的一个库，Python2中则使用mysqldb。

PyMySQL 遵循 Python 数据库 API v2.0 规范，并包含了 pure-Python MySQL 客户端库。

二、MySQL基本操作

参见Python3 MySQL 数据库连接 – PyMySQL 驱动www.runoob.com

三、代码区

main.py

from scrapy.cmdline import execute

import os

import sys

a=os.path.dirname(os.path.abspath(file))

print(a)

sys.path.append(os.path.dirname(os.path.abspath(file)))

execute([“scrapy”,”crawl”,”baidu”])

–– coding: utf-8 ––

import scrapy

from urllib import parse

from baidu_tieba.items import TiebaItem

class BaiduSpider(scrapy.Spider):

name = ‘baidu’

allowed_domains = [‘tieba.baidu.com’]

start_urls = [‘https://tieba.baidu.com/f\?ie\=utf-8\&kw\=%E9%98%B2%E8%AF%88%E9%AA%97’]

start_urls = [‘https://tieba.baidu.com/f?ie=utf-8&kw=%E9%98%B2%E8%AF%88%E9%AA%97’]

def parse(self, response):

url_list=response.xpath(‘//a[@class=”j_th_tit “]/@href’).extract()

print(url_list)

for url in url_list:

yield scrapy.Request(url=parse.urljoin(response.url,url),callback=self.parse_detail)

def parse_detail(self,response):

content_list=response.xpath(‘//div[contains(@class,”d_post_content”)]/text()’).extract()

temp=response.xpath(‘//div[@class=”d_post_content_main”]//div[@class=”p_content “]//text()’).extract()

post_content_main=response.xpath(‘//div[contains(@class,”d_post_content_main”)]’)

content_list=[]

device_list=[]

floor_list=[]

posttime_list=[]

此部分顺序取得回帖内容、设备、日期等。

for r in post_content_main:

content=r.xpath(‘.//div[contains(@class,”d_post_content”)]/text()’).extract()

content=””.join(content).strip()

content_list.append(content)

print(content)

post_tail=r.xpath(‘.//div[@class=”post-tail-wrap”]’)

tail_info=r.xpath(‘.//div[@class=”post-tail-wrap”]/span[@class=”tail-info”]’)

for pt in post_tail:

if len(tail_info)==3:

device=pt.xpath(‘./span[@class=”tail-info”][1]//text()’).extract()

device=””.join(device)

floor=pt.xpath(‘./span[@class=”tail-info”][2]/text()’).extract()

floor=floor[0]

post_time=pt.xpath(‘./span[@class=”tail-info”][3]/text()’).extract()

post_time=post_time[0]

else:

device=”无”

floor=pt.xpath(‘./span[@class=”tail-info”][1]/text()’).extract()

floor=floor[0]

post_time=pt.xpath(‘./span[@class=”tail-info”][2]/text()’).extract()

post_time=post_time[0]

device_list.append(device)

floor_list.append(floor)

posttime_list.append(post_time)

for i in range(len(content_list)):

tieba_item = TiebaItem()

tieba_item[‘title’]=title[0]

tieba_item[‘content’]=content_list[i]

tieba_item[‘device’]=device_list[i]

tieba_item[‘floor’] = floor_list[i]

tieba_item[‘posttime’]=posttime_list[i]

yield tieba_item

–– coding: utf-8 ––

Define here the models for your scraped items

See documentation in:

https://docs.scrapy.org/en/latest/topics/items.html

import scrapy

class BaiduTiebaItem(scrapy.Item):

define the fields for your item here like:

name = scrapy.Field()

pass

自定义的class

class TiebaItem(scrapy.Item):

define the fields for your item here like:

title = scrapy.Field()

content=scrapy.Field()

device = scrapy.Field()

floor = scrapy.Field()

posttime = scrapy.Field()

def get_insert_sql(self):

insert_sql=”’

insert into baidu_tieba(title,content,device,floor,posttime) values (%s,%s,%s,%s,%s)

”’

params=(self[‘title’],self[‘content’],self[‘device’],self[‘floor’],self[‘posttime’])

return insert_sql,params

pipelines. py

–– coding: utf-8 ––

Define your item pipelines here

Don’t forget to add your pipeline to the ITEM_PIPELINES setting

See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html

class BaiduTiebaPipeline(object):

def process_item(self, item, spider):

return item

from twisted.enterprise import adbapi

import pymysql

import pymysql.cursors

class MysqlTwistedPipeline(object):

def init(self,dbpool):

self.dbpool=dbpool

”’

MYSQL_HOST=’localhost’

MYSQL_DBNAME=’baidutieba’

MYSQL_USER=’root’

MYSQL_PASSWORD=’Mysql86mysql.’

”’

加载配置

@classmethod

def from_settings(cls,setting):

dbparams=dict(

host=setting[‘MYSQL_HOST’],

db=setting[‘MYSQL_DBNAME’],

user=setting[‘MYSQL_USER’],

passwd=setting[‘MYSQL_PASSWORD’],

charset=’utf8′,

cursorclass=pymysql.cursors.DictCursor,

use_unicode=True

连接池

dbpool=adbapi.ConnectionPool(“pymysql”,**dbparams)

return cls(dbpool)

def process_item(self, item, spider):

query=self.dbpool.runInteraction(self.do_insert,item)

def do_insert(self,cursor,item):

insert_sql,params=item.get_insert_sql()

cursor.execute(insert_sql,params)

setting .py

ITEM_PIPELINES = {

‘baidu_tieba.pipelines.BaiduTiebaPipeline’: 300,

‘baidu_tieba.pipelines.MysqlTwistedPipeline’: 1,

MYSQL_HOST=’localhost’

MYSQL_DBNAME=’baidutieba’

MYSQL_USER=’***’

MYSQL_PASSWORD=’***’

测试可用！

Original: https://blog.csdn.net/weixin_39541189/article/details/113202026
Author: weixin_39541189
Title: scrapy和mysql_贪心学院作业：如何使用Scrapy框架和MySQL数据库

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/792943/

转载文章受原作者版权保护。转载请注明原作者出处！

python

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

Pytest—-如何通过filterwarnings配置不显示告警或将告警报错

首先在脚本中人工抛出一条告警，测试代码如下所示： import pytest import warnings def test_demo1(): print("in te…

Python 2023年9月13日
00114
【Git】一文带你入门Git分布式版本控制系统（必要配置、工作原理、创建/克隆项目）

个人简介 👀 个人主页：前端杂货铺🙋‍♂️ 学习方向：主攻前端方向，也会涉及到服务端📃 个人状态：在校大学生一枚，已拿多个前端 offer（秋招）🚀 未来打算：为中国的工业软…

Python 2023年9月16日
0044
Numpy – 1.22.1 详细笔记

Numpy介绍 numpy 读法：英[‘nʌmpi] NumPy是Python中科学计算的基本包。它是一个Python库，提供了一个多维数组对象，各种派生对象（如屏蔽数…

Python 2023年8月26日
0054
4-1 简单条图（单变量信息的可视化）

简单条图用matplotlib实现DataFrame.plot.bar()：对matplotlib.pyplot.bar()的打包调用。原始的pyplot.bar()需要同时定…

Python 2023年8月9日
0091
Unity打包WebGL平台如何区别移动端和电脑端

Unity打包WebGL平台如何区别移动端和电脑端完整解决方案前情提要最近有一个项目，其中有一个功能来控制角色移动，电脑端是使用WASD（键盘）控制，手机移动端呢使用虚拟摇杆…

Python 2023年9月30日
0038
matplotlib之pyplot模块——填充两条曲线之间区域（fill_between、fill_betweenx）

概述 fill_between和 fill_betweenx函数的作用都是填充两条曲线之间的区域。其中 fill_between函数作用是填充两条水平曲线之间的区域。 fill_b…

Python 2023年9月2日
00208
python预测值和真实值得到折线图的代码_使用python matploblib库绘制准确率,损失率折线图…

我就废话不多说了，大家还是直接看代码吧~ import matplotlib.pyplot as plt epochs = [0,1,2,3] acc = [4,8,6,5] lo…

Python 2023年9月4日
0075
广州大学机器学习与数据挖掘实验一

实验一线性回归一、实验目的本实验课程是计算机、人工智能、软件工程等专业学生的一门专业课程，通过实验，帮助学生更好地掌握数据挖掘与机器学习相关概念、技术、原理、应用等；通过实验提…

Python 2023年8月2日
0043
从头造轮子：python3 asyncio之 gather （3）

前言书接上文，本文造第三个轮子，也是asyncio包里面非常常用的一个函数 gather 一、知识准备 ● 相对于前两个函数， gather的使用频率更高，因为它支持多个协程任务…

Python 2023年5月24日
0059
跳槽一次能涨多少？今天见识到跳槽天花板。

2022年马上就快结束了，最近内卷严重，各种跳槽裁员，相信很多小伙伴也在准备明年的金三银四的面试计划。在此分享一套学习笔记 / 面试手册，年后跳槽的朋友可以好好刷一刷，还是挺有必…

Python 2023年10月9日
0034
Hands-on data analysis Task03

import numpy as np import pandas as pd df1=pd.read_csv(r"C:\Jupyter.Data\titanic\hand…

Python 2023年8月22日
0052
为何不要随便用from xx import *

新单位，新工作，远程办公，你想想有多麻烦吧。一块块看每个模块的详细功能。看到一个函数，返回值也是一个函数，这本来没啥难的，但是文件里只出现一次。怎么都找不到来自哪个文件，后来全…

Python 2023年6月15日
0054
30段极简Python代码，30秒学一个实用技巧

人生苦短，快学Python！学 Python 怎样才最快，当然是实战各种小项目，只有自己去想与写，才记得住规则。今天给大家分享的是 30 个极简任务，初学者可以尝试着自己实现；本…

Python 2023年5月24日
0049
pywinauto

pywinauto （仅作为个人笔记，如有雷同，请联系删除。。） == Pywinauto==是基于Python开发的，用于自动化测试的脚本模块，主要操作于Windows标准图形界…

Python 2023年8月1日
00112
面试突击91：MD5 加密安全吗？

MD5 是 Message Digest Algorithm 的缩写，译为信息摘要算法，它是 Java 语言中使用很广泛的一种加密算法。MD5 可以将任意字符串，通过不可逆的字符串…

Python 2023年10月18日
0067
Pytest之断言

在上一篇Pytest系列文章：Pytest之收集用例及命令行参数，主要介绍Pytest用例收集规则及常用的命令行参数。在自动化测试过程中，需要判断预期结果和实际结果是否一致，这时…

Python 2023年9月10日
0066

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

scrapy和mysql_贪心学院作业：如何使用Scrapy框架和MySQL数据库

此部分顺序取得回帖内容、设备、日期等。

自定义的class

加载配置

连接池

大家都在看