Scrapy爬取豆瓣电影top250数据并保存mysql/json/csv

2023年10月1日下午1:34 • Python • 阅读 53

文章目录

帮一个大一的小朋友做作业，看了一下Scrapy，这里也记录一下吧，有需要的可以参考一下。

1. 项目简介

Scrapy爬取豆瓣电影top250的代码网上有很多，之前没用过这个框架，也懒得细看了，这里借鉴了Scrapy实战这篇文章的爬取方式。爬取的部分就不细说了，大家可以直接看代码，博客上也有很多讲解的，这里主要分享一下爬取到的数据如何存储。
先展示一下我的目录结构，files和picture文件夹备用来放置爬取的数据保存文件和图片，其他文件是新建Scrapy项目后自动生成的。

; 2. 代码解析

main.py文件是自建的，用来作为项目执行文件，可以在pycharm中直接运行，代码如下：

#!/usr/bin/env python
-*- encoding: utf-8 -*-
"""
@File    :   main
@Author :   GrowingSnake
@Version :   1.0
@Desciption :
@Modify Time : 2021/6/8 17:22
"""
from scrapy.cmdline import execute

import sys
import os

sys.path.append(os.path.dirname(os.path.abspath(__file__)))
execute(['scrapy', 'crawl', 'moviespider'])
Scrapy&#x7ED9;&#x51FA;&#x7684;&#x5C06;&#x6570;&#x636E;&#x4FDD;&#x5B58;&#x4E3A;json&#x6587;&#x4EF6;&#x548C;csv&#x6587;&#x4EF6;&#x7684;&#x65B9;&#x5F0F;
execute(['scrapy', 'crawl', 'moviespider', '-o', 'moviespider.json'])
execute(['scrapy', 'crawl', 'moviespider', '-o', 'moviespider.csv'])

其实，Scrapy给出了将数据保存为json文件和csv文件的方式，大家可以执行main文件中注释掉的的命令即可，但是这种方式不够灵活，仅作为最简单的保存方式，下面还会介绍其他方式。

items.py中包含了想要爬取的电影相关数据

-*- coding: utf-8 -*-

Define here the models for your scraped items
#
See documentation in:
https://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class DoubanmovieItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    # &#x8BBE;&#x7F6E;&#x9700;&#x8981;&#x91C7;&#x96C6;&#x7684;&#x9879;
    # &#x7535;&#x5F71;&#x6392;&#x884C;
    rank = scrapy.Field()
    # &#x7535;&#x5F71;&#x540D;&#x79F0;
    title = scrapy.Field()
    # &#x5BFC;&#x6F14;
    director = scrapy.Field()
    # &#x8BC4;&#x5206;
    score = scrapy.Field()
    # &#x4ECB;&#x7ECD;
    introduction = scrapy.Field()
    # &#x7535;&#x5F71;&#x6D77;&#x62A5;&#x56FE;&#x53CA;&#x5730;&#x5740;
    picture = scrapy.Field()
    pass

moviespider.py包含了具体的xpath爬取规则，并且在爬取到电影海报图片地址后，这里我自己使用urllib库的request请求下载了电影海报，保存到picture文件夹下

-*- coding: utf-8 -*-
from urllib import request

import scrapy
from items import DoubanmovieItem

class BookspiderSpider(scrapy.Spider):
    name = 'moviespider'
    allowed_domains = ['douban.com']
    start_urls = ['https://movie.douban.com/top250']

    def parse(self, response):
        # &#x83B7;&#x53D6;xpath&#x8FC7;&#x6EE4;&#x8D77;&#x59CB;&#x4F4D;&#x7F6E;
        movie_item = response.xpath('//div[@class="item"]')
        # &#x904D;&#x5386;&#x8D77;&#x59CB;&#x4F4D;&#x7F6E;&#x9009;&#x62E9;&#x5668;&#x89E3;&#x6790;&#x6570;&#x636E;
        for item in movie_item:
            # &#x521B;&#x5EFA;&#x5B9E;&#x4F53;&#x5BF9;&#x8C61;
            movie = DoubanmovieItem()
            # &#x83B7;&#x53D6;&#x7535;&#x5F71;&#x6392;&#x884C;
            movie['rank'] = item.xpath('div[@class="pic"]/em/text()').extract()
            # &#x83B7;&#x53D6;&#x7535;&#x5F71;&#x540D;&#x79F0;
            movie['title'] = item.xpath('div[@class="info"]/div[@class="hd"]/a/span[1]/text()').extract()
            # &#x83B7;&#x53D6;&#x7535;&#x5F71;&#x5BFC;&#x6F14;
            movie['director'] = item.xpath('div[@class="info"]/div[@class="bd"]/p/text()').extract()
            # &#x83B7;&#x53D6;&#x7535;&#x5F71;&#x8BC4;&#x5206;
            movie['score'] = item.xpath(
                'div[@class="info"]/div[@class="bd"]/div[@class="star"]/span[2]/text()').extract()
            # &#x83B7;&#x53D6;&#x7535;&#x5F71;&#x4ECB;&#x7ECD;
            movie['introduction'] = item.xpath('div[@class="info"]/div[@class="bd"]/p[2]/span/text()').extract()
            # &#x83B7;&#x53D6;&#x7535;&#x5F71;&#x6D77;&#x62A5;&#x56FE;&#x53CA;&#x5730;&#x5740;
            movie['picture'] = item.xpath('div[@class="pic"]/a/img/@src').extract()
            # &#x4FDD;&#x5B58;&#x56FE;&#x7247;&#x5230;&#x672C;&#x5730;
            try:
                if movie['picture']:
                    #   &#x521B;&#x5EFA;&#x65B0;&#x7684;Request&#x5BF9;&#x8C61;&#xFF0C;&#x5C06;url&#x4F20;&#x5165;
                    req_img = request.Request(url=movie['picture'][0])
                    img_data = request.urlopen(req_img)
                    img_name = str(movie['title'][0]) + '.jpg'# &#x4FDD;&#x5B58;&#x56FE;&#x7247;
                    with open('./picture/'+img_name, 'wb') as f:
                        f.write(img_data.read())
            except Exception:
                print('&#x83B7;&#x53D6;&#x56FE;&#x7247;&#x5931;&#x8D25;')
            # &#x52A0;&#x5165;&#x751F;&#x6210;&#x5668;
            yield movie
        next_page = response.css('span.next a::attr(href)').extract_first()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

pipelines.py
这里我构建了3个Pipline，分别用于将爬取到的数据存储到mysql数据库，以及保存为json文件和csv文件

-*- coding: utf-8 -*-

Define your item pipelines here
#
Don't forget to add your pipeline to the ITEM_PIPELINES setting
See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import pymysql
from scrapy.exporters import JsonItemExporter, CsvItemExporter, JsonLinesItemExporter

class DoubanmoviePipeline(object):
"""
    &#x6570;&#x636E;&#x5E93;pipline
"""
    def __init__(self):
        # &#x8FDE;&#x63A5;MySQL&#x6570;&#x636E;&#x5E93;
        self.connect = pymysql.connect(host='localhost', user='&#x6570;&#x636E;&#x5E93;&#x7528;&#x6237;&#x540D;', password='&#x5BC6;&#x7801;', db='&#x6570;&#x636E;&#x5E93;&#x540D;', port=3306)
        self.cursor = self.connect.cursor()

    def process_item(self, item, spider):
        print('&#x7535;&#x5F71;&#x6392;&#x884C;:', item['rank'][0])
        print('&#x7535;&#x5F71;&#x540D;&#x79F0;:', item['title'][0])
        print(item['director'][0].strip().split('\xa0\xa0\xa0&#x4E3B;&#x6F14;: ')[0])
        print('&#x8BC4;&#x5206;:', item['score'][0])
        if len(item['introduction']) > 0:
            introduction = item['introduction'][0]
            print('&#x4ECB;&#x7ECD;:', item['introduction'][0])
        else:
            introduction = 0
        print('&#x7535;&#x5F71;&#x6D77;&#x62A5;&#x56FE;&#x53CA;&#x5730;&#x5740;:', item['picture'][0])

        self.cursor.execute(
            'insert into doubanmovie(id,name,director,score,introduction,picture) '
            'VALUES ("{}","{}","{}","{}","{}","{}")'
                .format(item['rank'][0], item['title'][0].strip(),
                        item['director'][0].strip().split('\xa0\xa0\xa0&#x4E3B;&#x6F14;: ')[0], item['score'][0],
                        introduction, item['picture'][0]))

        self.connect.commit()
        return item

    def close_spider(self, spider):
        self.cursor.close()
        # &#x5173;&#x95ED;&#x6E38;&#x6807;
        self.connect.close()

class JsonPipeline(object):
"""
    json&#x5B58;&#x50A8;pipline
"""
    def __init__(self):
        self.file = open('./files/movie.json', 'wb')
        # self.exporter = JsonLinesItemExporter(self.file, encoding="utf-8", ensure_ascii=False)
        self.exporter = JsonItemExporter(self.file, encoding="utf-8", ensure_ascii=False)
        self.exporter.start_exporting()

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

    def close_spider(self, spider):
        self.exporter.finish_exporting()
        self.file.close()

class CsvPipeline(object):
"""
    csv&#x5B58;&#x50A8;pipline
"""
    def __init__(self):
        self.file = open('./files/booksdata.csv', 'wb')
        self.exporter = CsvItemExporter(self.file,  encoding='utf-8-sig')
        self.exporter.start_exporting()

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

    def close_spider(self, spider):
        self.exporter.finish_exporting()
        self.file.close()

注意：在构建好Pipline后，要注意在settings.py中配置自定义的Pipline

ITEM_PIPELINES = {
    'doubanmovie.pipelines.DoubanmoviePipeline': 300,
    'doubanmovie.pipelines.JsonPipeline': 200,
    'doubanmovie.pipelines.CsvPipeline': 100
}

3. 总结

之前也没用过Scrapy，但是现在的框架都太完善了，一些基础问题都能很简便地通过框架实现，真的是方便。本来也是帮大一学生做的作业，就想跟新人们说一下，不要害怕代码，没有那么难搞，多动手多看，入门了就会好起来的。

Original: https://blog.csdn.net/nc514819873/article/details/117918051
Author: Growing_Snake
Title: Scrapy爬取豆瓣电影top250数据并保存mysql/json/csv

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/788778/

转载文章受原作者版权保护。转载请注明原作者出处！

python

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

python_爬虫 21 Scrapy框架之（七）下载中间件

目录 Downloader Middlewares(下载器中间件) 一、process_request(self, request, spider) 二、process_respo…

Python 2023年10月4日
0070
python中scrapy可以爬取多少数据_python中scrapy框架爬取携程景点数据

——————————————————————————————— [版权申明：本文系作者原创，转载请注明出处] 文章出处：https://blog.csdn.net/sdksdk0/…

Python 2023年10月5日
0032
[948]Pandas数据分组的函数应用（df.apply()、df.agg()、df.transform()、df.applymap()、df.groupby().apply()）

将自己定义的或其他库的函数应用于Pandas对象： apply()：逐行或逐列应用该函数agg()和transform()：聚合和转换applymap()：逐元素应用函数group…

Python 2023年8月18日
0059
【爬虫+情感判定+Top10高频词+词云图】“谷爱凌”热门弹幕python舆情分析

一、背景介绍二、代码讲解-爬虫部分 2.1 分析弹幕接口 2.2 讲解爬虫代码三、代码讲解-情感分析部分 3.1 整体思路 3.2 情感分析打标 3.3 统计top10高频词 …

Python 2023年5月24日
00140
K8S Calico网络插件

🚀 优质资源分享 🚀 学习路线指引（点击解锁）知识定位人群定位🧡 Python实战微信订餐小程序 🧡 进阶级本课程是python flask+微信小程序的完美结合，从项目搭建到腾讯…

Python 2023年8月10日
0072
pytest-allure报告

pytest-allure报告 1、安装allure1.下载 allure.zip下载地址：allure-github：https://github.com/allure-fram…

Python 2023年9月13日
0041
通达信全市场数据导入指南—基于股票量化分析工具V2.0！

今天是除夕夜，祝大家春节快乐！阖家欢乐！万事如意！骑牛冲天！不少小伙伴准备趁着春节假期，好好结合书本消化下股票量化分析工具V2.0代码。这样可以把自己的分析思路量化到工具中，来…

Python 2023年8月21日
0056
scrapy批量保存mysql_scrapy操作mysql/批量下载图片

–– coding: utf-8 –– importscrapyimportosimporturllib.requestimport…

Python 2023年10月7日
0035
python几何图形turtle库

python turtle python几何图形turtle库 * goto（） dot（） penup（）和pendown（）随机数 setheading（）和forward（…

Python 2023年9月19日
0087
【Django | 增删改查】学生管理系统案例

Python 2023年5月24日
0059
Python项目外星人入侵（终）记录分数

文章目录 Python项目外星人入侵（终）记录分数 * 添加Play按键 – 使游戏进入非活跃状态创建Button类在屏幕上绘制按钮使用按钮开始游戏 &#8211…

Python 2023年9月22日
0047
36、Java——一个案例学会三层架构对数据表的增删改查

抵扣说明： 1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。 Original: https://blo…

Python 2023年11月7日
0042
python对月饼数据进行可视化，看看哪家最划算

Original: https://www.cnblogs.com/Qqun261823976/p/16661850.htmlAuthor: python倩Title: pytho…

Python 2023年6月9日
0056
[Pandas] 数据追加 df.append

df.append()可以将其他DataFrame附加到调用方的末尾，并返回一个新对象它是最简单、最常用的数据合并方式语法 df.append(other, ignore_in…

Python 2023年8月7日
0033
pytest 常用场景

pytest常用用法先走在跑起飞 * 文件夹的方式执行 – pytest 文件夹名 pytest 提供转测试打回 pytest 执行过滤用例（关键字） –…

Python 2023年9月13日
0042
1.简介

1.简介 python的创始人为吉多·范罗苏姆（Guido van Rossum),创建于1989年的圣诞节期间，根据本人热爱的电视剧《蒙提·派森的飞行马戏团》（Monty Py…

Python 2023年10月31日
0025

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Scrapy爬取豆瓣电影top250数据并保存mysql/json/csv

文章目录

1. 项目简介

; 2. 代码解析

3. 总结

大家都在看