python scrapy实践-爬取豆瓣读书

按scrapy官网的介绍来说,scrapy是一种快速的高级 web crawlingweb scraping,用于对网站进行爬取并从其页面提取结构化数据的框架,也就是爬虫。它可以用于数据挖掘、数据监控和自动化测试。

在有python环境下,可以直接使用pip安装。

pip install scrapy

安装完成后,直接在命令中输入scrapy验证是否安装成功,运行之后结果如下:

Scrapy 2.4.1 - no active project

Usage:
  scrapy <command> [options] [args]

Available commands:
  bench         Run quick benchmark test
  commands
  fetch         Fetch a URL using the Scrapy downloader
  genspider     Generate new spider using pre-defined templates
  runspider     Run a self-contained spider (without creating a project)
  settings      Get settings values
  shell         Interactive scraping console
  startproject  Create new project
  version       Print Scrapy version
  view          Open URL in browser, as seen by Scrapy

  [ more ]      More commands available when run from project directory

Use "scrapy <command> -h" to see more info about a command

默认情况下,会直接弹出可运行的命令。

创建一个scrapy项目,在命令行中输入以下命令下,则会在当前目录下生成scr_first项目,这是一个python项目,可以使用IDE打开。

scrapy startproject hello_scrapy

创建好项目后,然后进入到hello_scrapy目录下,看下当前的目录结构:

&#x251C;&#x2500;&#x2500; hello_scrapy
&#x2502;&#xA0;&#xA0; &#x251C;&#x2500;&#x2500; __init__.py
&#x2502;&#xA0;&#xA0; &#x251C;&#x2500;&#x2500; items.py
&#x2502;&#xA0;&#xA0; &#x251C;&#x2500;&#x2500; middlewares.py
&#x2502;&#xA0;&#xA0; &#x251C;&#x2500;&#x2500; pipelines.py
&#x2502;&#xA0;&#xA0; &#x251C;&#x2500;&#x2500; settings.py
&#x2502;&#xA0;&#xA0; &#x2514;&#x2500;&#x2500; spiders
&#x2502;&#xA0;&#xA0;     &#x2514;&#x2500;&#x2500; __init__.py
&#x2514;&#x2500;&#x2500; scrapy.cfg

2 directories, 7 files

创建好项目后,接着我们创建一个db_book爬虫

scrapy genspider -t basic db_book douban.com

以上命令的意思就是说使用basic模版创建一个名为db_book,域名为douban.com的爬虫。

创建成功后,有如下提示:

Created spider 'db_book' using template 'basic' in module:
  hello_scrapy.spiders.db_book

现在我们再来看看当前的目录结构:

&#x251C;&#x2500;&#x2500; hello_scrapy
&#x2502;&#xA0;&#xA0; &#x251C;&#x2500;&#x2500; __init__.py
&#x2502;&#xA0;&#xA0; &#x251C;&#x2500;&#x2500; __pycache__
&#x2502;&#xA0;&#xA0; &#x2502;&#xA0;&#xA0; &#x251C;&#x2500;&#x2500; __init__.cpython-37.pyc
&#x2502;&#xA0;&#xA0; &#x2502;&#xA0;&#xA0; &#x2514;&#x2500;&#x2500; settings.cpython-37.pyc
&#x2502;&#xA0;&#xA0; &#x251C;&#x2500;&#x2500; items.py
&#x2502;&#xA0;&#xA0; &#x251C;&#x2500;&#x2500; middlewares.py
&#x2502;&#xA0;&#xA0; &#x251C;&#x2500;&#x2500; pipelines.py
&#x2502;&#xA0;&#xA0; &#x251C;&#x2500;&#x2500; settings.py
&#x2502;&#xA0;&#xA0; &#x2514;&#x2500;&#x2500; spiders
&#x2502;&#xA0;&#xA0;     &#x251C;&#x2500;&#x2500; __init__.py
&#x2502;&#xA0;&#xA0;     &#x251C;&#x2500;&#x2500; __pycache__
&#x2502;&#xA0;&#xA0;     &#x2502;&#xA0;&#xA0; &#x2514;&#x2500;&#x2500; __init__.cpython-37.pyc
&#x2502;&#xA0;&#xA0;     &#x2514;&#x2500;&#x2500; db_book.py
&#x2514;&#x2500;&#x2500; scrapy.cfg

4 directories, 11 files

通过创建爬虫db_book前后对比,可以发现新增了几个文件,其中核心的文件在./spiders/db_book.py,这个是使用命令工具生成的名为db_book的爬虫。

新建好项目和爬虫后,我们打开项目中的db_book爬虫,定义其具体的操作,该文件在hello_scrapy/spiders/db_book.py,打开db_book.py文件,默认情况下会定义了一个Book类,类里面默认重写parse方法,这个方法就是爬虫实际的解析操作方法。

import scrapy

class DbBookSpider(scrapy.Spider):
    name = 'db_book'
    allowed_domains = ['douban.com']
    start_urls = ['http://douban.com/']

    def parse(self, response):
        pass

这里需要注意下start_urls字段,这个字段的意思就是爬虫开始的入口地址,我们这里爬取豆瓣读书编程类目看下,所以这里进行调整。调整后代码如下:

import scrapy

class DbBookSpider(scrapy.Spider):
    name = 'db_book'
    allowed_domains = ['douban.com']
    start_urls = ['https://book.douban.com/tag/%E7%BC%96%E7%A8%8B']

    def parse(self, response):
        pass

在这里,我们为了快速验证爬虫的操作,我们直接保存爬取到的数据,在这里,可以直接修改parse方法,修改如下:

    def parse(self, response):
        file_name = 'douban_python'
        with open(file_name,'wb') as f:
            f.write(response.body)

现在已经修改好了爬虫,那我们怎么运行它?

在命令行下,我们进入到项目根目录,在这里是:

/Users/michaelkoo/work/env/csdn/code/hello_scrapy

这里是刚刚我们新建的爬虫根目录,在根目录下,我们使用如下命令运行爬虫:

scrapy crawl db_book

运行之后,我们会看到类似下面的字眼:

2021-02-20 10:28:23 [scrapy.utils.log] INFO: Scrapy 2.4.1 started (bot: hello_scrapy)
2021-02-20 10:28:23 [scrapy.utils.log] INFO: Versions: lxml 4.6.2.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.7.3 (default, Dec 13 2019, 19:58:14) - [Clang 11.0.0 (clang-1100.0.33.17)], pyOpenSSL 20.0.1 (OpenSSL 1.1.1i  8 Dec 2020), cryptography 3.3.1, Platform Darwin-19.6.0-x86_64-i386-64bit
2021-02-20 10:28:23 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2021-02-20 10:28:23 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'hello_scrapy',
 'NEWSPIDER_MODULE': 'hello_scrapy.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['hello_scrapy.spiders']}

 ...

 ...

 ...

 2021-02-20 10:28:25 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 1,
 'downloader/exception_type_count/scrapy.exceptions.IgnoreRequest': 1,
 'downloader/request_bytes': 227,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 536,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'elapsed_time_seconds': 1.490386,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2021, 2, 20, 2, 28, 25, 451064),
 'log_count/DEBUG': 2,
 'log_count/INFO': 10,
 'memusage/max': 50585600,
 'memusage/startup': 50585600,
 'response_received_count': 1,
 'robotstxt/forbidden': 1,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/200': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2021, 2, 20, 2, 28, 23, 960678)}
2021-02-20 10:28:25 [scrapy.core.engine] INFO: Spider closed (finished)

注意,如果出现以下类似字眼,则需要修改下配置,

2021-02-20 15:31:31 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2021-02-20 15:31:31 [scrapy.core.engine] DEBUG: Crawled (403) <get https: book.douban.com robots.txt> (referer: None)
2021-02-20 15:31:31 [scrapy.core.engine] DEBUG: Crawled (403) <get https: book.douban.com tag %e7%bc%96%e7%a8%8b> (referer: None)
2021-02-20 15:31:31 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 https: book.douban.com tag %e7%bc%96%e7%a8%8b>: HTTP status code is not handled or not allowed
</403></get></get>

打开setting.py文件,取消USER_AGENT的注释,并修改其内容,修改类似如下:

USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_7) AppleWebKit/601.1 (KHTML, like Gecko) Version/13.1.3 Safari/601.1'

然后再次运行,即可看到以下类似字眼:

2021-02-20 15:33:22 [scrapy.core.engine] DEBUG: Crawled (200) <get https: book.douban.com robots.txt> (referer: None)
2021-02-20 15:33:23 [scrapy.core.engine] DEBUG: Crawled (200) <get https: book.douban.com tag %e7%bc%96%e7%a8%8b> (referer: None)
2021-02-20 15:33:23 [scrapy.core.engine] INFO: Closing spider (finished)
</get></get>

代表已经抓取成功。

再看对应目录下,已经保存了对应的内容。

michaelkoo@MacBook hello_scrapy % ls
douban_python.html  hello_scrapy        scrapy.cfg

目前为止,已经抓取到对应网页的内容了。

在前面,我们只是简单保存数据,并没有实际的进行数据提取,那么我们现在来看看如何提取数据。

  • scrapy中,本身提供了一个类似调试模式的窗口,比如,使用shell工具,在命令行中,使用以下命令即可进入调试模式:
scrapy shell https://book.douban.com/tag/%E7%BC%96%E7%A8%8B

成功进入之后,会有对应的提示:

2021-02-20 15:39:13 [scrapy.core.engine] DEBUG: Crawled (200) <get https: book.douban.com tag %e7%bc%96%e7%a8%8b> (referer: None)
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.crawler object at 0x108cb6128>
[s]   item       {}
[s]   request    <get https: book.douban.com tag %e7%bc%96%e7%a8%8b>
[s]   response   <200 https: book.douban.com tag %e7%bc%96%e7%a8%8b>
[s]   settings   <scrapy.settings.settings object at 0x108ca9d68>
[s]   spider     <dbbookspider 'db_book' at 0x108e95860>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser
</dbbookspider></scrapy.settings.settings></200></get></scrapy.crawler.crawler></get>

这个时候,我们就可以单步调试,逐步提取我们需要的内容。

  • 分析网页数据结构,获取所有节点
    通过分析网页结构可以知道,每本书的节点都是在li标签下,并且有subject-item的class,因此我们目前就可以获取到所有节点。调试如下:
items = response.css('li.subject-item')
items[0] # &#x67E5;&#x770B;&#x7B2C;&#x4E00;&#x4E2A;&#x6570;&#x636E;&#x5185;&#x5BB9;
  • 获取节点中的子节点
    通过分析subject-item中的数据可知,每个item下面有个div,包含了书本的内容,其他div的class内容是info,依次,我们可以调试如下:
info = items[0].css('div.info')
title = info.css('a::text').get() # &#x4E3B;&#x6807;&#x9898;
sub_title = info.css('span::text').get() # &#x526F;&#x6807;&#x9898;
pub = info.css('div.pub::text').get() # &#x51FA;&#x7248;&#x4FE1;&#x606F;
rating = info.css('span.rating_nums::text').get()
pl = info.css('span.pl::text').get() # &#x8BC4;&#x4EF7;
desp = info.css('p::text').get() # &#x63CF;&#x8FF0;

以上调试中,”::text”是表示获取内容的意思。

图片获取如下:

pic_div = item.css('div.pic')
pic = pic_div.css('img::attr(src)').get()

以上调试中的”::attr(src)”,表示获取属性src的内容。

  • 现在我们来看看如何获取下一页的内容
    从网页数据结构中,我们可以知道,下一页(后页)的内容在于class内容为next的span节点中,其父节点为class内容为paginator的div下面,调试如下:
paginator = response.css('div.paginator')
next = paginator.css('span.next').css('a::attr(href)').get()

  • 编写爬虫实现
import scrapy
from scrapy.http.response.html import HtmlResponse

class DbBookSpider(scrapy.Spider):
    name = 'db_book'
    allowed_domains = ['douban.com']
    start_urls = ['https://book.douban.com/tag/%E7%BC%96%E7%A8%8B']

    def parse(self, response: HtmlResponse):
        for item in response.css('li.subject-item'):
            info = item.css('div.info')
            title = info.css('a::text').get()  # &#x4E3B;&#x6807;&#x9898;
            sub_title = info.css('span::text').get()  # &#x526F;&#x6807;&#x9898;
            pub = info.css('div.pub::text').get()  # &#x51FA;&#x7248;&#x4FE1;&#x606F;
            rating = info.css('span.rating_nums::text').get()
            pl = info.css('span.pl::text').get()  # &#x8BC4;&#x4EF7;
            description = info.css('p::text').get()  # &#x63CF;&#x8FF0;

            pic_div = item.css('div.pic')
            pic = pic_div.css('img::attr(src)').get()
            yield {
                'title': title,
                'sub_title': sub_title,
                'pub': pub,
                'rating': rating,
                'pl': pl,
                'description': description,
                'pic': pic
            }

        paginator = response.css('div.paginator')
        next_page = paginator.css('span.next').css('a::attr(href)').get()
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)

如果需要把提取结果保存在json文件中,则可以在命令行中输入以下命令:

scrapy crawl db_book -O db_book.json

以下是抓取结果的统计:

2021-02-20 17:29:19 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 22533,
 'downloader/request_count': 52,
 'downloader/request_method_count/GET': 52,
 'downloader/response_bytes': 671738,
 'downloader/response_count': 52,
 'downloader/response_status_count/200': 52,
 'elapsed_time_seconds': 52.904263,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2021, 2, 20, 9, 29, 19, 530091),
 'item_scraped_count': 1000,
 'log_count/DEBUG': 1052,
 'log_count/INFO': 11,
 'memusage/max': 50323456,
 'memusage/startup': 50323456,
 'request_depth_max': 50,
 'response_received_count': 52,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/200': 1,
 'scheduler/dequeued': 51,
 'scheduler/dequeued/memory': 51,
 'scheduler/enqueued': 51,
 'scheduler/enqueued/memory': 51,
 'start_time': datetime.datetime(2021, 2, 20, 9, 28, 26, 625828)}

另外,我们看看db_book.json文件的数据

[
{"title": "\n\n    Python\u7f16\u7a0b\n\n\n    \n      ", "sub_title": " : \u4ece\u5165\u95e8\u5230\u5b9e\u8df5 ", "pub": "\n        \n  \n  [\u7f8e] \u57c3\u91cc\u514b\u00b7\u9a6c\u745f\u65af / \u8881\u56fd\u5fe0 / \u4eba\u6c11\u90ae\u7535\u51fa\u7248\u793e / 2016-7-1 / 89.00\u5143\n\n      ", "rating": "9.1", "pl": "\n        (3559\u4eba\u8bc4\u4ef7)\n    ", "description": "\u672c\u4e66\u662f\u4e00\u672c\u9488\u5bf9\u6240\u6709\u5c42\u6b21\u7684Python \u8bfb\u8005\u800c\u4f5c\u7684Python \u5165\u95e8\u4e66\u3002\u5168\u4e66\u5206\u4e24\u90e8\u5206\uff1a\u7b2c\u4e00\u90e8\u5206\u4ecb\u7ecd\u7528Python \u7f16\u7a0b\u6240\u5fc5\u987b\u4e86\u89e3\u7684\u57fa\u672c\u6982\u5ff5\uff0c\u5305\u62ecmatplot... ", "pic": "https://img9.doubanio.com/view/subject/s/public/s28891775.jpg"},

由于数据太多,我们在这里只是截取了其中一部分来看看格式化好的数据:

[{
        "title": "\n\n    Python&#x7F16;&#x7A0B;\n\n\n    \n      ",
        "sub_title": " : &#x4ECE;&#x5165;&#x95E8;&#x5230;&#x5B9E;&#x8DF5; ",
        "pub": "\n        \n  \n  [&#x7F8E;] &#x57C3;&#x91CC;&#x514B;&#xB7;&#x9A6C;&#x745F;&#x65AF; / &#x8881;&#x56FD;&#x5FE0; / &#x4EBA;&#x6C11;&#x90AE;&#x7535;&#x51FA;&#x7248;&#x793E; / 2016-7-1 / 89.00&#x5143;\n\n      ",
        "rating": "9.1",
        "pl": "\n        (3559&#x4EBA;&#x8BC4;&#x4EF7;)\n    ",
        "description": "&#x672C;&#x4E66;&#x662F;&#x4E00;&#x672C;&#x9488;&#x5BF9;&#x6240;&#x6709;&#x5C42;&#x6B21;&#x7684;Python &#x8BFB;&#x8005;&#x800C;&#x4F5C;&#x7684;Python &#x5165;&#x95E8;&#x4E66;&#x3002;&#x5168;&#x4E66;&#x5206;&#x4E24;&#x90E8;&#x5206;&#xFF1A;&#x7B2C;&#x4E00;&#x90E8;&#x5206;&#x4ECB;&#x7ECD;&#x7528;Python &#x7F16;&#x7A0B;&#x6240;&#x5FC5;&#x987B;&#x4E86;&#x89E3;&#x7684;&#x57FA;&#x672C;&#x6982;&#x5FF5;&#xFF0C;&#x5305;&#x62EC;matplot... ",
        "pic": "https://img9.doubanio.com/view/subject/s/public/s28891775.jpg"
    },
    {
        "title": "\n\n    &#x7F16;&#x7801;\n\n\n    \n      ",
        "sub_title": " : &#x9690;&#x533F;&#x5728;&#x8BA1;&#x7B97;&#x673A;&#x8F6F;&#x786C;&#x4EF6;&#x80CC;&#x540E;&#x7684;&#x8BED;&#x8A00; ",
        "pub": "\n        \n  \n  [&#x7F8E;] Charles Petzold / &#x5DE6;&#x98DE;&#x3001;&#x859B;&#x4F5F;&#x4F5F; / &#x7535;&#x5B50;&#x5DE5;&#x4E1A;&#x51FA;&#x7248;&#x793E; / 2010 / 55.00&#x5143;\n\n      ",
        "rating": "9.3",
        "pl": "\n        (3659&#x4EBA;&#x8BC4;&#x4EF7;)\n    ",
        "description": "&#x672C;&#x4E66;&#x8BB2;&#x8FF0;&#x7684;&#x662F;&#x8BA1;&#x7B97;&#x673A;&#x5DE5;&#x4F5C;&#x539F;&#x7406;&#x3002;&#x4F5C;&#x8005;&#x7528;&#x4E30;&#x5BCC;&#x7684;&#x60F3;&#x8C61;&#x548C;&#x6E05;&#x6670;&#x7684;&#x7B14;&#x58A8;&#x5C06;&#x770B;&#x4F3C;&#x7E41;&#x6742;&#x7684;&#x7406;&#x8BBA;&#x9610;&#x8FF0;&#x5F97;&#x901A;&#x4FD7;&#x6613;&#x61C2;&#xFF0C;&#x4F60;&#x4E1D;&#x6BEB;&#x4E0D;&#x4F1A;&#x611F;&#x5230;&#x67AF;&#x71E5;&#x548C;&#x751F;&#x786C;&#x3002;&#x66F4;&#x91CD;&#x8981;&#x7684;&#x662F;&#xFF0C;&#x4F60;&#x4F1A;&#x56E0;&#x6B64;&#x800C;&#x83B7;&#x5F97;&#x5BF9;&#x8BA1;&#x7B97;&#x673A;&#x5DE5;&#x4F5C;&#x539F;&#x7406;... ",
        "pic": "https://img2.doubanio.com/view/subject/s/public/s27331702.jpg"
    }
]

看起来不是很完善,但好在基本的数据全部都有了。

从以上结果可以看出,到这一步,基本是实现了从抓取、分析到存储到全部步骤,从上面的实例中,我们完全可以了解了scrapy的功能。至于其他更高级的功能,咱们可以根据后续文章看看。

Original: https://blog.csdn.net/m0_52973494/article/details/114296565
Author: 慕容卡卡
Title: python scrapy实践-爬取豆瓣读书

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/789438/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

亲爱的 Coder【最近整理,可免费获取】👉 最新必读书单  | 👏 面试题下载  | 🌎 免费的AI知识星球