scrapy入门笔记(2)–当当热销月榜数据

目标:通过scrapy爬取当当网热销月榜所有页面数据

新学:xpath解析数据,多页下载,(多)管道下载,管道封装

url = “http://bang.dangdang.com/books/bestsellers/01.00.00.00.00.00-recent30-0-0-1-%s”%(page)

page===>1~25

所需数据{书名,作者,出版社,价格,图片链接}

对应xpath:

name://div[@class="bang_list_box"]/ul/li/div[@class="name"]/text()
author://div[@class="bang_list_box"]/ul/li/div[@class="publisher_info"][1]/a[1]/text()
press://div[@class="bang_list_box"]/ul/li/div[@class="publisher_info"][2]/a[1]/text()
price://div[@class="bang_list_box"]/ul/li/div[@class="price"]/p[1]/span[1]/text()
src://div[@class="bang_list_box"]/ul/li/div[@class="pic"]/a/img/@src

爬虫文件:

import scrapy

class DangdangbookSpider(scrapy.Spider):
    name = 'dangdangbook'
    allowed_domains = []
    start_urls = ['http://bang.dangdang.com/books/bestsellers/01.00.00.00.00.00-recent30-0-0-1-1']
    page = 1
    base_url = 'http://bang.dangdang.com/books/bestsellers/01.00.00.00.00.00-recent30-0-0-1-%s'
    def parse(self, response):
        msg_dict = {}
        msg_list = response.xpath('//div[@class="bang_list_box"]/ul/li')
        for msg in msg_list:
            msg_dict['name'] = msg.xpath('./div[@class="name"]/a/text()').extract_first()
            msg_dict['author'] = msg.xpath('./div[@class="publisher_info"][1]/a[1]/text()').extract_first()
            msg_dict['press'] = msg.xpath('./div[@class="publisher_info"][2]/a[1]/text()').extract_first()
            msg_dict['price'] = msg.xpath('./div[@class="price"]/p[1]/span[1]/text()').extract_first()
            msg_dict['src'] = msg.xpath('./div[@class="pic"]/a/img/@src').extract_first()
            yield msg_dict

        if self.page

分析:

管道文件:pinlines.py

import json
import urllib.request

from itemadapter import ItemAdapter

class DangdangbookPipeline:
    def open_spider(self,spider):
        self.fp = open("dd_sell_well_books.json","w",encoding="utf-8")

    def process_item(self, item, spider):
        self.fp.write(str(item))
        return item

    def close_spider(self,spider):
        self.fp.close()

class DangDangImgPipeline:
    def process_item(self, item, spider):
        url = item.get("src")
        filename = 'dd_imgs/' + item.get("name").replace("/","-") + '.jpg'
        urllib.request.urlretrieve(url=url,filename=filename,)
        return item

分析:

settings.py

ITEM_PIPELINES = {
   'DangDangBook.pipelines.DangdangbookPipeline': 300,
   'DangDangBook.pipelines.DangDangImgPipeline':301,
}

说明:优先级即管道下载的优先顺序,从1~1000,越小越优先

Original: https://blog.csdn.net/qq_50300933/article/details/123038729
Author: 归琳
Title: scrapy入门笔记(2)–当当热销月榜数据

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/790323/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

亲爱的 Coder【最近整理,可免费获取】👉 最新必读书单  | 👏 面试题下载  | 🌎 免费的AI知识星球