目标:通过scrapy爬取当当网热销月榜所有页面数据
新学:xpath解析数据,多页下载,(多)管道下载,管道封装
url = “http://bang.dangdang.com/books/bestsellers/01.00.00.00.00.00-recent30-0-0-1-%s”%(page)
page===>1~25
所需数据{书名,作者,出版社,价格,图片链接}
对应xpath:
name://div[@class="bang_list_box"]/ul/li/div[@class="name"]/text()
author://div[@class="bang_list_box"]/ul/li/div[@class="publisher_info"][1]/a[1]/text()
press://div[@class="bang_list_box"]/ul/li/div[@class="publisher_info"][2]/a[1]/text()
price://div[@class="bang_list_box"]/ul/li/div[@class="price"]/p[1]/span[1]/text()
src://div[@class="bang_list_box"]/ul/li/div[@class="pic"]/a/img/@src
爬虫文件:
import scrapy
class DangdangbookSpider(scrapy.Spider):
name = 'dangdangbook'
allowed_domains = []
start_urls = ['http://bang.dangdang.com/books/bestsellers/01.00.00.00.00.00-recent30-0-0-1-1']
page = 1
base_url = 'http://bang.dangdang.com/books/bestsellers/01.00.00.00.00.00-recent30-0-0-1-%s'
def parse(self, response):
msg_dict = {}
msg_list = response.xpath('//div[@class="bang_list_box"]/ul/li')
for msg in msg_list:
msg_dict['name'] = msg.xpath('./div[@class="name"]/a/text()').extract_first()
msg_dict['author'] = msg.xpath('./div[@class="publisher_info"][1]/a[1]/text()').extract_first()
msg_dict['press'] = msg.xpath('./div[@class="publisher_info"][2]/a[1]/text()').extract_first()
msg_dict['price'] = msg.xpath('./div[@class="price"]/p[1]/span[1]/text()').extract_first()
msg_dict['src'] = msg.xpath('./div[@class="pic"]/a/img/@src').extract_first()
yield msg_dict
if self.page
分析:
管道文件:pinlines.py
import json
import urllib.request
from itemadapter import ItemAdapter
class DangdangbookPipeline:
def open_spider(self,spider):
self.fp = open("dd_sell_well_books.json","w",encoding="utf-8")
def process_item(self, item, spider):
self.fp.write(str(item))
return item
def close_spider(self,spider):
self.fp.close()
class DangDangImgPipeline:
def process_item(self, item, spider):
url = item.get("src")
filename = 'dd_imgs/' + item.get("name").replace("/","-") + '.jpg'
urllib.request.urlretrieve(url=url,filename=filename,)
return item
分析:
settings.py
ITEM_PIPELINES = {
'DangDangBook.pipelines.DangdangbookPipeline': 300,
'DangDangBook.pipelines.DangDangImgPipeline':301,
}
说明:优先级即管道下载的优先顺序,从1~1000,越小越优先
Original: https://blog.csdn.net/qq_50300933/article/details/123038729
Author: 归琳
Title: scrapy入门笔记(2)–当当热销月榜数据
原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/790323/
转载文章受原作者版权保护。转载请注明原作者出处!