爬取汽车之家图片 – scrapy – crawlspider – python爬虫案例

2023年10月3日上午5:54 • Python • 阅读 42

爬取汽车之家图片

需求:爬取汽车之家某一个汽车的图片

一、普通scrapy

第一步页面分析

目标url:
https://car.autohome.com.cn/photolist/series/265/p1/
https://car.autohome.com.cn/photolist/series/265/p2/ 第二页
https://car.autohome.com.cn/photolist/series/265/p3/ 第三页
观察网页很明显265是该车型的编码
页数p1 p2编码
观察图片url：
大图：https://car2.autoimg.cn/cardfs/product/g25/M0B/29/A8/800x0_1_q95_autohomecar__wKgHIlrwJHaAK02EAAsUwWrTmXY510.jpg
小图：
https://car2.autoimg.cn/cardfs/product/g25/M0B/29/A8/240x180_0_q95_c42_autohomecar__wKgHIlrwJHaAK02EAAsUwWrTmXY510.jpg

第二步实现步骤

1 创建scrapy项目
scrapy startproject lsls
2 创建爬虫程序
scrapy genspider hy car.autohome.com.cn
3 实现逻辑

; （一）准备程序

在terminal终端输入

scrapy startproject lsls
&#x722C;&#x866B;&#x7A0B;&#x5E8F;&#x540D;&#x6700;&#x597D;&#x4E0D;&#x8981;&#x548C;&#x722C;&#x866B;&#x7A0B;&#x5E8F;&#x91CD;&#x540D;
scrapy genspider hy car.autohome.com.cn

创建start.py文件，放在与scrapy.cfg同层目录下

&#x8981;&#x8FD0;&#x884C;&#x6574;&#x4E2A;&#x7A0B;&#x5E8F;&#x7684;&#x8BDD;&#xFF0C;&#x53EA;&#x9700;&#x8981;&#x8FD0;&#x884C;&#x8FD9;&#x4E2A;&#x6587;&#x4EF6;
from scrapy import cmdline
cmdline.execute('scrapy crawl hy'.split())
cmdline.execute(['scrapy','crawl','hy'])

（二）setting.py文件

固定格式

LOG_LEVEL = 'WARNING'

ROBOTSTXT_OBEY = False

DEFAULT_REQUEST_HEADERS = {
  'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36',
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'en',
}
&#x5F00;&#x542F;&#x7BA1;&#x9053;
ITEM_PIPELINES = {
   'lsls.pipelines.LslsPipeline': 300,
}

&#x5F00;&#x542F;&#x81EA;&#x5B9A;&#x4E49;&#x4E0B;&#x8F7D;&#x4E2D;&#x95F4;&#x952E;&#xFF0C;&#x8BBE;&#x7F6E;&#x968F;&#x673A;&#x8BF7;&#x6C42;&#x5934;
DOWNLOADER_MIDDLEWARES = {
    #    'lsls.middlewares.LslsDownloaderMiddleware': 543,
    'lsls.middlewares.UserAgentDownloaderMiddleware': 543
}

（三）hy.py文件

import scrapy
from lsls.items import LslsItem

class HySpider(scrapy.Spider):
    name = 'hy'
    allowed_domains = ['car.autohome.com.cn']
    start_urls = ['https://car.autohome.com.cn/photolist/series/265/p1/']
    print('&#x722C;&#x53D6;&#x7B2C;1&#x9875;')
    n = 1

    def parse(self, response):
        imgList = response.xpath('//ul[@id="imgList"]/li')
        for img in imgList:
            src = img.xpath('./a/img/@src').get()
            if src[-1] != 'g':
                src = img.xpath('./a/img/@src2').get()
            # &#x62FC;&#x63A5;url &#x5E76;&#x6362;&#x6210;&#x5927;&#x56FE;
            url = 'https:' + src.replace('240x180_0_q95_c42','800x0_1_q95')
            title = img.xpath('./div/a/text()').get()
            item = LslsItem(
                title = title,
                url = url
             )
            yield item

        # &#x7FFB;&#x9875;
        next_btn = response.xpath('//div[@class="page"]/a[@class="page-item-next"]')
        if next_btn:
            self.n+=1
            print(f'&#x722C;&#x53D6;&#x7B2C;{self.n}&#x9875;')
            url = f'https://car.autohome.com.cn/photolist/series/265/p{self.n}/'
            yield scrapy.Request(url=url)
        else:
            print('&#x9875;&#x9762;&#x722C;&#x53D6;&#x5B8C;&#x6BD5;')

（四）item.py文件

import scrapy

class LslsItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    url = scrapy.Field()
    pass

（五）middlewares.py文件

不变

from scrapy import signals
from fake_useragent import UserAgent
import random

class UserAgentDownloaderMiddleware:
    USER_AGENTS = [
        "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
        "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
        "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
        "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
        "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
        "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
        "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
        "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5"]
    # &#x7B2C;&#x4E00;&#x79CD;&#x65B9;&#x5F0F; &#x91CC;&#x9762;&#x6539;&#x53D8;&#x7B56;&#x7565;
    # def process_request(self, request, spider):
    #     user_agent = random.choice(self.USER_AGENTS)
    #     request.headers['User-Agent'] = user_agent

    # &#x7B2C;&#x4E8C;&#x79CD;&#x65B9;&#x5F0F;
    def process_request(self, request, spider):
        ua = UserAgent()
        user_agent = ua.random
        request.headers['User-Agent'] = user_agent

（六）pipelines.py文件

import urllib.request

class LslsPipeline:
    def open_spider(self, spider):
        self.title_list = {}

    def process_item(self, item, spider):
        url = 'https:'+ dict(item)['url']
        title = dict(item)['title']
        if name in self.title_list.keys():
            self.title_list[title]+=1
        else:
            self.title_list.setdefault(title,1)
        path = r'D:\python_lec\&#x5168;&#x6808;&#x5F00;&#x53D1;\&#x722C;&#x866B;&#x9879;&#x76EE;\&#x722C;&#x866B;&#x5C0F;&#x7EC3;&#x4E60;\qczj\&#x56FE;&#x7247;&#x4E0B;&#x8F7D;'
        urllib.request.urlretrieve(url=url,filename=path+f'\{title} {self.title_list[title]}.jpg')

保存的是800大小的图

二、 crawlspider

翻页过程更加简单

（一）准备程序

scrapy startproject qczj
&#x722C;&#x866B;&#x7A0B;&#x5E8F;&#x540D;&#x6700;&#x597D;&#x4E0D;&#x8981;&#x548C;&#x722C;&#x866B;&#x7A0B;&#x5E8F;&#x91CD;&#x540D;
cd qczj
scrapy genspider lsls car.autohome.com.cn

（二）lsls.py

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from qczj.items import QczjItem

class LslsSpider(CrawlSpider):
    name = 'lsls'
    allowed_domains = ['car.autohome.com.cn']
    start_urls = ['https://car.autohome.com.cn/photolist/series/265/p1/']

    rules = (
         # &#x4E3B;&#x9875;
        Rule(LinkExtractor(allow=r'https://car.autohome.com.cn/photolist/series/265/p[1-17]/'),follow=True),
        # &#x8BE6;&#x60C5;&#x9875;
        Rule(LinkExtractor(allow=r'https://car.autohome.com.cn/photo/series/31145/\d+/\d+.html'), callback='parse_item'),
    )

    def parse_item(self, response):
        item = QczjItem()
        img = response.xpath('//*[@id="img"]/@src').get()
        name = response.xpath('//*[@id="czts"]/div/div/p[1]/a/text()').get()
        item['img'] = img
        item['name'] = name

        return item

（三）pipelines.py

import urllib.request

class QczjPipeline:
    def open_spider(self, spider):
        self.title_list = {}

    def process_item(self, item, spider):
        url = 'https:'+ dict(item)['img']
        name = dict(item)['name']
        if name in self.title_list.keys():
            self.title_list[name]+=1
        else:
            self.title_list.setdefault(name,1)
        path = r'D:\python_lec\&#x5168;&#x6808;&#x5F00;&#x53D1;\&#x722C;&#x866B;&#x9879;&#x76EE;\&#x722C;&#x866B;&#x5C0F;&#x7EC3;&#x4E60;\qczj\&#x56FE;&#x7247;&#x4E0B;&#x8F7D;'
        urllib.request.urlretrieve(url=url,filename=path+f'\{name} {self.title_list[name]}.jpg')

Original: https://blog.csdn.net/weixin_43761516/article/details/117636488
Author: 洋芋本人
Title: 爬取汽车之家图片 – scrapy – crawlspider – python爬虫案例

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/790032/

转载文章受原作者版权保护。转载请注明原作者出处！

python

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

将侧边栏制成inclusion_tag

在开发过程中，像侧边栏这种功能的版块，我们在很多页面都需要使用到的时候，我们则需要在视图函数中书写重复的代码，这样很繁琐，我们可以将侧边栏制成inclusion_tag,后面我们需…

Python 2023年10月31日
0023
pandas，datatime基本操作

dataframe（简称df） 1.df常用属性 df.index 获取行索引 df.columns 获取列索引 df.values 获取以二维数组形式表示的df df.dascr…

Python 2023年8月8日
0062
JavaWeb基础(一) Mybatis使用详解

JavaWeb基础——Mybatis 1，配置文件实现CRUD 如上图所示产品原型，里面包含了品牌数据的 查询 、 &#x630…

Python 2023年11月5日
0039
Yolov5口罩佩戴实时检测项目（模型剪枝+opencv+python推理）

目录 0. 前言 1. 训练 * 1.1 获取口罩佩戴检测数据集 1.2 训练环境配置 1.3 修改模型文件和数据集文件 – 1.3.1 使用的模型 1.3.2 下载y…

Python 2023年9月17日
00108
Scrapy框架（5）：翻页操作、数据库存储以及简单的数据可视化

Scrapy框架（5）：翻页操作、数据库存储以及简单的数据可视化一、翻页二、存数据库三、可视化 * 1、将数据从数据库中提取出来 2、绘图一、翻页点击href属性的值是可…

Python 2023年10月2日
0043
python——学习scrapy框架

srapy框架什么是框架——就是一个集成了很多功能并且具有很强通用性的项目模板如何学习框架——专门学习框架封装的各种功能的详细用法什么是scrapy——爬虫中封装好的一个明星…

Python 2023年10月5日
0028
python游戏制作

游戏开发过程学习参考网站：https://www.pygame.org/news常用函数总结： scene = pygame.display.set_mode([屏幕宽，屏幕高])…

Python 2023年9月21日
0048
爬虫常说的君子协议是什么?

我们常常说 robots.txt 协议防君子不防小人。robots.txt 协议防什么样的君子，又防不了什么样的小人呢？就听我给你一一道来。首先我们需要了解爬虫常说的君子协议是什…

Python 2023年6月10日
0058
Flask+vue 小白搭建

工具Pycharm、vsCode 1、Flask搭建 pycharm环境安装完成之后，创建app.py ”’ flask+vue 前后…

Python 2023年8月14日
0052
conda镜像源及常用命令

查看源 conda config –show-sources 添加仓库 conda config –add channels https://mirrors.tuna.tsin…

Python 2023年9月8日
0037
Flask的jinjia2语句最详细容易理解教程

jinjia2语句来自flask框架下的一个模板组件本次教程非常容易理解适合咱们这些学废了 jinjia2简介要了解jinja2，那么需要先理解模板的概念。模板在Python的…

Python 2023年8月10日
0044
pytest学习笔记

目录 1、虚拟环境准备 2、安装pytest 3、pytest使用约束 4、执行方式 5、pytest常用运行参数介绍 6、解决大批量执行测试case问题多进程 7、失败重跑机制(…

Python 2023年9月13日
0045
使用pymysql库，将tushare股票信息保存入本地MySQL数据库

使用pymysql库，将tushare股票信息保存入本地MySQL数据库 1、前言由于tushare存在积分权限限制，高频读取tushare数据容易挤占服务器带宽，因此对于常用的…

Python 2023年6月11日
0083
Python小游戏——Pygame制作2048小游戏

pygame之2048小游戏初学python,发现看网上的python理论太过枯燥。哎嘿~~干脆直接弄个小游戏叭，一边学习，一边写代码。备注：1.很多地方我尽量注解，方便大家观…

Python 2023年9月20日
0054
django树形结构展示

1、使用Django自带的标签 unordered_list 1.1、定义模型 class Department(models.Model): name = models.Char…

Python 2023年8月4日
0041
Css3入门详解

一、Css基本语法 1.Html和Css没分开点击查看代码 <!DOCTYPE html> <html lang="en"> <…

Python 2023年6月9日
0071

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

爬取汽车之家图片 – scrapy – crawlspider – python爬虫案例

爬取汽车之家图片

一、 普通scrapy

; （一）准备程序

（二）setting.py文件

（三）hy.py文件

（四）item.py文件

（五）middlewares.py文件

（六）pipelines.py文件

二、 crawlspider

（一）准备程序

（二）lsls.py

（三）pipelines.py

大家都在看

一、普通scrapy