scrapy爬虫简单使用python执行cmd命令程序

1. 安装

pip install scrapy

2. scrapy简单运行以及架构

1. 项目创建以及运行

  1. 创建项目
aaa@localhost pyspace % scrapy startproject demo1
New Scrapy project 'demo1', using template directory '/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/scrapy/templates/project', created in:
    /Users/aaa/app/pyspace/demo1

You can start your first spider with:
    cd demo1
    scrapy genspider example example.com
  1. 项目组成

scrapy爬虫简单使用python执行cmd命令程序

1.spiders 文件夹存放的是爬虫文件。我们需要在spiders下面新增 爬虫.py, 下面创建。 是实现爬虫核心功能的文件
2.items.py 定义数据结构
3.middlewares.py中间件, 代理
4.pipelines.py 管道文件,用于处理下载数据的后续处理。 里面只有一个类,可以自己定义多个。优先级是1-1000, 默认是300 优先级(值越小优先级越高)
5.settings.py 配置文件,比如是否遵守rebots 协议,User-Agent 定义等

  1. 创建爬虫文件

转到项目的文件夹并创建一个爬行器

[En]

Go to the folder of the project and create a spider

aaa@localhost pyspace % cd demo1
aaa@localhost demo1 % ls
demo1       scrapy.cfg
aaa@localhost demo1 % scrapy genspider baidu baidu.com
Created spider 'baidu' using template 'basic' in module:
  demo1.spiders.baidu

如果执行上面的命令,您将在项目的spiders目录下创建一个新的bai du.py。修改后的内容如下:

[En]

If you execute the command above, you will create a new baidu.py under the spiders directory of the project. The modified content is as follows:

import scrapy

class BaiduSpider(scrapy.Spider):
    # 爬虫的名字
    name = 'baidu'
    # 允许访问的域名(这里不需要家http)
    allowed_domains = ['baidu.com']
    # 起始的url,指的是第一次访问的url
    start_urls = ['http://baidu.com/']

    # 执行start_urls 的回调方法,方法中的response 就是返回的那个对象
    # 相当于 response = urllib.request.urlopen(urls)
    def parse(self, response):
        print("======")
        pass

  1. 运行上面的爬虫

语法:

[En]

Syntax:

scrapy crawl 爬虫名称

注:上面有一个机器人协议,可以理解为规定了哪些可以攀登,哪些不可以攀登,我们访问

[En]

Note here: there is a robots protocol on it, which can be understood as stipulating which can and can not be climbed, and we visit

https://www.baidu.com/robots.txt 可以看到相关的描述。

1》修改不遵守robots协议,修改settings.py

ROBOTSTXT_OBEY = True

​ 将上面的配置修改为False, 或者直接注释掉。

2》执行爬虫baidu

aaa@localhost demo1 % pwd
/Users/aaa/app/pyspace/demo1
aaa@localhost demo1 % scrapy crawl baidu

3》结果可以看到自己打印的信息

  1. 修改代码,定位到百度一下按钮元素
import scrapy

class BaiduSpider(scrapy.Spider):
    # 爬虫的名字
    name = 'baidu'
    # 允许访问的域名(这里不需要家http)
    allowed_domains = ['baidu.com']
    # 起始的url,指的是第一次访问的url
    start_urls = ['http://baidu.com/']

    # 执行start_urls 的回调方法,方法中的response 就是返回的那个对象
    # 相当于 response = urllib.request.urlopen(urls)
    def parse(self, response):
        # print("======")
        # 响应的是字符串
        # print(response.text)
        print("******")
        # 响应的是二进制数据
        # print(response.body)

        # response.xpath 可以直接用xpayh 方法来解析response 中的内容. 返回的是一个 scrapy.selector.unified.SelectorList
        subList = response.xpath('//*[@id="su"]')
        print(subList)
        print(subList.__class__)
        # 可以用下标拿第一个元素,会拿到对应的元素。 也可以直接用 extract_first 获取。
        # extract 和 extract_first 拿到的是一个我们获取的元素data
        print(subList[0].extract())
        print(subList[0].extract().__class__)
        print(subList.extract_first())
        print(subList.extract_first().__class__)
        # .get() 等价于 .extract_first()
        # print(subList.get())
        # 比如直接拿按钮的 value 属性
        # print(response.xpath('//*[@id="su"]/@value').extract_first())

当然,您可以使用css或bs4选择器:

[En]

Of course, you can use the css or bs4 selector:

subList = response.css('#su')

1》重新运行

scrapy crawl baidu

2》结果

******
[]

百度搜索<details><summary>[En]</summary>Baidu Search</details>

2. 架构以及简单原理

1. 架构

1.引擎:自动运行,无需关注,会自动组织所有的请求对象,分发给下载器
2.下载器:从引擎处获取到请求对象后,请求数据
3.spiders:定义爬取的动作以及爬取的网站
4.调度器:有自己的调度规则
5.管道:按照一定的顺序对Item 进行处理。 可以理解为对数据进行处理,一般落库、保存为文件写在管道里面

2.工作原理

scrapy爬虫简单使用python执行cmd命令程序

3. scrapy 例子

1. 爬取读书网

我们在阅文网站上抓取了归类为散文的图书信息,家庭住址为:

[En]

We crawled the book information classified as prose essays on the reading website, and the home address is:

https://www.dushu.com/book/1163_1.html

这里您需要使用爬行器定义一些规则来提取符合规则的页面数据,然后继续爬行。页面爬行规则如下:

[En]

Here you need to use crawlspider to define some rules to extract the data of the page that conforms to the rules, and then continue crawling. The page crawling rules are as follows:

allow=() 正则表达式,提取符合正则的链接
deny=() 正则表达式,拒绝符合正则的连接
allow_domains() 允许的域名
deny_domains=() 拒绝的域名
restrict_xpaths=() 提取符合xpath规则的连接
restract_css=() 提取符合css规则的连接

2. 创建项目以及运行

  1. 创建项目
scrapy startproject dushu
cd dushu
scrapy genspider -t crawl read_dushu www.dushu.com
# 查看现有的爬虫名称
aaa@localhost dushu % scrapy list
read_dushu
  1. 修改代码

1》修改read_dushu.py:item 数据结构用最简单的dict 字典数据类型

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

class ReadDushuSpider(CrawlSpider):
    name = 'read_dushu'
    allowed_domains = ['www.dushu.com']
    start_urls = ['https://www.dushu.com/book/1163_1.html']

    '''
    follow  表示是否追踪后面的代码。 也就是从后续的页面继续利用此规则。
    False: 只适用于当前页
    True: 后续爬取的页面继续利用规则,效果就是爬取的椰树会增加 (后续页面访问的时候页号会增加,第一页只显示13, 后面的用... 表示)
    '''
    rules = (
        Rule(LinkExtractor(allow=r'/book/1163_\d+.html'), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        # 这里的item 是用一个dict 字典存取其数据
        item = {}
        div_list = response.xpath('//div[@class="bookslist"]/ul/li/div')
        for div in div_list:
            # data-original 表示图片是懒加载,不能获取src 属性
            item['src'] = div.xpath('./div/a/img/@data-original').extract_first()
            item['name'] = div.xpath('./div/a/img/@alt').extract_first()
            item['author'] = div.xpath('./p[1]/a[1]/text()|./p[1]/text()').extract_first()
            yield item

2》pipelines.py 修改

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html

# useful for handling different item types with a single interface
from itemadapter import ItemAdapter

class DushuPipeline:
    '''
    open_spider\close_spider 方法只会调用一次。 一般用于资源的打开和关闭
    '''

    def open_spider(self,spider):
        self.fp = open('dushu.json','w',encoding='utf-8')

    def process_item(self, item, spider):
        self.fp.write(str(item))
        return item

    def close_spider(self,spider):
        self.fp.close()

3》修改settings.py,取消遵循robots 协议以及放开pipeline

# Obey robots.txt rules
# ROBOTSTXT_OBEY = True

...

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'dushu.pipelines.DushuPipeline': 300,
}
  1. 运行项目
scrapy crawl read_dushu
  1. 修改代码的item, 用数据结构代替dict数据类型

1》修改items.py

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy

class DushuItem(scrapy.Item):
    # define the fields for your item here like:
    name = scrapy.Field()
    src = scrapy.Field()
    author = scrapy.Field()

2》修改spiders/reader_dushu.py

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from dushu.items import DushuItem

class ReadDushuSpider(CrawlSpider):
    name = 'read_dushu'
    allowed_domains = ['www.dushu.com']
    start_urls = ['https://www.dushu.com/book/1163_1.html']

    '''
    follow  表示是否追踪后面的代码。 也就是从后续的页面继续利用此规则。
    False: 只适用于当前页
    True: 后续爬取的页面继续利用规则,效果就是爬取的椰树会增加 (后续页面访问的时候页号会增加,第一页只显示13, 后面的用... 表示)
    '''
    rules = (
        Rule(LinkExtractor(allow=r'/book/1163_\d+.html'), callback='parse_item', follow=False),
    )

    def parse_item(self, response):
        # 这里的item 是用一个dict 字典存取其数据
        item = {}
        div_list = response.xpath('//div[@class="bookslist"]/ul/li/div')
        for div in div_list:
            # data-original 表示图片是懒加载,不能获取src 属性
            src = div.xpath('./div/a/img/@data-original').extract_first()
            name = div.xpath('./div/a/img/@alt').extract_first()
            author = div.xpath('./p[1]/a[1]/text()|./p[1]/text()').extract_first()
            yield DushuItem(src=src, name=name, author=author)

3. 继续改造项目,将书详情的价格也爬取出来

​ 实现的效果就是将读书网点击书籍后的价格也爬取出来。

  1. 修改items.py 增加价格price 字段
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy

class DushuItem(scrapy.Item):
    # define the fields for your item here like:
    name = scrapy.Field()
    src = scrapy.Field()
    author = scrapy.Field()
    price = scrapy.Field()
  1. 修改spiders/read_dushu.py
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from dushu.items import DushuItem

class ReadDushuSpider(CrawlSpider):
    name = 'read_dushu'
    allowed_domains = ['www.dushu.com']
    start_urls = ['https://www.dushu.com/book/1163_1.html']

    '''
    follow  表示是否追踪后面的代码。 也就是从后续的页面继续利用此规则。
    False: 只适用于当前页
    True: 后续爬取的页面继续利用规则,效果就是爬取的椰树会增加 (后续页面访问的时候页号会增加,第一页只显示13, 后面的用... 表示)
    '''
    rules = (
        Rule(LinkExtractor(allow=r'/book/1163_\d+.html'), callback='parse_item', follow=False),
    )

    def parse_item(self, response):
        # 这里的item 是用一个dict 字典存取其数据
        item = {}
        div_list = response.xpath('//div[@class="bookslist"]/ul/li/div')
        for div in div_list:
            # data-original 表示图片是懒加载,不能获取src 属性
            src = div.xpath('./div/a/img/@data-original').extract_first()
            name = div.xpath('./div/a/img/@alt').extract_first()
            author = div.xpath('./p[1]/a[1]/text()|./p[1]/text()').extract_first()
            url = div.xpath('./div/a/@href').extract_first()
            url = "https://www.dushu.com" + url
            yield scrapy.Request(url=url, callback=self.parse_second, meta={'name': name, 'src': src, 'author': author})

    def parse_second(self, response):
        price = response.xpath('//div[@class="book-details"]//span/text()').get()
        name = response.meta['name']
        src = response.meta['src']
        author = response.meta['author']
        yield DushuItem(src=src, name=name, author=author, price=price)
  1. 测试运行

4. 继续改造,增加pipeline将图片下载下来

  1. 需要安装 pillow
pip install pillow
  1. 修改piplines.py
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html

# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
from scrapy.pipelines.images import ImagesPipeline
import scrapy

class DushuPipeline:
    '''
    open_spider\close_spider 方法只会调用一次。 一般用于资源的打开和关闭
    '''

    def open_spider(self,spider):
        self.fp = open('dushu.json','w',encoding='utf-8')

    def process_item(self, item, spider):
        self.fp.write(str(item))
        return item

    def close_spider(self,spider):
        self.fp.close()

class ImgsPipLine(ImagesPipeline):

    def get_media_requests(self, item, info):
        yield scrapy.Request(url=item['src'], meta={'item': item})

    # 返回图片名称即可, 路径在全局配置文件中进行配置
    def file_path(self, request, response=None, info=None):
        print("******")
        item = request.meta['item']
        filePath = item['name']
        return filePath

    def item_completed(self, results, item, info):
        return item
  1. 修改settings.py增加相关配置
LOG_LEVEL = "WARNING"
IMAGES_STORE = './result'   #文件保存路径
DEFAULT_REQUEST_HEADERS = {
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'en',
   #  这两个请求头是必须的,没有referer 访问图片会报错403 。
   'referer': 'https://www.dushu.com/book/1163_11.html',
  'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36'
}

ITEM_PIPELINES = {
   'dushu.pipelines.DushuPipeline': 300,
   'dushu.pipelines.ImgsPipLine': 301,
}
  1. 测试

运行后,在根目录下生成结果目录,然后下载相应的jpg图片。

[En]

After running, the result directory is generated in the root directory, and then the relevant jpg images are downloaded.

5. 修改scrapy日志级别

修改settings.py

[En]

Modify settings.py

LOG_LEVEL = "WARNING"

6. 编写main 类启动scrapy 程序

  1. 方法一
from scrapy.cmdline import execute
import sys
import os
sys.path.append(os.path.dirname(os.path.abspath(__file__)));
execute(["srcapy","crawl","read_dushu"])
  1. 方法二
import os

# 无法获取控制台输出的内容,只是简单的执行cmd指令,返回命令退出状态,其中结果为0表示执行成功
# retValue = os.system("ipconfig")
# print(retValue)

# 可以获取控制台输出的内容,返回的是一个file对象
# 'r' 消除转义符带来的影响,即'\'
# retValue = os.popen('ipconfig', 'r')
# res = retValue.read()
# for line in res.splitlines():
#     print(line)
# retValue.close()

# 执行scrapy 程序
retValue = os.popen('scrapy list', 'r')
res = retValue.read()
for line in res.splitlines():
    print(line)
retValue.close()

参考资料:

[En]

Reference:

https://docs.scrapy.org/en/latest/topics/commands.html

https://docs.scrapy.org/en/latest/topics/architecture.html

Original: https://www.cnblogs.com/qlqwjy/p/16523021.html
Author: QiaoZhi
Title: scrapy爬虫简单使用python执行cmd命令程序

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/6189/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

发表回复

登录后才能评论
免费咨询
免费咨询
扫码关注
扫码关注
联系站长

站长Johngo!

大数据和算法重度研究者!

持续产出大数据、算法、LeetCode干货,以及业界好资源!

2022012703491714

微信来撩,免费咨询:xiaozhu_tec

分享本页
返回顶部