python爬虫框架——scrapy(2) 实战练习

实践练习(爬取实训1,”http://www.tipdm.org”的所有新闻动态)

python爬虫框架——scrapy(2) 实战练习

(这是建立的爬虫文件夹)这是打开后的样子,里已经有了scrapy框架的各种组件,只要我动手写代码就可以进行爬虫。

python爬虫框架——scrapy(2) 实战练习
(这是爬取的网页):原网站(这里正在定位标签位置)
下面来编写爬虫文件
(1)pachong.spider python文件
python爬虫框架——scrapy(2) 实战练习
import scrapy
from zwhSpider.items import ZwhspiderItem
import scrapy
from zwhSpider.items import ZwhspiderItem

class PachongSpider(scrapy.Spider):

    name = 'pachong'

    start_urls = ['https://www.tipdm.org/bdrace/notices/']

    url='https://www.tipdm.org/bdrace/notices/index_%d.html'
    page_sum=2

    def parse(self, response):
        div_list = response.xpath('/html/body/div/div[3]/div/div/div[2]/div[2]/ul/li')
        for div in div_list:
            tit=div.xpath('./div[1]/a/text()')[0].extract()
            time=div.xpath('./div[2]/span[1]/text()').extract()
            contnet=div.xpath('./div[3]//text()').extract()
            contnet=''.join(contnet)

            item=ZwhspiderItem()
            item['tit']=tit
            item['time']=time
            item['contnet']=contnet

            yield  item
            if self.page_sum5:
                new_url=format(self.url%self.page_sum)
                self.page_sum+=1

                yield scrapy.Request(url=new_url,callback=self.parse)

(2)items.py文件

python爬虫框架——scrapy(2) 实战练习

import scrapy

class ZwhspiderItem(scrapy.Item):

    tit=scrapy.Field()
    time=scrapy.Field()
    contnet=scrapy.Field()

(3)middlewares.py文件(起始settings并没有开启中间件)

python爬虫框架——scrapy(2) 实战练习

from scrapy import signals

from itemadapter import is_item, ItemAdapter

class ZwhspiderSpiderMiddleware:

    @classmethod
    def from_crawler(cls, crawler):

        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_spider_input(self, response, spider):

        return None

    def process_spider_output(self, response, result, spider):

        for i in result:
            yield i

    def process_spider_exception(self, response, exception, spider):

        pass

    def process_start_requests(self, start_requests, spider):

        for r in start_requests:
            yield r

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

class ZwhspiderDownloaderMiddleware:

    @classmethod
    def from_crawler(cls, crawler):

        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):

        return None

    def process_response(self, request, response, spider):

        return response

    def process_exception(self, request, exception, spider):

        pass

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

(4)pipelines.py文件(管道)

python爬虫框架——scrapy(2) 实战练习

from itemadapter import ItemAdapter

class ZwhspiderPipeline(object):
    fp = None

    def open_spider(self, spider):
        print("开始爬虫.....")
        self.fp = open('./新闻动态.txt', 'w', encoding='utf-8')

    def process_item(self, item, spider):

        tit = item['tit']
        time = item['time']
        contnet=item['contnet']

        self.fp.write(str(tit) +':'+str(time) +':' + str(contnet)+'\n\n')

        return item

    def close_spider(self, spider):
        print("结束爬虫!")
        self.fp.close()

(5)settings.py文件

python爬虫框架——scrapy(2) 实战练习

BOT_NAME = 'zwhSpider'

SPIDER_MODULES = ['zwhSpider.spiders']
NEWSPIDER_MODULE = 'zwhSpider.spiders'

USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'

ROBOTSTXT_OBEY = False
LOG_LEVEL='ERROR'

ITEM_PIPELINES = {
    'zwhSpider.pipelines.ZwhspiderPipeline': 300,

}

使用scrapy crawl pachong 启动爬虫

程序运行结果:

python爬虫框架——scrapy(2) 实战练习
证明爬取成功,我把爬取的内容放在了本地的 ‘新闻动态.txt’文件
scrapy框架里面也有显示,我们先在scrapy里面打开这个文件夹,看一看是否爬取成功。

python爬虫框架——scrapy(2) 实战练习

可以看到,我们成功的爬取到了我们想要的内容,接下来在本地文件中看到有这个文件夹;

python爬虫框架——scrapy(2) 实战练习
打开看见了我们想要的内容。

python爬虫框架——scrapy(2) 实战练习
(本文只用于对爬虫的简单学习,不针对任何网站)

Original: https://blog.csdn.net/qq_45976312/article/details/113101545
Author: ruowenz
Title: python爬虫框架——scrapy(2) 实战练习

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/790617/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

亲爱的 Coder【最近整理,可免费获取】👉 最新必读书单  | 👏 面试题下载  | 🌎 免费的AI知识星球