实践练习(爬取实训1,”http://www.tipdm.org”的所有新闻动态)
(这是建立的爬虫文件夹)这是打开后的样子,里已经有了scrapy框架的各种组件,只要我动手写代码就可以进行爬虫。
(这是爬取的网页):原网站(这里正在定位标签位置)
下面来编写爬虫文件
(1)pachong.spider python文件
import scrapy
from zwhSpider.items import ZwhspiderItem
import scrapy
from zwhSpider.items import ZwhspiderItem
class PachongSpider(scrapy.Spider):
name = 'pachong'
start_urls = ['https://www.tipdm.org/bdrace/notices/']
url='https://www.tipdm.org/bdrace/notices/index_%d.html'
page_sum=2
def parse(self, response):
div_list = response.xpath('/html/body/div/div[3]/div/div/div[2]/div[2]/ul/li')
for div in div_list:
tit=div.xpath('./div[1]/a/text()')[0].extract()
time=div.xpath('./div[2]/span[1]/text()').extract()
contnet=div.xpath('./div[3]//text()').extract()
contnet=''.join(contnet)
item=ZwhspiderItem()
item['tit']=tit
item['time']=time
item['contnet']=contnet
yield item
if self.page_sum5:
new_url=format(self.url%self.page_sum)
self.page_sum+=1
yield scrapy.Request(url=new_url,callback=self.parse)
(2)items.py文件
import scrapy
class ZwhspiderItem(scrapy.Item):
tit=scrapy.Field()
time=scrapy.Field()
contnet=scrapy.Field()
(3)middlewares.py文件(起始settings并没有开启中间件)
from scrapy import signals
from itemadapter import is_item, ItemAdapter
class ZwhspiderSpiderMiddleware:
@classmethod
def from_crawler(cls, crawler):
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_spider_input(self, response, spider):
return None
def process_spider_output(self, response, result, spider):
for i in result:
yield i
def process_spider_exception(self, response, exception, spider):
pass
def process_start_requests(self, start_requests, spider):
for r in start_requests:
yield r
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)
class ZwhspiderDownloaderMiddleware:
@classmethod
def from_crawler(cls, crawler):
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_request(self, request, spider):
return None
def process_response(self, request, response, spider):
return response
def process_exception(self, request, exception, spider):
pass
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)
(4)pipelines.py文件(管道)
from itemadapter import ItemAdapter
class ZwhspiderPipeline(object):
fp = None
def open_spider(self, spider):
print("开始爬虫.....")
self.fp = open('./新闻动态.txt', 'w', encoding='utf-8')
def process_item(self, item, spider):
tit = item['tit']
time = item['time']
contnet=item['contnet']
self.fp.write(str(tit) +':'+str(time) +':' + str(contnet)+'\n\n')
return item
def close_spider(self, spider):
print("结束爬虫!")
self.fp.close()
(5)settings.py文件
BOT_NAME = 'zwhSpider'
SPIDER_MODULES = ['zwhSpider.spiders']
NEWSPIDER_MODULE = 'zwhSpider.spiders'
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
ROBOTSTXT_OBEY = False
LOG_LEVEL='ERROR'
ITEM_PIPELINES = {
'zwhSpider.pipelines.ZwhspiderPipeline': 300,
}
使用scrapy crawl pachong 启动爬虫
程序运行结果:
证明爬取成功,我把爬取的内容放在了本地的 ‘新闻动态.txt’文件
scrapy框架里面也有显示,我们先在scrapy里面打开这个文件夹,看一看是否爬取成功。
可以看到,我们成功的爬取到了我们想要的内容,接下来在本地文件中看到有这个文件夹;
打开看见了我们想要的内容。
(本文只用于对爬虫的简单学习,不针对任何网站)
Original: https://blog.csdn.net/qq_45976312/article/details/113101545
Author: ruowenz
Title: python爬虫框架——scrapy(2) 实战练习
原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/790617/
转载文章受原作者版权保护。转载请注明原作者出处!