scrapy爬虫练习-中财网股票数据爬取

2023年10月2日下午8:45 • Python • 阅读 34

settings.py

添加如下代码指定输出字段顺序
FEED_EXPORT_FIELDS = ['code', 'name', 'new', 'rise_fall', 'price_limit', 'harvest', 'opening', 'high', 'low', 'volume', 'turnover', 'ratio', 'rate', 'capital', 'currency', 'company', 'trade', 'time', 'capitals', 'A_shares']

item.py

Define here the models for your scraped items
#
See documentation in:
https://docs.scrapy.org/en/latest/topics/items.html

import scrapy

代码 名称 最新 涨跌 涨跌幅 前收 开盘 最高 最低 成交量 成交额 市盈率 换手率 总股本 流通股本
公司名称 所属行业 成立日期 总股本(亿) 流通A股(亿)
code, name, new, rise_fall, price_limit, harvest, opening, high, low, volume, turnover, ratio, rate, capital, currency
company, company, time, capitals, A_shares
class ZhongcaiItem(scrapy.Item):
    code = scrapy.Field()
    name = scrapy.Field()
    new = scrapy.Field()
    rise_fall = scrapy.Field()
    price_limit = scrapy.Field()
    harvest = scrapy.Field()
    opening = scrapy.Field()
    high = scrapy.Field()
    low = scrapy.Field()
    volume = scrapy.Field()
    turnover = scrapy.Field()
    ratio = scrapy.Field()
    rate = scrapy.Field()
    capital = scrapy.Field()
    currency = scrapy.Field()

    company = scrapy.Field()
    trade = scrapy.Field()
    time = scrapy.Field()
    capitals = scrapy.Field()
    A_shares = scrapy.Field()

spider文件

-*- coding: utf-8 -*-
import scrapy
from ..items import ZhongcaiItem

code, name, new, rise_fall, price_limit,harvest, opening, high, low, volume, turnover, ratio, rate, capital, currency
company, trade, time, capitals, A_shares
class ZcspiderSpider(scrapy.Spider):
    name = 'ZCSpider'
    allowed_domains = ['data.cfi.cn']
    max_page = 2
    start_urls = ['https://quote.cfi.cn/quotelist.aspx?sectypeid=1&cfidata=1']
    base_url = 'https://quote.cfi.cn/quotelist.aspx?sortcol=stockcode&sortway=asc&pageindex={}§ypeid=1'

    # 起始爬虫
    def start_requests(self):
        for page in range(1, self.max_page+1):
            url = self.base_url.format(page)
            yield scrapy.Request(url, callback=self.parse)

    # code, name, new, rise_fall, price_limit,harvest, opening, high, low, volume, turnover, ratio, rate, capital, currency
    def parse(self, response):
        tr = response.css(".table_data tr")
        for t in tr:
            item = ZhongcaiItem()
            if "代码" not in t.xpath('.//a/text()').extract()[0]:
                href = t.xpath('.//a/@href').extract_first()
                item['code'] = t.xpath('.//a/text()').extract()[0]
                item['name'] = t.xpath('.//a/text()').extract()[1]
                item['new'] = t.xpath('.//nobr/text()').extract()[0]
                item['rise_fall'] = t.xpath('.//nobr/text()').extract()[1]
                item['price_limit'] = t.xpath('.//nobr/text()').extract()[2]
                item['harvest'] = t.xpath('.//nobr/text()').extract()[3]
                item['opening'] = t.xpath('.//nobr/text()').extract()[4]
                item['high'] = t.xpath('.//nobr/text()').extract()[5]
                item['low'] = t.xpath('.//nobr/text()').extract()[6]
                item['volume'] = t.xpath('.//nobr/text()').extract()[7]
                item['turnover'] = t.xpath('.//nobr/text()').extract()[8]
                item['ratio'] = t.xpath('.//nobr/text()').extract()[9]
                item['rate'] = t.xpath('.//nobr/text()').extract()[10]
                item['capital'] = t.xpath('.//nobr/text()').extract()[11]
                item['currency'] = t.xpath('.//nobr/text()').extract()[12]
            else:
                continue
            yield scrapy.Request(response.urljoin(href), callback=self.new_parse, meta={'item': item}, dont_filter=True)

    # company, trade, time, capitals, A_shares
    def new_parse(self, response):
        item = response.meta['item']
        tr = response.css('.vertical_table').xpath('.//tr')
        item['company'] = (tr.xpath('./td[2]/text()').extract()[0])
        item['trade'] = (tr.xpath('./td[2]/text()').extract()[1])
        item['time'] = (tr.xpath('./td[2]/text()').extract()[2])
        item['capitals'] = (tr.xpath('./td[2]/text()').extract()[3])
        item['A_shares'] = (tr.xpath('./td[2]/text()').extract()[4])
        yield item

代码编写完成之后在cmd或pycharm终端使用命令，scrapy crawl ZCSpider -o data.csv

就可以将爬取下来的数据按照指定的格式顺序保存为csv文件

写在最后

爬取中财网的过程中遇到的问题

问题1：

在爬取中财网的过程中出现了 Filtered offsite request to ‘quote.cfi.cn’问题

2022-01-07 11:47:22 [scrapy.robotstxt] WARNING: Failure while parsing robots.txt. File either contains garbage or is in an encoding other than UTF-8, treating it as an empty file. Traceback (most recent call last): File “C:\Users\admin\AppData\Roaming\Python\Python38\site-packages\twisted\inter net\defer.py”, line 1661, in _inlineCallbacks result = current_context.run(gen.send, result) StopIteration:
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File “C:\Users\admin\AppData\Roaming\Python\Python38\site-packages\scrapy\robots txt.py”, line 16, in decode_robotstxt robotstxt_body = robotstxt_body.decode(‘utf-8’) UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xd5 in position 248: invalid continuation byte

解决：

在settings.py将robots协议改成Flase,如下

ROBOTSTXT_OBEY = False

问题2：

2022-01-07 10:17:52 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite req uest to ‘quote.cfi.cn’:

解决：

在yield scrapy.Request() 里添加 dont_filter=True即可

Original: https://blog.csdn.net/weixin_45971950/article/details/122361122
Author: bug智造
Title: scrapy爬虫练习-中财网股票数据爬取

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/789732/

转载文章受原作者版权保护。转载请注明原作者出处！

python

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

【conda 安装软件报错 ERROR conda.auxlib.logz:stringify(163): Expecting value: line 1 column 1 (char 0)】

conda install matplotlib反复报错，报错信息如下Collecting package metadata (current_repodata.json): \ …

Python 2023年9月3日
0028
Python小项目俄罗斯方块代码基于pygame编写

python实习作业或者期末作业，俄罗斯方块，基于pygame编写有很多小伙伴想要找一些小项目练练手，下面是我在闲暇时写的一个俄罗斯方块的一个小游戏，它是基于pygame板块来实现…

Python 2023年9月17日
0042
matplotlib学习笔记（三）

import matplotlib.pyplot as plt import numpy as np import pandas as pd 1.直方图hist() data = …

Python 2023年9月1日
0063
Python: list列表的11个内置方法

Original: https://www.cnblogs.com/123456feng/p/16178845.htmlAuthor: 蚂蚁ailingTitle: Python:…

Python 2023年11月3日
0032
numpy初学者笔记

文章目录 numpy * numpy 创建数组（矩阵） – 数组的形状 numpy读取本地数据和索引 – 读取数据 + numpy的四种转置方式 numpy…

Python 2023年8月27日
0056
聊一聊作为高并发系统基石之一的缓存，会用很简单，用好才是技术活

大家好，又见面了。本文是笔者作为掘金技术社区签约作者的身份输出的缓存专栏系列内容，将会通过系列专题，讲清楚缓存的方方面面。如果感兴趣，欢迎关注以获取后续更新。在服务端开发中， …

Python 2023年10月17日
0064
Python爬虫：爬取华为应用市场全部app信息

不要急着写程序，先分析一下网站。 [En] Don’t rush to write the program, first analyze the website. 目标…

Python 2023年5月23日
0084
给python安装matplotlib模块以及遇到的问题

matplotlib是python中强大的画图模块。最近需要import matplotlib，操作了一波之后，一直报错。最后安装成功，还好没放弃。下面是我的安装历程，向大家分享三…

Python 2023年9月2日
0065
记一个“奇葩”需求的实现

1、前言我们这边没有专门的产品经理， UI对产品的设计基本具有决定权，说实话，是有那么一点可怖的（前后改了很多次，差一点就改回原版了，我自己都觉得不好意思了🤣）。 🤔有没有一种可…

Python 2023年10月23日
0051
33【源码】数据可视化：基于 Echarts + Python Flask 动态实时大屏 – 制造业生产管理看板

效果展示 1.动态实时更新数据效果图 2.鼠标右键切换主题一.确定需求方案 1. 屏幕分辨率这个案例的分辨率是16:9，最常用的的宽屏比。根据电脑分辨率屏幕自适应显示，F11…

Python 2023年8月9日
00102
Python自动化测试框架之Pytest-参数化Paramerize（3）

1、参数化、数据驱动通常测试数据与功能函数分离，存储在功能函数的外部位置。在自动化测试运行时，数据驱动框架会读取数据源中的数据，把数据作为参数传递到功能函数中，并会根据数据的条数…

Python 2023年9月11日
0058
scrapy框架爬取大单、中单、小单净流入流出

文章目录 * – 一、 scrapy框架简介 – 二、爬取大单数据 – + 1. 选取目标网站 + 2. 确定信息是否可以爬取 + 3. 加载…

Python 2023年10月2日
0052
Python之Pygame学习笔记–1

Python之Pygame学习笔记–1 废话不多说，直接开始干货。此处我们将以案例的形式讲解 Pygame 的相关知识，本文将绘制圆形，矩形，动图这三个简单的案例。说实话，个人…

Python 2023年9月21日
0073
如何使用tensorboard及打开tensorboard生成文件

一、使用tensorboard tensorboard中常用函数 1、writer.add_scalar() def add_scalar( self, tag, scalar_v…

Python 2023年8月2日
0059
全网最通俗易懂的 Self-Attention自注意力机制讲解

目录前言非常非常基础的知识键值对（Key-Value）注意力 Q K V矩阵的意义结语前言因工作需要，不得不再次将Transformer从尘封的记忆中取出。半年前…

Python 2023年10月27日
0058
DRF JWT认证（一）

DRF JWT认证（一） JWT认证为什么使用JWT 认证？构成和工作原理 JWT的构成 1. header 2. payload 3. signature 本质原理 JWT认…

Python 2023年11月3日
0038

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

scrapy爬虫练习-中财网股票数据爬取

大家都在看