Python爬取全国各地区疫情风险等级

2023年7月4日上午6:58 • 人工智能 • 阅读 135

需求

通过爬虫取得国家官网各地区疫情风险等级,存入电子表格最终如下:

; 数据来源

http://bmfw.www.gov.cn/yqfxdjcx/risk.html

分析网页

页面结构
上部:有截至时间,有三个按键:高\中\低.点击可以切换中部的信息
中部:风险地区信息
下部:翻页
确定请求方式
因为有翻页按钮,猜想应该是用ajax更新数据
打开F12,点
分析请求信息

url=http://bmfw.www.gov.cn/bjww/interface/interfaceJson

headers={
    Accept: application/json, text/plain, */*
    Accept-Encoding: gzip, deflate
    Accept-Language: zh-CN,zh;q=0.9
    Connection: keep-alive
    Content-Length: 235
    Content-Type: application/json;charset=UTF-8
    Cookie: wdcid=57661336733ee69d; _gscu_1088464070=62382713kmnu2p11; __auc=5e75e3b61830dbb29f71ed50e8e; wdses=7e7e0e45f5b9f4e6; _gscbrs_1088464070=1; __asc=578b530b18312ed51c75133a6b5; acw_tc=2760823f16624698875982607ee72639a46b54881093f71cc7ab12b65f0a17; wdlast=1662469913; _gscs_1088464070=62469886ezpysq11|pv:2; SERVERID=edf8bc70025336506334b22603ae1cc6|1662469904|1662469877
    Host: bmfw.www.gov.cn
    Origin: http://bmfw.www.gov.cn
    Referer: http://bmfw.www.gov.cn/yqfxdjcx/risk.html
    User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36
    x-wif-nonce: QkjjtiLM2dCratiA
    x-wif-paasid: smt-application
    x-wif-signature: B80277094A20F7C04C735C8413BE2B014332114512BFB51BADD30E21D5C368D9
    x-wif-timestamp: 1662469914
}

from_data={
    "key":"3C502C97ABDA40D0A60FBEE50FAAD1DA",
    "appId":"NcApplication",
    "paasHeader":"zdww",
    "timestampHeader":"1662469914",
    "nonceHeader":"123456789abcdefg",
    "signatureHeader":"B0BF67E09448D9A8A6C0538B259E715FD51CB51FCD6822E85000C2196354EB0B"
}

大多数都是常规项目请求.重点分析:
x-wif-nonce
x-wif-paasid
x-wif-signature
x-wif-timestampy timestampHeader
signatureHeader
4. 确认请求头参数
先在上图可见headers只验证
Accept
Content-Type
x-wif-nonce
x-wif-paasid
x-wif-signature
x-wif-timestamp
使用的是POST请求,请求数据是字符串,字符串中是字典.

分析JS
5.1 搜索x-wif-nonce等关键字
在

e是一个时间戳
n是SHA(e+"fTN2pfuisxTavbTuYVSsNJHetwq5bJvCQkjjtiLM2dCratiA" + e)

这里再进一步验证SHA256是否是标准算法,上图紫框是网页算出来的字串:

可见这里使用的是标准SHA256算法.

5.2搜索timestampHeader
在

timestampHeader为时间戳
signatureHeader是SHA(timestampHeader+'23y0ufFl5YxIyGrI8hWRUZmKkvtSjLQA'+'123456789abcdefg'+timestampHeader)

编写代码

1.爬虫代码

import hashlib
import os
import requests
import time
import sys
import json
import csv

def show_level_count(x_list):
    j=0
    for i in range(len(x_list)):
        j+=len (x_list[i]["communitys"])
    print(j)
    return j

def writer_to_csv(risk_txt):
    risk_json = json.loads(risk_txt)

    so_far_time = risk_json["data"]["end_update_time"]

    highlist = risk_json["data"]["highlist"]
    middlelist = risk_json["data"]["middlelist"]
    lowlist = risk_json["data"]["lowlist"]

    encoding='utf_8_sig'
    f = open('risk_data_' + so_far_time + '.csv','w', encoding=encoding,newline='')
    csv_writer = csv.writer(f)

    level_dict={}
    level_dict["高风险"]=highlist
    level_dict["中风险"]=middlelist
    level_dict["低风险"]=lowlist

    for level in  level_dict.keys():
        risk_level = level
        for i in range(len(level_dict[level])):
            province = level_dict[level][i]["province"]
            city = level_dict[level][i]["city"]
            county = level_dict[level][i]["county"]
            for j in range(len(level_dict[level][i]["communitys"])):
                csv_writer.writerow(
                    [risk_level, province, city, county, level_dict[level][i]["communitys"][j]])

    f.close()

    print("写入risk_data.csv完成.")

def get_risk_area_data():
    timestamp = str(int(time.time()))

    x_wif_timestamp = timestamp
    timestampHeader = timestamp

    x_wif_nonce = 'QkjjtiLM2dCratiA'
    x_wif_paasid = 'smt-application'

    x_wif_signature_str = timestamp + \
        'fTN2pfuisxTavbTuYVSsNJHetwq5bJvCQkjjtiLM2dCratiA'+timestamp
    x_wif_signature = hashlib.sha256(
        x_wif_signature_str.encode('utf-8')).hexdigest().upper()

    signatureHeader_str = timestamp + \
        '23y0ufFl5YxIyGrI8hWRUZmKkvtSjLQA'+'123456789abcdefg'+timestamp
    signatureHeader = hashlib.sha256(
        signatureHeader_str.encode('utf-8')).hexdigest().upper()

    url = 'http://bmfw.www.gov.cn/bjww/interface/interfaceJson'

    headerss = {
        'Accept': "application/json, text/plain, */*",
        'Content-Type': "application/json;charset=utf-8",
        'x-wif-nonce': "QkjjtiLM2dCratiA",
        'x-wif-paasid': "smt-application",
        'x-wif-signature': x_wif_signature,
        'x-wif-timestamp': x_wif_timestamp,
    }

    From_data = "{\"key\":\"3C502C97ABDA40D0A60FBEE50FAAD1DA\",\
    \"appId\":\"NcApplication\",\"paasHeader\":\"zdww\",\
    \"timestampHeader\":\"" + timestampHeader + "\",\
    \"nonceHeader\":\"123456789abcdefg\",\"signatureHeader\":\"" + signatureHeader + "\"}"

    response = requests.post(url=url, data=From_data, headers=headerss)
    if not response.status_code == 200:

        return "", response.status_code

    return response.text.replace('\u2022', ''), response.status_code

if __name__ == '__main__':
    risk_data=get_risk_area_data()
    if risk_data[1]==200:
        with open('./risk_data.json', 'w',encoding='utf-8') as f:
            f.write(risk_data[0])
        print("写入risk_data.log完成.")

    f = open('risk_data.json', 'r', encoding='utf-8')
    risk_txt = f.read()
    f.close()

    writer_to_csv(risk_txt)

    print('全部程序完成,请勿频繁使用!')
    os.system('pause')

Original: https://blog.csdn.net/LILI00000/article/details/126710407
Author: LILI00000
Title: Python爬取全国各地区疫情风险等级

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/669200/

转载文章受原作者版权保护。转载请注明原作者出处！

人工智能

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

Bert不完全手册5. BERT推理提速？训练提速!内存压缩！Albert

Albert是A Lite Bert的缩写，确实Albert通过词向量矩阵分解，以及transformer block的参数共享，大大降低了Bert的参数量级。在我读Albert论…

人工智能 2023年6月4日
00108
数据挖掘竞赛lightgbm通过求最大auc调参

一、使用步骤 0.首先展示最后的结果参数含义 learning_rate 一般设置在0.05-0.1之间 n_estimators 100-1000 boosting的迭代次数 …

人工智能 2023年7月17日
0080
tf.squeeze()与tf.expand_dims()

tf.squeeze()与tf.expand_dims()在变换维度时经常使用，今天来做下总结记录。 tf.squeeze(a,-1)与tf.expand_dims(a,-1)这里…

人工智能 2023年5月25日
0057
【6】目标检测之YOLO v3

这里写目录标题改进 * Darknet-53 多尺度特征图预测正负样本匹配边界框计算公式优化网络结构知识点损失函数改进 Darknet-53 骨干网络采用Darkne…

人工智能 2023年7月12日
0055
Fully Convolutional Networks for Semantic Segmentation ————全卷积网络 FCN论文解读

Fully Convolutional Networks for Semantic Segmentation 作者： Jonathan Long, Evan Shelhamer, …

人工智能 2023年7月14日
0066
Matplotlib入门篇，也可以很酷炫

哈喽，大家好。今天写一篇 Matplotlib 的入门教程。 Matplotlib 是 Python 数据可视化库，广泛应用在数据分析和机器学习中。 1. 第一张图 Matplot…

人工智能 2023年7月18日
0076
cuda与torch的安装匹配

此博客主要用于记录个人的问题解决。如果能帮到路过的朋友那再好不过啦。我在某网站的评论所分享的链接下载的torch，似乎是阿里的源。是1.10.1的torch。 pip show…

人工智能 2023年6月16日
00123
PyTorch环境搭建

文章目录 PyTorch环境搭建 * 一、pytorch简介 – 1.1 pytorch是什么？ 1.2 pytorch的优点二、安装pytorch（基于pychar…

人工智能 2023年7月27日
0064
【医学图像处理】用于肝血管分割的平均教师辅助置信学习

标题：Noisy Labels are Treasure: Mean-Teacher-Assisted Confident Learning for Hepatic Vessel …

人工智能 2023年6月20日
0081
【机器学习kaggle赛事】泰坦尼克号生存预测

目录写在前面数据集情况查看数据清洗 Embarked： Fare Age Cabin 特征工程 1，探究Sex与Survived的相关性 2，探究Pcalss与Survive…

人工智能 2023年7月26日
0082
【pytorch】ECA-NET注意力机制应用于ResNet的代码实现

一、前言 ECA-NET(CVPR 2020)简介：论文名：ECA-Net: Effificient Channel Attention for Deep Convolution…

人工智能 2023年7月20日
00457
大数据技术&并行计算

大数据通常来说，常规软件无法完成抓取、处理的数据可称为大数据(Big Data)。例如，互联网上的网页数据，社交网站上的用户交互数据，物联网产生的活动数据、电信网络的话单数据等。…

人工智能 2023年7月16日
0082
计算机视觉与图形学-神经渲染专题-TensoRF

（说明：如果您认为下面的文章对您有帮助，请您花费一秒时间点击一下最底部的广告以此来激励本人创作，谢谢!!!）摘要我们提出了TensoRF ，一种建模和重建辐射场的新方法。与…

人工智能 2023年6月16日
0089
3.python-opencv图像mask掩膜处理

3.python-opencv图像mask掩膜处理第一章 python-opencv-图片导入和显示第二章 python-opencv图像简单处理 ` 文章目录 3.python…

人工智能 2023年6月18日
00140
Linux服务器安装pytorch更换conda清华镜像源

1、添加清华镜像源依次输入以下命令： conda config –add channels https://mirros.tuna.tsinghua.edu.cn/a…

人工智能 2023年7月22日
0047
分类，目标检测，语义分割，实例分割

Classification 简单地说，图像分类是一种用于对图像中特定对象的类别进行分类或预测的技术，该技术的主要目的是准确识别图像中的特征。就是说输入一张图片然后输出图片中含有的…

人工智能 2023年7月9日
0085

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Python爬取全国各地区疫情风险等级

需求

; 数据来源

分析网页

编写代码

大家都在看