python 获取最新房价信息-以北京房价为例

从整个数据中获取的信息是通过房屋平台获取的,整个过程通过下载网页元素和数据提取和分析来完成。

[En]

The information obtained from the whole data is obtained through the housing platform, and the whole process is completed by downloading web page elements and data extraction and analysis.

【阅读全文】

python 获取最新房价信息-以北京房价为例

导入相关网页下载、数据解析和数据处理库

[En]

Import related web page downloads, data parsing, and data processing libraries

from fake_useragent import UserAgent  # 身份信息生成库

from bs4 import BeautifulSoup  # 网页元素解析库
import numpy as np  # 科学计算库
import requests  # 网页下载库
from requests.exceptions import RequestException  # 网络请求异常库
import pandas as pd  # 数据处理库

然后,在开始之前,初始化一个由身份信息生成的对象,该对象用于在下载网页时随机生成身份信息。

[En]

Then, before starting, initialize an object generated by identity information, which is used to randomly generate identity information when the web page is downloaded.

user_agent = UserAgent()

编写一个网页下载函数get_html_txt,从相应的url地址下载网页的html文本。

def get_html_txt(url, page_index):
    '''
    获取网页html文本信息
    :param url: 爬取地址
    :param page_index:当前页数
    :return:
    '''
    try:
        headers = {
            'user-agent': user_agent.random
        }
        response = requests.request("GET", url, headers=headers, timeout=10)
        html_txt = response.text
        return html_txt
    except RequestException as e:
        print('获取第{0}页网页元素失败!'.format(page_index))
        return ''

编写网页元素处理函数catch_html_data,用于解析网页元素,并将解析后的数据元素保存到csv文件中。

def catch_html_data(url, page_index):
    '''
    处理网页元素数据
    :param url: 爬虫地址
    :param page_index:
    :return:
    '''

    # 下载网页元素
    html_txt = str(get_html_txt(url, page_index))

    if html_txt.strip() != '':

        # 初始化网页元素对象
        beautifulSoup = BeautifulSoup(html_txt, 'lxml')

        # 解析房源列表
        h_list = beautifulSoup.select('.resblock-list-wrapper li')

        # 遍历当前房源的详细信息
        for n in range(len(h_list)):
            h_detail = h_list[n]

            # 提取房源名称
            h_detail_name = h_detail.select('.resblock-name a.name')
            h_detail_name = [m.get_text() for m in h_detail_name]
            h_detail_name = ' '.join(map(str, h_detail_name))

            # 提取房源类型
            h_detail_type = h_detail.select('.resblock-name span.resblock-type')
            h_detail_type = [m.get_text() for m in h_detail_type]
            h_detail_type = ' '.join(map(str, h_detail_type))

            # 提取房源销售状态
            h_detail_status = h_detail.select('.resblock-name span.sale-status')
            h_detail_status = [m.get_text() for m in h_detail_status]
            h_detail_status = ' '.join(map(str, h_detail_status))

            # 提取房源单价信息
            h_detail_price = h_detail.select('.resblock-price .main-price .number')
            h_detail_price = [m.get_text() for m in h_detail_price]
            h_detail_price = ' '.join(map(str, h_detail_price))

            # 提取房源总价信息
            h_detail_total_price = h_detail.select('.resblock-price .second')
            h_detail_total_price = [m.get_text() for m in h_detail_total_price]
            h_detail_total_price = ' '.join(map(str, h_detail_total_price))

            h_info = [h_detail_name, h_detail_type, h_detail_status, h_detail_price, h_detail_total_price]
            h_info = np.array(h_info)
            h_info = h_info.reshape(-1, 5)
            h_info = pd.DataFrame(h_info, columns=['房源名称', '房源类型', '房源状态', '房源均价', '房源总价'])
            h_info.to_csv('北京房源信息.csv', mode='a+', index=False, header=False)

        print('第{0}页房源信息数据存储成功!'.format(page_index))
    else:
        print('网页元素解析失败!')

编写多线程处理函数,初始化网络网页下载地址,并使用多线程启动调用业务处理函数catch_html_data,启动线程完成整个业务流程。

import threading  # 导入线程处理模块

def thread_catch():
    '''
    线程处理函数
    :return:
    '''
    for num in range(1, 50, 3):
        url_pre = "https://bj.fang.lianjia.com/loupan/pg{0}/".format(str(num))
        url_cur = "https://bj.fang.lianjia.com/loupan/pg{0}/".format(str(num + 1))
        url_aft = "https://bj.fang.lianjia.com/loupan/pg{0}/".format(str(num + 2))

        thread_pre = threading.Thread(target=catch_html_data, args=(url_pre, num))
        thread_cur = threading.Thread(target=catch_html_data, args=(url_cur, num + 1))
        thread_aft = threading.Thread(target=catch_html_data, args=(url_aft, num + 2))
        thread_pre.start()
        thread_cur.start()
        thread_aft.start()

thread_catch()

数据存储结果展示效果

python 获取最新房价信息-以北京房价为例

【往期精彩】

python 获取最新房价信息-以北京房价为例

办公自动化:Image图片转换成PDF文档存储…

python做一个微型美颜图片处理器,十行代码即可完成…

用python做一个文本翻译器,自动将中文翻译成英文,超方便的!

小王,给这2000个客户发一下节日祝福的邮件…

python 一行命令开启网络间的文件共享…

PyQt5 批量删除 Excel 重复数据,多个文件、自定义重复项一键删除…

再见XShell,这款国人开源的终端命令行工具更nice!

python 表情包下载器,轻松下载上万个表情包、斗图不用愁…

Python 自动清理电脑垃圾文件,一键启动即可…

有了jmespath,处理python中的json数据就变成了一种享受…

解锁一个新技能,如何在Python代码中使用表情包…

万能的list列表,python中的堆栈、队列实现全靠它!

Original: https://www.cnblogs.com/lwsbc/p/16154263.html
Author: Python集中营
Title: python 获取最新房价信息-以北京房价为例

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/499922/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

亲爱的 Coder【最近整理,可免费获取】👉 最新必读书单  | 👏 面试题下载  | 🌎 免费的AI知识星球