Python爬取各大外包网站需求

2023年7月31日下午6:53 • Python • 阅读 65

文章目录

前言
一、需求
二、分析
三、处理
四、总结

前言

为了更好的掌握数据处理的能力，因而开启Python网络爬虫系列小项目文章。

小项目小需求驱动
总结各种方式
页面源代码返回数据（Xpath、Bs4、PyQuery、正则）
接口返回数据

一、需求

二、分析

一品威客
1、查看网页源代码
2、查找数据
3、获取详情页（赏金、任务要求、需求、状态）

软件项目交易网
1、查看网页源码
2、全局搜索数据

获取YesPMP平台需求任务
1、查看网页源代码
2、全局搜索数据

码市
1、F12抓包即可获取数据
2、构造请求即可获取数据

; 三、处理

一品威客
1、任务页任务
2、详情页（处理直接雇佣）
3、获取赏金、任务要求、时间


__author__ = "Nick"
__created_date__ = "2022/11/12"

import requests
from bs4 import BeautifulSoup
import re

HEADERS = {"user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36",
           "Content-Type": "text/html; charset=utf-8"}

def get_index_source(url):
    res = requests.request("GET",url=url,headers=HEADERS)
    res.encoding = "utf-8"
    return res.text

def method_bs4(html):
    page = BeautifulSoup(html, "html.parser")
    return page

def method_zz(code):
    deal = re.compile(r'',re.S)
    result = deal.finditer(code)
    for i in result:
        check = i.group("is_direct")
        if "直接雇佣任务" in check:
            return True

def get_task_url(html):
    page = method_bs4(html)

    div = page.select(".title.marginLeft")

    url_list = {}
    for _div in div:

        content_url = _div.find("a")["href"]
        content = _div.text
        task = content.split("【数据采集】")[1]
        url_list[task] = content_url
    return url_list

def get_task_content(url_dict):
    with open("一品威客任务.txt",mode="a+", encoding="utf-8") as f:
        for name, url in url_dict.items():

            code_source = get_index_source(url)
            page = method_bs4(code_source)

            money = page.select(".nummoney.f_l span")
            for _money in money:
                task_money = _money.text.strip("\n").strip(" ")
                print(task_money)

            result = method_zz(code_source)
            if result:
                f.write(f"直接雇佣-{name}{task_money}\n")

            time = page.select("#TimeCountdown")
            for _time in time:
                start_time = _time["starttime"]
                end_time = _time["endtime"]
                print(start_time,end_time)

            content = page.select(".task-info-content p")
            for _content in content:
                content_data = _content.text
                print(content_data)
            f.write(f"{name}---{content_data},{task_money},{start_time},{end_time}\n")

if __name__ == '__main__':
    url = "https://task.epwk.com/sjcj/"
    html = get_index_source(url)
    url_dict = get_task_url(html)
    get_task_content(url_dict)

软件项目交易网
通过Xpath即可获取对应数据


__author__ = "Nick"
__created_date__ = "2022/11/12"

import requests
from lxml import etree

HEADERS = {"user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36",
           "Content-Type": "text/html; charset=utf-8"}

def get_index_source(url):
    res = requests.request("GET",url=url,headers=HEADERS)
    res.encoding = "utf-8"
    return res.text

def method_xpath(html):
    parse = etree.HTML(html)
    return parse

def get_task_info(html):

    with open("软件交易网站需求.txt",mode="w",encoding="utf-8") as f:

        parse = method_xpath(html)

        result = parse.xpath('//*[@id="projectLists"]/div/ul/li')
        for li in result:

            status = li.xpath('./div[@class="left_2"]/span/text()')[1]

            status = status.strip()

            task = li.xpath('./div[@class="left_8"]/h4/a/text()')
            task_content = task[-1].strip()

            bond = li.xpath('./div[@class="left_8"]/span[1]/em/text()')[0]

            hot = li.xpath('./div[@class="left_8"]/span[2]/em/text()')[0]

            start_time = li.xpath('./div[@class="left_8"]/span[3]/em/text()')[0]

            end_time = li.xpath('./div[@class="left_8"]/span[4]/em/text()')[0]
            f.write(f"{status},{task_content},{bond},{hot},{start_time},{end_time}\n")

if __name__ == '__main__':
    url = "https://www.sxsoft.com/page/project"
    html = get_index_source(url)
    get_task_info(html)

获取YesPMP平台需求任务
通过PQuery即可获取数据


__author__ = "Nick"
__created_date__ = "2022/11/12"

import requests
from pyquery import PyQuery as pq

HEADERS = {"user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36",
           "Content-Type": "text/html; charset=utf-8"}

def get_index_source(url):
    res = requests.request("GET",url=url,headers=HEADERS)
    res.encoding = "utf-8"
    return res.text

def method_pq(html):
    parse = pq(html)
    return parse

def get_task_info(html):
    with open("yespmp网站需求.txt",mode="a",encoding="utf-8") as f:
        parse = method_pq(html)

        result =parse.find(".promain")

        for _ in result.items():

            task_name = _.find(".name").text()

            price =  _.find(".price").text()

            date = _.find(".date").text()

            bid_num =  _.find(".num").text()
            f.write(f"{task_name},{price},{date},{bid_num}\n")

if __name__ == '__main__':
    for i in range(2,10):
        url = f"https://www.yespmp.com/project/index_i{i}.html"
        html = get_index_source(url)
        get_task_info(html)

码市
基本request请求操作（请求头、参数）


__author__ = "Nick"
__created_date__ = "2022/11/12"

import requests
import json

headers = {
        'cookie': 'mid=6c15e915-d258-41fc-93d9-939a767006da; JSESSIONID=1hfpjvpxsef73sbjoak5g5ehi; _gid=GA1.2.846977299.1668222244; _hjSessionUser_2257705=eyJpZCI6ImI3YzVkMTc5LWM3ZDktNTVmNS04NGZkLTY0YzUxNGY3Mzk5YyIsImNyZWF0ZWQiOjE2NjgyMjIyNDM0NzgsImV4aXN0aW5nIjp0cnVlfQ==; _ga_991F75Z0FG=GS1.1.1668245580.3.1.1668245580.0.0.0; _ga=GA1.2.157466615.1668222243; _gat=1',
        'referer': 'https://codemart.com/projects?labelId=&page=1',
        'accept': 'application/json',
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36'
    }

def get_data():
    url = "https://codemart.com/api/project?labelId=&page=1"
    payload = {}
    response = requests.request("GET", url, headers=headers, data=payload)
    print(json.loads(response.text))

if __name__ == '__main__':
    get_data()

四、总结

Xpath
适用于要获取的信息在某个标签下，且各标签层次明显，通过路径找到位置，for循环遍历即可
Bs4
适用于要获取的信息比较分散，且通过选择器可以定位（class唯一、id唯一）
PyQuery
适用于要获取的信息比较分散，且通过选择器可以定位（class唯一、id唯一）
正则
通过（.*？）就可以处理元素失效或者定位少量信息
不适用网页代码有很多其它符号，定位失效
接口返回数据
对于接口没有进行加密，通过requests构造请求即可获取数据
关注点在请求头中的参数

欢迎加入免费的知识星球内！
我正在「Print(“Hello Python”)」和朋友们讨论有趣的话题，你⼀起来吧？
https://t.zsxq.com/076uG3kOn

Original: https://blog.csdn.net/Uncle_wangcode/article/details/127819205
Author: 不秃头的测开
Title: Python爬取各大外包网站需求

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/726428/

转载文章受原作者版权保护。转载请注明原作者出处！

python

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

对Python-pandas中缺失值问题的归纳

目录 * – DataFrame中单个数据的缺失值判断 – 获得DataFrame缺失值情况 – 保留某列为空值的行 – 保留不为空…

Python 2023年8月22日
0049
DataGridViewImageColumn 图片照片

Private Sub BT_PHOTOADDRESS_Click(sender As Object, e As EventArgs) Handles BT_PHOTOADDRES…

Python 2023年6月10日
0091
aspnetcore6.0源代码编译调试

虽然编译源码折腾了几个时间（卡在restore），最后还是跑起来了aspnetcore6.0mvc源码项目，下面说步骤，前提是网络能连外，对于不能连外的懒得折腾。第一步电脑找个…

Python 2023年10月21日
0040
Python：用argparse模块解析命令行选项

1. 用argparse模块解析命令行选项我们在上一篇博客《Linux：可执行程序的Shell传参格式规范》中介绍了Linux系统Shell命令行下可执行程序应该遵守的传参规范（…

Python 2023年11月2日
0036
NumPy⾼级

导⼊numpy模块 import numpy as np 定&…

Python 2023年8月27日
0044
联邦学习中的优化算法

1 导引联邦学习做为一种特殊的分布式机器学习，仍然面临着分布式机器学习中存在的问题，那就是设计分布式的优化算法。以分布式机器学习中常采用的client-server架构（同步）…

Python 2023年10月29日
0036
【Openshift】OC命令

oc登录 ~]# oc login -u dev https://192.168.172.167:8443 ~]# oc login –token=eyJhbGciOiJSUzI…

Python 2023年6月16日
0063
django+drf_haystack+elasticsearch+ik+高亮显示

0.前提准备环境 1. 准备好django2.2 2. 创建一个app 3.elasticsearch7.5启动 4.可视化工具(实在没有,也没啥) models.py from…

Python 2023年8月6日
0067
都下班了老板让我做一个可视化报表，还好我会Python，分分钟就完成了！

刚下班，老板就踩点过来了，发给我一堆东西，让我做完可视化报表再下班，我特么心态崩了呀！在数据展示中使用图表来分享自己的见解，是个非常常见的方法。这也是Tableau、Power …

Python 2023年11月9日
0041
python数据清洗—实战案例（清洗csv文件）

我也是最近才开始这方面的学习，这篇就当作学习的笔记，记录一下学习的过程目录 * – 所以我们现在要解决的问题就是删除列名中的空格 – 接下来要解决的问题就…

Python 2023年8月16日
0056
Arduino驱动OLED显示屏

使用Arduino驱动SSD1306 OLED 显示屏工作 1、准备工作 1.1）Arduino中库的载入 include ; 1.2）接线在这里使用的是arduino UNO开…

Python 2023年11月8日
0037
深度视觉中有关图像projection的代码改写cv2.remap() → F.grid_sample() | Numpy+cv2格式改为PyTorch格式

Numpy+cv2实现的代码迁移到PyTorch上往往不怎么需要改动，直接把np换成torch即可，但 cv2.remap()函数是个特殊例子，该函数通过xy两个数组重新采样图像，…

Python 2023年8月29日
0065
python处理DataFrame类型数据常用方法

全文中pandas简写为pd，data和df都是DataFrame类型。数据预处理常用方法汇总数据操作读取不同格式的数据 data = pd.read_csv("文件…

Python 2023年8月7日
0062
Python | Numpy三维数组维度变换/提取

0. 问题描述每次使用Numpy，遇到需要从（A,B,C）三维数组中提取（A,B）、（A,C）或者（B,C）或者（A,）这几个维度数据时，总是忘记该如何切片，网上搜到的又太详细，…

Python 2023年8月23日
0066
python 一些使用速记

在调试代码时，需要打印一个 Numpy 数组，直接 print 打印，可能会没有逗号： [[[[ 71. 104. 107.] [ 16. 78. 68.] [ 60. 61. 8…

Python 2023年8月28日
0034
yolov5训练并生成rknn模型以及3588平台部署

1.服务器环境配置 1.1GPU驱动安装下载GPU驱动 https://www.nvidia.cn/geforce/drivers/ 选择对应的显卡型号以及操作系统，点击搜索选…

Python 2023年8月1日
0060

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Python爬取各大外包网站需求

文章目录

大家都在看