【Python】秀人集-写真集-爬虫-2.0

2023年5月23日下午11:10 • Python • 阅读 119

好久不见呀，各位。[/坏笑]

自从上一篇文章发表以来，已经有很长时间了，所以我现在将带着承诺的2.0版本出来。毕竟，评论区已经开始蜂拥而至，不能再拖延了。

[En]

It has been a long time since the last article was published, so I will come out now with the promised version 2.0. After all, the comment area has begun to rush, and it can’t be delayed.

emm…具体的网页链接我就不写在正文了，我会放在代码区的注释部分。

在没有太多八卦的情况下，以下是这次更新的代码：

[En]

Without much gossip, here is the code for this update:

目标网址：https://www.xiurenb.com

导入库
import time, os, requests
from lxml import etree
from urllib import parse

定义请求头

headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36 Edg/96.0.1054.62'
    }

格式化列表
img_list = []
url_list = []
page_list = []

编码输入数据
human_unencode = input('Enter the human_name:')
human_encode = parse.quote(human_unencode)

编码后索引url
url_human = 'https://www.xiurenb.com/plus/search/index.asp?keyword=' + str(human_encode) + '&searchtype=title'

获取指定角色相册的列表页数。<details><summary>*<font color='gray'>[En]</font>*</summary>*<font color='gray'>Gets the number of list pages of the specified character photo album</font>*</details>
res_first = requests.get(url=url_human, headers=headers)
tree_first = etree.HTML(res_first.text)
Num_first = len(tree_first.xpath('/html/body/div[3]/div[1]/div/div/ul/div[3]/div/div[2]/a'))
print(f'Page_total:{Num_first})

获取指定页数的每个写真集的url并写入列表
i = input('Enter the PageNumber:)
print(f'Getting the page-{i}...')
res_human = requests.get(url_human + '&p=' + str(i))
tree_human = etree.HTML(res_human.text)
jihe_human = tree_human.xpath('/html/body/div[3]/div[1]/div/div/ul/div[3]/div/div[1]/div/div[1]/h2/a/@href')
for page in jihe_human:
    page_list.append(page)
time.sleep(2)

获取每个相册的所有图片<details><summary>*<font color='gray'>[En]</font>*</summary>*<font color='gray'>Get all the pictures of each photo album</font>*</details>
for Page_Num in page_list:
    url = 'https://www.xiurenb.com' + str(Page_Num)
    Num_res = requests.get(url=url, headers=headers)
    Num_tree = etree.HTML(Num_res.text)
    Num = len(Num_tree.xpath('/html/body/div[3]/div/div/div[4]/div/div/a'))
    url_list.append(url)
    for i in range(1, int(Num) - 2):
        url_other = url[:-5] + '_' + str(i) +'.html'
        url_list.append(url_other)
    # 获取所有图片url
    for url_img in url_list:
        res = requests.get(url=url_img, headers=headers)
        tree = etree.HTML(res.text)
        img_src = tree.xpath('/html/body/div[3]/div/div/div[5]/p/img/@src')
        for img in img_src:
            img_list.append(img)
        time.sleep(0.5)
    # 创建保存目录
    res = requests.get(url=url_list[0], headers=headers)
    res.encoding = 'utf-8'
    tree = etree.HTML(res.text)
    path_name = tree.xpath('/html/body/div[3]/div/div/div[1]/h1//text()')[0][11:]
    print(path_name)
    if not os.path.exists(f'C:/Users/liu/Pictures/{human_unencode}'):
        os.mkdir(f'C:/Users/liu/Pictures/{human_unencode}')
    the_path_name = f'C:/Users/liu/Pictures/{human_unencode}/' + path_name
    if not os.path.exists(the_path_name):
        os.mkdir(the_path_name)
        # 保存图片数据
        num = 0
        for j in img_list:
            img_url = 'https://www.xiurenb.com' + j
            img_data = requests.get(url=img_url, headers=headers).content
            img_name = img_url.split('/')[-1]
            finish_num = str(num) + '/' + str(len(img_list))
            with open(f'C:/Users/liu/Pictures/{human_unencode}/' + path_name + '/' + img_name, 'wb') as f:
                print(f'Downloading the img:{img_name}/{finish_num}')
                f.write(img_data)
                f.close()
            num += 1
            time.sleep(0.5)
        # 再次格式化列表
        img_list = []
        url_list = []
    else:
        print('gone>>>')
        # 再次格式化列表
        img_list = []
        url_list = []

输出结束提示
print('Finished!')

这次代码比较长，我就不一一解释了。这里需要注意的是，记住要将保存路径更改为您自己的路径，毕竟用户名是不同的。

[En]

The code is relatively long this time, so I won’t explain them one by one. It is important to note here that remember to change the save path to your own, after all, the user name is different.

这个版本是按名字搜索相册，比如唐安琪。运行代码时，输入要搜索的内容，然后在中间输入要下载的页数。

[En]

This version is to search for photo albums by name, such as Tang Anqi. When you run the code, enter what you want to search, and then enter the number of pages you want to download in the middle.

如果你还有任何其他问题，你可以在评论区问我。

[En]

If you have any other questions, you can ask me in the comments section.

当然，如果我解决不了的话我会去补课的[/痛哭]，毕竟我学python也没多久…

Original: https://www.cnblogs.com/moxing-wanqian/p/moxingwanqian_1.html
Author: 魔性万千
Title: 【Python】秀人集-写真集-爬虫-2.0

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/498821/

转载文章受原作者版权保护。转载请注明原作者出处！

python

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

python中pandas包

目录 1.Series 1.1Series的创建 1.2 Series的属性 2.DataFrame 2.1DataFrame的创建 2.2 DataFrame的属性 2.3 Da…

Python 2023年8月20日
0063
基于JAVA+SpringMVC+Mybatis+Vue+MYSQL的大健康老年公寓管理系统

项目介绍本系统采用java语言开发，后端采用ssm框架，前端采用vue技术，数据库采用mysql进行数据存储。管理员后台页面：功能：主页、个人中心、护理人员管理、收费标准管理、…

Python 2023年9月15日
0045
Protobuf生成文件报错

Mac下protobuf生成文件报错问题解决办法，windows下就不会这么麻烦了，如果linux下出现类似报错信息按照下面的解决逻辑依然适用。 1、由–go_out引…

Python 2023年6月3日
0084
Python计算大文件行数方法及性能比较

如何使用Python快速高效地统计出大文件的总行数, 下面是一些实现方法和性能的比较。 1.readline读所有行使用 readlines方法读取所有行: def readli…

Python 2023年6月3日
0062
【Docker下集成Jenkins+Pytest+Allure 测试环境】

Docker下集成Jenkins+Pytest+Allure 测试环境背景技术流开始 * 使用docker搭建api接口测试框架使用docker搭建Jenkins 开始构建…

Python 2023年9月9日
0058
【学习笔记】基于flask的web项目开发

目录前言正文初始Flask 拓展一 Flask与HTTP Flask模板基本用法辅助工具之上下文辅助工具之全局对象写在最后参考资料前言 21年9月，因为项目需求方…

Python 2023年8月10日
0085
【爬虫实战项目】Python爬虫批量下载网易云音乐飙升榜并保存本地（附源码）

前言今天给大家介绍的是Python爬虫批量下载音乐飙升榜并保存本地，在这里给需要的小伙伴们代码，并且给出一点小心得。首先是爬取之前应该尽可能伪装成浏览器而不被识别出来是爬虫，基…

Python 2023年10月30日
0049
python排列和随机采样permutation&sample

python排列和随机采样permutation&sample 原创六mo神剑2022-07-18 15:01:34博主文章分类：Python ©著作权文章标签 pyt…

Python 2023年5月25日
0071
通过案例实战深入认识python图像处理库Pillow

什么是Pillow 首先我们需要了解一下PIL（Python Imaging Library）,它是Python2中非常强大的图像处理标准库，但只支持到Python2.7。Pill…

Python 2023年11月9日
0055
SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame 如何解决？很简单！

这里运行环境是jupyter notebook，演示的是如何MACD计算！下面是掘金量化相关api，如果不是做量化的，可以跳过，直接import pandas就可以了，这不是重点…

Python 2023年8月9日
0062
【自动化测试框架】pytest和unitttest你知道多少？区别在哪？该用哪个？

一、大家熟知的自动化测试框架 Java JUnit、TestNG等等。 python PyUnit（unittest）、Pytest、Robot Framework等等二、Pyt…

Python 2023年9月10日
0067
深度学习入门笔记：感知机

编程导航：nav.wenancoding.com个人blog：wenancoding.comgzh：【问安coding】定义感知机接收多个输入信号，输出一个信号。感知机的信号只…

Python 2023年6月10日
0044
Python 各种画图

文章目录 Part.I 基础知识 * Chap.I 快应用 Chap.II 常用语句 Part.II 画图样例 * Chap.I 散点图 Chap.II 柱状图 Chap.III …

Python 2023年8月1日
0067
django中的中间件

1. 什么是中间件中间件是django的门户，在请求响应进入进出django的时候，都需要先经过中间件，用来全局改变django的输入和输出。 django中自带7个中间件，每个…

Python 2023年5月23日
0060
当爬虫工程师遇到 CTF丨B 站 1024 安全攻防题解

答案参考第一题： a1cd5f84-27966146-3776f301-64031bb9 第二题： 36c7a7b4-cda04af0-8db0368d-b5166480 第三题…

Python 2023年5月25日
0090
django-连接mysql数据库-进行增删改查操作

首先确保你有一个可以用于连接的数据库。这里是用本地的数据库演示。先去项目的settings.py文件里面找到databases将其注释掉(默认是的sqlite3)。 DATABA…

Python 2023年8月4日
0050

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

【Python】秀人集-写真集-爬虫-2.0

大家都在看