【Python】秀人集-写真集-爬虫-2.0

好久不见呀,各位。[/坏笑]

自从上一篇文章发表以来,已经有很长时间了,所以我现在将带着承诺的2.0版本出来。毕竟,评论区已经开始蜂拥而至,不能再拖延了。

[En]

It has been a long time since the last article was published, so I will come out now with the promised version 2.0. After all, the comment area has begun to rush, and it can’t be delayed.

emm…具体的网页链接我就不写在正文了,我会放在代码区的注释部分。

在没有太多八卦的情况下,以下是这次更新的代码:

[En]

Without much gossip, here is the code for this update:

目标网址:https://www.xiurenb.com

导入库
import time, os, requests
from lxml import etree
from urllib import parse

定义请求头

headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36 Edg/96.0.1054.62'
    }

格式化列表
img_list = []
url_list = []
page_list = []

编码输入数据
human_unencode = input('Enter the human_name:')
human_encode = parse.quote(human_unencode)

编码后索引url
url_human = 'https://www.xiurenb.com/plus/search/index.asp?keyword=' + str(human_encode) + '&searchtype=title'

获取指定角色相册的列表页数。<details><summary>*<font color='gray'>[En]</font>*</summary>*<font color='gray'>Gets the number of list pages of the specified character photo album</font>*</details>
res_first = requests.get(url=url_human, headers=headers)
tree_first = etree.HTML(res_first.text)
Num_first = len(tree_first.xpath('/html/body/div[3]/div[1]/div/div/ul/div[3]/div/div[2]/a'))
print(f'Page_total:{Num_first})

获取指定页数的每个写真集的url并写入列表
i = input('Enter the PageNumber:)
print(f'Getting the page-{i}...')
res_human = requests.get(url_human + '&p=' + str(i))
tree_human = etree.HTML(res_human.text)
jihe_human = tree_human.xpath('/html/body/div[3]/div[1]/div/div/ul/div[3]/div/div[1]/div/div[1]/h2/a/@href')
for page in jihe_human:
    page_list.append(page)
time.sleep(2)

获取每个相册的所有图片<details><summary>*<font color='gray'>[En]</font>*</summary>*<font color='gray'>Get all the pictures of each photo album</font>*</details>
for Page_Num in page_list:
    url = 'https://www.xiurenb.com' + str(Page_Num)
    Num_res = requests.get(url=url, headers=headers)
    Num_tree = etree.HTML(Num_res.text)
    Num = len(Num_tree.xpath('/html/body/div[3]/div/div/div[4]/div/div/a'))
    url_list.append(url)
    for i in range(1, int(Num) - 2):
        url_other = url[:-5] + '_' + str(i) +'.html'
        url_list.append(url_other)
    # 获取所有图片url
    for url_img in url_list:
        res = requests.get(url=url_img, headers=headers)
        tree = etree.HTML(res.text)
        img_src = tree.xpath('/html/body/div[3]/div/div/div[5]/p/img/@src')
        for img in img_src:
            img_list.append(img)
        time.sleep(0.5)
    # 创建保存目录
    res = requests.get(url=url_list[0], headers=headers)
    res.encoding = 'utf-8'
    tree = etree.HTML(res.text)
    path_name = tree.xpath('/html/body/div[3]/div/div/div[1]/h1//text()')[0][11:]
    print(path_name)
    if not os.path.exists(f'C:/Users/liu/Pictures/{human_unencode}'):
        os.mkdir(f'C:/Users/liu/Pictures/{human_unencode}')
    the_path_name = f'C:/Users/liu/Pictures/{human_unencode}/' + path_name
    if not os.path.exists(the_path_name):
        os.mkdir(the_path_name)
        # 保存图片数据
        num = 0
        for j in img_list:
            img_url = 'https://www.xiurenb.com' + j
            img_data = requests.get(url=img_url, headers=headers).content
            img_name = img_url.split('/')[-1]
            finish_num = str(num) + '/' + str(len(img_list))
            with open(f'C:/Users/liu/Pictures/{human_unencode}/' + path_name + '/' + img_name, 'wb') as f:
                print(f'Downloading the img:{img_name}/{finish_num}')
                f.write(img_data)
                f.close()
            num += 1
            time.sleep(0.5)
        # 再次格式化列表
        img_list = []
        url_list = []
    else:
        print('gone>>>')
        # 再次格式化列表
        img_list = []
        url_list = []

输出结束提示
print('Finished!')

这次代码比较长,我就不一一解释了。这里需要注意的是,记住要将保存路径更改为您自己的路径,毕竟用户名是不同的。

[En]

The code is relatively long this time, so I won’t explain them one by one. It is important to note here that remember to change the save path to your own, after all, the user name is different.

这个版本是按名字搜索相册,比如唐安琪。运行代码时,输入要搜索的内容,然后在中间输入要下载的页数。

[En]

This version is to search for photo albums by name, such as Tang Anqi. When you run the code, enter what you want to search, and then enter the number of pages you want to download in the middle.

如果你还有任何其他问题,你可以在评论区问我。

[En]

If you have any other questions, you can ask me in the comments section.

当然,如果我解决不了的话我会去补课的[/痛哭],毕竟我学python也没多久…

Original: https://www.cnblogs.com/moxing-wanqian/p/moxingwanqian_1.html
Author: 魔性万千
Title: 【Python】秀人集-写真集-爬虫-2.0

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/498821/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

亲爱的 Coder【最近整理,可免费获取】👉 最新必读书单  | 👏 面试题下载  | 🌎 免费的AI知识星球