用python一键爬取几千张表情包斗图,分分钟征服朋友圈所有好友

如今,年轻人聊天时,不好意思说自己是没有表情包的年轻人。表情包已经成为人们聊天中不可或缺的一部分。

[En]

Nowadays, when young people chat, they are embarrassed to say that they are young people without memes. Memes have become an indispensable part of chatting with people.

我刚认识的朋友一分一秒地扔出几个表情包,然后分分钟拉进关系中,女朋友对整个两个表情包又生气又开心,这也可以化解尴尬。我没有时间把两个表情包都打出来,所以我很有礼貌,也很尴尬。

[En]

The friend I just met threw a few emojis out and pulled into the relationship minute by minute, and my girlfriend was angry and happy with the whole two emojis, which could also resolve the embarrassment. I didn’t have time to type the whole two emojis, so I was polite and embarrassed.

用python一键爬取几千张表情包斗图,分分钟征服朋友圈所有好友

; 一、欲扬先抑

准备非常重要,首先知道我们要做什么,带着什么做什么,怎么做,然后一步步实时、稳定地走下去。

[En]

Preparation is very important, first know what we are going to do, with what to do, how to do, and then go step by step real-time, steady.

开发环境配置

Python 3.6
Pycharm

打开浏览器搜索要安装的软件的名称

[En]

Open your browser to search for the name of the software you want to install

Python

之后的官方网站是官方网站,只要名字上有广告两个字,不要指向,自信,那就是广告。

[En]

After the official is the official website, as long as the name with the word advertisement, do not point, self-confidence, that is advertising.

用python一键爬取几千张表情包斗图,分分钟征服朋友圈所有好友
直接点下面的 Python 3.10.2 下载最新版本即可,不用点那啥 Download
用python一键爬取几千张表情包斗图,分分钟征服朋友圈所有好友

pycharm

用python一键爬取几千张表情包斗图,分分钟征服朋友圈所有好友
随便点一个 Download
用python一键爬取几千张表情包斗图,分分钟征服朋友圈所有好友
用python一键爬取几千张表情包斗图,分分钟征服朋友圈所有好友
安装方法已经写得太久了,可以添加以下组<details><summary>*<font color='gray'>[En]</font>*</summary>*<font color='gray'>The installation methods have been written one by one for too long, you can add the following groups</font>*</details>
Python学习交流1群:924040232
Python学习交流2群:815624229
我还给大家准备了大量的Python学习资料,直接在群里就可以免费领取了。

模块安装配置

requests
parsel
re

打开电脑,按住win+r,输入cmd,回车,输入pip install (加上要安装的模块名),回车即可安装。

二、代码

目标:fabiaoqing
你可以自己填写地址的前面和后面,包括后面的代码,这应该不是不可能的。

[En]

You can fill in the front and back of the address by yourself, including those in the back code, which should not be impossible.

导入模块

import requests
import parsel
import re
import time

请求网址

url = f'fabiaoqing/biaoqing/lists/page/{page}.html'

请求头

headers = {
       'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'
    }

返回网页源代码

response = requests.get(url=url, headers=headers)

解析数据

selector = parsel.Selector(response.text) # 把respons.text 转换成 selector 对象

第一次提取 提取所有的div标签内容

divs = selector.css('#container div.tagbqppdiv') # css 根据标签提取内容

通过标签内容提取他的图片url地址

img_url = div.css('img::attr(data-original)').get()

提取标题

title = div.css('img::attr(title)').get()

获取图片的后缀名

name = img_url.split('.')[-1]

保存数据

new_title = change_title(title)

对表情包图片发送请求 获取它二进制数据

img_content = requests.get(url=img_url, headers=headers).content

保存数据

def save(title, img_url, name):

    img_content = get_response(img_url).content
    try:
        with open('img\\' + title + '.' + name, mode='wb') as f:
            # 写入图片二进制数据
            f.write(img_content)
            print('正在保存:', title)
    except:
        pass

替换标题中的特殊字符

因为文件名未知并且有特殊字符,所以我们需要用正则表达式替换特殊字符。

[En]

Because the file name is unknown and there are special characters, we need to replace the special characters with regular expressions.

def change_title(title):
    mode = re.compile(r'[\\\/\:\*\?\"\\|]')
    new_title = re.sub(mode, "_", title)
    return new_title

记录时间

time_2 = time.time()

use_time = int(time_2) - int(time_1)
print(f'总共耗时:{use_time}秒')

伙计们,这是单线程的,下面是多线程的,我会直接转到代码。

[En]

Guys, this is single-threaded, the following is multi-threaded, I will go straight to the code.

import requests
import parsel
import re
import time
import concurrent.futures

def change_title(title):

    mode = re.compile(r'[\\\/\:\*\?\"\\|]')
    new_title = re.sub(mode, "_", title)
    return new_title

def get_response(html_url):

    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'
    }
    repsonse = requests.get(url=html_url, headers=headers)
    return repsonse

def save(title, img_url, name):

    img_content = get_response(img_url).content
    try:
        with open('img\\' + title + '.' + name, mode='wb') as f:

            f.write(img_content)
            print('正在保存:', title)
    except:
        pass

def main(html_url):

    html_data = get_response(html_url).text
    selector = parsel.Selector(html_data)
    divs = selector.css('#container div.tagbqppdiv')
    for div in divs:

        img_url = div.css('img::attr(data-original)').get()

        title = div.css('img::attr(title)').get()

        name = img_url.split('.')[-1]

        new_title = change_title(title)
        save(new_title, img_url, name)

if __name__ == '__main__':
    time_1 = time.time()
    exe = concurrent.futures.ThreadPoolExecutor(max_workers=10)
    for page in range(1, 201):
        url = f'fabiaoqing/biaoqing/lists/page/{page}.html'
        exe.submit(main, url)
    exe.shutdown()
    time_2 = time.time()
    use_time = int(time_2) - int(time_1)
    print(f'总共耗时:{use_time}秒')

兄弟们,18秒一千多张,这结束的有点快了啊

用python一键爬取几千张表情包斗图,分分钟征服朋友圈所有好友
如果你看过后觉得有用,请喜欢它并收集它。我爱你是为了让你觉得自己很伟大。
[En]

If you think it is useful after reading it, please like it and collect it. I love you to feel big.

你看,代码运行得如此之快,只需要18秒。我不希望每个人在日常生活中都这么快。嘿,这不太好。

[En]

You see, the code runs so fast, it only takes 18 seconds. I don’t want everyone to be so fast in daily life. Hey, it’s not good.

Original: https://www.cnblogs.com/hahaa/p/15990045.html
Author: 轻松学Python
Title: 用python一键爬取几千张表情包斗图,分分钟征服朋友圈所有好友

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/509844/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

亲爱的 Coder【最近整理,可免费获取】👉 最新必读书单  | 👏 面试题下载  | 🌎 免费的AI知识星球