85行代码实现多线程+数据文件操作+数据库存储的爬虫实例

2023年5月23日下午11:58 • Python • 阅读 91

写在前面

这是我接触到爬虫后写的第二个爬虫例子。

[En]

This is the second crawler example I wrote after I came into contact with the crawler.

也是我在学习python后真正意义上写的第二个小项目，第一个小项目就是第一个爬虫了。

我从学习python到现在，也就三个星期不到，平时课程比较多，python是额外学习的，每天学习python的时间也就一个小时左右。
所以我目前对于python也不是特别了解，如果代码以及理解方面存在错误，欢迎大家的指正。

爬取的网站

这是一个推荐网络小说的网站。

[En]

This is a website that recommends online novels.

https://www.tuishujun.com/

我使用了下面的代码示例来抓取这个站点上的所有新奇数据，大约是170000。

[En]

I used the following code example to crawl all the novel data on this site, which is about 170000.

大概花了6个小时的时间，效率还是不错的，如果是在单线程的情况下，我估计在不停机24小时爬取的情况下，也需要几天。

我在刚开始写这个爬虫实例的时候，也遇到了很多问题，首先就是网上虽然有很多关于python多线程爬虫的东西，但…

此外，爬虫程序使用多线程操作数据库的实例很少。

[En]

In addition, there are few instances of crawlers using multithreading to operate databases.

为了解决上述问题，我找了很多资料，走了很多弯路，摸索了几天，才写出了下面的例子。

[En]

In order to solve the above problems, I found a lot of information, took a lot of detours, and groped for a few days before writing the following examples.

您可以参考以下示例来扩展和编写您自己的多线程爬行器。

[En]

You can refer to the following examples to expand and write your own multithreaded crawlers.

需要注意的点：
在实例中我使用了ThreadPoolExecutor构造线程池的方式（大家可以找找这方面的资料看看），如果你在使用多线程的时候想要操作数据库存储数据，建议使用以上方式，要不然你会发现，在运行代码时出现各种各样的错误。

代码实例

import requests
import pymysql
import os
from lxml import etree
from fake_useragent import UserAgent
from concurrent.futures import ThreadPoolExecutor

class tuishujunSpider(object):
    def __init__(self):
        if not os.path.exists('db/tuishujun'):
            os.makedirs('db/tuishujun')
        else:
            pass
        self.f = open('./db/tuishujun/tuishujun.txt', 'a', encoding='utf-8')
        self.con = pymysql.connect(host='localhost', user='root', password='123456789', database='novel',
                                   charset='utf8', port=3306)
        self.cursor = self.con.cursor()
        self.cursor.execute(" SHOW TABLES LIKE 'tuishujun' ")
        judge = self.cursor.fetchone()
        if judge:
            pass
        else:
            self.cursor.execute("""create table tuishujun
                            ( id BIGINT NOT NULL AUTO_INCREMENT,
                              cover VARCHAR(255),
                              name VARCHAR(255),
                              author VARCHAR(255),
                              source VARCHAR(255),
                              intro LONGTEXT,
                              PRIMARY KEY (id))
                           """)
        self.con.commit()
        self.cursor.close()
        self.con.close()

    def start(self, page):
        con = pymysql.connect(
            host='localhost', user='root', password='123456789', database='novel', charset='utf8', port=3306)
        cursor = con.cursor()
        headers = {
            'User-Agent': UserAgent().random
        }
        url = 'https://www.tuishujun.com/books/' + str(page)
        r = requests.get(url, headers=headers)
        if r.status_code == 500:
            return
        else:
            html = etree.HTML(r.text)
            book = {}
            book['id'] = str(page)
            try:
                cover = html.xpath('//*[@id="__layout"]/div/div[2]/div/div[1]/div/div[1]/div[1]/div[1]/img/@src')[0]
            except IndexError:
                cover = ''
            book['cover'] = cover
            name = \
                html.xpath(
                    '//*[@id="__layout"]/div/div[2]/div/div[1]/div/div[1]/div[1]/div[2]/div/div[1]/h3/text()')[0]
            book['name'] = name
            author = \
                html.xpath(
                    '//*[@id="__layout"]/div/div[2]/div/div[1]/div/div[1]/div[1]/div[2]/div/div[2]/a/text()')[
                    0].strip()
            author = author.replace("\n", "")
            book['author'] = author
            source = \
                html.xpath('//*[@id="__layout"]/div/div[2]/div/div[1]/div/div[1]/div[1]/div[2]/div/div[5]/text()')[
                    0]
            book['source'] = source
            intro = html.xpath('//*[@id="__layout"]/div/div[2]/div/div[1]/div/div[1]/div[2]/text()')[0]
            intro = intro.replace(" ", "")
            intro = intro.replace("\n", "")
            book['intro'] = intro
            self.f.write(str(book) + '\n')
            cursor.execute("insert into tuishujun(id,cover,name,author,source,intro) "
                           "values(%s,%s,%s,%s,%s,%s)",
                           (book['id'], book['cover'], book['name'], book['author'],
                            book['source'], book['intro']))
            con.commit()
            cursor.close()
            con.close()
            print(book)

    def run(self):
        pages = range(1, 200000)
        with ThreadPoolExecutor() as pool:
            pool.map(self.start, pages)

if __name__ == '__main__':
    spider = tuishujunSpider()
    spider.run()

Original: https://www.cnblogs.com/ouhouyi/p/16412282.html
Author: 蚂蚁追风筝
Title: 85行代码实现多线程+数据文件操作+数据库存储的爬虫实例

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/499211/

转载文章受原作者版权保护。转载请注明原作者出处！

python

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

Bert不完全手册5. BERT推理提速？训练提速!内存压缩！Albert

Albert是A Lite Bert的缩写，确实Albert通过词向量矩阵分解，以及transformer block的参数共享，大大降低了Bert的参数量级。在我读Albert论…

Python 2023年10月26日
0036
Machine Learning and Data Science (2): Introduction to pandas in Python

Creating a series series_name = pd.Series(["1", "2", "3"]) C…

Python 2023年8月9日
0050
python自动测试k_Pytest -k选项不限制测试功能的运行

当我为Pytest指定使用’-k’运行的函数时，我期望只运行一个函数。但是，它正在运行两个测试。 python3 -m pytest -s /home/jo…

Python 2023年9月11日
0036
Conda配置pytorch

目录一、conda创建虚拟环境（环境名称为FCN，可随意更改）二、激活虚拟环境三、查看cuda版本四、安装Pytorch 五、测试pytorch是否安装成功一、conda…

Python 2023年9月7日
0030
2023跨年代码（烟花+雪花）

一眨眼，马上就2023年了，祝大家在新的一年里：身体健康平安，生活充实饱满，事业步步高升，心情阳光灿烂，财运滚滚而来，家庭美满幸福，新年开心快乐! 本文将给大家分享一些跨年代码，基…

Python 2023年8月12日
00160
dockers 项目部署

docker里需要用到vim apt-get update apt-get install vim dockers mariadb的使用安装mariadb docker pull…

Python 2023年8月4日
0036
一切皆对象和深浅拷贝

1.元类元类的来源是：python中一切皆对象。 1.1 什么是元类元类就是用来实例化产生类的类关系：元类—实例化—类（自定义的类）—实例…

Python 2023年5月23日
0053
python怎么填充背景颜色_python matplotlib：如何使用带颜色贴图的填充之间填充2d绘图的背景？…

这里有一个解决方案，那些不想使用contourf或者因为其他原因需要填充(比如在这种情况下使用不规则网格数据)。在import numpy as np import matplot…

Python 2023年9月6日
0040
按指定频次对时间序列数据进行分组pd.grouper()方法

【小白从小学Python、C、Java】【计算机等级考试+500强双证书】【Python-数据分析】构造时间序列为索引的DataFrame按照指定的时间间隔分组统计df.group…

Python 2023年8月17日
0066
python中pandas模块导入csv文件_Python之pandas导入导出数据

1.导入pandas模块 import pandas as pd 2.导入CSV表格数据 titanic = pd.read_csv(r’C:\Users\Admini…

Python 2023年8月20日
0056
pytest重运行失败用例

pytest重运行失败用例 python版本必须3.5以上；pytest版本必须是5.0以上，不然会出现一些乱七八糟的问题，这也是官方建议。安装 pytest-rerunfail…

Python 2023年9月13日
0041
AI常用框架和工具丨3. 可视化库Matplotlib

可视化库Matplotlib，AI常用框架和工具之一。理论知识结合代码实例，希望对您有所帮助。文章目录 * – 环境说明 – 一、Matplotlib简介…

Python 2023年9月1日
0047
Flask 生成图片验证码

目录一、Flask 图片验证码 * – + * – + 1 接收前端传来的uuid + 2生成一个随机的图片验证码由数字字母组成(字母区分大小写) 设定…

Python 2023年8月14日
0035
《Hierarchical Text-Conditional Image Generation with CLIP Latents》阅读笔记

概括模型总述本篇论文主要介绍DALL·E 2模型，它是OpenAI在2022年4月推出的一款模型，OpenAI在2021年1月推出了DALL·E模型，2021年年底推出了GLI…

Python 2023年10月14日
0028
python数据清洗—实战案例（清洗csv文件）

我也是最近才开始这方面的学习，这篇就当作学习的笔记，记录一下学习的过程目录 * – 所以我们现在要解决的问题就是删除列名中的空格 – 接下来要解决的问题就…

Python 2023年8月16日
0047
Jinja2渲染HTML模板-python发送邮件html格式正文

背景有用过Flask的同学应该都知道，flask创建上下文之后就可以使用render_template（基于Jinja2模板引擎）去渲染HTML页面了。看这个函数的源码我们可以发…

Python 2023年8月15日
0077

2024 年 4 月
一	二	三	四	五	六	日
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

85行代码实现多线程+数据文件操作+数据库存储的爬虫实例

写在前面

爬取的网站

代码实例

大家都在看