85行代码实现多线程+数据文件操作+数据库存储的爬虫实例

2023年5月23日下午11:58 • Python • 阅读 102

写在前面

这是我接触到爬虫后写的第二个爬虫例子。

[En]

This is the second crawler example I wrote after I came into contact with the crawler.

也是我在学习python后真正意义上写的第二个小项目，第一个小项目就是第一个爬虫了。

我从学习python到现在，也就三个星期不到，平时课程比较多，python是额外学习的，每天学习python的时间也就一个小时左右。
所以我目前对于python也不是特别了解，如果代码以及理解方面存在错误，欢迎大家的指正。

爬取的网站

这是一个推荐网络小说的网站。

[En]

This is a website that recommends online novels.

https://www.tuishujun.com/

我使用了下面的代码示例来抓取这个站点上的所有新奇数据，大约是170000。

[En]

I used the following code example to crawl all the novel data on this site, which is about 170000.

大概花了6个小时的时间，效率还是不错的，如果是在单线程的情况下，我估计在不停机24小时爬取的情况下，也需要几天。

我在刚开始写这个爬虫实例的时候，也遇到了很多问题，首先就是网上虽然有很多关于python多线程爬虫的东西，但…

此外，爬虫程序使用多线程操作数据库的实例很少。

[En]

In addition, there are few instances of crawlers using multithreading to operate databases.

为了解决上述问题，我找了很多资料，走了很多弯路，摸索了几天，才写出了下面的例子。

[En]

In order to solve the above problems, I found a lot of information, took a lot of detours, and groped for a few days before writing the following examples.

您可以参考以下示例来扩展和编写您自己的多线程爬行器。

[En]

You can refer to the following examples to expand and write your own multithreaded crawlers.

需要注意的点：
在实例中我使用了ThreadPoolExecutor构造线程池的方式（大家可以找找这方面的资料看看），如果你在使用多线程的时候想要操作数据库存储数据，建议使用以上方式，要不然你会发现，在运行代码时出现各种各样的错误。

代码实例

import requests
import pymysql
import os
from lxml import etree
from fake_useragent import UserAgent
from concurrent.futures import ThreadPoolExecutor

class tuishujunSpider(object):
    def __init__(self):
        if not os.path.exists('db/tuishujun'):
            os.makedirs('db/tuishujun')
        else:
            pass
        self.f = open('./db/tuishujun/tuishujun.txt', 'a', encoding='utf-8')
        self.con = pymysql.connect(host='localhost', user='root', password='123456789', database='novel',
                                   charset='utf8', port=3306)
        self.cursor = self.con.cursor()
        self.cursor.execute(" SHOW TABLES LIKE 'tuishujun' ")
        judge = self.cursor.fetchone()
        if judge:
            pass
        else:
            self.cursor.execute("""create table tuishujun
                            ( id BIGINT NOT NULL AUTO_INCREMENT,
                              cover VARCHAR(255),
                              name VARCHAR(255),
                              author VARCHAR(255),
                              source VARCHAR(255),
                              intro LONGTEXT,
                              PRIMARY KEY (id))
                           """)
        self.con.commit()
        self.cursor.close()
        self.con.close()

    def start(self, page):
        con = pymysql.connect(
            host='localhost', user='root', password='123456789', database='novel', charset='utf8', port=3306)
        cursor = con.cursor()
        headers = {
            'User-Agent': UserAgent().random
        }
        url = 'https://www.tuishujun.com/books/' + str(page)
        r = requests.get(url, headers=headers)
        if r.status_code == 500:
            return
        else:
            html = etree.HTML(r.text)
            book = {}
            book['id'] = str(page)
            try:
                cover = html.xpath('//*[@id="__layout"]/div/div[2]/div/div[1]/div/div[1]/div[1]/div[1]/img/@src')[0]
            except IndexError:
                cover = ''
            book['cover'] = cover
            name = \
                html.xpath(
                    '//*[@id="__layout"]/div/div[2]/div/div[1]/div/div[1]/div[1]/div[2]/div/div[1]/h3/text()')[0]
            book['name'] = name
            author = \
                html.xpath(
                    '//*[@id="__layout"]/div/div[2]/div/div[1]/div/div[1]/div[1]/div[2]/div/div[2]/a/text()')[
                    0].strip()
            author = author.replace("\n", "")
            book['author'] = author
            source = \
                html.xpath('//*[@id="__layout"]/div/div[2]/div/div[1]/div/div[1]/div[1]/div[2]/div/div[5]/text()')[
                    0]
            book['source'] = source
            intro = html.xpath('//*[@id="__layout"]/div/div[2]/div/div[1]/div/div[1]/div[2]/text()')[0]
            intro = intro.replace(" ", "")
            intro = intro.replace("\n", "")
            book['intro'] = intro
            self.f.write(str(book) + '\n')
            cursor.execute("insert into tuishujun(id,cover,name,author,source,intro) "
                           "values(%s,%s,%s,%s,%s,%s)",
                           (book['id'], book['cover'], book['name'], book['author'],
                            book['source'], book['intro']))
            con.commit()
            cursor.close()
            con.close()
            print(book)

    def run(self):
        pages = range(1, 200000)
        with ThreadPoolExecutor() as pool:
            pool.map(self.start, pages)

if __name__ == '__main__':
    spider = tuishujunSpider()
    spider.run()

Original: https://www.cnblogs.com/ouhouyi/p/16412282.html
Author: 蚂蚁追风筝
Title: 85行代码实现多线程+数据文件操作+数据库存储的爬虫实例

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/499211/

转载文章受原作者版权保护。转载请注明原作者出处！

python

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

安装numpy问题总结

一、WARNING，YOU are using pip version 22.0.4；however，version 22.1 is available 原因分析：提醒您正在使用p…

Python 2023年8月23日
0048
matplotlib-bilibili，抖音很火的动态数据视频自动生成（第四节）-视频，语音合成

“ matplotlib-bilibili，抖音很火的动态数据视频自动生成（第四节)-视频，语音自动合成 “ 还记得上一节中我们所提到的数据动态视频吗？这…

Python 2023年8月31日
0088
pandas快速入门指南

Pandas 是一个开源的第三方 Python 库，从 Numpy 和 Matplotlib 的基础上构建而来，享有数据分析”三剑客之一”的盛名（NumPy…

Python 2023年8月18日
0071
用Python实现简单的人脸识别，10分钟搞定！（附源码）

前言让我的电脑认识我，我的电脑只有认识我，才配称之为我的电脑！今天，我们用Python实现简单的人脸识别技术！ Python里，简单的人脸识别有很多种方法可以实现，依赖于pyt…

Python 2023年9月26日
0036
GCN-图卷积神经网络算法简单实现（含python代码）

本文是就实现GCN算法模型进行的代码介绍，上一篇文章是GCN算法的原理和模型介绍。代码中用到的Cora数据集：链接：https://pan.baidu.com/s/1SbqIOty…

Python 2023年8月1日
0068
Pandas二次学习- 回炉重造（进阶）

### 回答1： pandas-official-tut-zh epub是指 Pandas_官方教程的中文电子书格式。 _Pandas_是一个强大的数据分析和处理库，被广泛应用于数…

Python 2023年8月22日
0053
python导入pandas包时报错“ModuleNotFoundError: No module named ‘six‘”的解决办法

python导入pandas包时报错”ModuleNotFoundError: No module named ‘six’”的解决办法问题描述…

Python 2023年8月2日
0051
python中的公共操作与列表推导式

1.公共操作 + 合并将两个相同类型序列进行连接字符串、列表、元组 l1 = [1, 2, 3] l2 = [4, 5, 6] print(l1+l2) # [1, 2, 3…

Python 2023年10月30日
0033
Python爬虫：urllib3与urllib到底有何区别？内行人告诉你答案

目录网络库urllib3 * 网络请求 – GET请求 POST请求 HTTP响应头上传文件超时处理 ; 网络库urllib3 urllib3是比urllib库更…

Python 2023年8月12日
0046
【机器学习】李宏毅——机器学习基本概念简介

机器学习就是找到一个我们人类无法写出来的函数来完成各种任务机器学习的任务回归Regression：输出是一个数值例如：预测未来某一个时间PM2.5数值分类Classific…

Python 2023年10月28日
0046
Python常用的数据文件存储格式大全（2021最新/最全/最详细版）

序言：保存数据的方式各种各样，最简单的方式是直接保存为文本文件，如TXT、JSON、CSV等，除此之外Excel也是现在比较流行的存储格式，通过这篇文章你也将掌握通过一些第三方库(…

Python 2023年5月25日
00293
dataframe,python,numpy 问题索引2

20230413 df.where(df != ‘其它’)所有元素都筛选 20221109 https://noxymgr5yr.feishu.cn/doc…

Python 2023年8月28日
0057
Fatal error in launcher:解决

报错目录 Fatal error in launcher: * 1.安装第三方库时 – 解决： 2.创建项目时 – 解决：永久性解决：建议重新安装个稳定版…

Python 2023年8月3日
0067
用python从日期中获取年、月、日、星期等30种信息

大家好, 本博客将持续更新python数据分析技巧, 一次解决一类（个）问题，欢迎关注订阅! 这次介绍日期数据处理。用python中的方法对日期数据进行处理, 我们可以获取很多有用…

Python 2023年8月16日
0039
Django笔记-Django自定义用户验证实现用户登录

自定义编写一个认证后端步骤 1.在users子应用下新建py文件utils.py,定义一个类为AuthPasswordUsernameModelBackend继承django.co…

Python 2023年8月5日
0033
Conda 创建 Python 虚拟环境不纯净的问题(2021.1.18)

Conda 创建 Python 虚拟环境不纯净的问题(2021.1.18) Ubuntu 16.04； Conda 4.9.2； Python 3.6；通过 Conda 命令创建…

Python 2023年5月24日
0075

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

85行代码实现多线程+数据文件操作+数据库存储的爬虫实例

写在前面

爬取的网站

代码实例

大家都在看