学python,怎么能不学习scrapy呢!

摘要:本文讲述如何编写scrapy爬虫。

在正式编写爬虫案例之前,让我们先系统地研究一下scrapy。

[En]

Before formally writing a crawler case, let’s do a systematic study of scrapy.

scrapy 安装与简单运行

使用命令pip install scrapy安装,安装成功后还需要收集多个网址,方便后续学习和使用。

[En]

Use the command pip install scrapy to install, after success, you also need to collect several web addresses to facilitate subsequent learning and use.

  • scrapy 官网:;
  • scrapy 文档:;
  • scrapy 更新日志:。

安装完成后,直接在控制台输入scrapy,下面的命令表示安装成功。

[En]

After the installation is complete, type scrapy directly in the console, and the following command indicates that the installation is successful.

上面的屏幕截图是scrapy的内置命令和标准格式的scrapy的列表。

[En]

The screenshot above is a list of scrapy’s built-in commands and scrapy in standard format.

scrapy 中提供两种类型的命令,一种是全局的,一种的项目中的,后者需要进入到 scrapy 目录才可运行。

这些命令不需要一开始就完全记住,可以随时检查它们,并且有几个更常用的命令,例如:

[En]

These commands do not need to be fully remembered at the beginning, they can be checked at any time, and several are more commonly used, such as:

**scrpy startproject

该命令首先根据工程名称创建一个文件夹,然后在该文件夹下创建一个scrpy工程,这是所有后续代码的起点。

[En]

This command first creates a folder based on the * project name * , and then creates a scrpy project under the folder, which is the starting point for all subsequent code.

上面的内容增加了一些评论,可以比较一下。默认情况下,生成的文件位于python运行时目录中。如果要修改项目目录,请使用以下格式命令:

[En]

Some comments have been added to the above content, which can be compared. By default, the generated files are in the python runtime directory. If you want to modify the project directory, use the following format command:

命令依据模板创建出来的项目结构如下所示,其中红色下划线的是项目目录,而绿色下划线才是 scrapy 项目,如果想要运行项目命令,则必须先进入红色下划线 my_scrapy 文件夹,在项目目录中才能控制项目。

使用命令 scrapy genspider [-t template]

使用以下命令查看所有模板,默认模板为Basic。

[En]

View all templates using the following command, and the default template is basic.

使用以下测试命令创建第一个scrapy Crawler文件:

[En]

Create the first scrapy crawler file with the following test command:

此时,蜘蛛文件夹中将出现pm.py文件,其内容如下:

[En]

At this point in the spiders folder, the pm.py file appears, and its contents are as follows:

使用命令scrapy Crawl

[En]

Use the command scrapy crawl

scrapy 基本应用

scrapy 工作流程非常简单:

接下来,我们将演示一个完整的scrapy应用程序,作为120个爬行动物的scrapy部分的第一个示例。

[En]

Next, we will demonstrate a complete case application of scrapy as the first example in the scrapy section of 120 reptiles.

该项目的结构如下:

[En]

The structure of the project is as follows:

对上图中的一些文件进行简要说明。

[En]

A brief description of some of the files in the figure above.

  • scrapy.cfg:配置文件路径与部署配置;
  • items.py:目标数据的结构;
  • middlewares.py:中间件文件;
  • pipelines.py:管道文件;
  • settings.py:配置信息。

使用scrapy Crawl PM运行Crawler后,所有输出和说明如下:

[En]

After running the crawler using scrapy crawl pm, all the output and instructions are as follows:

上述代码的请求数为7,因为在pm.py文件中没有默认添加www,但如果添加此内容,请求数将变为4。

[En]

The number of requests for the above code is 7, because there is no www added by default in the pm.py file, but if you add this content, the number of requests becomes 4.

Pm.py文件的代码现在如下所示:

[En]

The code for the pm.py file is now as follows:

其中的 parse 表示请求 start_urls 中的地址,获取响应之后的回调函数,直接通过参数 response 的 .text 属性进行网页源码的输出。

在存储之前,您需要手动定义一个数据结构,该数据结构在items.py文件中实现,并在代码MyProjectItem CLASSES ArticleItem中修改类名。

[En]

Before storing, you need to manually define a data structure, which is implemented in the items.py file, and modify the class name in the code, MyProjectItem classes ArticleItem.

修改pm.py文件中的parse函数,增加与网页解析相关的操作,类似于pyquery知识点,直接观察代码即可掌握。

[En]

Modify the parse function in the pm.py file to increase the operation related to web page parsing, which is similar to pyquery knowledge points, which can be mastered by directly observing the code.

响应.css方法返回一个选择器列表,您可以迭代这些选择器,然后对其中的对象调用css方法。

[En]

The response.css method returns a list of selectors that you can iterate over and then call the css method on the objects in it.

  • item.css(‘.title::text’),获取标签内文本;
  • item.css(‘.a_block::attr(href)’),获取标签属性值;
  • extract_first():解析列表第一项;
  • extract():获取列表。

在 pm.py 中导入 items.py 中的 ArticleItem 类,然后按照下述代码进行修改:

运行抓取爬网程序时,会出现以下提示。

[En]

When you are running the scrapy crawler, the following prompt appears.

至此,单页爬行器完成。

[En]

At this point, a single page crawler is completed.

接下来,再次修改解析函数,以便在解析页面1之后可以解析页面2数据。

[En]

Next, the parse function is modified again so that after parsing page 1, page 2 data can be parsed.

在上面的代码中,变量Next表示下一个页面的地址,链接是通过Response.css函数获得的,css选择器将重点学习该函数。

[En]

In the above code, the variable next represents the address of the next page, and the link is obtained through the response.css function, of which the css selector please focus on learning.

yield scrapy.Request(url=next, callback=self.parse) 表示再次创建一个请求,并且该请求的回调函数是 parse 本身,代码运行效果如下所示。

如果要保存运行结果,请运行以下命令。

[En]

If you want to save the run results, run the following command.

如果您希望将每段数据存储为单独的行,请使用以下命令来抓取Crawl pm-o pm.jl。

[En]

If you want to store each piece of data as a separate row, use the following command to scrapy crawl pm-o pm.jl.

生成的文件还支持CSV、XML、Marchal和Pickle,您可以自己尝试。

[En]

The generated files also support csv, xml, marchal and pickle, and you can try on your own.

打开Pipelines.py文件,更改类名称MyProjectPipeline Clages TitlePipeline,然后编写以下代码:

[En]

Open the pipelines.py file, change the class name MyProjectPipeline classes TitlePipeline, and program the following code:

此代码用于删除标题中的左右空格。

[En]

This code is used to remove the left and right spaces in the title.

编写完毕,需要在 settings.py 文件中开启 ITEM_PIPELINES 配置。

300 是 PIPELINES 运行的优先级顺序,根据需要修改即可。再次运行爬虫代码,会发现标题的左右空格已经被移除。

在这一点上,已经编写了一个基本的scrapy爬虫。

[En]

At this point a basic crawler for scrapy has been written.

Original: https://www.cnblogs.com/huaweiyun/p/16550829.html
Author: 华为云开发者联盟
Title: 学python,怎么能不学习scrapy呢!

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/6376/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

最近整理资源【免费获取】:   👉 程序员最新必读书单  | 👏 互联网各方向面试题下载 | ✌️计算机核心资源汇总