python应用市场app爬虫_【Python实战】Scrapy豌豆荚应用市场爬虫

2023年10月4日上午12:08 • Python • 阅读 47

‘#j-search-list>li::attr(data-pn)’

接下来，我们来分析APP的详情页，APP的名称所对应的HTML元素如图：

python应用市场app爬虫_【Python实战】Scrapy豌豆荚应用市场爬虫

APP类别的如图：

APP描述的如图：

不难得到这三类元素所对应的CSS选择器

.app-name>span::text

.crumb>.second>a>span::text

.desc-info>.con::text

通过上面的分析，确定爬取策略如下：

逐行读取APP文件，拼接搜索页面URL；

分析搜索结果页面，跳转到第一条结果对应的详情页；

爬取详情页相关结果，写到输出文件

爬虫实现

分析完页面，可以coding写爬虫了。但是，若裸写Python实现，则要处理下载间隔、请求、页面解析、爬取结果序列化。Scrapy提供一个轻量级、快速的web爬虫框架，并很好地解决了这些问题；中文doc有比较详尽的介绍。

数据清洗

APP文件中，可能有一些名称不规整，需要做清洗：

–– coding: utf-8 ––

import re

def clean_app_name(app_name):

space = u’\u00a0′

app_name = app_name.replace(space, ”)

brackets = r'(.)|[.]|【.】|(.)’

return re.sub(brackets, ”, app_name)

URL处理

拿清洗后APP名称，拼接搜索结果页面URL。因为URL不识别中文等字符，需要用urllib.quote做URL编码：

–– coding: utf-8 ––

from appMarket import clean

import urllib

def get_kw_url(kw):

“””concatenate the url for searching”””

base_url = u”http://www.wandoujia.com/search?key=%s”

kw = clean.clean_app_name(kw)

return base_url % (urllib.quote(kw.encode(“utf8”)))

def get_pkg_url(pkg):

“””get the detail url according to pkg”””

return ‘http://www.wandoujia.com/apps/%s’ % pkg

爬取

Scrapy的爬虫均继承与scrapy.Spider类，主要的属性及方法：

name，爬虫的名称，scrapy crawl命令后可直接跟爬虫的名称，即可启动该爬虫

allowed_domains，允许爬取域名的列表

start_requests()，开始爬取的方法，返回一个可迭代对象(iterable)，一般为scrapy.Request对象

parse(response)，既可负责处理response并返回处理的数据，也可以跟进的URL(以做下一步处理)

items为保存爬取后数据的容器，类似于Python的dict，

import scrapy

class AppMarketItem(scrapy.Item):

define the fields for your item here like:

kw = scrapy.Field() # key word

name = scrapy.Field() # app name

tag = scrapy.Field() # app tag

desc = scrapy.Field() # app description

豌豆荚Spider代码：

–– coding: utf-8 ––

@Time : 2016/6/23

@Author : rain

import scrapy

import codecs

from appMarket import util

from appMarket.util import wandoujia

from appMarket.items import AppMarketItem

class WandoujiaSpider(scrapy.Spider):

name = “WandoujiaSpider”

allowed_domains = [“www.wandoujia.com”]

def init(self):

self.apps_path = ‘./input/apps.txt’

def start_requests(self):

with codecs.open(self.apps_path, ‘r’, ‘utf-8’) as f:

for app_name in f:

yield scrapy.Request(url=wandoujia.get_kw_url(app_name),

callback=self.parse_search_result,

meta={‘kw’: app_name.rstrip()})

def parse(self, response):

item = AppMarketItem()

item[‘kw’] = response.meta[‘kw’]

item[‘name’] = response.css(‘.app-name>span::text’).extract_first()

item[‘tag’] = response.css(‘.crumb>.second>a>span::text’).extract_first()

desc = response.css(‘.desc-info>.con::text’).extract()

item[‘desc’] = util.parse_desc(desc)

item[‘desc’] = u”” if not item[“desc”] else item[“desc”].strip()

self.log(u’crawling the app %s’ % item[“name”])

yield item

def parse_search_result(self, response):

pkg = response.css(“#j-search-list>li::attr(data-pn)”).extract_first()

yield scrapy.Request(url=wandoujia.get_pkg_url(pkg), meta=response.meta)

APP文件里的应用名作为搜索词，也应被写在输出文件里。但是，在爬取时URL有跳转，如何在不同层级间的Request传递变量呢？Request中的meta (dict) 参数实现了这种传递。

APP描述.desc-info>.con::text，extract返回的是一个list，拼接成string如下：

def parse_desc(desc):

return reduce(lambda a, b: a.strip()+b.strip(), desc, ”)

结果处理

Scrapy推荐的序列化方式为Json。Json的好处显而易见：

跨语言；

Schema明晰，较于’\t’分割的纯文本，读取不易出错

爬取结果有可能会有重复的、为空的(无搜索结果的)；此外，Python2序列化Json时，对于中文字符，其编码为unicode。对于这些问题，可自定义Pipeline对结果进行处理:

class CheckPipeline(object):

“””check item, and drop the duplicate one”””

def init(self):

self.names_seen = set()

def process_item(self, item, spider):

if item[‘name’]:

if item[‘name’] in self.names_seen:

raise DropItem(“Duplicate item found: %s” % item)

else:

self.names_seen.add(item[‘name’])

return item

else:

raise DropItem(“Missing price in %s” % item)

class JsonWriterPipeline(object):

def init(self):

self.file = codecs.open(‘./output/output.json’, ‘wb’, ‘utf-8’)

def process_item(self, item, spider):

line = json.dumps(dict(item), ensure_ascii=False) + “\n”

self.file.write(line)

return item

还需在settings.py中设置

ITEM_PIPELINES = {

‘appMarket.pipelines.CheckPipeline’: 300,

‘appMarket.pipelines.JsonWriterPipeline’: 800,

}

分配给每个类的整型值，确定了他们运行的顺序，按数字从低到高的顺序，通过pipeline，通常将这些数字定义在0-1000范围内。

Original: https://blog.csdn.net/weixin_31300407/article/details/113980779
Author: 王亚昌
Title: python应用市场app爬虫_【Python实战】Scrapy豌豆荚应用市场爬虫

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/790628/

转载文章受原作者版权保护。转载请注明原作者出处！

python

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

摒弃无意义的单元测试

在ThoughtWorks经历过几个项目后，我从一个只会莽code的糙汉子变成了一个会写UT的糙汉子。写过UT，也写过集成测试，也实践过TDD，发现了一些有趣的地方，跟大家分享下。…

Python 2023年6月6日
0096
我在阿里做测试，入职5个月的回顾与总结

初来阿里实习的时候，我对测试人员的职责知之甚少，在校时更是从未接触过测试工作。一头雾水之际，主管说：”做项目吧，在实战中快速成长”。从学生到校招生，我在思维…

Python 2023年9月26日
0028
学生为什么要在CSDN写博客？

学生为什么要在CSDN写博客？ * – 引言 – 写博客的好处 – + 构建知识体系 + 提升写作能力 + 扩展人脉 + 为简历加分 + 帮助他…

Python 2023年10月7日
0071
numpy学习之五：multiply()函数，squeeze()函数，reshape()函数，assert()函数

NumPy 算术函数包含简单的加减乘除: add()，subtract()，multiply() 和 divide()。注：如果传入的参数是矩阵，必须有相同的形状，或者符合广播规…

Python 2023年8月28日
0052
肯德尔（Kendall）相关系数概述及Python计算例

目录 1. 何谓相关（correlation）? 2. 肯德尔相关 3. 肯德尔相关的假设 4. 计算公式及代码示例 4.1 Tau-a 4.2 Tau-b 何谓相关（correl…

Python 2023年8月1日
00112
手把手教你安装torch_geometric库（pyg）

正常步骤： 1.在安装前要检查电脑的上的torch和cuda版本 import torch; print(torch.version)检查torch版本； import torch…

Python 2023年8月2日
0088
python中的scale_Python Matplotlib.pyplot.yscale()用法及代码示例

Matplotlib是Python中的一个库，它是NumPy库的数字-数学扩展。 Pyplot是Matplotlib模块的基于状态的接口，该模块提供MATLAB-like接口。 m…

Python 2023年9月4日
0058
关于git，你需要了解这些

啊哦~你想找的内容离你而去了哦内容不存在，可能为如下原因导致： ① 内容还在审核中 ② 内容以前存在，但是由于不符合新的规定而被删除 ③ 内容地址错误 ④ 作者删除了内容。可…

Python 2023年10月7日
0055
多远线性算法预测房价

一、基于统计分析库statsmodels 1.数据读取 import pandas as pd import numpy as np import seaborn as sns i…

Python 2023年8月8日
0040
pytest学习记录01

pytest.main()：main中传入不同的指令用以执行指定测试用例 -s: 显示程序中的print/logging输出 -v: 丰富信息模式, 输出更详细的用例执行信息 -q…

Python 2023年9月14日
0034
python（模块）xlwt

目录一、xlwt简介二、xlwt语法 * 1、模块安装 2、模块导入 3、向xls文件中写入内容 4、设置写入文件的格式 – 4.1 字体设置（font） 4.2 …

Python 2023年8月1日
0062
[Python]解密pyc文件

公司的pyc做了加密, 前段时间研究了一下怎么解密. 最开始的思路是反汇编pypy的dll, 找到import代码的实现, 然后写一个解码的函数. 但是对反编译的东西不熟悉, 想要…

Python 2023年10月31日
0043
数据分析：工具篇

初识pandas pandas简介 pandas的数据结构构建数据表的类简单介绍 * Series创建方法 DataFrame创建方法表格内容的查找方法 * 揭秘Series …

Python 2023年8月16日
0044
Scrapy学习第一节

Scrapy 2021/9/7 Scrapy是基于Twisted的异步处理框架同步是按顺序执行，执行完一个再执行下一个，需要等待、协调运行。异步就是彼此独立,在等待某事件的过…

Python 2023年10月7日
0038
Pygame(十二)打砖块

Pygame(十二)打砖块目标小球撞击响应砖块撞击响应挡板撞击响应完整示例代码 ”’设计一个打砖块的游戏”’ import pygame import sys imp…

Python 2023年9月21日
0046
Python作图总结——plot,subplots

目录 1 前言 2 常用属性和属性变量介绍 3 plot 4 subplots 5 plot封装 1 前言需要导入的模块和函数：import matplotlib.pyplot …

Python 2023年9月1日
0047

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

python应用市场app爬虫_【Python实战】Scrapy豌豆荚应用市场爬虫

大家都在看