【从零开始学爬虫】采集谷歌网页列表数据

采集网站

【场景描述】采集谷歌浏览器关键词搜索出的网页列表数据。

【使用工具】前嗅ForeSpider数据采集系统,免费下载:

http://www.forenose.com/view/forespider/view/download.html

【入口网址】

【采集内容】采集谷歌”apple”关键词下的全部列表数据,包括来源、标题以及摘要。

【待采集内容】

思路分析

配置思路概览:

配置步骤

. 新建采集任务

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:7a488ab7-b744-4da5-8cee-cc616b970dc3

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:b6c60255-e132-4198-a08e-d8ec1cd1003b

【新建采集任务】

二. 模板配置

①查找翻页链接及其规律

在入口地址页内打开”F12″,按如下步骤找到翻页地址,并复制刷新后的翻页链接地址

【翻页链接位置】

对比观察翻页链接的规律

【翻页链接】

观察发现:随着翻页变化,页码数与请求网址(Requestrian URL)中”start=”后的数字相关。所以,其规律为:

“https://www.google.com/search?q=apple&ei=K_UaY4yIFNqBxc8PuNiymAk&start=”+页码数减1后乘以10+”&sa=N&ved=2ahUKEwjMyfG7oYf6AhXaQPEDHTisDJM4KBDy0wN6BAgBEEE&biw=553&bih=755&dpr=1”

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:297bf256-6006-4e18-9275-70ca049ba6e5

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:884913b5-feaf-4dc5-a9a1-0a66d48c48d0

②脚本的创建与编写

【脚本的创建与编写】

脚本文本:

url u;//定义一个u,赋予其url属性

var url_beg=DOM.FindClass(“AaVjTc”,”table”);//定义一个url_beg,其属性位置位于 table class =”AaVjTc”所属节点下

var ur=url_beg.child.child.next.next.next;//定义一个ur,其位置属性位于url_beg二级子节点三级兄弟节点下

for(int i=0;i

u.title=”谷歌第”+(i+1)+”页”;//将标题内容设置为:谷歌第几页

var ur=”https://www.google.com/search?q=apple&ei=K_UaY4yIFNqBxc8PuNiymAk&start=”+i*10+”&sa=N&ved=2ahUKEwjMyfG7oYf6AhXaQPEDHTisDJM4KBDy0wN6BAgBEEE&biw=553&bih=755&dpr=1″;//根据翻页链接规律,拼全链接

u.urlname=ur;//取得已拼合链接

u.entryid=CHANN.id;

u.tmplid=2;//关联模板02

RESULT.AddLink(u); //输出采集结果

ur=ur.next;//进入下个翻页链接的采集

③查看采集预览

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:e4349901-0160-409e-acaa-7c0f7cb86d41

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:1fef879d-432e-4986-b3a9-0eb3da39336d

【采集预览】

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:568c40a6-cf67-4db3-a1ef-8f618daffd4d

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:9f01519d-e4f2-482d-b4d4-3bace0589032

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:1c2c5372-73e8-45ca-812b-7375e324c31c

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:c0405297-5ecf-4d8c-9794-65bb7f072508

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:06690d91-4b8f-44be-aa26-07fe41c850a4

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:3d26fb07-18e8-4454-b05b-f213db618a0f

②数据表结构创建

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:f277f299-765d-4d12-8afb-d9f74085e291

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:808ed668-5f3b-49ab-bbeb-61d079c16084

【创建表结构】

③关联表单

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:eb7c5fab-920a-4f4c-90ba-922f4c74a8e0

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:db31c7ba-5357-4d0e-8a8a-4d7979214cf9

【关联表单】

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:89d5e804-86ef-4202-9df1-d6cdb830de3d

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:70447a3c-425d-4957-9ea3-a7e1a85fddcc

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:8ce10ad4-67b4-49d0-8c72-a228b3622183

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:b8576cbc-0845-4593-a594-4ab57eee65da

【脚本的创建与编写】

脚本文本:

record re ;//定义一个re,赋予其record属性

var ret=DOM.FindClass(“hlcw0c”,”div”);//定义一个ret,其位置属性位于 div class=”hlcw0c”所在节点下

while (ret){//遍历ret

var beg=DOM.FindClass(“MjjYud”,”div”,ret);//定义一个beg,其位置属性位于 div class=”MjjYud”所在节点下

var pu =beg.child.child.child;// 定义一个pu,其位置属性位于 div class=”Z26q7c UK95Uc jGGQ5e VGXe8″所在节点下

var tit=pu.child.child.child.next; //定义一个tit,其位置属性位于 h3 class=”LC20lb MBeuO DKV0Md”所在节点下

var pag=tit.next;//定义一个pag ,其位置属性位于tit下一兄弟节点下

var con=pu.next;//定义一个con ,其位置属性位于pu下一兄弟节点下

re.page=DOM.GetTextAll(pag);//取得列表内容来源

re.title=DOM.GetTextAll(tit);//取得列表内容标题

re.content=DOM.GetTextAll(con);//取得列表内容摘要

RESULT.AddRec(re,this.schemaid);//输出采集结果

ret=ret.next;//进入下一个待采集列表

⑤查看采集预览

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:68570acd-235f-4c2a-811e-33e8b554abd6

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:9d0da43d-5b96-413d-b7d2-38dee5d29f7c

【采集预览】

Original: https://www.cnblogs.com/forenose/p/16690376.html
Author: 前嗅
Title: 【从零开始学爬虫】采集谷歌网页列表数据

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/561795/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

亲爱的 Coder【最近整理,可免费获取】👉 最新必读书单  | 👏 面试题下载  | 🌎 免费的AI知识星球