数据采集实战（五）– 当当网童书排名

2023年5月26日下午11:14 • 大数据 • 阅读 76

概述

如今，学校越来越重视孩子对课外知识的掌握，给孩子选课外读物一般都是学校或家长推荐的。

[En]

Nowadays, schools pay more and more attention to the mastery of children’s extracurricular knowledge, and the selection of extracurricular books for children is generally recommended by the school or parents.

有时候，我也想看看现在儿时流行什么样的书。

[En]

Sometimes, I also want to see what kind of books are popular at present in childhood.

结果，我只是简单地写下了这只小爬虫，并收集了畅销儿童书籍前20名。

[En]

As a result, I simply wrote about this little crawler and collected the top 20 best-selling children’s books.

要想采集更多的畅销童书，后者采集其他类型的畅销书，调整相应的参数和URL就可以了。

采集流程

因为当当网的图书排行榜不需要登录查看，前20名的集合也不需要翻页，所以流程非常简单，打开网页直接解析保存。

[En]

Because Dangdang’s book ranking does not need to log in to view, and the collection of the top 20 does not need to turn the page, so the process is very simple, open the web page to directly parse and save it.

核心代码如下：

import { saveContent } from "../utils.js";

const http_prefix = "http://bang.dangdang.com/books/childrensbooks";

const age_start = 1;
const age_end = 4;
const month_start = 1;
const month_end = 11; // 目前只到11月

const age_map = { 1: "0~2岁", 2: "3~6岁", 3: "7~10岁", 4: "11~14岁" };

const childrensbooks = async (page) => {
    // 0~2岁， 3~6岁，7~10岁，11~14岁
    for (let i = age_start; i  {
    await page.goto(url);

    const listContent = await page.$$("ul.bang_list>li");
    let lines = [
        "排名order,书名name,评论数comment,推荐率recommend_pct,作者author,出版日期publish_date,出版社publisher",
    ];
    for (let i = 0; i < listContent.length; i++) {
        const order = await listContent[i].$eval(
            "div.list_num",
            (node) => node.innerText
        );

        const name = await listContent[i].$eval(
            "div.name>a",
            (node) => node.innerText
        );
        const comment = await listContent[i].$eval(
            "div.star>a",
            (node) => node.innerText
        );
        const recommend_pct = await listContent[i].$eval(
            "div.star>span.tuijian",
            (node) => node.innerText
        );
        const publisher_info = await listContent[i].$$("div.publisher_info");
        const authors = await publisher_info[0].$$eval("a", (nodes) =>
            nodes.map((node) => node.innerText)
        );

        const author = authors.join("&");
        const publish_date = await publisher_info[1].$eval(
            "span",
            (node) => node.innerText
        );
        const publisher = await publisher_info[1].$eval(
            "a",
            (node) => node.innerText
        );

        const line = ${order},${name},${comment},${recommend_pct},${author},${publish_date},${publisher};
        lines.push(line);
        console.log(line);
    }

    return lines;
};

export default childrensbooks;

收集的内容按照月份和年龄进行分类和保存。

[En]

The collected content is classified and saved according to month and age.

文件的内容是csv格式的（下图是其中部分字段）。

总结

以上内容是通过 puppeteer 采集的，除了童书排行榜，还有图书畅销榜，新书热卖榜，图书飙升榜，特价榜，五星图书榜等等。
各个榜单的结构都类似，只需要修改上面代码中的 http_prefix，以及童书年龄阶段的循环控制等，就能采集相应数据。

注意事项

爬行数据只是为了研究和学习使用，本文中的代码如下：

[En]

Crawling data is just for research, learning to use, and the code in this article follows:

如果网站有 robots.txt，遵循其中的约定
爬取速度模拟正常访问的速率，不增加服务器的负担
只获取完全公开的数据，有可能涉及隐私的数据绝对不碰

Original: https://www.cnblogs.com/wang_yb/p/15650185.html
Author: wang_yb
Title: 数据采集实战（五）– 当当网童书排名

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/522653/

转载文章受原作者版权保护。转载请注明原作者出处！

大数据

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

Flink1.13-java版教程（高阶1）

第 7 章处理函数 7.1 基本处理函数（ProcessFunction）处理函数主要是定义数据流的转换操作，所以也可以把它归到转换算子中。我们知道在Flink 中几乎所有转换…

大数据 2023年5月24日
0077
[SQLite]浅析其一——SQLite数据库简介

SQLite数据库简述 1.1. 介绍节选并翻译自官网介绍： SQLite是一个进程内的库，实现了自给自足的、无服务器的、零配置的、事务性的 SQL 数据库引擎。其代码完全开源，…

大数据 2023年11月12日
0053
Docker使用exec进入正在运行中的容器

docker在1.3.X版本之后提供了一个新的命令exec用于进入容器，这种方式相对简单一些，下面我们来看一下该命令的使用： docker exec –help 接下来…

大数据 2023年5月29日
0066
虚拟机配置Hadoop

目录前言一、配置网络信息1.修改配置信息2.重启网络服务(所有结果显示ok即成功)3.配置映射(修改主机名：vi /etc/sysconfig/network)二、SSH配置免密…

大数据 2023年11月13日
0073
基于Docker安装RabbitMQ及基本使用

Rabbit 默认的端口5672 默认启动方式如下（可登录容器后使用简单的命令进行管理） docker run -d –hostname my-rabbit –name som…

大数据 2023年5月29日
0081
spark-sql运行报错 ERROR server.TransportRequestHandler: Error while invoking RpcHandler#receive() on RPC

环境：CDH6.3.2spark版本2.4.0spark-sql脚本 #!/bin/bash export HADOOP_CONF_DIR=/etc/hadoop/conf exp…

大数据 2023年11月13日
0040
ASP.Net Core MVC 发生二次请求

Bug回忆录昨天搭建新框架的时候，遇到一个很奇怪的”Bug”，每次请求都会触发两次Aciton，举例子吧，Demo： _Layout.cshtml &qu…

大数据 2023年6月3日
00109
Linux安全防护（一）

Linux安全防护（一）原创运维灬小兵2022-06-28 16:55:54博主文章分类：Linux ©著作权文章标签 SELinux 防火墙链路聚合 ip地址 bash …

大数据 2023年5月26日
0075
GsonUtil

package com.credithc.microenterprise.utils; import com.credithc.microenterprise.utils.adap…

大数据 2023年6月3日
0065
[Elasticsearch] ES更新问题踩坑记录

问题描述我们有个系统设计的时候针对Hive创建表、删除表, 需要更新ES中的一个状态,标记是否删除,在几乎同时执行两条下面的语句的时候,发现在ES 中出现表即使被创建了还是无法被…

大数据 2023年5月26日
0071
hadoop入门

今天学习了Hadoop的入门知识，在Hadoop官网上学的，下面是内容分享：这篇文档的目的是帮助你快速完成单机上的Hadoop安装与使用以便你对Hadoop分布式文件系统(H…

大数据 2023年5月26日
0090
十分钟速成DevOps实践

摘要：以华为云软件开发平台DevCloud为例，十分钟简单体验下DevOps应用上云实践——H5经典小游戏上云。 DevOps是什么？ DevOps是Development和Ope…

大数据 2023年6月2日
00102
hiveSQL常见函数及用法（持续收集）

1，时间函数 ; 2，聚合函数注意：聚合函数常与 SELECT 语句的 GROUP BY 子句一块儿使用。换句话说使用聚合函数时，一个列字段要不在group by里，要没必要须在…

大数据 2023年11月12日
0041
Asible 批量跑出服务器相关信息

bash;gutter:true; 1、定义剧本（我这里只输出序列号）</p> <pre><code> ;gutter:true;[root@l…

大数据 2023年5月27日
0073
校验文件MD5_SHA1_SHA256值

win中查看MD5值： certutil -hashfile 文件名 MD5 查看 SHA1 certutil -hashfile 文件名 SHA1 查看SHA256 certu…

大数据 2023年6月3日
0079
datax-＞hdfsreader-＞orc文件读取出错ArrayIndexOutOfBoundsException: 6

大数据 2023年11月14日
0057

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

数据采集实战（五）– 当当网童书排名

大家都在看