数据采集实战(五)– 当当网童书排名

  1. 概述

如今,学校越来越重视孩子对课外知识的掌握,给孩子选课外读物一般都是学校或家长推荐的。

[En]

Nowadays, schools pay more and more attention to the mastery of children’s extracurricular knowledge, and the selection of extracurricular books for children is generally recommended by the school or parents.

有时候,我也想看看现在儿时流行什么样的书。

[En]

Sometimes, I also want to see what kind of books are popular at present in childhood.

结果,我只是简单地写下了这只小爬虫,并收集了畅销儿童书籍前20名。

[En]

As a result, I simply wrote about this little crawler and collected the top 20 best-selling children’s books.

要想采集更多的畅销童书,后者采集其他类型的畅销书,调整相应的参数和URL就可以了。

  1. 采集流程

因为当当网的图书排行榜不需要登录查看,前20名的集合也不需要翻页,所以流程非常简单,打开网页直接解析保存。

[En]

Because Dangdang’s book ranking does not need to log in to view, and the collection of the top 20 does not need to turn the page, so the process is very simple, open the web page to directly parse and save it.

核心代码如下:

import { saveContent } from "../utils.js";

const http_prefix = "http://bang.dangdang.com/books/childrensbooks";

const age_start = 1;
const age_end = 4;
const month_start = 1;
const month_end = 11; // 目前只到11月

const age_map = { 1: "0~2岁", 2: "3~6岁", 3: "7~10岁", 4: "11~14岁" };

const childrensbooks = async (page) => {
    // 0~2岁, 3~6岁,7~10岁,11~14岁
    for (let i = age_start; i  {
    await page.goto(url);

    const listContent = await page.$$("ul.bang_list>li");
    let lines = [
        "排名order,书名name,评论数comment,推荐率recommend_pct,作者author,出版日期publish_date,出版社publisher",
    ];
    for (let i = 0; i < listContent.length; i++) {
        const order = await listContent[i].$eval(
            "div.list_num",
            (node) => node.innerText
        );

        const name = await listContent[i].$eval(
            "div.name>a",
            (node) => node.innerText
        );
        const comment = await listContent[i].$eval(
            "div.star>a",
            (node) => node.innerText
        );
        const recommend_pct = await listContent[i].$eval(
            "div.star>span.tuijian",
            (node) => node.innerText
        );
        const publisher_info = await listContent[i].$$("div.publisher_info");
        const authors = await publisher_info[0].$$eval("a", (nodes) =>
            nodes.map((node) => node.innerText)
        );

        const author = authors.join("&");
        const publish_date = await publisher_info[1].$eval(
            "span",
            (node) => node.innerText
        );
        const publisher = await publisher_info[1].$eval(
            "a",
            (node) => node.innerText
        );

        const line = ${order},${name},${comment},${recommend_pct},${author},${publish_date},${publisher};
        lines.push(line);
        console.log(line);
    }

    return lines;
};

export default childrensbooks;

收集的内容按照月份和年龄进行分类和保存。

[En]

The collected content is classified and saved according to month and age.

数据采集实战(五)-- 当当网童书排名

文件的内容是csv格式的(下图是其中部分字段)。

数据采集实战(五)-- 当当网童书排名
  1. 总结

以上内容是通过 puppeteer 采集的,除了童书排行榜,还有图书畅销榜,新书热卖榜,图书飙升榜,特价榜,五星图书榜等等。
各个榜单的结构都类似,只需要修改上面代码中的 http_prefix,以及童书年龄阶段的循环控制等,就能采集相应数据。

  1. 注意事项

爬行数据只是为了研究和学习使用,本文中的代码如下:

[En]

Crawling data is just for research, learning to use, and the code in this article follows:

  1. 如果网站有 robots.txt,遵循其中的约定
  2. 爬取速度模拟正常访问的速率,不增加服务器的负担
  3. 只获取完全公开的数据,有可能涉及隐私的数据绝对不碰

Original: https://www.cnblogs.com/wang_yb/p/15650185.html
Author: wang_yb
Title: 数据采集实战(五)– 当当网童书排名

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/522653/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

亲爱的 Coder【最近整理,可免费获取】👉 最新必读书单  | 👏 面试题下载  | 🌎 免费的AI知识星球