数据采集实战(五)– 当当网童书排名

  1. 概述



Nowadays, schools pay more and more attention to the mastery of children’s extracurricular knowledge, and the selection of extracurricular books for children is generally recommended by the school or parents.



Sometimes, I also want to see what kind of books are popular at present in childhood.



As a result, I simply wrote about this little crawler and collected the top 20 best-selling children’s books.


  1. 采集流程



Because Dangdang’s book ranking does not need to log in to view, and the collection of the top 20 does not need to turn the page, so the process is very simple, open the web page to directly parse and save it.


import { saveContent } from "../utils.js";

const http_prefix = "http://bang.dangdang.com/books/childrensbooks";

const age_start = 1;
const age_end = 4;
const month_start = 1;
const month_end = 11; // 目前只到11月

const age_map = { 1: "0~2岁", 2: "3~6岁", 3: "7~10岁", 4: "11~14岁" };

const childrensbooks = async (page) => {
    // 0~2岁, 3~6岁,7~10岁,11~14岁
    for (let i = age_start; i  {
    await page.goto(url);

    const listContent = await page.$$("ul.bang_list>li");
    let lines = [
    for (let i = 0; i < listContent.length; i++) {
        const order = await listContent[i].$eval(
            (node) => node.innerText

        const name = await listContent[i].$eval(
            (node) => node.innerText
        const comment = await listContent[i].$eval(
            (node) => node.innerText
        const recommend_pct = await listContent[i].$eval(
            (node) => node.innerText
        const publisher_info = await listContent[i].$$("div.publisher_info");
        const authors = await publisher_info[0].$$eval("a", (nodes) =>
            nodes.map((node) => node.innerText)

        const author = authors.join("&");
        const publish_date = await publisher_info[1].$eval(
            (node) => node.innerText
        const publisher = await publisher_info[1].$eval(
            (node) => node.innerText

        const line = ${order},${name},${comment},${recommend_pct},${author},${publish_date},${publisher};

    return lines;

export default childrensbooks;



The collected content is classified and saved according to month and age.

数据采集实战(五)-- 当当网童书排名


数据采集实战(五)-- 当当网童书排名
  1. 总结

以上内容是通过 puppeteer 采集的,除了童书排行榜,还有图书畅销榜,新书热卖榜,图书飙升榜,特价榜,五星图书榜等等。
各个榜单的结构都类似,只需要修改上面代码中的 http_prefix,以及童书年龄阶段的循环控制等,就能采集相应数据。

  1. 注意事项



Crawling data is just for research, learning to use, and the code in this article follows:

  1. 如果网站有 robots.txt,遵循其中的约定
  2. 爬取速度模拟正常访问的速率,不增加服务器的负担
  3. 只获取完全公开的数据,有可能涉及隐私的数据绝对不碰

Original: https://www.cnblogs.com/wang_yb/p/15650185.html
Author: wang_yb
Title: 数据采集实战(五)– 当当网童书排名





亲爱的 Coder【最近整理,可免费获取】👉 最新必读书单  | 👏 面试题下载  | 🌎 免费的AI知识星球