数据采集实战（四）– 线性代数习题答案下载

2023年5月26日下午11:26 • 大数据 • 阅读 79

概述

前段时间，我正在读很多人推荐的第三版线性代数教材《线性代数应该这样学》，其中每一章都有大量的习题。

[En]

Some time ago, I was reading the third edition of the linear algebra textbook “Linear Algebra should be learned in this way” recommended by many people, in which there are a large number of exercises in each chapter.

虽然官网按章节提供习题答案，但一方面因为网站是国外的，访问并不顺畅，另一方面答案中还夹杂着广告，影响观看。

[En]

Although the official website provides the answers to the exercises according to the chapter, on the one hand, because the website is foreign, the visit is not smooth, and on the other hand, the answers are interspersed with advertisements, which affect the viewing.

所以，想试着将答案爬取下来制作成pdf，查看起来方便，也不会受网络的影响。

采集流程

只是得到一个网页很容易，而且没什么可说的。这篇文章与之前的数据收集文章的区别如下：

[En]

It’s just that it’s easy to get a web page, and there’s nothing to say. The difference between this and the previous data collection articles is as follows:

网页中有数学公式，这些公式通过前端js的转换才能正常显示，所以从html中直接获取DOM内容是没用的，要获取所有html元素
获取网页之后要去除不必要的元素（比如网页中的header，footer，menu，广告等等），然后再保存网页，也就是采集网页局部内容

绿色背景的部分是通过puppetter来完成的。
蓝色背景的部分在采集之后通过PDF相关的命令行小工具来完成。

2.1 去除网页中元素（绿色背景部分）

    await page.evaluate(() => {
      const domToRemove = [
        "#top-bar-wrap",
        "#site-header",
        "#main> .page-header",
        "#content > article > ul",
        "#content > article > .entry-content > center",
        "#content > article > .entry-content > .google-auto-placed",
        "#content > article > .entry-content > #amzn_assoc_ad_div_adunit0_0",
        "#content > article > .entry-content > #related_posts",
        ".post-tags",
        "nav",
        "section",
        ".addthis-smartlayers",
        "#right-sidebar",
        "footer",
      ];
      for (let j = 0; j < domToRemove.length; j++) {
        const doms = document.querySelectorAll(domToRemove[j]);
        for (let k = 0; k < doms.length; k++) {
          // !!!这一步是关键，将自己从DOM树中删除
          doms[k].parentNode.removeChild(doms[k]);
        }
      }
    });

    // 网页保存成html文件，为了后面可以转换成pdf
    await savePage(
      page,
      "./output/linearAlgebraExercises",
      exercies[i] + ".html"
    );

2.2 生成PDF文档（蓝色背景部分）

将html文件转成pdf的工具很多，python和nodejs有很多这种库，选择一种你熟悉的都可以。
我用的 pandoc，转换效果还不错！数学公式都可以正确显示。

转换html的命令示例
pandoc input.html -t latex -o output.pdf

显示效果如下：

合并多个pdf也有很多小工具，我使用的是 pdftk。

合并pdf的命令示例
pdftk input1.pdf input2.pdf input3.pdf cat output output.pdf

总结

整个过程很简单，唯一值得讨论的技术要点可能是，在获取网页时，会实时移除不必要的部分。

[En]

The whole process is simple, and the only technical point worth talking about may be that unnecessary parts are removed in real time when getting a web page.

虽然很简单，但整个过程是完整的，经过对其细节的一些改进，实际上是一个电子书自动化生产的过程。

[En]

Although it is very simple, but the whole process is complete, after some improvement of its details, it is actually a process of automatic production of e-books.

注意事项

爬行数据只是为了研究和学习使用，本文中的代码如下：

[En]

Crawling data is just for research, learning to use, and the code in this article follows:

如果网站有 robots.txt，遵循其中的约定
爬取速度模拟正常访问的速率，不增加服务器的负担
只获取完全公开的数据，有可能涉及隐私的数据绝对不碰

Original: https://www.cnblogs.com/wang_yb/p/15380917.html
Author: wang_yb
Title: 数据采集实战（四）– 线性代数习题答案下载

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/522701/

转载文章受原作者版权保护。转载请注明原作者出处！

一	二	三	四	五	六	日
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

数据采集实战（四）– 线性代数习题答案下载

2.1 去除网页中元素（绿色背景部分）

2.2 生成PDF文档（蓝色背景部分）

大家都在看