python爬虫高级功能

【自取】最近整理的,有需要可以领取学习:

在上一篇文章中,我们介绍了爬虫的实现以及爬虫对数据的爬行功能。在这里我们将遇到几个问题,例如站点中的robots.txt文件,该文件包含禁止爬行的URL。以及爬虫是否支持代理功能。以及一些地点爬行动物的风险控制措施。设计了爬虫下载限速功能。

[En]

In the previous article, we introduced the implementation of the crawler and the function of the crawler to crawl data. Here we will encounter several problems, such as the robots.txt file in the site, which contains the URL that forbids crawling. And whether the crawler supports the proxy function. And the risk control measures for reptiles in some sites. Designed crawler download speed limit function.

1、解析robots.txt
首先,我们需要解析robots.txt文件。以避免下载被禁止爬行的URL。应用随附的robotparser模块,您可以轻松完成这项工作,如以下代码所示。

[En]

First, we need to parse the robots.txt file. To avoid downloading URL that is prohibited from crawling. Apply the robotparser module that comes with Python, you can easily finish this work, such as the following code.

robotparser模块首先载入robots.txt文件。然后通过can_fetch()函数确定指定的用户代理是否同意訪问网页。

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">get_robots</span><span class="hljs-params">(url)</span>:</span>
    <span class="hljs-string">"""Initialize robots parser for this domain
    """</span>
    rp = robotparser.RobotFileParser()
    rp.set_url(urlparse.urljoin(url, <span class="hljs-string">'/robots.txt'</span>))
    rp.read()
    <span class="hljs-keyword">return</span> rp

为了将此功能集成到爬虫程序中,我们需要在爬虫循环中包括此检查。

[En]

In order to integrate this functionality into the crawler, we need to include this check in the crawl loop.

<span class="hljs-keyword">while</span> crawl_queue:
    url = crawl_queue.pop()

    <span class="hljs-keyword">if</span> rp.can_fetch(user_agent,url):
        <span class="hljs-keyword">...</span>
    <span class="hljs-keyword">else</span>&#xFF1A;
        print <span class="hljs-string">'Blocked by robots.txt:'</span>,url

2、支持代理
有时我们需要使用代理来询问站点。例如,Netflix屏蔽了美国以外的大多数国家。

[En]

Sometimes we need to use an agent to ask a site. Netflix, for example, blocks most countries outside the United States.

使用urllib2支持代理并没有想象的那么easy(能够尝试使用更友好的Python HTTP 模块requests来实现该功能)以下是使用urllib2支持代理的代码。
proxy = …

opener = urllib2.build_opener()
proxy_params = {urlparse.urlparse(url).scheme:proxy}
opener.add_header(urllib2.ProxyHandler(proxy_params))
response = opener.open(request)
以下是集成此功能的新版本号下载功能。

[En]

The following is the new version number download function that integrates this feature.

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">download</span><span class="hljs-params">(url, headers, proxy, num_retries, data=None)</span>:</span>
    <span class="hljs-keyword">print</span> <span class="hljs-string">'Downloading:'</span>, url
    request = urllib2.Request(url, data, headers)
    opener = urllib2.build_opener()
    <span class="hljs-keyword">if</span> proxy:
        proxy_params = {urlparse.urlparse(url).scheme: proxy}
        opener.add_handler(urllib2.ProxyHandler(proxy_params))
    <span class="hljs-keyword">try</span>:
        response = opener.open(request)
        html = response.read()
        code = response.code
    <span class="hljs-keyword">except</span> urllib2.URLError <span class="hljs-keyword">as</span> e:
        <span class="hljs-keyword">print</span> <span class="hljs-string">'Download error:'</span>, e.reason
        html = <span class="hljs-string">''</span>
        <span class="hljs-keyword">if</span> hasattr(e, <span class="hljs-string">'code'</span>):
            code = e.code
            <span class="hljs-keyword">if</span> num_retries > <span class="hljs-number">0</span> <span class="hljs-keyword">and</span> <span class="hljs-number">500</span> <= code < <span class="hljs-number">600:

                html = download(url, headers, proxy, num_retries - <span class="hljs-number">1</span>, data)
        <span class="hljs-keyword">else</span>:
            code = <span class="hljs-keyword">None</span>
    <span class="hljs-keyword">return</span> html</=>

3、下载限速
如果我们爬得太快,我们就有被屏蔽或导致服务器超载的风险。为了降低这些风险,我们可以在下载之间增加延迟以限制爬行器的速度。下面是实现此功能的类的代码。

[En]

If we crawl the site too fast, we run the risk of being blocked or causing server overload. To reduce these risks, we can add a delay between downloads to limit the speed of the crawler. The following is the code for the class that implements this functionality.

<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">Throttle</span>:</span>
    <span class="hljs-string">"""Throttle downloading by sleeping between requests to same domain
    """</span>
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__init__</span><span class="hljs-params">(self, delay)</span>:</span>

        self.delay = delay

        self.domains = {}

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">wait</span><span class="hljs-params">(self, url)</span>:</span>
        <span class="hljs-string">"""Delay if have accessed this domain recently
        """</span>
        domain = urlparse.urlsplit(url).netloc
        last_accessed = self.domains.get(domain)
        <span class="hljs-keyword">if</span> self.delay > <span class="hljs-number">0</span> <span class="hljs-keyword">and</span> last_accessed <span class="hljs-keyword">is</span> <span class="hljs-keyword">not</span> <span class="hljs-keyword">None</span>:
            sleep_secs = self.delay - (datetime.now() - last_accessed).seconds
            <span class="hljs-keyword">if</span> sleep_secs > <span class="hljs-number">0</span>:
                time.sleep(sleep_secs)
        self.domains[domain] = datetime.now()
Throttle&#x7C7B;&#x8BB0;&#x5F55;&#x4E86;&#x6BCF;&#x4E00;&#x4E2A;&#x57DF;&#x540D;&#x4E0A;&#x6B21;&#x8A2A;&#x95EE;&#x7684;&#x65F6;&#x95F4;&#x3002;&#x5047;&#x8BBE;&#x5F53;&#x524D;&#x65F6;&#x95F4;&#x8DDD;&#x79BB;&#x4E0A;&#x6B21;&#x8A2A;&#x95EE;&#x65F6;&#x95F4;&#x5C0F;&#x4E8E;&#x6307;&#x5B9A;&#x5EF6;&#x65F6;&#x3002;&#x5219;&#x8FD0;&#x884C;&#x7761;&#x7720;&#x64CD;&#x4F5C;&#x3002;&#x6211;&#x4EEC;&#x80FD;&#x591F;&#x5728;&#x6BCF;&#x6B21;&#x4E0B;&#x8F7D;&#x4E4B;&#x524D;&#x8C03;&#x7528;Throttle&#x5BF9;&#x722C;&#x866B;&#x8FDB;&#x884C;&#x9650;&#x901F;&#x3002;

4、避免爬虫陷阱
目前,我们的爬行器跟踪所有以前没有被询问过的链接。

[En]

For now, our crawlers follow all links that have not been asked before.

然而。一些网站会动态生成页面内容,这样就会有无限数量的网页。

[En]

however. Some sites will dynamically generate page content, so that there will be an unlimited number of web pages.

例如,该网站具有在线日历功能。提供请求下个月或年份的链接,下个月的页面上将有相同的下个月链接。这样,页面将被无休止地链接。这样的情况就成了爬行动物的陷阱。

[En]

For example, the site has an online calendar function. Provides a link to ask for the next month or year, and there will be the same link for the next month on the next month’s page. In this way, the page will be linked endlessly. Such a situation becomes a reptile trap.

试图避免落入爬虫陷阱。一种简单的方法是累积到达当前页面所需的链接数量,即深度。当达到最大深度时,爬网程序不再将页面中的链接添加到对齐方式。

[En]

Trying to avoid falling into a crawler trap. A simple way is to accumulate how many links, that is, depth, have taken to reach the current page. When the maximum depth is reached, the crawler no longer adds the link in the page to the alignment.

为此,我们需要更改所看到的变量。这个变量过去只记录所查询的网页链接,现在已更改为词典,增加了一条页面深度记录。

[En]

To do this, we need to change the seen variable. This variable, which used to record only the web page links asked, has now been changed to a dictionary, adding a record of page depth.

def link_crawler(…,max_length = 2):
max_length = 2
seen = {}

depth = seen[url]
if depth != max_depth:
for link in links:
if link not in seen:
seen[link] = depth + 1
crawl_queue.qppend(link)
如今有了这一功能,我们就有信心爬虫的终于一定能够完毕。假设想要禁用该功能。仅仅须要将max_depth设为一个负数就可以,此时当前深度永远不会与之相等。
终于版本号

<span class="hljs-keyword">import</span> re
<span class="hljs-keyword">import</span> urlparse
<span class="hljs-keyword">import</span> urllib2
<span class="hljs-keyword">import</span> time
<span class="hljs-keyword">from</span> datetime <span class="hljs-keyword">import</span> datetime
<span class="hljs-keyword">import</span> robotparser
<span class="hljs-keyword">import</span> Queue
<span class="hljs-keyword">from</span> scrape_callback3 <span class="hljs-keyword">import</span> ScrapeCallback

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">link_crawler</span><span class="hljs-params">(seed_url, link_regex=None, delay=<span class="hljs-number">5</span>, max_depth=-<span class="hljs-number">1</span>, max_urls=-<span class="hljs-number">1</span>, headers=None, user_agent=<span class="hljs-string">'wswp'</span>,
                 proxy=None, num_retries=<span class="hljs-number">1</span>, scrape_callback=None)</span>:</span>
    <span class="hljs-string">"""Crawl from the given seed URL following links matched by link_regex
    """</span>

    crawl_queue = [seed_url]

    seen = {seed_url: <span class="hljs-number">0</span>}

    num_urls = <span class="hljs-number">0</span>
    rp = get_robots(seed_url)
    throttle = Throttle(delay)
    headers = headers <span class="hljs-keyword">or</span> {}
    <span class="hljs-keyword">if</span> user_agent:
        headers[<span class="hljs-string">'User-agent'</span>] = user_agent

    <span class="hljs-keyword">while</span> crawl_queue:
        url = crawl_queue.pop()
        depth = seen[url]

        <span class="hljs-keyword">if</span> rp.can_fetch(user_agent, url):
            throttle.wait(url)
            html = download(url, headers, proxy=proxy, num_retries=num_retries)
            links = []
            <span class="hljs-keyword">if</span> scrape_callback:
                links.extend(scrape_callback(url, html) <span class="hljs-keyword">or</span> [])

            <span class="hljs-keyword">if</span> depth != max_depth:

                <span class="hljs-keyword">if</span> link_regex:

                    links.extend(link <span class="hljs-keyword">for</span> link <span class="hljs-keyword">in</span> get_links(html) <span class="hljs-keyword">if</span> re.match(link_regex, link))

                <span class="hljs-keyword">for</span> link <span class="hljs-keyword">in</span> links:
                    link = normalize(seed_url, link)

                    <span class="hljs-keyword">if</span> link <span class="hljs-keyword">not</span> <span class="hljs-keyword">in</span> seen:
                        seen[link] = depth + <span class="hljs-number">1</span>

                        <span class="hljs-keyword">if</span> same_domain(seed_url, link):

                            crawl_queue.append(link)

            num_urls += <span class="hljs-number">1</span>
            <span class="hljs-keyword">if</span> num_urls == max_urls:
                <span class="hljs-keyword">break</span>
        <span class="hljs-keyword">else</span>:
            <span class="hljs-keyword">print</span> <span class="hljs-string">'Blocked by robots.txt:'</span>, url

<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">Throttle</span>:</span>
    <span class="hljs-string">"""Throttle downloading by sleeping between requests to same domain
    """</span>

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__init__</span><span class="hljs-params">(self, delay)</span>:</span>

        self.delay = delay

        self.domains = {}

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">wait</span><span class="hljs-params">(self, url)</span>:</span>
        <span class="hljs-string">"""Delay if have accessed this domain recently
        """</span>
        domain = urlparse.urlsplit(url).netloc
        last_accessed = self.domains.get(domain)
        <span class="hljs-keyword">if</span> self.delay > <span class="hljs-number">0</span> <span class="hljs-keyword">and</span> last_accessed <span class="hljs-keyword">is</span> <span class="hljs-keyword">not</span> <span class="hljs-keyword">None</span>:
            sleep_secs = self.delay - (datetime.now() - last_accessed).seconds
            <span class="hljs-keyword">if</span> sleep_secs > <span class="hljs-number">0</span>:
                time.sleep(sleep_secs)
        self.domains[domain] = datetime.now()

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">download</span><span class="hljs-params">(url, headers, proxy, num_retries, data=None)</span>:</span>
    <span class="hljs-keyword">print</span> <span class="hljs-string">'Downloading:'</span>, url
    request = urllib2.Request(url, data, headers)
    opener = urllib2.build_opener()
    <span class="hljs-keyword">if</span> proxy:
        proxy_params = {urlparse.urlparse(url).scheme: proxy}
        opener.add_handler(urllib2.ProxyHandler(proxy_params))
    <span class="hljs-keyword">try</span>:
        response = opener.open(request)
        html = response.read()
        code = response.code
    <span class="hljs-keyword">except</span> urllib2.URLError <span class="hljs-keyword">as</span> e:
        <span class="hljs-keyword">print</span> <span class="hljs-string">'Download error:'</span>, e.reason
        html = <span class="hljs-string">''</span>
        <span class="hljs-keyword">if</span> hasattr(e, <span class="hljs-string">'code'</span>):
            code = e.code
            <span class="hljs-keyword">if</span> num_retries > <span class="hljs-number">0</span> <span class="hljs-keyword">and</span> <span class="hljs-number">500</span> <= code < <span class="hljs-number">600:

                html = download(url, headers, proxy, num_retries - <span class="hljs-number">1</span>, data)
        <span class="hljs-keyword">else</span>:
            code = <span class="hljs-keyword">None</span>
    <span class="hljs-keyword">return</span> html

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">normalize</span><span class="hljs-params">(seed_url, link)</span>:</span>
    <span class="hljs-string">"""Normalize this URL by removing hash and adding domain
    """</span>
    link, _ = urlparse.urldefrag(link)
    <span class="hljs-keyword">return</span> urlparse.urljoin(seed_url, link)

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">same_domain</span><span class="hljs-params">(url1, url2)</span>:</span>
    <span class="hljs-string">"""Return True if both URL's belong to same domain
    """</span>
    <span class="hljs-keyword">return</span> urlparse.urlparse(url1).netloc == urlparse.urlparse(url2).netloc

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">get_robots</span><span class="hljs-params">(url)</span>:</span>
    <span class="hljs-string">"""Initialize robots parser for this domain
    """</span>
    rp = robotparser.RobotFileParser()
    rp.set_url(urlparse.urljoin(url, <span class="hljs-string">'/robots.txt'</span>))
    rp.read()
    <span class="hljs-keyword">return</span> rp

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">get_links</span><span class="hljs-params">(html)</span>:</span>
    <span class="hljs-string">"""Return a list of links from html
    """</span>

    webpage_regex = re.compile(<span class="hljs-string">'<a[^>]+href=["\'](.*?)["\']'</a[^></span>, re.IGNORECASE)

    <span class="hljs-keyword">return</span> webpage_regex.findall(html)

<span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">'__main__'</span>:

    link_crawler(<span class="hljs-string">'http://fund.eastmoney.com'</span>,<span class="hljs-string">r'/fund.html#os_0;isall_0;ft_;pt_1'</span>,max_depth=-<span class="hljs-number">1</span>,scrape_callback=ScrapeCallback
</=>

我觉得很不错。只要给编辑一个奖励就行了。

[En]

I think it’s good. Just give the editor a reward.

Original: https://www.cnblogs.com/yangykaifa/p/7402905.html
Author: yangykaifa
Title: python爬虫高级功能

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/7443/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

发表回复

登录后才能评论
免费咨询
免费咨询
扫码关注
扫码关注
联系站长

站长Johngo!

大数据和算法重度研究者!

持续产出大数据、算法、LeetCode干货,以及业界好资源!

2022012703491714

微信来撩,免费咨询:xiaozhu_tec

分享本页
返回顶部