Scrapy集成Selenium ChromeDriver

目录

参考:
官网chromedriver
chromedriver-downloads
Running Selenium Headless with Chrome

一、安装chrome浏览器

1、windows
可通过 帮助->关于Google Chrome查看已安装的Chrome版本

Scrapy集成Selenium ChromeDriver
2、linux
TODO

; 二、下载chromdriver

下载链接:
https://sites.google.com/a/chromium.org/chromedriver/downloads
国内下载链接 – http://npm.taobao.org/mirrors/chromedriver/
1、选择对应的版本

Scrapy集成Selenium ChromeDriver
2、选择对应的操作系统
Scrapy集成Selenium ChromeDriver
如win32版本下载解压后:
Scrapy集成Selenium ChromeDriver
如linux64版本下载解压后
Scrapy集成Selenium ChromeDriver

; 三、测试chromdriver

首先需要先安装selenium

pip install selenium

windows环境下测试chromedriver

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

chrome_options = Options()

chrome_options.add_argument("--headless")

browser = webdriver.Chrome(
    executable_path=r"D:\programs\chromedriver_win32\chromedriver.exe",
    chrome_options=chrome_options
)
browser.get("https://www.baidu.com/")
print("Title: %s" % browser.title)
browser.quit()

运行结果

Title: 百度一下,你就知道

注:
注释掉chrome_options.add_argument(“–headless”)这条语句,就会看见弹出的chrome窗口,在browser.quit()后会自动关闭

Scrapy集成Selenium ChromeDriver

四、chromedriver解析Json

https://stackoverflow.com/questions/37121843/how-to-get-a-json-response-from-a-google-chrome-selenium-webdriver-client

Scrapy集成Selenium ChromeDriver
即json响应默认会通过body>pre进行包装
<html>
 <head>
  <style>style>
  <script src="chrome-extension://mooikfkahbdckldjjndioackbalphokd/assets/prompt.js">script>
 head>
 <body>
  <pre>json content...pre>
  ...

 body>
html>

五、chromdriver无图模式

方式1: https://tarunlalwani.com/post/selenium-disable-image-loading-different-browsers/

from selenium import webdriver

option = webdriver.ChromeOptions()
chrome_prefs = {}
option.experimental_options["prefs"] = chrome_prefs

chrome_prefs["profile.default_content_settings"] = {"images": 2}
chrome_prefs["profile.managed_default_content_settings"] = {"images": 2}

driver = webdriver.Chrome(chrome_options=option)
driver.get("http://www.baidu.com")

实际测试发现方式1在headless模式下不生效,而在删除headless选项后(即弹出浏览器窗口)是可以生效的。

方式2【推荐】: https://stackoverflow.com/questions/48773031/how-to-prevent-chrome-headless-from-loading-images

from selenium import webdriver

option = webdriver.ChromeOptions()

option.add_argument('--blink-settings=imagesEnabled=false')

driver = webdriver.Chrome(chrome_options=option)
driver.get("http://www.baidu.com")

实际测试方式2在headless和有界面模式下均生效

Scrapy集成Selenium ChromeDriver

六、Scrapy集成Selenium+ChromeDriver

1、修改settings.py:


CHROME_DRIVER_PATH = 'D:/programs/chromedriver_win32/chromedriver.exe'

DOWNLOADER_MIDDLEWARES = {
    'mx_crawl_spider.middlewares.MxCrawlSpiderDownloaderMiddleware': 543,
}

2、集成ChromeDriver的downloader middlewares代码实现:

from scrapy.http import HtmlResponse
from selenium import webdriver
from selenium.common.exceptions import TimeoutException

class MxCrawlSpiderDownloaderMiddleware:

    @classmethod
    def from_crawler(cls, crawler):

        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):

        try:

            spider.logger.info(f"Chrome driver get: {request.url}")
            self.driver.get(request.url)

            return HtmlResponse(url=request.url,
                                body=self.convert_resp_body(request, spider),
                                request=request,
                                encoding='utf-8',
                                status=200)
        except TimeoutException:
            return HtmlResponse(url=request.url, request=request, encoding='utf-8', status=500)
        finally:
            spider.logger.info('Chrome driver end...')

    def convert_resp_body(self, request, spider):

        try:
            json = self.driver.find_element_by_css_selector("body > pre").text
            spider.logger.info(f"convert {request.url} to json resp")
            return json
        except Exception as e:
            return self.driver.page_source

    def process_response(self, request, response, spider):

        return response

    def process_exception(self, request, exception, spider):

        pass

    def spider_opened(self, spider):
        spider.logger.info(f'Spider opened: {spider.name}')
        options = webdriver.ChromeOptions()

        options.add_argument('--headless')

        options.add_argument('--blink-settings=imagesEnabled=false')

        chrome_driver_path = spider.settings.get("CHROME_DRIVER_PATH")
        self.driver = webdriver.Chrome(chrome_options=options, executable_path=chrome_driver_path)

解决CloudFlare防火墙

参考:
https://stackoverflow.com/questions/33247662/how-to-bypass-cloudflare-bot-ddos-protection-in-scrapy
https://stackoverflow.com/questions/55480924/how-to-enable-javascript-in-selenium-webdriver-chrome-using-python
https://stackoverflow.com/questions/64842858/selenium-app-redirect-to-cloudflare-page-when-hosted-on-heroku

Scrapy集成Selenium ChromeDriver
在新弹出的chrome窗口中查看是否支持JS:
chrome://settings/content/javascript
DOCTYPE HTML>
<html lang="en-US">
<head>
  <meta charset="UTF-8" />
  <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
  <meta http-equiv="X-UA-Compatible" content="IE=Edge,chrome=1" />
  <meta name="robots" content="noindex, nofollow" />
  <meta name="viewport" content="width=device-width,initial-scale=1" />
  <title>Just a moment...title>
  <style type="text/css">
    html, body {width: 100%; height: 100%; margin: 0; padding: 0;}
    body {background-color: #ffffff; color: #000000; font-family:-apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", Roboto, Oxygen, Ubuntu, "Helvetica Neue",Arial, sans-serif; font-size: 16px; line-height: 1.7em;-webkit-font-smoothing: antialiased;}
    h1 { text-align: center; font-weight:700; margin: 16px 0; font-size: 32px; color:#000000; line-height: 1.25;}
    p {font-size: 20px; font-weight: 400; margin: 8px 0;}
    p, .attribution, {text-align: center;}
    #spinner {margin: 0 auto 30px auto; display: block;}
    .attribution {margin-top: 32px;}
    @keyframes fader     { 0% {opacity: 0.2;} 50% {opacity: 1.0;} 100% {opacity: 0.2;} }
    @-webkit-keyframes fader { 0% {opacity: 0.2;} 50% {opacity: 1.0;} 100% {opacity: 0.2;} }
    #cf-bubbles > .bubbles { animation: fader 1.6s infinite;}
    #cf-bubbles > .bubbles:nth-child(2) { animation-delay: .2s;}
    #cf-bubbles > .bubbles:nth-child(3) { animation-delay: .4s;}
    .bubbles { background-color: #f58220; width:20px; height: 20px; margin:2px; border-radius:100%; display:inline-block; }
    a { color: #2c7cb0; text-decoration: none; -moz-transition: color 0.15s ease; -o-transition: color 0.15s ease; -webkit-transition: color 0.15s ease; transition: color 0.15s ease; }
    a:hover{color: #f4a15d}
    .attribution{font-size: 16px; line-height: 1.5;}
    .ray_id{display: block; margin-top: 8px;}
    #cf-wrapper #challenge-form { padding-top:25px; padding-bottom:25px; }
    #cf-hcaptcha-container { text-align:center;}
    #cf-hcaptcha-container iframe { display: inline-block;}
  style>

      <meta http-equiv="refresh" content="12">
  <script type="text/javascript">script>

head>
<body>
  <table width="100%" height="100%" cellpadding="20">
    <tr>
      <td align="center" valign="middle">
          <div class="cf-browser-verification cf-im-under-attack">
  <noscript>
    <h1 data-translate="turn_on_js" style="color:#bd2426;">Please turn JavaScript on and reload the page.h1>
  noscript>
  <div id="cf-content" style="display:none">

    <div id="cf-bubbles">
      <div class="bubbles">div>
      <div class="bubbles">div>
      <div class="bubbles">div>
    div>
    <h1><span data-translate="checking_browser">Checking your browser before accessingspan> investing.com.h1>

    <div id="no-cookie-warning" class="cookie-warning" data-translate="turn_on_cookies" style="display:none">
      <p data-translate="turn_on_cookies" style="color:#bd2426;">Please enable Cookies and reload the page.p>
    div>
    <p data-translate="process_is_automatic">This process is automatic. Your browser will redirect to your requested content shortly.p>
    <p data-translate="allow_5_secs" id="cf-spinner-allow-5-secs" >Please allow up to 5 seconds...p>
    <p data-translate="redirecting" id="cf-spinner-redirecting" style="display:none">Redirecting...p>
  div>

  <form class="challenge-form" id="challenge-form" action="/instruments/HistoricalDataAjax?__cf_chl_jschl_tk__=pmd_b44e6894f26381ec65b7ce23b86e8129a364b849-1627892247-0-gqNtZGzNAjijcnBszQbO" method="POST" enctype="application/x-www-form-urlencoded">
    <input type="hidden" name="md" value="451c537d80e1bea80c69dc737c5769b09b670734-1627892247-0-AaEUgkAqbIiw24JsE46HbzTyxeOEhQmi8EbAWnTlRKLCuY_9e8BK9vsyTztGDpOgOKGeHJ6c2nGZlaC9GD7yXQgT3yayH4hyysTGh0qAA8Cohn_rXsoVKEy5sQELH-n4w4O7ueEyvl-qpKfI1OcD-NwUfpABwRYqlgZ8IFAtYpWfWSWJZV6-a_jc_KTmZWYEsQgcrL7ymfTZ9GRGWrWe0gXh6Nd_3Lnix8qjaq-D2PfTLLdnMMK6dR2QRfEGsDndMrYjd0wITStlRWRn-vIbWqlDgUe6ZquuYNAinXRsaRS0pFGBkrmgLAgCEWD2Qucurcw5chay51tk_bKrFsSuAKEf2j8EG4x47F_QKQjhrCpKwdAGymcrxlrzuKY4iTL4RVfXOD-1oG6OkC-hLxfUriL41rHF7n069gS_7CPGruyQudaZT-G7JSP6ziEB7ewhlg_0wesnWQvLRS-38NXv4FKxPXFh-y7yVf4M5CW1qsEudiiYr7IllPoURvr-jEmMVg" />
    <input type="hidden" name="r" value="79ae4aef921ee4d89184402e18dc0eef7ef2799e-1627892247-0-ATyPW3cpfiZHnKMzYYz6kWN5wLP8bh69u0RAqk0c0NVdhf5rZYvFYEI1XErZIsXclkv+OiQk3wyP4UCWpqGdHkl34vCx28J66C3QHxcXHWmltizpewOrPNzIV39l0t3tos+LRohlQVGEd7CD2DN+2w3eNIzmj8IcxhRDIWa4kXvLdGF10zxehh9dB/zaRLJtUPnPk3fKshXcQbRTT9Uz207nUrk6N3qoCh6baJwAccK6tYPAsuf8jYesH+oKWGT1ZavzujhFvaPAMqEmOELZGRCq/Cq9s3HJdg3njBknHmKIkBoYaecpaewpGqBIZeXPOvgdr11FEPhvamjJATEyhss6r8/P3UooX3OiPgix0ePXcIzhtXaNTY5bftVmVyTiHKcwpLQx2SQH4lKzC273CxsIGQH91SI0/TmFqJx+e1cz8K9SrPcDq4nBJX9NuwP56NM9jkZA2hPjOkf9kNOD/sF9KAEXPIumhP1/k5Gnyp5O+u3gxiAfJHHeAclxUsFLXA/TBGdP4+qXb/2D6/wRPoBQKuXPK3QaiBf6KhzmvaEhktTgXqBX7E8m7tWetIlpXSsNGW7oVEabVd14BJfXNh7wnFhbxadYBrL7jPs8F7fiTlyekw4omUcTq52kpgW5KanEcuTksGn4yldb4O9C086LTasGLPkd0Qz25RIuGvXUmXfwnfoJQ8pg2mNV0GykXcyIVs09r0Wz5+IgclodF4WhZ2GMaL1ZDeGJbRQQHrmsF8a74cEgm4/HaZW0rn0xrAxNovhwPkWTaz7UxMNQaxVa0uoJ7c1g5j83wHkrKwnX12P9TSI8X375B8l7P4PS56i3iDd7edcsyqgtq8F8FKfh3BLl+MU4ZNJu/nKa43GqlD+YEkfK2aK7MXfuhT+vuVF5fSUkOo05TQ+td8VxYDmwTjN8vl5XXTEgLCKBIN1QTaxQN+YSwbGsmHy9brqRKcvCTFzJjneQZ8XDWuLh2FUIeqD6N8viDlFcJB4VMT9p5hNEKQLV7bEg84o3UKtOtBYj3M4k+hfz664fGg/giI7gYhQ7l9W+FbT3zKnni6wlxjgCWwo8h78b/S/4UPXqaT9p8eKIpOZVpIe9AXQRQNtl7uQntf1xiW00XupiAHC0N95rRQy5KrSAtwMiuFbxPP+ttfodwESvakbQ0rzQk5t2huYKljcNF4rzdexJe4c1iPetB97VmOo9vXixvwwQGds5iQIMqSrRxi/PCjowSK8JjReH1qLeGU0I/9RKJZA30Sz8jMQM7S2FC/kqSsr2rUAH7Ku/UIjblXUpaoCxEsC57YnhhMUxC5f8wNWuxMZz2+IfZfHqZXMtv2L6APd2LEnOaSk9DIlUectu4kwhsQ+59x9zWicVOPDoPQJ5DvB0TP5BYj+jvxmXUi3vls3TsMNaDSzhxv6IOQorhkBVuu5Md973zksLLUy7kH9E9ffH1jyEo/4G6/MkDTFpfqRBcES7s6zdHJgyuFO5rVC72SWTmj7bWvKZHLs3UFA7vWyOxdBSbuYidY6fKS/qgR1CkcipEy+5YsPYkSTpqqllGJrKV6fWWBT2hLG2DFkoBfgSKdoPEJ/pvMia1qBOI7t4G5Cfvcb+G4j4g3AdPUF1axJwPPXjQeHK7xpzJ9baNE5gGzxLihp3JUcdoFmdZaZ5Sv1qMq5UdKrKqZBSQ1j52Bo="/>
    <input type="hidden" value="62f859a005e28f79ef2069d9de198947" id="jschl-vc" name="jschl_vc"/>

    <input type="hidden" name="pass" value="1627892251.661-yYtA6C6xg3"/>
    <input type="hidden" id="jschl-answer" name="jschl_answer"/>
  form>

    <script type="text/javascript">script>

  <div id="trk_jschal_nojs" style="background-image:url('/cdn-cgi/images/trace/jschal/nojs/transparent.gif?ray=6785deb3cd503aec')"> div>
div>

          <div class="attribution">
            DDoS protection by <a rel="noopener noreferrer" href="https://www.cloudflare.com/5xx-error-landing/" target="_blank">Cloudflarea>
            <br />
            <span class="ray_id">Ray ID: <code>6785deb3cd503aeccode>span>
          div>
      td>

    tr>
  table>
body>
html>

七、从入门到放弃💔

selenium+chromedriver的组合可以很好的解决网页渲染(js执行)的问题,
但是在Scrapy中使用selenium+chromdriver存在以下问题:
(1)Python + scrapy + selenium + chromedriver + chrome环境配置繁杂;
(2)Scrapy线程阻塞 – 串行的执行http请求,爬取速度太慢😭,并没有充分发挥Scrapy的性能;
(3)多个spider同时执行时开启多个chrome实例,系统负载过高;
综上,结合当前同时爬取500+网站的需求,最终弃用Selenium+ChromeDriver的组合😓
通过进一步了解,决定使用 Scrapy+Splash的架构…

Scrapy集成Selenium ChromeDriver

Original: https://blog.csdn.net/luo15242208310/article/details/114978290
Author: 罗小爬EX
Title: Scrapy集成Selenium ChromeDriver

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/789394/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

亲爱的 Coder【最近整理,可免费获取】👉 最新必读书单  | 👏 面试题下载  | 🌎 免费的AI知识星球