scrapy python proxy unsolved_python爬虫之Scrapy 使用代理配置

在爬取网站内容的时候,最常遇到的问题是:网站对IP有限制,会有防抓取功能,最好的办法就是IP轮换抓取(加代理)

下面内容分作两部分第一部分来自网络,第二部分写的使用大蚂蚁代理的代码

#####################第一部分

下面来说一下Scrapy如何配置代理,进行抓取

1.在Scrapy工程下新建”middlewares.py”

Importing base64 library because we’ll need it ONLY in case if the proxy we are going to use requires authentication

import base64

Start your middleware class

class ProxyMiddleware(object):

overwrite process request

def process_request(self, request, spider):

Set the location of the proxy

request.meta[‘proxy’]= “http://YOUR_PROXY_IP:PORT”

Use the following lines if your proxy requires authentication

proxy_user_pass= “USERNAME:PASSWORD”

setup basic authentication for the proxy

encoded_user_pass= base64.encodestring(proxy_user_pass)

request.headers[‘Proxy-Authorization’]= ‘Basic ‘ + encoded_user_pass

2.在项目配置文件里(./pythontab/settings.py)添加

DOWNLOADER_MIDDLEWARES= {

‘scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware’:110,

‘pythontab.middlewares.ProxyMiddleware’:100,

######################第二部分

import hashlib

import time

Start your middleware class

class ProxyMiddleware(object):

overwrite process request

def process_request(self, request, spider):

Set the location of the proxy

request.meta[‘proxy’] = “http://代理地址:端口”

appkey=”your app key”

secret=”your sercret num string”

paramMap = {“app_key”: appkey,”timestamp”: time.strftime(“%Y-%m-%d %H:%M:%S”)}

keys = paramMap.keys()

keys.sort()

codes= “%s%s%s” % (secret,str().join(‘%s%s’ % (key, paramMap[key]) for key in keys),secret)

sign = hashlib.md5(codes).hexdigest().upper()

paramMap[“sign”] = sign

keys = paramMap.keys()

authHeader = “MYH-AUTH-MD5 ” + str(‘&’).join(‘%s=%s’ % (key, paramMap[key]) for key in keys)

request.headers[‘Proxy-Authorization’] = authHeader

print authHeader

DOWNLOADER_MIDDLEWARES = {

‘scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware’: 110,

‘yourproject.middlewares.ProxyMiddleware’: 100,

Original: https://blog.csdn.net/weixin_39946767/article/details/113540027
Author: weixin_39946767
Title: scrapy python proxy unsolved_python爬虫之Scrapy 使用代理配置

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/792816/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

亲爱的 Coder【最近整理,可免费获取】👉 最新必读书单  | 👏 面试题下载  | 🌎 免费的AI知识星球