在爬取网站内容的时候,最常遇到的问题是:网站对IP有限制,会有防抓取功能,最好的办法就是IP轮换抓取(加代理)
下面内容分作两部分第一部分来自网络,第二部分写的使用大蚂蚁代理的代码
#####################第一部分
下面来说一下Scrapy如何配置代理,进行抓取
1.在Scrapy工程下新建”middlewares.py”
Importing base64 library because we’ll need it ONLY in case if the proxy we are going to use requires authentication
import base64
Start your middleware class
class ProxyMiddleware(object):
overwrite process request
def process_request(self, request, spider):
Set the location of the proxy
request.meta[‘proxy’]= “http://YOUR_PROXY_IP:PORT”
Use the following lines if your proxy requires authentication
proxy_user_pass= “USERNAME:PASSWORD”
setup basic authentication for the proxy
encoded_user_pass= base64.encodestring(proxy_user_pass)
request.headers[‘Proxy-Authorization’]= ‘Basic ‘ + encoded_user_pass
2.在项目配置文件里(./pythontab/settings.py)添加
DOWNLOADER_MIDDLEWARES= {
‘scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware’:110,
‘pythontab.middlewares.ProxyMiddleware’:100,
######################第二部分
import hashlib
import time
Start your middleware class
class ProxyMiddleware(object):
overwrite process request
def process_request(self, request, spider):
Set the location of the proxy
request.meta[‘proxy’] = “http://代理地址:端口”
appkey=”your app key”
secret=”your sercret num string”
paramMap = {“app_key”: appkey,”timestamp”: time.strftime(“%Y-%m-%d %H:%M:%S”)}
keys = paramMap.keys()
keys.sort()
codes= “%s%s%s” % (secret,str().join(‘%s%s’ % (key, paramMap[key]) for key in keys),secret)
sign = hashlib.md5(codes).hexdigest().upper()
paramMap[“sign”] = sign
keys = paramMap.keys()
authHeader = “MYH-AUTH-MD5 ” + str(‘&’).join(‘%s=%s’ % (key, paramMap[key]) for key in keys)
request.headers[‘Proxy-Authorization’] = authHeader
print authHeader
DOWNLOADER_MIDDLEWARES = {
‘scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware’: 110,
‘yourproject.middlewares.ProxyMiddleware’: 100,
Original: https://blog.csdn.net/weixin_39946767/article/details/113540027
Author: weixin_39946767
Title: scrapy python proxy unsolved_python爬虫之Scrapy 使用代理配置
原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/792816/
转载文章受原作者版权保护。转载请注明原作者出处!