爬虫二

2023年6月12日上午4:41 • Python • 阅读 75

selenium最初是一个自动化测试工具,而爬虫中使用它主要是为了解决requests无法直接执行JavaScript代码的问题

selenium本质是通过驱动浏览器，完全模拟浏览器的操作，比如跳转、输入、点击、下拉等，来拿到网页渲染之后的结果，可支持多种浏览器

from selenium import webdriver
browser=webdriver.Chrome()
browser=webdriver.Firefox()
browser=webdriver.PhantomJS()
browser=webdriver.Safari()
browser=webdriver.Edge()

2，基本使用

from selenium import webdriver
from selenium.webdriver import ActionChains
from selenium.webdriver.common.by import By #按照什么方式查找，By.ID,By.CSS_SELECTOR
from selenium.webdriver.common.keys import Keys #键盘按键操作
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait #等待页面加载某些元素

browser=webdriver.Chrome()
try:
    browser.get('https://www.baidu.com')

input_tag=browser.find_element_by_id(‘kw’)
input_tag.send_keys(‘美女’) #python2中输入中文错误，字符串前加个u
input_tag.send_keys(Keys.ENTER) #输入回车


    wait=WebDriverWait(browser,10)
    wait.until(EC.presence_of_element_located((By.ID,'content_left'))) #等到id为content_left的元素加载完毕,最多等10秒

print(browser.page_source)
print(browser.current_url)
print(browser.get_cookies())


finally:
    browser.close()

案例：

from selenium import webdriver
from selenium.webdriver.common.keys import Keys #键盘按键操作
from selenium.webdriver.chrome.options import Options  # 浏览器的配置。例如隐藏
chrome_options = Options()
chrome_options.add_argument('window-size=1920x3000') #指定浏览器分辨率
chrome_options.add_argument('--disable-gpu') #谷歌文档提到需要加上这个属性来规避bug
chrome_options.add_argument('--hide-scrollbars') #隐藏滚动条, 应对一些特殊页面
chrome_options.add_argument('blink-settings=imagesEnabled=false') #不加载图片, 提升速度
chrome_options.add_argument('--headless') #浏览器不提供可视化页面. linux下如果系统不支持可视化不加这条会启动失败
chrome_options.binary_location = r"C:\Program Files (x86)\Google\Chrome\Application\chrome.exe" #手动指定使
bro=webdriver.PhantomJS()
bro=webdriver.Chrome(chrome_options=chrome_options) # 把配置信息传进来
import time
bro = webdriver.Chrome()  # 拿到浏览器对象
bro.get("http://www.baidu.com")   # 在浏览器中输入百度，get中输入url路径，用js渲染完成的页面打印到控制台中
print(bro.page_source)   # 打印该页面所有内容
bro.close()  # 关闭浏览器页面

每次打开浏览器很烦，不想看到界面弹出来 ，无界面浏览器
取到输入框，根据class属性，以及css来选择标签
inp=bro.find_element_by_id('kw')  # 根据标签的id获取框，有id最好找id，因为id是惟一的
inp.send_keys('美女')
inp.send_keys(Keys.ENTER)   # 输入回车
time.sleep(10)
bro.close()    #  关闭浏览器页面

3，拿到cookies。把cookies放到cookie池里

from selenium import webdriver

import time
bro=webdriver.Chrome()
bro.get("http://www.baidu.com")
bro.implicitly_wait(10)
1、find_element_by_id   根据id找 ****
2、find_element_by_link_text     根据链接名字找到控件（a标签的文字）
3、find_element_by_partial_link_text   根据链接名字找到控件（a标签的文字）模糊查询
4、find_element_by_tag_name       根据标签名
5、find_element_by_class_name     根据类名
6、find_element_by_name           根据属性名
7、find_element_by_css_selector   根据css选择器 *******
8、find_element_by_xpath          根据xpath选择  ****

dl_button=bro.find_element_by_link_text("登录")
dl_button.click()
user_login=bro.find_element_by_id('TANGRAM__PSP_10__footerULoginBtn')
user_login.click() # 点击
time.sleep(1)
input_name=bro.find_element_by_name('userName')   # 根据用户名选择，下面的也可以根据css选择器进行选择
input_name.send_keys("30323545@qq.com")
input_password=bro.find_element_by_id("TANGRAM__PSP_10__password")
input_password.send_keys("xxxxxx")
submit_button=bro.find_element_by_id('TANGRAM__PSP_10__submit')
time.sleep(1)
submit_button.click()

time.sleep(100)

print(bro.get_cookies())
bro.close()

显示等待和隐示等待
隐式等待:在查找所有元素时，
browser.implicitly_wait(10)   表示等待所有， 10 表示如果3秒拿到了就执行，如果还没拿到，10s之后就报错

显式等待：显式地等待某个元素被加载，  是指你要指定显示哪一个
wait=WebDriverWait(browser,10)
wait.until(EC.presence_of_element_located((By.ID,'content_left')))

View Code

4，用css 选择器爬取数据

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time

pro = webdriver.Chrome()
pro.get('http://www.jd.com')
pro.implicitly_wait(10)

def get_goods(pro):
    # 进入了另一个页面
    print('---------------------------------------')
    good_list = pro.find_elements_by_class_name('gl-item')
    for good in good_list:
        img_url = good.find_element_by_css_selector('.p-img a img').get_attribute('src')
        if not img_url:
            img_url = 'https:' + good.find_element_by_css_selector('.p-img a img').get_attribute('data-lazy-img')
        url = good.find_element_by_css_selector('.p-img a').get_attribute('href')
        # print(url) # 拿到了商品的链接，接下来拿到商品的其他信息
        # print(img_url)
        price = good.find_element_by_css_selector('.p-price i').text
        name = good.find_element_by_css_selector('.p-name em').text.replace('\n', '')
        commit = good.find_element_by_css_selector('.p-commit a').text
        print("""
        商品链接：%s
        商品图片：%s
        商品名称：%s
        商品价格：%s
        商品评论数：%s
        """ % (url, img_url, name, price, commit))
    next_page = pro.find_element_by_partial_link_text('下一页')
    time.sleep(1)
    next_page.click()
    time.sleep(1)
    get_goods(pro)
input_seach = pro.find_element_by_id('key')
input_seach.send_keys('内衣')
input_seach.send_keys(Keys.ENTER)
try:
    get_goods(pro)
except Exception as e:
    print('结束')
finally:
    pro.close()   # 不管是否爬完，出错，都关闭浏览器

View Code

获取属性：
tag.get_attribute('src')
获取文本内容
tag.text
获取标签ID，位置，名称，大小（了解）
print(tag.id)
print(tag.location)
print(tag.tag_name)
print(tag.size)

模拟浏览器前进后退
browser.back()
time.sleep(10)
browser.forward()

cookies管理
print(browser.get_cookies())  获取cookie
browser.add_cookie({'k1':'xxx','k2':'yyy'})  设置cookie
print(browser.get_cookies())

运行js
from selenium import webdriver
import time

bro=webdriver.Chrome()
bro.get("http://www.baidu.com")
bro.execute_script('alert("hello world")') #打印警告
time.sleep(5)
选项卡管理
import time
from selenium import webdriver

browser=webdriver.Chrome()
browser.get('https://www.baidu.com')
browser.execute_script('window.open()')

print(browser.window_handles) #获取所有的选项卡
browser.switch_to_window(browser.window_handles[1])
browser.get('https://www.taobao.com')
time.sleep(3)
browser.switch_to_window(browser.window_handles[0])
browser.get('https://www.sina.com.cn')
browser.close()

动作链
from selenium import webdriver
from selenium.webdriver import ActionChains

from selenium.webdriver.support.wait import WebDriverWait  # 等待页面加载某些元素
import time

driver = webdriver.Chrome()
driver.get('http://www.runoob.com/try/try.php?filename=jqueryui-api-droppable')
wait=WebDriverWait(driver,3)
driver.implicitly_wait(3)  # 使用隐式等待

try:
    driver.switch_to.frame('iframeResult') ##切换到iframeResult
    sourse=driver.find_element_by_id('draggable')
    target=driver.find_element_by_id('droppable')

#方式一：基于同一个动作链串行执行
actions=ActionChains(driver) #拿到动作链对象
actions.drag_and_drop(sourse,target) #把动作放到动作链中，准备串行执行
actions.perform()

#方式二：不同的动作链，每次移动的位移都不同

    ActionChains(driver).click_and_hold(sourse).perform()
    distance=target.location['x']-sourse.location['x']

    track=0
    while track < distance:
        ActionChains(driver).move_by_offset(xoffset=2,yoffset=0).perform()
        track+=2

    ActionChains(driver).release().perform()

    time.sleep(10)

finally:
    driver.close()

View Code

5 ，xpath选择

doc='''

  Example website

   Name: My image 1
   Name: My image 2
   Name: My image 3
   Name: My image 4
   Name: My image 5
   testName: My image 6

'''
from lxml import etree

html=etree.HTML(doc)
html=etree.parse('search.html',etree.HTMLParser())
1 所有节点
a=html.xpath('//*')    #匹配所有标签
2 指定节点（结果为列表）
a=html.xpath('//head')
3 子节点，子孙节点
a=html.xpath('//div/a')
a=html.xpath('//body/a') #无数据
a=html.xpath('//body//a')
4 父节点
a=html.xpath('//body//a[@href="image1.html"]/..')
a=html.xpath('//body//a[1]/..')  #从1开始
也可以这样
a=html.xpath('//body//a[1]/parent::*')
5 属性匹配
a=html.xpath('//body//a[@href="image1.html"]')

6 文本获取
a=html.xpath('//body//a[@href="image1.html"]/text()')
a=html.xpath('//body//a/text()')

7 属性获取
a=html.xpath('//body//a/@href')
# 注意从1 开始取（不是从0）
a=html.xpath('//body//a[2]/@href')
8 属性多值匹配
 a 标签有多个class类，直接匹配就不可以了，需要用contains
a=html.xpath('//body//a[@class="li"]')
a=html.xpath('//body//a[contains(@class,"li")]/text()')
a=html.xpath('//body//a[contains(@class,"li")]/text()')
9 多属性匹配
a=html.xpath('//body//a[contains(@class,"li") or @name="items"]')
a=html.xpath('//body//a[contains(@class,"li") and @name="items"]/text()')
a=html.xpath('//body//a[contains(@class,"li")]/text()')
10 按序选择
a=html.xpath('//a[2]/text()')
a=html.xpath('//a[2]/@href')
取最后一个
a=html.xpath('//a[last()]/@href')
位置小于3的
a=html.xpath('//a[position()')
倒数第二个
a=html.xpath('//a[last()-2]/@href')
11 节点轴选择
ancestor：祖先节点
使用了* 获取所有祖先节点
a=html.xpath('//a/ancestor::*')
# 获取祖先节点中的div
a=html.xpath('//a/ancestor::div')
attribute：属性值
a=html.xpath('//a[1]/attribute::*')
child：直接子节点
a=html.xpath('//a[1]/child::*')
descendant：所有子孙节点
a=html.xpath('//a[6]/descendant::*')
following:当前节点之后所有节点
a=html.xpath('//a[1]/following::*')
a=html.xpath('//a[1]/following::*[1]/@href')
following-sibling:当前节点之后同级节点
a=html.xpath('//a[1]/following-sibling::*')
a=html.xpath('//a[1]/following-sibling::a')
a=html.xpath('//a[1]/following-sibling::*[2]/text()')
a=html.xpath('//a[1]/following-sibling::*[2]/@href')

print(a)

Original: https://www.cnblogs.com/Fzhiyuan/p/11945883.html
Author: 在于折腾
Title: 爬虫二

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/602967/

转载文章受原作者版权保护。转载请注明原作者出处！

python

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

【大数据实战项目四】Mongo/ES数据储存及利用Flask进行结果展示

Mongo/ES数据储存及利用Flask进行结果展示 5 数据储存及结果展示 * 5.1 将数据保存到MongoDB 5.2 利用Flash进行数据结果展示 – 5.2…

Python 2023年8月12日
0066
pygame入门(4)

Rect对象负责支持Rect对象的模块是pygame.Rect,这是一个相对独立于其他模块的小模块，但也是Pygame中的一个常用功能，Rect对象可以使许多工作变得更加简单 R…

Python 2023年9月23日
0053
Pandas入门笔记（二）

Pandas入门笔记（二） 3.DataFrame 对象 3.1图解DataFrame对象图 3.1 DataFrame结构DataFrame是一个二维表数据结构，即由行列数据…

Python 2023年8月7日
0045
pytest测试框架获取修改文件清单

由于每次更新会有文件改动，需要获取每次修改文件的清单以便后期核对，设计思路如下： linux系统下每个文件会有唯一的md5值，当文件被修改以后对应的md5值也会被更改，利用这个特性…

Python 2023年9月13日
0047
Python scrapy数据建模与请求

学习目标：1、应用在scrapy项目中进行建模；2、应用构造 Request 对象，并发送请求；3、应用利用meta参数在不同的解析函数中传递数据； 1、数据建模通常在做项目…

Python 2023年10月2日
0031
持续集成：jenkins + pytest + selenium + Git + Allure自动化测试

目录 0-测试环境准备 1-jenkins配置github项目 * 1、新建项目 2、配置项目Git地址 3、配置代码管理 4、构建 5、配置构建后操作 2-运行 3-jenkin…

Python 2023年9月11日
0049
day08 pickle实力

day08 pickle实力原创 wx5e6caa8b9792d2022-08-01 17:06:03博主文章分类：Python自动化开发 ©著作权文章标签 git 文章分类 …

Python 2023年5月24日
0073
Python制作代码雨

我一个朋友在某音看到了这玩意儿。它非得让我也搞一个。反正也无聊就简单写了一个简单设计： 1.黑色背景，绿色字体，内容为随机符号—–pygame和rand…

Python 2023年9月18日
0045
Django 系列官方教程[1]Requests and responses

文章系官方教程，该章节将引导初学者建立第一个app。本文使用ubuntu 20.04、python3.9、django4.0.4、anaconda2.1.4为环境一、查看版本号…

Python 2023年8月4日
0054
Numpy 常见函数及使用

本文后续边补充，边更新！ 1. np.delete() 删除指定行 n…

Python 2023年8月28日
0050
python 插值处理一维数据 interpolate

scipy库： 原码： https://docs.scipy.org/doc/scipy/reference/ge…

Python 2023年10月10日
0020
python数据分析基础005 -pandas详解_pandas入门这一篇就足够了

文章目录 🌸前言 🌔（一）pandas基础介绍 * 🍸1.什么是pandas 🍹2.为什么要学习pandas 🍻3.pandas的安装 🥂4.导入pandas库 🌖（二）panda…

Python 2023年8月15日
0055
用matplotlib可视化加州房价价格（散点图各个参数的含义）

需要用到的数据是加州住房价格的数据集。该数据集基于1990年加州入口普查的数据。数据下载地址将地理数据可视化 housing.plot(kind=”scatter”, x=”lo…

Python 2023年9月5日
0076
数据分析小案例：招聘数据可视化，查看领域最需技术~

Original: https://www.cnblogs.com/Qqun261823976/p/16494018.htmlAuthor: python倩Title: 数据分析小…

Python 2023年11月2日
0046
【Python爬虫】Scrapy篇①——简介、安装和快速开始

Scrapy简介 scrapy是一个为了爬取网站数据，提取结构性数据而编写的应用框架，我们只需实现少量的代码，就能够快速的抓取。 scrapy使用了twisted异步网络框架，可以…

Python 2023年10月6日
0049
使用 scipy.fft 进行Fourier Transform：Python 信号处理

Fourier transform 是一个强大的概念，用于各种领域，从纯数学到音频工程甚至金融。 scipy.fft模块傅立叶变换是许多应用中的重要工具，尤其是在科学计算和数据科…

Python 2023年9月3日
0057

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

爬虫二

大家都在看