Python爬虫常用库selenium详解

简介

selenium是一款支持多种浏览器的自动化测试工具,爬虫中主要用于解决JavaScript渲染页面的问题。

基本使用

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait

browser = webdriver.Chrome()
try:
browser.get('https://www.baidu.com')
input = browser.find_element_by_id('kw')
input.send_keys('Python')
input.send_keys(Keys.ENTER)
wait = WebDriverWait(browser, 10)
wait.until(EC.presence_of_element_located((By.ID, 'content_left')))
print(browser.current_url)
print(browser.get_cookies())
print(browser.page_source)
finally:
browser.close()

这个代码执行会打开一个Chrome浏览器,然后自行访问百度首页,再找到idkw的元素,传入Python,等到idcontent_left的元素加载完毕,最多等10秒;最后打印出URL、cookie和网页源码:

1
2
3
4
5
6
https://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=0&rsv_idx=1&tn=baidu&wd=Python&rsv_pq=ecec174c000446b6&rsv_t=e937VWYSLXgLXfqehqBQAlxkvD%2BxfJgrJbQShd9tlzleUDgLT79hlx3OY4I&rqlang=cn&rsv_enter=1&rsv_sug3=6&rsv_sug2=0&inputT=151&rsv_sug4=152
[{'domain': '.baidu.com', 'httpOnly': False, 'name': 'H_PS_PSSID', 'path': '/', 'secure': False, 'value': '1431_25548_21083_17001_20927'}, {'domain': '.baidu.com', 'expiry': 3667193230.07363, 'httpOnly': False, 'name': 'BAIDUID', 'path': '/', 'secure': False, 'value': 'A7794CCB63F72DF1DB89412B90FD7594:FG=1'}, {'domain': '.baidu.com', 'expiry': 3667193230.073748, 'httpOnly': False, 'name': 'BIDUPSID', 'path': '/', 'secure': False, 'value': 'A7794CCB63F72DF1DB89412B90FD7594'}, {'domain': '.baidu.com', 'expiry': 3667193230.073789, 'httpOnly': False, 'name': 'PSTM', 'path': '/', 'secure': False, 'value': '1519709584'}, {'domain': 'www.baidu.com', 'httpOnly': False, 'name': 'BD_HOME', 'path': '/', 'secure': False, 'value': '0'}, {'domain': 'www.baidu.com', 'expiry': 1520573584, 'httpOnly': False, 'name': 'BD_UPN', 'path': '/', 'secure': False, 'value': '12314753'}, {'domain': 'www.baidu.com', 'httpOnly': False, 'name': 'BD_CK_SAM', 'path': '/', 'secure': False, 'value': '1'}, {'domain': '.baidu.com', 'httpOnly': False, 'name': 'PSINO', 'path': '/', 'secure': False, 'value': '6'}, {'domain': 'www.baidu.com', 'expiry': 1519712177, 'httpOnly': False, 'name': 'H_PS_645EC', 'path': '/', 'secure': False, 'value': '73c5tON4OP%2BcqrdDlwHz6rwaG1DOdU1Z3%2F9ptI2btWk%2BMk40sI5n%2BTm0W0M'}]
<!DOCTYPE html><!--STATUS OK--><html xmlns="http://www.w3.org/1999/xhtml"><head><script type="text/javascript" charset="gb2312" src="//www.baidu.com/cache/aladdin/ui/tabs5/tabs5.js?v=20170208" data-for="A.ui"></script><script charset="utf-8" async="" src="https://ss0.bdstatic.com/-0U0bnSm1A5BphGlnYG/tam-ogel/5d4e9b24-dcc5-483a-b6da-be1e9e621891.js"></script>
...
...
</body></html>

声明浏览器对象

selenium支持多种浏览器:

1
2
3
4
5
6
7
from selenium import webdriver

browser = webdriver.Chrome()
browser = webdriver.Firefox()
browser = webdriver.Edge()
browser = webdriver.PhantomJS()
browser = webdriver.Safari()

访问页面

1
2
3
4
5
6
from selenium import webdriver

browser = webdriver.Chrome()
browser.get('https://www.taobao.com')
print(browser.page_source)
browser.close()

结果:

1
2
3
4
<!DOCTYPE html><html xmlns="http://www.w3.org/1999/xhtml" lang="zh-CN" class="ks-webkit537 ks-webkit ks-chrome63 ks-chrome"><head><script charset="utf-8" src="https://g.alicdn.com/mm/tb-page-peel/0.0.5/index-min.js" async=""></script><script src="https://tce.alicdn.com/api/data.htm?ids=1017579&amp;callback=tce_fixedtool_callback" async=""></script>
...
...
</body></html>

查找元素

单个元素

1
2
3
4
5
6
7
8
9
from selenium import webdriver

browser = webdriver.Chrome()
browser.get('https://www.taobao.com')
input_first = browser.find_element_by_id('q')
input_second = browser.find_element_by_css_selector('#q')
input_third = browser.find_element_by_xpath('//*[@id="q"]')
print(input_first, '\n', input_second, '\n', input_third)
browser.close()

打开淘宝页面,查找idq的元素:

1
2
3
<selenium.webdriver.remote.webelement.WebElement (session="a1d52c692ca1e8a73e7238f098e13ce9", element="0.678986922879961-1")> 
<selenium.webdriver.remote.webelement.WebElement (session="a1d52c692ca1e8a73e7238f098e13ce9", element="0.678986922879961-1")>
<selenium.webdriver.remote.webelement.WebElement (session="a1d52c692ca1e8a73e7238f098e13ce9", element="0.678986922879961-1")>

selenium支持包括css选择器、xpath等多种选择方法:

  • find_element_by_name
  • find_element_by_xpath
  • find_element_by_link_text
  • find_element_by_partial_link_text
  • find_element_by_tag_name
  • find_element_by_class_name
  • find_element_by_css_selector

还有一种方法是把选择方式当参数传入,如browser.find_element(By.ID, 'q')

1
2
3
4
5
6
7
8
from selenium import webdriver
from selenium.webdriver.common.by import By

browser = webdriver.Chrome()
browser.get('https://www.taobao.com')
input_first = browser.find_element(By.ID, 'q')
print(input_first)
browser.close()

结果:

1
<selenium.webdriver.remote.webelement.WebElement (session="1f209c0d11551c40d9d20ad964fef244", element="0.07914603542731591-1")>

多个元素

查找多个元素用find_elements

1
2
3
4
5
6
7
from selenium import webdriver

browser = webdriver.Chrome()
browser.get('https://www.taobao.com')
lis = browser.find_elements_by_css_selector('.service-bd li')
print(lis)
browser.close()

结果返回一个列表:

1
[<selenium.webdriver.remote.webelement.WebElement (session="c688cf3c4681d66e813217aa5311a77e", element="0.3350212468864553-1")>, <selenium.webdriver.remote.webelement.WebElement (session="c688cf3c4681d66e813217aa5311a77e", element="0.3350212468864553-2")>, <selenium.webdriver.remote.webelement.WebElement (session="c688cf3c4681d66e813217aa5311a77e", element="0.3350212468864553-3")>, <selenium.webdriver.remote.webelement.WebElement (session="c688cf3c4681d66e813217aa5311a77e", element="0.3350212468864553-4")>, <selenium.webdriver.remote.webelement.WebElement (session="c688cf3c4681d66e813217aa5311a77e", element="0.3350212468864553-5")>, <selenium.webdriver.remote.webelement.WebElement (session="c688cf3c4681d66e813217aa5311a77e", element="0.3350212468864553-6")>, <selenium.webdriver.remote.webelement.WebElement (session="c688cf3c4681d66e813217aa5311a77e", element="0.3350212468864553-7")>, <selenium.webdriver.remote.webelement.WebElement (session="c688cf3c4681d66e813217aa5311a77e", element="0.3350212468864553-8")>, <selenium.webdriver.remote.webelement.WebElement (session="c688cf3c4681d66e813217aa5311a77e", element="0.3350212468864553-9")>, <selenium.webdriver.remote.webelement.WebElement (session="c688cf3c4681d66e813217aa5311a77e", element="0.3350212468864553-10")>, <selenium.webdriver.remote.webelement.WebElement (session="c688cf3c4681d66e813217aa5311a77e", element="0.3350212468864553-11")>, <selenium.webdriver.remote.webelement.WebElement (session="c688cf3c4681d66e813217aa5311a77e", element="0.3350212468864553-12")>, <selenium.webdriver.remote.webelement.WebElement (session="c688cf3c4681d66e813217aa5311a77e", element="0.3350212468864553-13")>, <selenium.webdriver.remote.webelement.WebElement (session="c688cf3c4681d66e813217aa5311a77e", element="0.3350212468864553-14")>, <selenium.webdriver.remote.webelement.WebElement (session="c688cf3c4681d66e813217aa5311a77e", element="0.3350212468864553-15")>, <selenium.webdriver.remote.webelement.WebElement (session="c688cf3c4681d66e813217aa5311a77e", element="0.3350212468864553-16")>]

对应的,查找多种元素也有多种方法:

  • find_elements_by_name
  • find_elements_by_xpath
  • find_elements_by_link_text
  • find_elements_by_partial_link_text
  • find_elements_by_tag_name
  • find_elements_by_class_name
  • find_elements_by_css_selector

元素交互操作

对获取的元素可以调用一些交互方法,如:

1
2
3
4
5
6
7
8
9
10
11
12
from selenium import webdriver
import time

browser = webdriver.Chrome()
browser.get('https://www.taobao.com')
input = browser.find_element_by_id('q')
input.send_keys('iPhone')
time.sleep(1)
input.clear()
input.send_keys('iPad')
button = browser.find_element_by_class_name('btn-search')
button.click()

这个代码会打开Chrome,找到搜索框,先输入iPhone,等待1秒,把输入框清空,在输入iPad然后点击搜索按钮。

更多有关元素交互操作的内容可以点击这里查看文档。

交互动作

将动作附加到动作链中串行执行,如:

1
2
3
4
5
6
7
8
9
10
11
12
from selenium import webdriver
from selenium.webdriver import ActionChains

browser = webdriver.Chrome()
url = 'http://www.runoob.com/try/try.php?filename=jqueryui-api-droppable'
browser.get(url)
browser.switch_to.frame('iframeResult')
source = browser.find_element_by_css_selector('#draggable')
target = browser.find_element_by_css_selector('#droppable')
actions = ActionChains(browser)
actions.drag_and_drop(source, target)
actions.perform()

这串代码会执行一个iframe的拖拽操作。

更多的交互动作点击这里查看。

执行JavaScript

使用execute_script()执行js代码:

1
2
3
4
5
6
from selenium import webdriver

browser = webdriver.Chrome()
browser.get('https://www.zhihu.com/explore')
browser.execute_script('window.scrollTo(0, document.body.scrollHeight)')
browser.execute_script('alert("To Bottom")')

上面的代码会打开知乎页面并把滚动条下拉到底部然后弹出提示。

获取元素信息

获取属性

获取属性使用get_attribute()

1
2
3
4
5
6
7
8
9
from selenium import webdriver
from selenium.webdriver import ActionChains

browser = webdriver.Chrome()
url = 'https://www.zhihu.com/explore'
browser.get(url)
logo = browser.find_element_by_id('zh-top-link-logo')
print(logo)
print(logo.get_attribute('class'))

结果:

1
2
<selenium.webdriver.remote.webelement.WebElement (session="767d4093cfd43cd8c5d9cd4dc12dc204", element="0.4229578279847983-1")>
zu-top-link-logo

获取文本值

获取文本值使用.text

1
2
3
4
5
6
7
from selenium import webdriver

browser = webdriver.Chrome()
url = 'https://www.zhihu.com/explore'
browser.get(url)
input = browser.find_element_by_class_name('zu-top-add-question')
print(input.text)

结果:

1
提问

获取ID、位置、标签名、大小

1
2
3
4
5
6
7
8
9
10
from selenium import webdriver

browser = webdriver.Chrome()
url = 'https://www.zhihu.com/explore'
browser.get(url)
input = browser.find_element_by_class_name('zu-top-add-question')
print(input.id)
print(input.location)
print(input.tag_name)
print(input.size)

结果:

1
2
3
4
0.6822924344980397-1
{'y': 7, 'x': 774}
button
{'height': 32, 'width': 66}

Frame

网页中有frame时,不能直接查找元素,需要切换到元素所在frame才能查找到:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import time
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException

browser = webdriver.Chrome()
url = 'http://www.runoob.com/try/try.php?filename=jqueryui-api-droppable'
browser.get(url)
browser.switch_to.frame('iframeResult')
source = browser.find_element_by_css_selector('#draggable')
print(source)
try:
logo = browser.find_element_by_class_name('logo')
except NoSuchElementException:
print('NO LOGO')
browser.switch_to.parent_frame()
logo = browser.find_element_by_class_name('logo')
print(logo)
print(logo.text)

结果:

1
2
3
4
<selenium.webdriver.remote.webelement.WebElement (session="4bb8ac03ced4ecbdefef03ffdc0e4ccd", element="0.44746093888932004-1")>
NO LOGO
<selenium.webdriver.remote.webelement.WebElement (session="4bb8ac03ced4ecbdefef03ffdc0e4ccd", element="0.13792611320464965-2")>
RUNOOB.COM

等待

隐式等待

当使用了隐式等待执行测试的时候,如果WebDriver没有在DOM中找到元素,将继续等待,超出设定时间后则抛出找不到元素的异常, 换句话说,当查找元素或元素并没有立即出现的时候,隐式等待将等待一段时间再查找 DOM,默认的时间是0。

1
2
3
4
5
6
7
from selenium import webdriver

browser = webdriver.Chrome()
browser.implicitly_wait(10)
browser.get('https://www.zhihu.com/explore')
input = browser.find_element_by_class_name('zu-top-add-question')
print(input)

implicitly_wait(10)指如果网速过慢等情况下,元素没有加载出来将额外等待10秒,10秒后还没有加载出来就抛出异常。一般情况下没有必要加隐式等待。

显式等待

比较常用的是显式等待,即指定一个等待条件和最长等待时间,它会在最长等待时间内判断条件是否成立,成立则直接返回,超出等待时间则抛出异常,如:

1
2
3
4
5
6
7
8
9
10
11
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

browser = webdriver.Chrome()
browser.get('https://www.taobao.com/')
wait = WebDriverWait(browser, 10)
input = wait.until(EC.presence_of_element_located((By.ID, 'q')))
button = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, '.btn-search')))
print(input, button)

presence_of_element_located((By.ID, 'q'))判断元素是否出现;element_to_be_clickable((By.CSS_SELECTOR, '.btn-search'))判断指定按钮是否可点击;可以看到它们都传入了一个元组。结果:

1
<selenium.webdriver.remote.webelement.WebElement (session="07dd2fbc2d5b1ce40e82b9754aba8fa8", element="0.5642646294074107-1")> <selenium.webdriver.remote.webelement.WebElement (session="07dd2fbc2d5b1ce40e82b9754aba8fa8", element="0.5642646294074107-2")>

常用的判断条件有:

  • title_is 标题是某内容
  • title_contains 标题包含某内容
  • presence_of_element_located 元素加载出,传入定位元组,如(By.ID, p’)
  • visibility_of_element_located 元素可见,传入定位元组
  • visibility_of 可见,传入元素对象
  • presence_of_all_elements_located 所有元素加载出
  • text_to_be_present_in_element 某个元素文本包含某文字
  • text_to_be_present_in_element_value 某个元素值包含某文字
  • frame_to_be_available_and_switch_to_it frame加载并切换
  • invisibility_of_element_located 元素不可见
  • element_to_be_clickable 元素可点击
  • staleness_of 判断一个元素是否仍在DOM,可判断页面是否已经刷新
  • element_to_be_selected 元素可选择,传元素对象
  • element_located_to_be_selected 元素可选择,传入定位元组
  • element_selection_state_to_be 传入元素对象以及状态,相等返回True,否则返回False
  • element_located_selection_state_to_be 传入定位元组以及状态,相等回True,否则返回False
  • alert_is_present 是否出现Alert

详细内容可以点击这里查看文档。

前进后退

back()forward()控制后退和前进:

1
2
3
4
5
6
7
8
9
10
11
import time
from selenium import webdriver

browser = webdriver.Chrome()
browser.get('https://www.baidu.com/')
browser.get('https://www.taobao.com/')
browser.get('https://www.python.org/')
browser.back()
time.sleep(1)
browser.forward()
browser.close()

它会先访问百度,再访问淘宝,再访问Python官网,然后返回淘宝,等待1秒后再前进到Python官网。

cookies

1
2
3
4
5
6
7
8
9
from selenium import webdriver

browser = webdriver.Chrome()
browser.get('https://www.zhihu.com/explore')
print(browser.get_cookies())
browser.delete_all_cookies()
print(browser.get_cookies())
browser.add_cookie({'name': 'name', 'domain': 'www.zhihu.com', 'value': 'jeff'})
print(browser.get_cookies())

上面代码会打开知乎页面,然后打印出cookies,然后删除掉所有cookies,最后往cookies添加一些内容再打印:

1
2
3
[{'domain': '.zhihu.com', 'expiry': 1614321990.390978, 'httpOnly': False, 'name': 'd_c0', 'path': '/', 'secure': False, 'value': '"AAAsNHqUNQ2PTmyB9_dLW6YtcYCfwvIaBac=|1519713991"'}, {'domain': 'www.zhihu.com', 'httpOnly': True, 'name': 'aliyungf_tc', 'path': '/', 'secure': False, 'value': 'AQAAADeoZFYSBAYACrwVt1hMfg3RqO/a'}, {'domain': '.zhihu.com', 'httpOnly': False, 'name': 'l_n_c', 'path': '/', 'secure': False, 'value': '1'}, {'domain': '.zhihu.com', 'expiry': 1519715791, 'httpOnly': False, 'name': '__utmb', 'path': '/', 'secure': False, 'value': '51854390.0.10.1519713992'}, {'domain': '.zhihu.com', 'expiry': 1614321988.412313, 'httpOnly': False, 'name': 'q_c1', 'path': '/', 'secure': False, 'value': 'bff234c12f284b39b3b49bb9be57735d|1519713988000|1519713988000'}, {'domain': 'www.zhihu.com', 'httpOnly': False, 'name': '_xsrf', 'path': '/', 'secure': False, 'value': '6f356dbeb54748552c389235dc679975'}, {'domain': '.zhihu.com', 'expiry': 1522305988.412491, 'httpOnly': False, 'name': 'r_cap_id', 'path': '/', 'secure': False, 'value': '"YjAzMTliOWEyOGRiNGEyMGE5NzVmYzY2NDg1MWZjZjQ=|1519713988|a448a57c0fa2941a00ca61a0bb8ba1c521adb298"'}, {'domain': '.zhihu.com', 'expiry': 1522305988.412595, 'httpOnly': False, 'name': 'cap_id', 'path': '/', 'secure': False, 'value': '"NGUzM2U2ZGVmNmFiNGY0MWI0M2MwMGE4ZGJhMjc0NGE=|1519713988|707d2d360638930efdf20eec53827e09891b1d53"'}, {'domain': '.zhihu.com', 'expiry': 1522305988.412739, 'httpOnly': False, 'name': 'l_cap_id', 'path': '/', 'secure': False, 'value': '"MzdlNzAyOWU4NmNiNDBlMjhlZDBhNGI4NWE1MGYwMDM=|1519713988|3a82264b18882e235680e56235a7dee96be9e1c2"'}, {'domain': '.zhihu.com', 'httpOnly': False, 'name': 'n_c', 'path': '/', 'secure': False, 'value': '1'}, {'domain': '.zhihu.com', 'expiry': 1582785991, 'httpOnly': False, 'name': '_zap', 'path': '/', 'secure': False, 'value': '310e0d7a-3b5f-4293-867e-e7c2052b3734'}, {'domain': '.zhihu.com', 'expiry': 1582785991, 'httpOnly': False, 'name': '__utma', 'path': '/', 'secure': False, 'value': '51854390.1700258999.1519713992.1519713992.1519713992.1'}, {'domain': '.zhihu.com', 'httpOnly': False, 'name': '__utmc', 'path': '/', 'secure': False, 'value': '51854390'}, {'domain': '.zhihu.com', 'expiry': 1535481991, 'httpOnly': False, 'name': '__utmz', 'path': '/', 'secure': False, 'value': '51854390.1519713992.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)'}, {'domain': '.zhihu.com', 'expiry': 1582785991, 'httpOnly': False, 'name': '__utmv', 'path': '/', 'secure': False, 'value': '51854390.000--|3=entry_date=20180227=1'}]
[]
[{'domain': 'www.zhihu.com', 'expiry': 2150433993, 'httpOnly': False, 'name': 'name', 'path': '/', 'secure': True, 'value': 'jeff'}]

cookies可以在开发者工具的Application的Cookie看到。

选项卡管理

最简单的就是使用js代码window.open()打开新窗口:

1
2
3
4
5
6
7
8
9
10
11
12
import time
from selenium import webdriver

browser = webdriver.Chrome()
browser.get('https://www.baidu.com')
browser.execute_script('window.open()')
print(browser.window_handles)
browser.switch_to_window(browser.window_handles[1])
browser.get('https://www.taobao.com')
time.sleep(1)
browser.switch_to_window(browser.window_handles[0])
browser.get('https://python.org')

输出:

1
['CDwindow-(CCB96210C849FFC0EC59E7230C77B934)', 'CDwindow-(C67B5EFB619A8F76ED9A2609C0E79842)']

使用window_handles定位选项卡,通过switch_to_window()可以切换选项卡进行操作。

异常处理

如查找一个不存在的元素:

1
2
3
4
5
from selenium import webdriver

browser = webdriver.Chrome()
browser.get('https://www.baidu.com')
browser.find_element_by_id('hello')

报错:

1
2
3
4
5
6
7
8
9
10
11
---------------------------------------------------------------------------
NoSuchElementException Traceback (most recent call last)
<ipython-input-23-978945848a1b> in <module>()
3 browser = webdriver.Chrome()
4 browser.get('https://www.baidu.com')
----> 5 browser.find_element_by_id('hello')
...
...
NoSuchElementException: Message: no such element: Unable to locate element: {"method":"id","selector":"hello"}
(Session info: chrome=63.0.3239.132)
(Driver info: chromedriver=2.35.528161 (5b82f2d2aae0ca24b877009200ced9065a772e73),platform=Windows NT 10.0.16299 x86_64)

异常处理:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
from selenium import webdriver
from selenium.common.exceptions import TimeoutException, NoSuchElementException

browser = webdriver.Chrome()
try:
browser.get('https://www.baidu.com')
except TimeoutException:
print('Time Out')
try:
browser.find_element_by_id('hello')
except NoSuchElementException:
print('No Element')
finally:
browser.close()

输出:

1
No Element

关于异常处理 的更多内容点击这里查看文档。

结语

详细的说明和使用可以点击这里查看文档。