Python爬虫常用库urllib详解

简介

urllib库是Python内置的HTTP请求库,官方文档里说的很清楚:

urllib is a package that collects several modules for working with URLs:

  • urllib.request for opening and reading URLs
  • urllib.error containing the exceptions raised by urllib.request
  • urllib.parse for parsing URLs
  • urllib.robotparser for parsing robots.txt files

这个库的使用在Python2里有一些区别,具体差异可以参考:

Python 2 Python 3
urllib.urlretrieve() urllib.request.urlretrieve()
urllib.urlcleanup() urllib.request.urlcleanup()
urllib.quote() urllib.parse.quote()
urllib.quote_plus() urllib.parse.quote_plus()
urllib.unquote() urllib.parse.unquote()
urllib.unquote_plus() urllib.parse.unquote_plus()
urllib.urlencode() urllib.parse.urlencode()
urllib.pathname2url() urllib.request.pathname2url()
urllib.url2pathname() urllib.request.url2pathname()
urllib.getproxies() urllib.request.getproxies()
urllib.URLopener urllib.request.URLopener
urllib.FancyURLopener urllib.request.FancyURLopener
urllib.ContentTooShortError urllib.error.ContentTooShortError
urllib2.urlopen() urllib.request.urlopen()
urllib2.install_opener() urllib.request.install_opener()
urllib2.build_opener() urllib.request.build_opener()
urllib2.URLError urllib.error.URLError
urllib2.HTTPError urllib.error.HTTPError
urllib2.Request urllib.request.Request
urllib2.OpenerDirector urllib.request.OpenerDirector
urllib2.BaseHandler urllib.request.BaseHandler
urllib2.HTTPDefaultErrorHandler urllib.request.HTTPDefaultErrorHandler
urllib2.HTTPRedirectHandler urllib.request.HTTPRedirectHandler
urllib2.HTTPCookieProcessor urllib.request.HTTPCookieProcessor
urllib2.ProxyHandler urllib.request.ProxyHandler
urllib2.HTTPPasswordMgr urllib.request.HTTPPasswordMgr
urllib2.HTTPPasswordMgrWithDefaultRealm urllib.request.HTTPPasswordMgrWithDefaultRealm
urllib2.AbstractBasicAuthHandler urllib.request.AbstractBasicAuthHandler
urllib2.HTTPBasicAuthHandler urllib.request.HTTPBasicAuthHandler
urllib2.ProxyBasicAuthHandler urllib.request.ProxyBasicAuthHandler
urllib2.AbstractDigestAuthHandler urllib.request.AbstractDigestAuthHandler
urllib2.HTTPDigestAuthHandler urllib.request.HTTPDigestAuthHandler
urllib2.ProxyDigestAuthHandler urllib.request.ProxyDigestAuthHandler
urllib2.HTTPHandler urllib.request.HTTPHandler
urllib2.HTTPSHandler urllib.request.HTTPSHandler
urllib2.FileHandler urllib.request.FileHandler
urllib2.FTPHandler urllib.request.FTPHandler
urllib2.CacheFTPHandler urllib.request.CacheFTPHandler
urllib2.UnknownHandler urllib.request.UnknownHandler

我这里用的是Python3.6版本。

urllib.request请求模块

urlopen函数

属于urllib.request模块,用法如下:

1
urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

url为访问地址,data为指定要发送到服务器的附加数据,timeout参数设定超时,后面的是证书和SSL相关参数。

1
2
3
4
5
6
7
8
9
10
11
12
13
In [1]: import urllib.request

In [2]: response = urllib.request.urlopen('http://www.baidu.com')

In [3]: print(response.read().decode('utf-8'))
<!DOCTYPE html>
<!--STATUS OK-->
...
...
</body>
</html>

In [4]:

https://httpbin.org/是供给人们测试HTTP请求的网站。下面测试一个POST请求:

1
2
3
4
5
6
7
8
9
10
In [10]: import urllib.parse

In [11]: import urllib.request

In [12]: data = bytes(urllib.parse.urlencode({'name':'Jeff'}), encoding='utf8')

In [13]: response = urllib.request.urlopen('http://httpbin.org/post', data=data)

In [14]: print(response.read())
b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n "name": "Jeff"\n }, \n "headers": {\n "Accept-Encoding": "identity", \n "Connection": "close", \n "Content-Length": "9", \n "Content-Type": "application/x-www-form-urlencoded", \n "Host": "httpbin.org", \n "User-Agent": "Python-urllib/3.6"\n }, \n "json": null, \n "origin": "183.21.190.87", \n "url": "http://httpbin.org/post"\n}\n'

下面是timeout参数的使用:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
In [15]: import urllib.request

In [16]: response = urllib.request.urlopen('http://httpbin.org/get', timeout=1)

In [17]: print(response.read())
b'{\n "args": {}, \n "headers": {\n "Accept-Encoding": "identity", \n "Connection": "close", \n "Host": "httpbin.org", \n "User-Agent": "Python-urllib/3.6"\n }, \n "origin": "183.21.190.87", \n "url": "http://httpbin.org/get"\n}\n'

In [18]: import socket

In [19]: import urllib.request

In [20]: import urllib.error

In [21]: try:
...: response = urllib.request.urlopen('http://httpbin.org/get', timeout=0.1)
...: except urllib.error.URLError as e:
...: if isinstance(e.reason, socket.timeout):
...: print('TIME OUT')
...:
TIME OUT

超过设定的时间就会抛出异常。

请求

直接使用urlopen()是不可以请求时加入headers的,要加headers需要构造一个Request对象:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
In [33]: from urllib import request, parse

In [34]: url = 'http://httpbin.org/post'

In [35]: headers = {
...: 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.
...: 0.3239.132 Safari/537.36',
...: 'Host': 'httpbin.org'
...: }

In [36]: dict = {
...: 'name': 'Jeff'
...: }

In [37]: data = bytes(parse.urlencode(dict), encoding='utf8')

In [38]: req = request.Request(url=url, headers=headers, method='POST')

In [39]: response = request.urlopen(req)

In [40]: print(response.read().decode('utf-8'))
{
"args": {},
"data": "",
"files": {},
"form": {},
"headers": {
"Accept-Encoding": "identity",
"Connection": "close",
"Content-Length": "0",
"Host": "httpbin.org",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"
},
"json": null,
"origin": "183.21.190.87",
"url": "http://httpbin.org/post"
}

还有一个方法add_header是用于添加headers的:

1
req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36')

响应

得到Response之后可以看到响应的一些基本信息:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
In [22]: import urllib.request

In [23]: response = urllib.request.urlopen('https://www.baidu.com')

In [24]: print(type(response))
<class 'http.client.HTTPResponse'>

In [25]: print(response.status)
200

In [26]: print(response.getheader('Server'))
BWS/1.1

In [27]: print(response.getheaders())
[('Accept-Ranges', 'bytes'), ('Cache-Control', 'no-cache'), ('Content-Length', '227'), ('Content-Type', 'text/html'), ('Date', 'Sun, 25 Feb 2018 14:38:41 GMT'), ('Last-Modified', 'Sun, 11 Feb 2018 04:46:00 GMT'), ('P3p', 'CP=" OTI DSP COR IVA OUR IND COM "'), ('Pragma', 'no-cache'), ('Server', 'BWS/1.1'), ('Set-Cookie', 'BD_NOT_HTTPS=1; path=/; Max-Age=300'), ('Set-Cookie', 'BIDUPSID=B15F4542153F7063286F0B770AC1501D; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com'), ('Set-Cookie', 'PSTM=1519569521; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com'), ('Strict-Transport-Security', 'max-age=0'), ('X-Ua-Compatible', 'IE=Edge,chrome=1'), ('Connection', 'close')]

使用代理

设置代理用于欺骗目标网站,让服务器把请求识别成来自不同地区的请求防止爬虫被封:

1
2
3
4
5
6
7
8
9
10
11
12
13
In [68]: import urllib.request

In [69]: proxy_handler = urllib.request.ProxyHandler({
...: 'http': 'http://113.121.242.122:30041',
...: 'https': 'https://113.121.242.122:30041'
...: })

In [70]: opener = urllib.request.build_opener(proxy_handler)

In [71]: response = opener.open('http://httpbin.org/get')

In [72]: print(response.read())
b'{\n "args": {}, \n "headers": {\n "Accept-Encoding": "identity", \n "Cache-Control": "max-age=259200", \n "Connection": "close", \n "Host": "httpbin.org", \n "User-Agent": "Python-urllib/3.6"\n }, \n "origin": "113.121.242.122", \n "url": "http://httpbin.org/get"\n}\n'

可以看到"origin": "113.121.242.122"是我们设置的代理。

用于保存我们的登录状态。
以下代码可以输出访问百度的cookie内容。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
In [73]: import http.cookiejar, urllib.request

In [74]: cookie = http.cookiejar.CookieJar()

In [75]: handler = urllib.request.HTTPCookieProcessor(cookie)

In [76]: opener = urllib.request.build_opener(handler)

In [77]: response = opener.open('http://www.baidu.com')

In [78]: for item in cookie:
...: print(item.name + '=' + item.value)
...:
BAIDUID=B158E71ED10979B2873EAD4F92F69BD8:FG=1
BIDUPSID=B158E71ED10979B2873EAD4F92F69BD8
H_PS_PSSID=1469_21112_18559_20930
PSTM=1519575810
BDSVRTM=0
BD_HOME=0

也可以把cookie保存起来:

1
2
3
4
5
6
7
8
9
10
11
12
13
In [79]: import http.cookiejar, urllib.request

In [80]: filename = "cookie.txt"

In [81]: cookie = http.cookiejar.MozillaCookieJar(filename)

In [82]: handler = urllib.request.HTTPCookieProcessor(cookie)

In [83]: opener = urllib.request.build_opener(handler)

In [84]: response = opener.open('http://www.baidu.com')

In [85]: cookie.save(ignore_discard=True, ignore_expires=True)

这是MozillaCookieJar的保存格式,打开cookie.txt可以看到:

1
2
3
4
5
6
7
8
9
10
# Netscape HTTP Cookie File
# http://curl.haxx.se/rfc/cookie_spec.html
# This is a generated file! Do not edit.

.baidu.com TRUE / FALSE 3667060046 BAIDUID 241959CDD6CF8465BDF2D583B828249D:FG=1
.baidu.com TRUE / FALSE 3667060046 BIDUPSID 241959CDD6CF8465BDF2D583B828249D
.baidu.com TRUE / FALSE H_PS_PSSID 1428_21107_17001_20930
.baidu.com TRUE / FALSE 3667060046 PSTM 1519576400
www.baidu.com FALSE / FALSE BDSVRTM 0
www.baidu.com FALSE / FALSE BD_HOME 0

还有另一种保存格式LWPCookieJar

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
In [86]: cookie.save(ignore_discard=True, ignore_expires=True)

In [87]: import http.cookiejar, urllib.request

In [88]: filename = "cookie.txt"

In [89]: cookie = http.cookiejar.LWPCookieJar(filename)

In [90]: handler = urllib.request.HTTPCookieProcessor(cookie)

In [91]: opener = urllib.request.build_opener(handler)

In [92]: response = opener.open('http://www.baidu.com')

In [93]: cookie.save(ignore_discard=True, ignore_expires=True)

打开cookie.txt可以看到:

1
2
3
4
5
6
7
#LWP-Cookies-2.0
Set-Cookie3: BAIDUID="04A75DA71590BD4D6D9839D727E42ABB:FG=1"; path="/"; domain=".baidu.com"; path_spec; domain_dot; expires="2086-03-15 20:26:18Z"; version=0
Set-Cookie3: BIDUPSID=04A75DA71590BD4D6D9839D727E42ABB; path="/"; domain=".baidu.com"; path_spec; domain_dot; expires="2086-03-15 20:26:18Z"; version=0
Set-Cookie3: H_PS_PSSID=25641_1427_21103_22160; path="/"; domain=".baidu.com"; path_spec; domain_dot; discard; version=0
Set-Cookie3: PSTM=1519578732; path="/"; domain=".baidu.com"; path_spec; domain_dot; expires="2086-03-15 20:26:18Z"; version=0
Set-Cookie3: BDSVRTM=0; path="/"; domain="www.baidu.com"; path_spec; discard; version=0
Set-Cookie3: BD_HOME=0; path="/"; domain="www.baidu.com"; path_spec; discard; version=0

读取保存的cookie:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
In [94]: import http.cookiejar, urllib.request

In [95]: cookie = http.cookiejar.LWPCookieJar()

In [96]: cookie.load('cookie.txt', ignore_discard=True, ignore_expires=True)

In [97]: handler = urllib.request.HTTPCookieProcessor(cookie)

In [98]: opener = urllib.request.build_opener(handler)

In [99]: response = opener.open('http://www.baidu.com')

In [100]: print(response.read().decode('utf-8'))
<!DOCTYPE html>
<!--STATUS OK-->
...
...
</body>
</html>

这个过程就是把cookie保存下来,然后在下次请求时读取保存的cookie,它适用于需要登录页面。

urllib.error异常处理模块

1
2
3
4
5
6
7
8
In [102]: from urllib import request, error

In [103]: try:
...: response = request.urlopen('http://jeffyang.top/404.html')
...: except error.URLError as e:
...: print(e.reason)
...:
Not Found

URLError只有reason属性,而HTTPError还有codeheaders,具体可以查看官方文档

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
In [104]: from urllib import request, error

In [105]: try:
...: response = request.urlopen('http://jeffyang.top/404.html')
...: except error.HTTPError as e:
...: print(e.reason, e.code, e.headers, sep='\n')
...: except error.URLError as e:
...: print(e.reason)
...: else:
...: print('Qequest Successfully!')
...:
Not Found
404
Server: GitHub.com
Content-Type: text/html; charset=utf-8
ETag: "5952c2dc-247c"
Access-Control-Allow-Origin: *
Content-Security-Policy: default-src 'none'; style-src 'unsafe-inline'; img-src data:; connect-src 'self'
X-GitHub-Request-Id: EF30:11359:B2B55C:BCEF88:5A92F18B
Content-Length: 9340
Accept-Ranges: bytes
Date: Sun, 25 Feb 2018 17:31:09 GMT
Via: 1.1 varnish
Age: 338
Connection: close
X-Served-By: cache-hnd18736-HND
X-Cache: HIT
X-Cache-Hits: 1
X-Timer: S1519579870.603620,VS0,VE0
Vary: Accept-Encoding
X-Fastly-Request-ID: 6603cd54ba10d110b51972d94037a3ecab5a6b17

前面使用timeout时也有用到:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
In [107]: import socket

In [108]: import urllib.request

In [109]: import urllib.error

In [110]: try:
...: response = urllib.request.urlopen('https://www.baidu.com', timeout=0.01)
...: except urllib.error.URLError as e:
...: print(type(e.reason))
...: if isinstance(e.reason, socket.timeout):
...: print('TIME OUT')
...:
<class 'socket.timeout'>
TIME OUT

可以看到e.reason是一个<class 'socket.timeout'>对象,因此可以使用isinstance判断。
ps:isinstance()type()区别:

  • type()不会认为子类是一种父类类型,不考虑继承关系。
  • isinstance()会认为子类是一种父类类型,考虑继承关系。

urllib.parse模块

urlparse()函数用于拆分URL:

1
urllib.parse.urlparse(urlstring, scheme='', allow_fragments=True)

例如:

1
2
3
4
5
6
In [111]: from urllib.parse import urlparse

In [112]: result = urlparse('http://www.baidu.com/index.html;user?id=5#comment')

In [113]: print(type(result), result)
<class 'urllib.parse.ParseResult'> ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=5', fragment='comment')

scheme参数是协议类型:

1
2
3
4
5
6
In [114]: from urllib.parse import urlparse

In [115]: result = urlparse('www.baidu.com/index.html;user?id=5#comment', scheme='https')

In [116]: print(result)
ParseResult(scheme='https', netloc='', path='www.baidu.com/index.html', params='user', query='id=5', fragment='comment')

如果URL里已经有协议类型了,那么这个参数就不会生效:

1
2
3
4
5
6
In [117]: from urllib.parse import urlparse

In [118]: result = urlparse('http://www.baidu.com/index.html;user?id=5#comment', scheme='https')

In [119]: print(result)
ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=5', fragment='comment')

allow_fragments参数是锚点链接的设置,为False则会把#后面的内容拼接到前面:

1
2
3
4
5
6
In [120]: from urllib.parse import urlparse

In [121]: result = urlparse('http://www.baidu.com/index.html;user?id=5#comment', allow_fragments=False)

In [122]: print(result)
ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=5#comment', fragment='')

1
2
3
4
5
6
In [123]: from urllib.parse import urlparse

In [124]: result = urlparse('http://www.baidu.com/index.html#comment', allow_fragments=False)

In [125]: print(result)
ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html#comment', params='', query='', fragment='')

urlunparse()函数与urlparse()函数相反,用于组合URL:

1
2
3
4
5
6
In [126]: from urllib.parse import urlunparse

In [127]: data = ['http', 'www.baidu.com', 'index.html', 'user', 'a=6', 'comment']

In [128]: print(urlunparse(data))
http://www.baidu.com/index.html;user?a=6#comment

urljoin()函数用于拼接URL:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
In [129]: from urllib.parse import urljoin

In [130]: print(urljoin('http://www.baidu.com', 'FAQ.html'))
http://www.baidu.com/FAQ.html

In [131]: print(urljoin('http://www.baidu.com', 'https://jeffyang.top/404.html'))
https://jeffyang.top/404.html

In [132]: print(urljoin('http://www.baidu.com?wd=abc', 'http://jeffyang.top/index.html'))
http://jeffyang.top/index.html

In [133]: print(urljoin('http://www.baidu.com', '?category=2#comment'))
http://www.baidu.com?category=2#comment

In [134]: print(urljoin('www.baidu.com#comment', '?category=2'))
www.baidu.com?category=2

前后的URL都可以分成六个部分,前后URL如果都有这些部分,会以后面的URL为准。

urlencode()可以把一个字典对象转换成GET请求参数:

1
2
3
4
5
6
7
8
9
10
11
12
13
In [136]: from urllib.parse import urlencode

In [137]: params = {
...: 'name': 'jeff',
...: 'age': 21
...: }

In [138]: base_url = 'http://www.baidu.com?'

In [139]: url = base_url + urlencode(params)

In [140]: print(url)
http://www.baidu.com?name=jeff&age=21

urllib.robotparser解析robots.txt模块

这个模块用于解析 robots.txt 规则,判断要爬取的 url 按照 robots.txt 文件是否合法。

这个模块平常使用的不是很多,具体的内容可以查看文档

结语

详细的说明和使用可以点击这里查看文档。