Python爬虫常用库requests详解

简介

官方的说法是:

Requests: 让 HTTP 服务人类
Requests 唯一的一个非转基因的 Python HTTP 库,人类可以安全享用。

警告:非专业使用其他 HTTP 库会导致危险的副作用,包括:安全缺陷症、冗余代码症、重新发明轮子症、啃文档症、抑郁、头疼、甚至死亡。

Requests 是以 PEP 20 的箴言为中心开发的:

  • Beautiful is better than ugly.(美丽优于丑陋)
  • Explicit is better than implicit.(直白优于含蓄)
  • Simple is better than complex.(简单优于复杂)
  • Complex is better than complicated.(复杂优于繁琐)
  • Readability counts.(可读性很重要)

总之,这是一个Python实现的简单方便的HTTP库。

请求

GET请求

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
In [1]: import requests

In [2]: response = requests.get('https://www.baidu.com/')

In [3]: print(type(response))
<class 'requests.models.Response'>

In [4]: print(response.status_code)
200

In [5]: print(type(response.text))
<class 'str'>

In [6]: print(response.cookies)
<RequestsCookieJar[<Cookie BDORZ=27315 for .baidu.com/>]>

In [7]: print(response.text)
<!DOCTYPE html>
<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge
...
...
</body> </html>

response.status_code获取状态码,response.text获取网页源码,相当于urllib里的read()方法,也不用使用decode()转码。
response.cookies可以直接获取cookie不用引入额外的模块。

带参数的GET请求:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
In [14]: import requests

In [15]: response = requests.get("http://httpbin.org/get?name=jeff&age=21")

In [16]: print(response.text)
{
"args": {
"age": "21",
"name": "jeff"
},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Connection": "close",
"Host": "httpbin.org",
"User-Agent": "python-requests/2.18.4"
},
"origin": "183.21.190.87",
"url": "http://httpbin.org/get?name=jeff&age=21"
}

也可以使用一个参数params

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
In [17]: import requests

In [18]: data = {
...: 'name': 'jeff',
...: 'age': 21
...: }

In [19]: response = requests.get("http://httpbin.org/get", params=data)

In [20]: print(response.text)
{
"args": {
"age": "21",
"name": "jeff"
},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Connection": "close",
"Host": "httpbin.org",
"User-Agent": "python-requests/2.18.4"
},
"origin": "183.21.190.87",
"url": "http://httpbin.org/get?name=jeff&age=21"
}

解析json

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
In [21]: import requests

In [22]: response = requests.get("http://httpbin.org/get")

In [23]: print(type(response.text))
<class 'str'>

In [24]: print(response.json())
{'args': {}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Connection': 'close', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.18.4'}, 'origin': '183.21.190.87', 'url': 'http://httpbin.org/get'}

In [25]: print(type(response.json()))
<class 'dict'>

In [26]: import json

In [27]: print(json.loads(response.text))
{'args': {}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Connection': 'close', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.18.4'}, 'origin': '183.21.190.87', 'url': 'http://httpbin.org/get'}

可以看到直接使用response.json()和使用json.loads(response.text)结果是一样的。

获取二进制数据

1
2
3
4
5
6
7
8
9
10
11
In [33]: import requests

In [34]: response = requests.get("https://github.com/favicon.ico")

In [35]: print(type(response.text), type(response.content))
<class 'str'> <class 'bytes'>

In [36]: print(response.content)
b'\x00\x00\x01\x00\x02\x00\x10\x10\x00\x00\x01\x00 \x00(\x05\x00\x00&\x00\x00\x00 \x00\x00\x01\x00 \x00(\x14\x00\x00N\x05\x00\x00(\x00\x00\x00\x10\x00\x00\x00 \x00\x00\x00\x01\x00 \x00\x00\x00\x00\x00\x00\x05\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x11\x11\x13v\x13\x13\x13\xc5\x0e\x0e\x0e\x12\x00\x00\x00\x00\x00\x00\x00\x00\x0f\x0f\x0f\x11\x11\x11\x14\xb1\x13\x13\x13i\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x14\x14\x14\x96\x13\x13\x14\xfc\x13\x13\x14\xed\x00\x00\x00\x19\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x18\x15\x15\x17\xff\x15\x15\x17\xff\x11\x11\x13\x85\x00\x00\x00
...
...

保存:

1
2
3
4
5
6
7
In [37]: import requests

In [38]: response = requests.get("https://github.com/favicon.ico")

In [39]: with open('favicon.ico', 'wb') as f:
...: f.write(response.content)
...:

添加headers

例如知乎,没有添加headers请求会返回500

1
2
3
4
5
6
7
8
In [40]: import requests

In [41]: response = requests.get("https://www.zhihu.com")

In [42]: print(response.text)
<html><body><h1>500 Server Error</h1>
An internal server error occured.
</body></html>

使用headers参数:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
In [45]: import requests

In [46]: headers = {
...: 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.
...: 0.3239.132 Safari/537.36'
...: }

In [47]: response = requests.get("https://www.zhihu.com", headers=headers)

In [48]: print(response.text)
<!doctype html>
<html lang="zh" data-hairline="true" data-theme="light"><head><meta charset="utf-8"/><title data-react-helmet="true">知 乎 - 发现更大的世界</title><meta name="viewport" content="width=device-width,initial-scale=1,maximum-scale=1"/><meta name="renderer" content="webkit"/>
...
...
</body></html>

POST请求

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
In [50]: import requests

In [51]: ata = {'name': 'jeff', 'age': '21'}

In [52]: import requests

In [53]: data = {'name': 'jeff', 'age': '21'}

In [54]: response = requests.post("http://httpbin.org/post", data=data)

In [55]: print(response.text)
{
"args": {},
"data": "",
"files": {},
"form": {
"age": "21",
"name": "jeff"
},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Connection": "close",
"Content-Length": "16",
"Content-Type": "application/x-www-form-urlencoded",
"Host": "httpbin.org",
"User-Agent": "python-requests/2.18.4"
},
"json": null,
"origin": "183.21.190.87",
"url": "http://httpbin.org/post"
}

headers

1
2
3
4
5
6
7
8
9
10
11
12
13
In [56]: import requests

In [57]: data = {'name': 'jeff', 'age': '21'}

In [58]: headers = {
...: 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.
...: 0.3239.132 Safari/537.36'
...: }

In [59]: response = requests.post("http://httpbin.org/post", data=data, headers=headers)

In [60]: print(response.json())
{'args': {}, 'data': '', 'files': {}, 'form': {'age': '21', 'name': 'jeff'}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Connection': 'close', 'Content-Length': '16', 'Content-Type': 'application/x-www-form-urlencoded', 'Host': 'httpbin.org', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}, 'json': None, 'origin': '183.21.190.87', 'url': 'http://httpbin.org/post'}

各种类型请求

1
2
3
4
5
6
7
import requests
requests.get('http://httpbin.org/get')
requests.post('http://httpbin.org/post')
requests.put('http://httpbin.org/put')
requests.delete('http://httpbin.org/delete')
requests.head('http://httpbin.org/get')
requests.options('http://httpbin.org/get')

响应

Response信息

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
In [62]: import requests

In [63]: response = requests.get('http://jeffyang.top/')

In [64]: print(type(response.status_code), response.status_code)
<class 'int'> 200

In [65]: print(type(response.headers), response.headers)
<class 'requests.structures.CaseInsensitiveDict'> {'Server': 'GitHub.com', 'Content-Type': 'text/html; charset=utf-8', 'Last-Modified': 'Sat, 24 Feb 2018 12:56:30 GMT', 'Access-Control-Allow-Origin': '*', 'Expires': 'Mon, 26 Feb 2018 09:14:14 GMT', 'Cache-Control': 'max-age=600', 'Content-Encoding': 'gzip', 'X-GitHub-Request-Id': 'BB66:6BA9:E59020:F12729:5A93CD8D', 'Content-Length': '12829', 'Accept-Ranges': 'bytes', 'Date': 'Mon, 26 Feb 2018 09:58:00 GMT', 'Via': '1.1 varnish', 'Age': '14', 'Connection': 'keep-alive', 'X-Served-By': 'cache-hnd18737-HND', 'X-Cache': 'HIT', 'X-Cache-Hits': '1', 'X-Timer': 'S1519639081.677214,VS0,VE0', 'Vary': 'Accept-Encoding', 'X-Fastly-Request-ID': '38d73d44dc2f717cdb15f5a317f4ae28b5761489'}

In [66]: print(type(response.cookies), response.cookies)
<class 'requests.cookies.RequestsCookieJar'> <RequestsCookieJar[]>

In [67]: print(type(response.url), response.url)
<class 'str'> http://jeffyang.top/

In [68]: print(type(response.history), response.history)
<class 'list'> []

状态码判断

1
2
3
4
5
6
In [69]: import requests

In [70]: response = requests.get('http://jeffyang.top/')

In [71]: exit() if not response.status_code == 200 else print('Request Successfully')
Request Successfully
1
2
3
4
5
6
In [72]: import requests

In [73]: response = requests.get('http://jeffyang.top/404.html')

In [74]: exit() if not response.status_code == requests.codes.not_found else print('404 Not Found')
404 Not Found

可以看到,我们既能使用数字判断,也可以使用状态码对应的状态判断,如response.status_code == 200response.status_code == requests.codes.not_found
状态码及对应状态:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
100: ('continue',),
101: ('switching_protocols',),
102: ('processing',),
103: ('checkpoint',),
122: ('uri_too_long', 'request_uri_too_long'),
200: ('ok', 'okay', 'all_ok', 'all_okay', 'all_good', '\\o/', '✓'),
201: ('created',),
202: ('accepted',),
203: ('non_authoritative_info', 'non_authoritative_information'),
204: ('no_content',),
205: ('reset_content', 'reset'),
206: ('partial_content', 'partial'),
207: ('multi_status', 'multiple_status', 'multi_stati', 'multiple_stati'),
208: ('already_reported',),
226: ('im_used',),

# Redirection.
300: ('multiple_choices',),
301: ('moved_permanently', 'moved', '\\o-'),
302: ('found',),
303: ('see_other', 'other'),
304: ('not_modified',),
305: ('use_proxy',),
306: ('switch_proxy',),
307: ('temporary_redirect', 'temporary_moved', 'temporary'),
308: ('permanent_redirect',
'resume_incomplete', 'resume',), # These 2 to be removed in 3.0

# Client Error.
400: ('bad_request', 'bad'),
401: ('unauthorized',),
402: ('payment_required', 'payment'),
403: ('forbidden',),
404: ('not_found', '-o-'),
405: ('method_not_allowed', 'not_allowed'),
406: ('not_acceptable',),
407: ('proxy_authentication_required', 'proxy_auth', 'proxy_authentication'),
408: ('request_timeout', 'timeout'),
409: ('conflict',),
410: ('gone',),
411: ('length_required',),
412: ('precondition_failed', 'precondition'),
413: ('request_entity_too_large',),
414: ('request_uri_too_large',),
415: ('unsupported_media_type', 'unsupported_media', 'media_type'),
416: ('requested_range_not_satisfiable', 'requested_range', 'range_not_satisfiable'),
417: ('expectation_failed',),
418: ('im_a_teapot', 'teapot', 'i_am_a_teapot'),
421: ('misdirected_request',),
422: ('unprocessable_entity', 'unprocessable'),
423: ('locked',),
424: ('failed_dependency', 'dependency'),
425: ('unordered_collection', 'unordered'),
426: ('upgrade_required', 'upgrade'),
428: ('precondition_required', 'precondition'),
429: ('too_many_requests', 'too_many'),
431: ('header_fields_too_large', 'fields_too_large'),
444: ('no_response', 'none'),
449: ('retry_with', 'retry'),
450: ('blocked_by_windows_parental_controls', 'parental_controls'),
451: ('unavailable_for_legal_reasons', 'legal_reasons'),
499: ('client_closed_request',),

# Server Error.
500: ('internal_server_error', 'server_error', '/o\\', '✗'),
501: ('not_implemented',),
502: ('bad_gateway',),
503: ('service_unavailable', 'unavailable'),
504: ('gateway_timeout',),
505: ('http_version_not_supported', 'http_version'),
506: ('variant_also_negotiates',),
507: ('insufficient_storage',),
509: ('bandwidth_limit_exceeded', 'bandwidth'),
510: ('not_extended',),
511: ('network_authentication_required', 'network_auth', 'network_authentication'),

进阶

文件上传

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
In [75]: import requests

In [76]: files = {'ico': open('favicon.ico', 'rb')}

In [77]: response = requests.post("http://httpbin.org/post", files=files)

In [78]: print(response.text)
{
"args": {},
"data": "",
"files": {
"ico": "data:application/octet-stream;base64,AAABAAIAEB... ...AAAAAAAA="
},
"form": {},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Connection": "close",
"Content-Length": "6664",
"Content-Type": "multipart/form-data; boundary=e98882dde92244bf8e7fdaa6f03255fc",
"Host": "httpbin.org",
"User-Agent": "python-requests/2.18.4"
},
"json": null,
"origin": "183.21.190.87",
"url": "http://httpbin.org/post"
}

这里的'ico'可以自己随便起名。

获取cookie

1
2
3
4
5
6
7
8
9
10
11
In [79]: import requests

In [80]: response = requests.get("https://www.baidu.com")

In [81]: print(response.cookies)
<RequestsCookieJar[<Cookie BDORZ=27315 for .baidu.com/>]>

In [82]: for key, value in response.cookies.items():
...: print(key + '=' + value)
...:
BDORZ=27315

使用requests就不用像urllib一样经过CookieJarhandleropener了。

会话维持

用于模拟登陆:

1
2
3
4
5
6
7
8
9
10
11
In [83]: import requests

In [84]: requests.get('http://httpbin.org/cookies/set/number/123456789')
Out[84]: <Response [200]>

In [85]: response = requests.get('http://httpbin.org/cookies')

In [86]: print(response.text)
{
"cookies": {}
}

http://httpbin.org/cookies是用于测试cookie的链接。这里设置了number但是下面输出是空的是因为我们使用的两次get是两个独立的操作,没有任何关联。对于这种问题,requests提供了Session对象:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
In [87]: import requests

In [88]: s = requests.Session()

In [89]: s.get('http://httpbin.org/cookies/set/number/123456789')
Out[89]: <Response [200]>

In [90]: response = s.get('http://httpbin.org/cookies')

In [91]: print(response.text)
{
"cookies": {
"number": "123456789"
}
}

证书验证

Requests 可以为 HTTPS 请求验证 SSL 证书,就像 web 浏览器一样。SSL 验证默认是开启的,如果证书验证失败,Requests 会抛出 SSLError。

1
2
3
4
5
6
7
8
9
In [92]: import requests

In [93]: response = requests.get('https://www.12306.cn')
---------------------------------------------------------------------------
Error Traceback (most recent call last)
d:\program files\python3.6.1\lib\site-packages\urllib3\contrib\pyopenssl.py in wrap_socket(self, sock, server_side, do_handshake_on_connect, suppress_ragged_eofs, server_hostname)
...
...
SSLError: HTTPSConnectionPool(host='www.12306.cn', port=443): Max retries exceeded with url: / (Caused by SSLError(SSLError("bad handshake: Error([('SSL routines', 'tls_process_server_certificate', 'certificate verify failed')],)",),))

可以通过设置verify=False

1
2
3
4
5
6
7
8
In [94]: import requests

In [95]: response = requests.get('https://www.12306.cn', verify=False)
d:\program files\python3.6.1\lib\site-packages\urllib3\connectionpool.py:858: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
InsecureRequestWarning)

In [96]: print(response.status_code)
200

这里虽然状态码200,但还是有警告,提醒Adding certificate verification is strongly advised.可以通过设置urllib3.disable_warnings()去掉:

1
2
3
4
5
6
7
8
9
10
In [97]: import requests

In [98]: from requests.packages import urllib3

In [99]: urllib3.disable_warnings()

In [100]: response = requests.get('https://www.12306.cn', verify=False)

In [101]: print(response.status_code)
200

也可以指定一个本地证书用作客户端证书,可以是单个文件(包含密钥和证书)或一个包含两个文件路径的元组:

1
response = requests.get('https://www.12306.cn', cert=('/path/server.crt', '/path/key'))

文档还提到其他用法:

你可以为 verify 传入 CA_BUNDLE 文件的路径,或者包含可信任 CA 证书文件的文件夹路径:

1
2
>requests.get('https://github.com', verify='/path/to/certfile')
>

或者将其保持在会话中:

1
2
3
>s = requests.Session()
>s.verify = '/path/to/certfile'
>

更多使用方法请参考文档

代理设置

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
In [105]: import requests

In [106]: proxies = {
...: "http": "http://183.145.201.137:28500",
...: "https": "https://183.145.201.137:28500",
...: }

In [107]: response = requests.get("http://httpbin.org/get", proxies=proxies)

In [108]: print(response.text)
{
"args": {},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Cache-Control": "max-age=259200",
"Connection": "close",
"Host": "httpbin.org",
"User-Agent": "python-requests/2.18.4"
},
"origin": "183.145.201.137",
"url": "http://httpbin.org/get"
}

如果代理是需要用户名和密码的可以这样写:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
In [109]: import requests

In [110]: proxies = {
...: "http": "http://user:password@183.145.201.137:28500/",
...: }

In [111]: response = requests.get("http://httpbin.org/get", proxies=proxies)

In [112]: print(response.text)
{
"args": {},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Cache-Control": "max-age=259200",
"Connection": "close",
"Host": "httpbin.org",
"User-Agent": "python-requests/2.18.4"
},
"origin": "183.145.201.137",
"url": "http://httpbin.org/get"
}

除了基本的HTTP代理Requests还支持SOCKS协议的代理。这是一个可选功能,使用需要安装第三方库:

1
pip install requests[socks]

使用:

1
2
3
4
proxies = {
'http': 'socks5://user:pass@host:port',
'https': 'socks5://user:pass@host:port'
}

超时设置

1
2
3
4
5
6
7
8
9
10
11
In [133]: import requests

In [134]: from requests.exceptions import ConnectTimeout

In [135]: try:
...: response = requests.get("http://httpbin.org/get", timeout = 0.2)
...: print(response.status_code)
...: except ConnectTimeout:
...: print('ConnectTimeout')
...:
ConnectTimeout

这个timeout值将会用作connectread二者的timeout。如果要分别制定,就传入一个元组:

1
response = requests.get('https://github.com', timeout=(3, 2))

如果传入timeout=None就会让requests永远等待。

身份认证

1
2
3
4
5
6
7
8
In [145]: import requests

In [146]: from requests.auth import HTTPBasicAuth

In [147]: response = requests.get('https://httpbin.org/basic-auth/user/passwd', auth=HTTPBasicAuth('user', 'w'))

In [148]: print(response.status_code)
401

上面是使用错误密码,下面是使用正确的密码请求的结果:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
In [150]: import requests

In [151]: from requests.auth import HTTPBasicAuth

In [152]: response = requests.get('https://httpbin.org/basic-auth/user/passwd', auth=HTTPBasicAuth('user', 'passwd'))

In [153]: print(response.status_code)
200

In [154]: print(response.text)
{
"authenticated": true,
"user": "user"
}

异常处理

异常处理时,我们可以先捕获子类异常,再捕获父类异常:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
In [155]: import requests

In [156]: from requests.exceptions import ReadTimeout, ConnectionError, RequestException

In [157]: try:
...: response = requests.get("http://httpbin.org/get", timeout = 0.5)
...: print(response.status_code)
...: except ReadTimeout:
...: print('Timeout')
...: except ConnectionError:
...: print('Connection error')
...: except RequestException:
...: print('Error')
...:
200

更多有关异常的内容可以参考文档

结语

详细的说明和使用可以点击这里查看文档。