Python爬虫常用库pyquery详解

简介

pyquery是一个强大的网页解析库。如果熟悉jquery,那么pyquery用起来也会很简单。

官方的说法是:

pyquery allows you to make jquery queries on xml documents. The API is as much as possible the similar to jquery. pyquery uses lxml for fast xml and html manipulation.

初始化

通过字符串初始化

导入一般都这样写:from pyquery import PyQuery as pq

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
In [1]: from pyquery import PyQuery as pq

In [2]: html = '''
...: <div>
...: <ul>
...: <li class="item-0">first item</li>
...: <li class="item-1"><a href="link2.html">second item</a></li>
...: <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
...: <li class="item-1 active"><a href="link4.html">fourth item</a></li>
...: <li class="item-0"><a href="link5.html">fifth item</a></li>
...: </ul>
...: </div>
...: '''

In [3]: doc = pq(html)

In [4]: print(doc('li'))
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>

使用pq(html)初始化对象后就可以像jquery一样使用选择器了。

通过URL初始化

1
2
3
4
5
6
In [5]: from pyquery import PyQuery as pq

In [6]: doc = pq(url='http://www.baidu.com')

In [7]: print(doc('head'))
<head><meta http-equiv="content-type" content="text/html;charset=utf-8"/><meta http-equiv="X-UA-Compatible" content="IE=Edge"/><meta content="always" name="referrer"/><link rel="stylesheet" type="text/css" href="http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css"/><title>百度一下,你就知道</title></head>

通过文件初始化

1
2
3
4
5
6
7
8
9
10
In [8]: from pyquery import PyQuery as pq

In [9]: doc = pq(filename='demo.html')

In [10]: print(doc('li'))
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>

当然,这需要在当前目录下有一个demo.html文件。

基本CSS选择器

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
In [11]: from pyquery import PyQuery as pq

In [12]: html = '''
...: <div id="container">
...: <ul class="list">
...: <li class="item-0">first item</li>
...: <li class="item-1"><a href="link2.html">second item</a></li>
...: <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
...: <li class="item-1 active"><a href="link4.html">fourth item</a></li>
...: <li class="item-0"><a href="link5.html">fifth item</a></li>
...: </ul>
...: </div>
...: '''

In [13]: doc = pq(html)

In [14]: print(doc('#container .list li'))
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>

doc('#container .list li')选择idcontainer的元素的下层元素里classlist的元素里查找li标签。这些元素不需要是直接的父与子元素,只要是有层级关系就可以就可以。

查找元素

子元素

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
In [15]: from pyquery import PyQuery as pq

In [16]: html = '''
...: <div id="container">
...: <ul class="list">
...: <li class="item-0">first item</li>
...: <li class="item-1"><a href="link2.html">second item</a></li>
...: <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
...: <li class="item-1 active"><a href="link4.html">fourth item</a></li>
...: <li class="item-0"><a href="link5.html">fifth item</a></li>
...: </ul>
...: </div>
...: '''

In [17]: doc = pq(html)

In [18]: items = doc('.list')

In [19]: print(type(items))
<class 'pyquery.pyquery.PyQuery'>

In [20]: print(items)
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>


In [21]: lis = items.find('li')

In [22]: print(type(lis))
<class 'pyquery.pyquery.PyQuery'>

In [23]: print(lis)
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>

find()找出所有符合条件的下层元素。可以看到,结果都是PyQuery对象。

此外,还有一个chilren(),它是找出直接子元素:

1
2
3
4
5
6
7
8
9
10
11
In [24]: lis = items.children()

In [25]: print(type(lis))
<class 'pyquery.pyquery.PyQuery'>

In [26]: print(lis)
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>

当然,这里输出的结果和上面一样。
children()也可以传入参数,如查找classactive的子标签:

1
2
3
4
5
In [27]: lis = items.children('.active')

In [28]: print(lis)
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>

父元素

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
In [29]: from pyquery import PyQuery as pq

In [30]: html = '''
...: <div id="container">
...: <ul class="list">
...: <li class="item-0">first item</li>
...: <li class="item-1"><a href="link2.html">second item</a></li>
...: <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
...: <li class="item-1 active"><a href="link4.html">fourth item</a></li>
...: <li class="item-0"><a href="link5.html">fifth item</a></li>
...: </ul>
...: </div>
...: '''

In [31]: doc = pq(html)

In [32]: items = doc('.list')

In [33]: container = items.parent()

In [34]: print(type(container))
<class 'pyquery.pyquery.PyQuery'>

In [35]: print(container)
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>

parent()找出父元素,父元素只有一个,此外,还有parents(),它找出所有祖先元素:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
In [36]: from pyquery import PyQuery as pq

In [37]: html = '''
...: <div class="wrap">
...: <div id="container">
...: <ul class="list">
...: <li class="item-0">first item</li>
...: <li class="item-1"><a href="link2.html">second item</a></li>
...: <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
...: <li class="item-1 active"><a href="link4.html">fourth item</a></li>
...: <li class="item-0"><a href="link5.html">fifth item</a></li>
...: </ul>
...: </div>
...: </div>
...: '''

In [38]: doc = pq(html)

In [39]: items = doc('.list')

In [40]: parents = items.parents()

In [41]: print(type(parents))
<class 'pyquery.pyquery.PyQuery'>

In [42]: print(parents)
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div><div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>

parents()还可以传入一个css选择器进行筛选:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
In [43]: parent = items.parents('.wrap')

In [44]: print(parent)
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>

兄弟元素

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
In [45]: from pyquery import PyQuery as pq

In [46]: html = '''
...: <div class="wrap">
...: <div id="container">
...: <ul class="list">
...: <li class="item-0">first item</li>
...: <li class="item-1"><a href="link2.html">second item</a></li>
...: <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
...: <li class="item-1 active"><a href="link4.html">fourth item</a></li>
...: <li class="item-0"><a href="link5.html">fifth item</a></li>
...: </ul>
...: </div>
...: </div>
...: '''

In [47]: doc = pq(html)

In [48]: li = doc('.list .item-0.active')

In [49]: print(li.siblings())
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0">first item</li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>

这里doc('.list .item-0.active')classlist的元素里筛选class包含item-0active的元素。注意:这里.item-0.active中间没有空格,中间的.表示并列。结果只会有<li class="item-1 active"><a href="link4.html">fourth item</a></li>这一条,然后siblings()找出兄弟节点。

当然,siblings()也可以传入选择器:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
In [50]: from pyquery import PyQuery as pq

In [51]: html = '''
...: <div class="wrap">
...: <div id="container">
...: <ul class="list">
...: <li class="item-0">first item</li>
...: <li class="item-1"><a href="link2.html">second item</a></li>
...: <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
...: <li class="item-1 active"><a href="link4.html">fourth item</a></li>
...: <li class="item-0"><a href="link5.html">fifth item</a></li>
...: </ul>
...: </div>
...: </div>
...: '''

In [52]: doc = pq(html)

In [53]: li = doc('.list .item-0.active')

In [54]: print(li.siblings('.active'))
<li class="item-1 active"><a href="link4.html">fourth item</a></li>

遍历

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
In [55]: from pyquery import PyQuery as pq

In [56]: html = '''
...: <div class="wrap">
...: <div id="container">
...: <ul class="list">
...: <li class="item-0">first item</li>
...: <li class="item-1"><a href="link2.html">second item</a></li>
...: <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
...: <li class="item-1 active"><a href="link4.html">fourth item</a></li>
...: <li class="item-0"><a href="link5.html">fifth item</a></li>
...: </ul>
...: </div>
...: </div>
...: '''

In [57]: doc = pq(html)

In [58]: lis = doc('li').items()

In [59]: print(type(lis))
<class 'generator'>

In [60]: for li in lis:
...: print(li)
...:
<li class="item-0">first item</li>

<li class="item-1"><a href="link2.html">second item</a></li>

<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>

<li class="item-1 active"><a href="link4.html">fourth item</a></li>

<li class="item-0"><a href="link5.html">fifth item</a></li>

使用.items()把结果变成一个生成器。然后就可以用for循环遍历。

获取信息

获取属性

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
In [61]: from pyquery import PyQuery as pq

In [62]: html = '''
...: <div class="wrap">
...: <div id="container">
...: <ul class="list">
...: <li class="item-0">first item</li>
...: <li class="item-1"><a href="link2.html">second item</a></li>
...: <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
...: <li class="item-1 active"><a href="link4.html">fourth item</a></li>
...: <li class="item-0"><a href="link5.html">fifth item</a></li>
...: </ul>
...: </div>
...: </div>
...: '''

In [63]: doc = pq(html)

In [64]: a = doc('.item-0.active a')

In [65]: print(a)
<a href="link3.html"><span class="bold">third item</span></a>

In [66]: print(a.attr('href'))
link3.html

In [67]: print(a.attr.href)
link3.html

使用.attr('href')或者直接使用.都可以访问属性。

获取文本

直接使用.text()就可以获取文本:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
In [68]: from pyquery import PyQuery as pq

In [69]: html = '''
...: <div class="wrap">
...: <div id="container">
...: <ul class="list">
...: <li class="item-0">first item</li>
...: <li class="item-1"><a href="link2.html">second item</a></li>
...: <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
...: <li class="item-1 active"><a href="link4.html">fourth item</a></li>
...: <li class="item-0"><a href="link5.html">fifth item</a></li>
...: </ul>
...: </div>
...: </div>
...: '''

In [70]: doc = pq(html)

In [71]: a = doc('.item-0.active a')

In [72]: print(a)
<a href="link3.html"><span class="bold">third item</span></a>

In [73]: print(a.text())
third item

获取HTML

.html()获取选中标签包含的HTML代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
In [74]: from pyquery import PyQuery as pq

In [75]: html = '''
...: <div class="wrap">
...: <div id="container">
...: <ul class="list">
...: <li class="item-0">first item</li>
...: <li class="item-1"><a href="link2.html">second item</a></li>
...: <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
...: <li class="item-1 active"><a href="link4.html">fourth item</a></li>
...: <li class="item-0"><a href="link5.html">fifth item</a></li>
...: </ul>
...: </div>
...: </div>
...: '''

In [76]: doc = pq(html)

In [77]: li = doc('.item-0.active')

In [78]: print(li)
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>


In [79]: print(li.html())
<a href="link3.html"><span class="bold">third item</span></a>

DOM操作

addClass()、removeClass()

添加或删除class

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
In [80]: from pyquery import PyQuery as pq

In [81]: html = '''
...: <div class="wrap">
...: <div id="container">
...: <ul class="list">
...: <li class="item-0">first item</li>
...: <li class="item-1"><a href="link2.html">second item</a></li>
...: <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
...: <li class="item-1 active"><a href="link4.html">fourth item</a></li>
...: <li class="item-0"><a href="link5.html">fifth item</a></li>
...: </ul>
...: </div>
...: </div>
...: '''

In [82]: doc = pq(html)

In [83]: li = doc('.item-0.active')

In [84]: print(li)
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>


In [85]: li.removeClass('active')
Out[85]: [<li.item-0>]

In [86]: print(li)
<li class="item-0"><a href="link3.html"><span class="bold">third item</span></a></li>


In [87]: li.addClass('active')
Out[87]: [<li.item-0.active>]

In [88]: print(li)
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>

attr()、css()

修改属性或css:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
In [89]: from pyquery import PyQuery as pq

In [90]: html = '''
...: <div class="wrap">
...: <div id="container">
...: <ul class="list">
...: <li class="item-0">first item</li>
...: <li class="item-1"><a href="link2.html">second item</a></li>
...: <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
...: <li class="item-1 active"><a href="link4.html">fourth item</a></li>
...: <li class="item-0"><a href="link5.html">fifth item</a></li>
...: </ul>
...: </div>
...: </div>
...: '''

In [91]: doc = pq(html)

In [92]: li = doc('.item-0.active')

In [93]: print(li)
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>


In [94]: li.attr('name', 'link')
Out[94]: [<li.item-0.active>]

In [95]: print(li)
<li class="item-0 active" name="link"><a href="link3.html"><span class="bold">third item</span></a></li>


In [96]: li.css('font-size', '14px')
Out[96]: [<li.item-0.active>]

In [97]: print(li)
<li class="item-0 active" name="link" style="font-size: 14px"><a href="link3.html"><span class="bold">third item</span></a></li>

remove()

删除指定元素:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
In [98]: from pyquery import PyQuery as pq

In [99]: html = '''
...: <div class="wrap">
...: Hello, World
...: <p>This is a paragraph.</p>
...: </div>
...: '''

In [100]: doc = pq(html)

In [101]: wrap = doc('.wrap')

In [102]: print(wrap.text())
Hello, World
This is a paragraph.

In [103]: wrap.find('p').remove()
Out[103]: [<p>]

In [104]: print(wrap.text())
Hello, World

其他DOM方法

其他DOM方法可以参考文档

伪类选择器

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
In [105]: from pyquery import PyQuery as pq

In [106]: html = '''
...: <div class="wrap">
...: <div id="container">
...: <ul class="list">
...: <li class="item-0">first item</li>
...: <li class="item-1"><a href="link2.html">second item</a></li>
...: <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
...: <li class="item-1 active"><a href="link4.html">fourth item</a></li>
...: <li class="item-0"><a href="link5.html">fifth item</a></li>
...: </ul>
...: </div>
...: </div>
...: '''

In [107]: doc = pq(html)

In [108]: li = doc('li:first-child')

In [109]: print(li)
<li class="item-0">first item</li>


In [110]: li = doc('li:last-child')

In [111]: print(li)
<li class="item-0"><a href="link5.html">fifth item</a></li>


In [112]: li = doc('li:nth-child(2)')

In [113]: print(li)
<li class="item-1"><a href="link2.html">second item</a></li>


In [114]: li = doc('li:gt(2)')

In [115]: print(li)
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>


In [116]: li = doc('li:nth-child(2n)')

In [117]: print(li)
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>


In [118]: li = doc('li:contains(second)')

In [119]: print(li)
<li class="item-1"><a href="link2.html">second item</a></li>

结语

更多有关css选择器的内容可以参考w3school,更多有关pyquery的内容可以参考官方文档