scrapy选择器详解

简介

Scrapy提取数据的一套机制称作选择器(seletors),它们通过特定的XPath或者CSS表达式来“选择” HTML文件中的某个部分。详细的选择器说明点击这里参考文档。

使用

Scrapy提供了一个样例页面https://doc.scrapy.org/en/latest/_static/selectors-sample1.html用于测试,页面源码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
<html>
<head>
<base href='http://example.com/' />
<title>Example website</title>
</head>
<body>
<div id='images'>
<a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>
<a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>
<a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>
<a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>
<a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>
</div>
</body>
</html>

打开命令行,输入:

1
scrapy shell http://doc.scrapy.org/en/latest/_static/selectors-sample1.html

接着就可以获得response变量了,可以用它在终端做一些测试,如response.selector.xpath()response.selector.css(),如:

1
2
3
4
5
6
7
8
9
10
11
In [3]: response.selector.xpath('//*[@id="images"]/a[2]')
Out[3]: [<Selector xpath='//*[@id="images"]/a[2]' data='<a href="image2.html">Name: My image 2 <'>]

In [4]: response.selector.xpath('//*[@id="images"]/a[2]/text()')
Out[4]: [<Selector xpath='//*[@id="images"]/a[2]/text()' data='Name: My image 2 '>]

In [5]: response.selector.xpath('//*[@id="images"]/a[2]/text()').extract()
Out[5]: ['Name: My image 2 ']

In [6]: response.selector.css('title::text').extract()
Out[6]: ['Example website']

extract()返回结果文本的一个列表,extract_first()返回第一个结果。

为了方便,Scrapy还提供了简化的使用方法response.xpath()response.css(),另外,两种选择器返回相同的选择器列表,因此可以嵌套使用,如:

1
2
3
4
5
6
7
8
9
10
In [7]: response.selector.xpath('//div[@id="images"]').css('img::attr(src)').extract()
Out[7]:
['image1_thumb.jpg',
'image2_thumb.jpg',
'image3_thumb.jpg',
'image4_thumb.jpg',
'image5_thumb.jpg']

In [8]: response.selector.xpath('//div[@id="images"]').css('img::attr(src)').extract_first()
Out[8]: 'image1_thumb.jpg'

extract_first()还可以传递一个参数·default表示默认值,如果找不到,就使用这个默认值:

1
2
In [11]: response.selector.xpath('//div[@id="images"]').css('img::attr(src2)').extract_first(default='test')
Out[11]: 'test'

选择属性:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
In [14]: response.xpath('//a/@href').extract()
Out[14]: ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']

In [15]: response.xpath('//a/@href')
Out[15]:
[<Selector xpath='//a/@href' data='image1.html'>,
<Selector xpath='//a/@href' data='image2.html'>,
<Selector xpath='//a/@href' data='image3.html'>,
<Selector xpath='//a/@href' data='image4.html'>,
<Selector xpath='//a/@href' data='image5.html'>]

In [16]: response.xpath('//a/@href').extract()
Out[16]: ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']

In [17]: response.css('a::attr(href)')
Out[17]:
[<Selector xpath='descendant-or-self::a/@href' data='image1.html'>,
<Selector xpath='descendant-or-self::a/@href' data='image2.html'>,
<Selector xpath='descendant-or-self::a/@href' data='image3.html'>,
<Selector xpath='descendant-or-self::a/@href' data='image4.html'>,
<Selector xpath='descendant-or-self::a/@href' data='image5.html'>]

In [18]: response.css('a::attr(href)').extract()
Out[18]: ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']

选择文本:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
In [22]: response.xpath('//a/text()')
Out[22]:
[<Selector xpath='//a/text()' data='Name: My image 1 '>,
<Selector xpath='//a/text()' data='Name: My image 2 '>,
<Selector xpath='//a/text()' data='Name: My image 3 '>,
<Selector xpath='//a/text()' data='Name: My image 4 '>,
<Selector xpath='//a/text()' data='Name: My image 5 '>]

In [23]: response.xpath('//a/text()').extract()
Out[23]:
['Name: My image 1 ',
'Name: My image 2 ',
'Name: My image 3 ',
'Name: My image 4 ',
'Name: My image 5 ']

In [24]: response.css('a::text')
Out[24]:
[<Selector xpath='descendant-or-self::a/text()' data='Name: My image 1 '>,
<Selector xpath='descendant-or-self::a/text()' data='Name: My image 2 '>,
<Selector xpath='descendant-or-self::a/text()' data='Name: My image 3 '>,
<Selector xpath='descendant-or-self::a/text()' data='Name: My image 4 '>,
<Selector xpath='descendant-or-self::a/text()' data='Name: My image 5 '>]

In [25]: response.css('a::text').extract()
Out[25]:
['Name: My image 1 ',
'Name: My image 2 ',
'Name: My image 3 ',
'Name: My image 4 ',
'Name: My image 5 ']

选择属性名称包含image的链接:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
In [27]: response.xpath('//a[contains(@href, "image")]/@href')
Out[27]:
[<Selector xpath='//a[contains(@href, "image")]/@href' data='image1.html'>,
<Selector xpath='//a[contains(@href, "image")]/@href' data='image2.html'>,
<Selector xpath='//a[contains(@href, "image")]/@href' data='image3.html'>,
<Selector xpath='//a[contains(@href, "image")]/@href' data='image4.html'>,
<Selector xpath='//a[contains(@href, "image")]/@href' data='image5.html'>]

In [28]: response.xpath('//a[contains(@href, "image")]/@href').extract()
Out[28]: ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']

In [29]: response.css('a[href*=image]::attr(href)')
Out[29]:
[<Selector xpath="descendant-or-self::a[@href and contains(@href, 'image')]/@href" data='image1.html'>,
<Selector xpath="descendant-or-self::a[@href and contains(@href, 'image')]/@href" data='image2.html'>,
<Selector xpath="descendant-or-self::a[@href and contains(@href, 'image')]/@href" data='image3.html'>,
<Selector xpath="descendant-or-self::a[@href and contains(@href, 'image')]/@href" data='image4.html'>,
<Selector xpath="descendant-or-self::a[@href and contains(@href, 'image')]/@href" data='image5.html'>]

In [30]: response.css('a[href*=image]::attr(href)').extract()
Out[30]: ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']

选择属性名称包含image<a><img>src

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
In [31]: response.xpath('//a[contains(@href, "image")]/img/@src').extract()
Out[31]:
['image1_thumb.jpg',
'image2_thumb.jpg',
'image3_thumb.jpg',
'image4_thumb.jpg',
'image5_thumb.jpg']

In [32]: response.css('a[href*=image] img::attr(src)').extract()
Out[32]:
['image1_thumb.jpg',
'image2_thumb.jpg',
'image3_thumb.jpg',
'image4_thumb.jpg',
'image5_thumb.jpg']

选择器还可以使用re()结合正则表达式使用,但是re()返回的不是选择器对象而是字符串列表;还有一个类似extract_first()的方法re_first()选择第一个匹配内容。

如获取<a>文本里Name:后面的内容:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
In [33]: response.css('a::text').extract()
Out[33]:
['Name: My image 1 ',
'Name: My image 2 ',
'Name: My image 3 ',
'Name: My image 4 ',
'Name: My image 5 ']

In [34]: response.css('a::text').re('Name\:(.*)')
Out[34]:
[' My image 1 ',
' My image 2 ',
' My image 3 ',
' My image 4 ',
' My image 5 ']

In [35]: response.css('a::text').re_first('Name\:(.*)')
Out[35]: ' My image 1 '

In [36]: response.css('a::text').re_first('Name\:(.*)').strip()
Out[36]: 'My image 1'

小技巧

Chrome浏览器的开发者工具的console界面除了可以调试js,还可以调试xpath和css,所以我们的选择器也可以在这里测试,可以结合页面查看比较方便。只是要注意,Scrapy的a::text这种语法就不支持了。

xpath选择器使用方法$x()

1
$x('//*[@id="images"]/a[1]')

css使用方法$$()

1
$$('#images > a:nth-child(1)')

另外,elements界面按esc键可以直接在elements界面打开console,还可以直接在elements界面右键Copy –> Copy XpathCopy Selector把选中元素的选择器复制下来。