Scrapy response.css学习

摘自官网

you can try selecting elements using CSS with the response object:

>>> response.css('title')
[<Selector xpath='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]
The result of running response.css('title') is a list-like object called SelectorList, which represents a list of Selector objects that wrap around XML/HTML elements and allow you to run further queries to fine-grain the selection or extract the data.

To extract the text from the title above, you can do:

>>> response.css('title::text').extract()
['Quotes to Scrape']
There are two things to note here: one is that we’ve added ::text to the CSS query, to mean we want to select only the text elements directly inside <title> element. If we don’t specify ::text, we’d get the full title element, including its tags:

>>> response.css('title').extract()
['<title>Quotes to Scrape</title>']
The other thing is that the result of calling .extract() is a list, because we’re dealing with an instance of SelectorList. When you know you just want the first result, as in this case, you can do:

>>> response.css('title::text').extract_first()
'Quotes to Scrape'
As an alternative, you could’ve written:

>>> response.css('title::text')[0].extract()
'Quotes to Scrape'
However, using .extract_first() avoids an IndexError and returns None when it doesn’t find any element matching the selection.

There’s a lesson here: for most scraping code, you want it to be resilient to errors due to things not being found on a page, so that even if some parts fail to be scraped, you can at least get some data.

示例

for href in response.css('a::attr(href)').extract():
- 注意:用css选择器时,response返回的是带标签的,使用::text等可以选择属性或文本

猜你喜欢

转载自blog.csdn.net/sspmii/article/details/81459375