Scrapy extraction mechanism has its own set of data, referred to as a selector ( selectors ) , by a specific Xpath or CSS selected expression HTML of a portion of the file
Xpath is specialized in XML selected node file language can also be used in HTML on.
CSS is a goalkeeper HTML document style language, it is defined by the selectors, and with a particular HTML associated with the style elements.
Xpath commonly used methods:
nodeName select all nodes this node
/ Select from the root node
// the current node matches the selected document node selected from, regardless of their location
. Selects the current node
.. select the parent node of the current node
@ Select Properties
* Matches any element node
@ * Matches any attribute node
The Node () matches any type of node
CSS common use:
.class .color select class = "color" of all the elements
#id #info select id = "info" all elements
* * Select all elements
element p selects all p elements
element, element div, p selects all div elements and all p elements
element element div p choose div internal label all p elements
[attribute] [target] selected with targe all elements attributes
[arrtibute = value] [target = _blank] select target = "_ blank" all elements
|
xpath selectors |
css selector |
Unable to find a match |
Customizable return value response.xpath ( '// title / text ( )'). Extract_first (default = 'not-found') The default is None |
with |
Extract the matching element (Back to list) |
.extract () method |
with |
Extracting the first matching element (returns the string) |
.extract_first () method |
with |
Get the text |
response.xpath('//title/text()') |
response.css('title::text') |
Acquiring property |
response.xpath('//base/@href') |
response.css('base::attr(href)') |
Obtaining a label all href included in image fields href attribute |
response.xpath('//a[contains(@href, "image")]/@href') |
response.css('a[href*=image]::attr(href)') |
Acquiring property in the label tag |
response.xpath('//a[contains(@href, "image")]/img/@src') |
response.css('a[href*=image] img::attr(src)') 注意img前面有空格 |
可与正则表达式连用re()返回列表,re_first()返回第一个匹配字符串 |
response.xpath('//a/text').re(r'Name:\s*(.*)') |
response.css('a::text').re(r'Name:\s*(.*)') |