XPath的使用:
常用匹配规则:
/ |
从当前节点,选取子节点 |
// |
从当前节点,选取子孙节点 |
. |
选取当前节点 |
.. |
选择当前节点的父节点 |
@ |
选择属性 |
属性获取:
from lxml import etree html = '<div><a class="du" href="http://www.baidu.com">百度</a></div>' parser = etree.HTML(html) result = parser.xpath('//a[@class="du"]/@href') print(result)
文本获取:
from lxml import etree html = '<div><a class="du" href="http://www.baidu.com">百度</a></div>' parser = etree.HTML(html) result = parser.xpath('//a[@class="du"]/text()') print(result)
属性多值匹配:
from lxml import etree html = '<div><a class="du baidu" href="http://www.baidu.com">百度</a></div>' parser = etree.HTML(html) result = parser.xpath('//a[contains(@class,"du")]/text()') print(result)
多属性匹配:
from lxml import etree html = '<div><a name="item" class="du baidu" href="http://www.baidu.com">百度</a></div>' parser = etree.HTML(html) result = parser.xpath('//a[contains(@class,"du") and @name="item"]/text()') print(result)
按序选择:
from lxml import etree html = """ <li>item1</li> <li>item2</li> <li>item3</li> <li>item4</li> <li>item5</li> """ parser = etree.HTML(html) result = parser.xpath('//li[1]/text()') #匹配第一个 print(result) result = parser.xpath('//li[last()]/text()') #匹配最后一个 print(result) result = parser.xpath('//li[position()<3]/text()') #匹配第一、第二个 print(result) result = parser.xpath('//li[last()-2]/text()') #匹配倒数第三个 print(result)
更多用法:http://www.w3school.com.cn/xpath/xpath_functions.asp