XPath----XML路径语言

XPath概览
- XPath是一门在XML文档中查找信息的语言，它提供了非常简洁明了的路径选择表达式。

XPath常用规则

表达式	描述
nodename	选取此节点的所有子节点
/	从当前节点选取直接子节点
//	从当前节点选取子孙节点
.	选取当前节点
..	选取当前节点的父节点
@	选取属性

示例： //title[@lang='eng'] 它代表选择所有名称为title，同时属性lang的值为eng的节点

实例引入

处理HTML变量

 1 from lxml import etree
 2 
 3 html = etree.parse('./test.html', etree.HTMLParser())               # 直接对html文本进行解析
 4 result = etree.tostring(html)
 5 print(result.decode('utf-8'))
 6 
 7 
 8 # 输出：
 9 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
10 <html><body><div>
11 <ul>
12 <li class="item-O"><a href="linkl.html">first item</a></li>
13 <li class="item-1"><a href="link2.html">second item</a></li>
14 <li class="item-inactive"><a href="link3.html">third item</a></li>
15 <li class="item-1"><a href="link4.html">fourth item</a></li>
16 <li class="item-0"><a href="link5.html">fifth item</a>
17 </li></ul>
18 </div>
19 </body></html>

处理HTML文本

HTML文本内容同HTML变量内容一样

所有节点

# 用 // 开头的XPath规则来选取所有符合要求的节点

xpath()获取节点

1 from lxml import etree
2 
3 html = etree.parse('./test.html', etree.HTMLParser())
4 result = html.xpath('//*')
5 print(result)
6 
7 
8 # 输出：
9 [<Element html at 0x112829bc8>, <Element body at 0x112829d08>, <Element div at 0x112829d48>, <Element ul at 0x112829d88>, <Element li at 0x112829dc8>, <Element a at 0x112829e48>, <Element li at 0x112829e88>, <Element a at 0x112829ec8>, <Element li at 0x112829f08>, <Element a at 0x112829e08>, <Element li at 0x112fc21c8>, <Element a at 0x112fc2208>, <Element li at 0x112fc2248>, <Element a at 0x112fc2288>]

View Code

 1 from lxml import etree
 2 
 3 html = etree.parse('./test.html', etree.HTMLParser())               # 直接对html文本进行解析
 4 result = html.xpath('//li')
 5 print(result, result[0], sep='\n')
 6 
 7 
 8 # 输出：
 9 [<Element li at 0x10dffcec8>, <Element li at 0x10dffcf08>, <Element li at 0x10dffcf48>, <Element li at 0x10dffcf88>, <Element li at 0x10dffcfc8>]
10 <Element li at 0x10dffcec8>

View Code

两个例子对比

子节点

 1 from lxml import etree
 2 
 3 html = etree.parse('./test.html', etree.HTMLParser())               # 直接对html文本进行解析
 4 result = html.xpath('//li/a')
 5 print(result)
 6 
 7 
 8 # 输出：
 9 [<Element a at 0x11b316dc8>, <Element a at 0x11b316e08>, <Element a at 0x11b316e48>, <Element a at 0x11b316e88>, <Element a at 0x11b316ec8>]
10 
11 
12 # //li用于选中所有li节点，/a用于选中li节点的所有直接子节点，即获得所有li节点的所有直接a子节点

View Code

1 from lxml import etree
2 
3 html = etree.parse('./test.html', etree.HTMLParser())               # 直接对html文本进行解析
4 result = html.xpath('//ul//a')
5 print(result)
6 
7 
8 # 输出：
9 [<Element a at 0x112930d88>, <Element a at 0x112930dc8>, <Element a at 0x112930e08>, <Element a at 0x112930e48>, <Element a at 0x112930e88>]

View Code

第一个例子用//li/a，第二个例子用//ul//a，输出结果一样

父节点

 1 from lxml import etree
 2 
 3 html = etree.parse('./test.html', etree.HTMLParser())               # 直接对html文本进行解析
 4 result = html.xpath('//a[@href="link4.html"]/../@class')
 5 # result = html.xpath('//a[@href="link4.html"]/parent::*/@class)
 6 print(result)
 7 
 8 
 9 # 输出：
10 ['item-1']
11 
12 
13 # 获取父节点
14 ..
15 或者
16 parent::

View Code

属性匹配

1 from lxml import etree
2 
3 html = etree.parse('./test.html', etree.HTMLParser())               # 直接对html文本进行解析
4 result = html.xpath('//li[@class="item-0"]')
5 print(result)
6 
7 
8 # 输出：
9 [<Element li at 0x115357d08>]

View Code

文本获取

# XPath 中的text()方法获取节点中的文本

 1 from lxml import etree
 2 
 3 html = etree.parse('./test.html', etree.HTMLParser())               # 直接对html文本进行解析
 4 result = html.xpath('//li[@class="item-0"]/text()')
 5 print(result)
 6 
 7 
 8 # 输出：
 9 ['\n']
10 
11 
12 # 输出结果没有获得任何文本，只获得一个换行符，这是因为XPath中text()前面是/，而此处/的含义是选取直接子节点，很明显li的子节点都是a节点，文本都是在a节点内部的，所以这里匹配到的结果就是被修正后的li节点内部的换行符，因为自动修正的li节点的尾标签换行来。
13 
14 # 上面XPath语句选中的HTML是：
15 <li class="item-O"><a href="linkl.html">first item</a></li>
16 <li class="item-0"><a href="link5.html">fifth item</a>
17 # 修正后的HTML：
18 <li class="item-O"><a href="linkl.html">first item</a></li>
19 <li class="item-0"><a href="link5.html">fifth item</a>
20 </li>

View Code

# 获取li节点内部的文本的两种方式
# 方式1、先获取a节点再回去文本
# 方式2、使用//

 1 # 方式1
 2 from lxml import etree
 3 
 4 html = etree.parse('./test.html', etree.HTMLParser())
 5 result = html.xpath('//li[@class="item-0"]/a/text()')
 6 print(result)
 7 # 输出：
 8 ['fifth item']
 9 
10 
11 # 方式2
12 from lxml import etree
13 
14 html = etree.parse('./test.html', etree.HTMLParser())
15 result = html.xpath('//li[@class="item-0"]//text()')
16 print(result)
17 # 输出：
18 ['fifth item', '\n']

View Code

属性获取

# 用@可以获取属性

1 from lxml import etree
2 
3 html = etree.parse('./test.html', etree.HTMLParser())
4 result = html.xpath('//li/a/@href')
5 print(result)
6 
7 # 输出：
8 ['linkl.html', 'link2.html', 'link3.html', 'link4.html', 'link5.html']

View Code

属性多值匹配

# 有些属性可能有多个值，那么要匹配这些属性，则需要用到contains()函数

contains()方法，第一个参数传入属性名称，第二个参数传入属性值

 1 from lxml import etree
 2 
 3 # 这里的HTML文本中的li节点的class属性有两个值li和li-first
 4 text = '''
 5 <li class="li li-first"><a href="link.html">first item</a></li>
 6 '''
 7 html = etree.HTML(text)
 8 # 获取text中的所有li节点中class属性是li的文本
 9 result = html.xpath('//li[contains(@class, "li")]/a/text()')
10 print(result)
11 
12 
13 # 输出：
14 ['first item']

View Code

多属性匹配

 1 from lxml import etree
 2 
 3 text = '''
 4 <li class="li li-first" name="item"><a href="link.html">first item</a></li>
 5 '''
 6 html = etree.HTML(text)
 7 result = html.xpath('//li[contains(@class, "li") and @name="item"]/a/text()')
 8 print(result)
 9 
10 
11 # 输出：
12 ['first item']

View Code

运算符及其介绍

运算符	描述	实。例	返回值
or	或	price=9.80 or price=9.70	如果 price 是 9.80，则返回 true。如果 price 是 9.50，则返回 false。
and	与	price>9.00 and price<9.90	如果 price 是 9.80，则返回 true。如果 price 是 8.50，则返回 false。
mod	计算除法的余数	5 mod 2	1
\|	计算两个节点集	//book \| //cd	返回所有拥有 book 和 cd 元素的节点集
+	加法	6 + 6	12
-	减法	6 - 6	0
*	乘法	6 * 6	36
div	除法	6 div 6	1
=	等于	price=9.80	如果 price 是 9.80，则返回 true。如果 price 不是 9.90，则返回 false。
!=	不等于	price!=9.80	如果 price 不是 9.90，则返回 true。如果 price 是 9.80，则返回 false。
<	小于	age<20	如果 age 小于 20，则返回 true。如果 age 不小于 20，则返回 false。
<=	小于等于	age<=20	如果 age 小于等于 20，则返回 true。如果 age 大于 20，则返回 false
>	大于	age>20	如果 age 大于 20，则返回 true。如果 age 不大于 20，则返回 false
>=	大于等于	age>=20	如果 age 大于等于 20，则返回 true。如果 age 小于 20，则返回 false

顺序选择

# 利用中括号传入索引的方法获取特定次序的节点

 1 from lxml import etree
 2 
 3 text = '''
 4 <div>
 5 <ul>
 6 <li class="item-O"><a href="linkl.html">first item</a></li>
 7 <li class="item-1"><a href="link2.html">second item</a></li>
 8 <li class="item-inactive"><a href="link3.html">third item</a></li>
 9 <li class="item-1"><a href="link4.html">fourth item</a></li>
10 <li class="item-0"><a href="link5.html">fifth item</a>
11 </ul>
12 </div>
13 '''
14 html = etree.HTML(text)
15 result1 = html.xpath('//li[1]/a/text()')                    # 选取第一个li节点
16 result2 = html.xpath('//li[last()]/a/text()')               # 选取最后一个li节点
17 result3 = html.xpath('//li[position()<3]/a/text()')         # 选取位置小于3的li节点
18 result4 = html.xpath('//li[last()-2]/a/text()')             # 选取倒数第3个li节点
19 
20 print(result1, result2, result3, result4, sep='\n')
21 
22 
23 # 输出：
24 ['first item']
25 ['fifth item']
26 ['first item', 'second item']
27 ['third item']

View Code

节点轴选择

# ancestor轴、attribute轴、child轴、descendant轴、following轴、following-sibling轴 等

 1 from lxml import etree
 2 
 3 text = '''
 4 <div>
 5 <ul>
 6 <li class="item-O"><a href="linkl.html"><span>first item</span></a></li>
 7 <li class="item-1"><a href="link2.html">second item</a></li>
 8 <li class="item-inactive"><a href="link3.html">third item</a></li>
 9 <li class="item-1"><a href="link4.html">fourth item</a></li>
10 <li class="item-0"><a href="link5.html">fifth item</a>
11 </ul>
12 </div>
13 '''
14 html = etree.HTML(text)
15 result1 = html.xpath('//li[1]/ancestor::*')                 # 获取第1个li节点的所有祖先节点
16 result2 = html.xpath('//li[1]/ancestor::div')               # 获取第1个li节点的这个祖先节点
17 result3 = html.xpath('//li[1]/attribute::*')                # 获取第1个li节点的所有属性值
18 result4 = html.xpath('//li[1]/child::a[@href="link.html"]')             # 获取所有（href属性值为link.html的a节点）直接子节点
19 result5 = html.xpath('//li[1]/descendant::span')            # 获取所有子孙节点（获取span节点）
20 result6 = html.xpath('//li[1]/following::*[2]')             # 获取当前节点之后的第2个捷点
21 result7 = html.xpath('//li[1]/following-sibling::*')          # 获取当前节点之后的所有同级节点
22 
23 print(result1, result2, result3, result4, result5, result6, result7, sep='\n')
24 
25 
26 # 输出：
27 [<Element html at 0x102e9f088>, <Element body at 0x10350fe08>, <Element div at 0x10350fd88>, <Element ul at 0x10350fd08>]
28 [<Element div at 0x10350fd88>]
29 ['item-O']
30 []
31 [<Element span at 0x10350fec8>]
32 [<Element a at 0x10350fe88>]
33 [<Element li at 0x10350ff48>, <Element li at 0x10350ff88>, <Element li at 0x10350ffc8>, <Element li at 0x111ba0048>]

View Code

使用XPath

XPath----XML路径语言

XPath概览

XPath常用规则

实例引入

所有节点

子节点

父节点

属性匹配

文本获取

属性获取

属性多值匹配

多属性匹配

顺序选择

节点轴选择

猜你喜欢