Three methods for xpath to extract all text information of multiple labels under a certain label

Crawler crawling data Sometimes we need to crawl the text content of multiple tags, or need to retain tag attributes, we must take it down together with the tag. You can write regular, today I introduce a method of crawling with xpath.
The first method below can crawl down together with HTML tags, the latter two can crawl all text information, but without the tag attributes:
①The
first method can take out the HTML string in a tag, including various tags Attribute, the output result is the HTML of the article part normally displayed on the web page.

 html_content3 = requests.get(details_url).text
    html = etree.HTML(html_content3)
    # content=html.xpath('//div[@class="article-entry"]')[0].xpath('string(.)').strip()#得到其中的所有文本信息,但没有了标签属性。
    # 先取出包含文章主体的标签
    contents = html.xpath('//div[@class="article-entry"]')[0]
    # 取出来的是个element对象,需要给他转换成字符串
    name1 = etree.tostring(contents, method='html')
    # 转成字符串后中文不能正常显示,需要再对其进行解析
    name2 = HTMLParser().unescape(name1.decode())
    content = name2
    

content = response.xpath('//div[@class="t1"]').xpath('string(.)').extract()[0]

content = response.xpath('//div[@class="t1"]//text()').extract()[0]

Guess you like

Origin blog.csdn.net/Python_BT/article/details/108222583