Supplement One of Knowledge Points (xpath)

An example of incorrectly solving the xpath usage problem

Before my memory update, my approach is like this

import requests
from lxml import etree
url = "https://editor.csdn.net/md/?not_checkout=1"
headers = {
    
    
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36"
         }
html = requests.get(url,headers=headers)
page = etree.HTML(html.text)
content = page.xpath("/html/body/div[1]/div[1]/div[2]/div/div[2]/div[4]/div/div[1]/div")
print(content)

The result is empty.
Solution:
First, I suspect that the code is not completed, and pretty_print is used, but the printed webpage code does not have a Chinese character, indicating that the corrected code is a byte type instead of a string, so an error will be reported when using xpath again . The symbol is the escape sequence of SGML-like languages ​​such as HTML and XML. They are not "encodings", which means that we can't use utf-8, gbk and other encodings for processing. We need to use HTMLParse hhh. Finally, it did not solve the problem of not being able to print and want the results and it was messed up.
Then we changed the website and continued to crawl with xpath. Things, it is found that the class of some websites will change! This overturns my previous thoughts that there is a problem with xpath. The problem is really because I did not write the syntax of the positioning element... As for the source code encoding problem considered before, after referring to other articles, I found that few people mentioned this. Most of the problems are a few simple lines of code, and there are few operations to complete the code. So it seems that the idea of ​​solving the problem at the beginning all
went wrong and then back to the positioning problem of the dynamic elements just mentioned. This situation can only be positioned by relative relationships. Dynamic element positioning
Then, after understanding the real reason why the return is empty, try crawling, and the crawling can be successful.
When crawling Baidu documents, trying to filter out some repeated advertising data is not allowed. Perhaps this goal can be placed in the data cleaning step.
Finally, the cost of learning xpath will be very high, and it will be used in many places such as scrapy, selenium, and it is valuable to spend more time to familiarize it.

xpath commonly used

html = requests.get(url,headers=headers)
print(html)

Returning <Response [200]> means normal
2

html = requests.get(url,headers=headers)
page = etree.HTML(html.text)
print(page)

return

<Element html at 0x216c24f1f40>

This is not the result of the desired html file.
This is actually an element encoded in a certain way, and its type is lxml.etree._Element,
so it must be turned into an html file by a method


The correct way is

data = requests.get(url).text
html = etree.HTML(data)
print(etree.tostring(html, pretty_print=True).decode('utf-8'))

Mentioned'utf-8' encoding, the way to add Chinese characters to the url and encode is like this

 keyword = input("请输入关键词:")
    keyword = urllib.parse.urlencode({
    
    "word":keyword})
    response = urllib.request.urlopen("http://baike.baidu.com/search/word?%s" % keyword)

3.xpath syntax usage rules

symbol Description
count Count selected elements

Replenish when used

4. If the request header is not set, xpath will report an error.
5. Redirecting (302) indicates that you need to log in to the cookie.
You can find the method one by setting cookies in the setting.

response = requests.get(url,headers=headers)
print(response.cookies)

Method Two

Detailed requests library and cookie operation

Problems to be solved

The xpath grammar cannot be found, it should be webpage anti-crawl. The typical example is the example of Douban image crawling.
Scrapy sets cookies to
learn. The consequences of learning without understanding and not taking notes are huge. This article will continue to be updated in the future.

Guess you like

Origin blog.csdn.net/qq_51598376/article/details/113773847