Scrapy shell request headers lead portion

scrapy shell command requests a web page:

scrapy shell "https://www.baidu.com"

Will get the page source request, we can get the source code following a request by response.text, and then you can then match what we want by being

 

 

 

 2. Then the above request method, no limit on the number of sites when requested ok, but just as talked about before, many sites have not set the request header for requesting access is prohibited, so our crawlers are set headers header section, how to set it in the request header scrapy?

Problem analysis, we set up a large part of the request header that is actually headers, so we set the user-agent in scrapy actually completes the setting request headers head.

scrapy shell -s USER_AGENT = "" request_url can complete the request to add the lead portion, known as a request almost (without request lead portion 400 error):

scrapy shell -s USER_AGENT="Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:61.0) Gecko/20100101 Firefox/61.0" https://www.zhihu.com/question/285908404

 

 

So that we can see the original page in scrapy our request and verification we write a regular expression

3. After the above request, we can get the source code of our request by response.text, then how to save it?

In fact, to save the code with the code editor as:

with open('d:/zhihu_question.html','wb') as f:
 
    f.write(response.text.encode('utf-8'))

 

 


原文链接:https://blog.csdn.net/godot06/article/details/81587242

Guess you like

Origin www.cnblogs.com/yoyowin/p/12348029.html