版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/weixin_40567229/article/details/84545576
原代码如下:
import requests
from requests.exceptions import RequestException
def one_page_code(url):
try:
page = requests.get(url)
if page.status_code == 200:
return page.text
print("Failed\n状态码为%d"%(page.status_code))
except RequestException:
print("Exception")
def main():
url = 'http://maoyan.com'
print(one_page_code(url))
if __name__ == '__main__':
main()
这个代码无论是请求百度、淘宝还是豆瓣都能正常的显示出网页源代码,但是在爬取猫眼时却返回403错误
原来请求网页的过程中,忽略了很重要的一点,就是请求头
我们在浏览器检查元素中把network中的请求头复制出来,添加到请求函数中
import requests
from requests.exceptions import RequestException
def one_page_code(url):
try:
header = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36'}
page = requests.get(url,headers = header)
if page.status_code == 200:
return page.text
print("Failed\n状态码为%d"%(page.status_code))
except RequestException:
print("Exception")
def main():
url = 'http://maoyan.com/board/4'
print(one_page_code(url))
if __name__ == '__main__':
main()
就可以正常获取到网页的源代码了