Datawhale之爬虫第一次任务

任务要求
Task1(3天)
1.1 学习get与post请求
学习get与post请求,尝试使用requests或者是urllib用get方法向https://www.baidu.com/发出一个请求,并将其返回结果输出。
如果是断开了网络,再发出申请,结果又是什么。了解申请返回的状态码。
了解什么是请求头,如何添加请求头。
1.2 正则表达式
学习什么是正则表达式并尝试一些正则表达式并进行匹配。
然后结合requests、re两者的内容爬取https://movie.douban.com/top250里的内容
要求抓取名次、影片名称、年份、导演等字段。
参考资料: https://desmonday.github.io/2019/03/02/python爬虫学习-day2正则表达式/
------------

get和post都是一种http的请求和发送数据方法主要区别具体参考此文章

get:客户端往服务端发送数据直接在url链接比如https://www.baidu.com/s?wd=boy

         https://www.baidu.com/s这是请求地址

         wd=boy是发送参数wd为boy的值给后台

post:发送数据没有暴露在URL上,会和header请求头一块发出去

总的来说其实两者在平时使用上来说没有什么本质区别

用get 方法请求百度页面

import requests

print(requests.get('http://www.baidu.com'))

返回

断网请求报错

Traceback (most recent call last):
  File "/home/capture/.local/lib/python3.6/site-packages/urllib3/connection.py", line 159, in _new_conn
    (self._dns_host, self.port), self.timeout, **extra_kw)
  File "/home/capture/.local/lib/python3.6/site-packages/urllib3/util/connection.py", line 57, in create_connection
    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
  File "/usr/lib/python3.6/socket.py", line 745, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/capture/.local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 600, in urlopen
    chunked=chunked)
  File "/home/capture/.local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 354, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/lib/python3.6/http/client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.6/http/client.py", line 1234, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.6/http/client.py", line 1026, in _send_output
    self.send(msg)
  File "/usr/lib/python3.6/http/client.py", line 964, in send
    self.connect()
  File "/home/capture/.local/lib/python3.6/site-packages/urllib3/connection.py", line 181, in connect
    conn = self._new_conn()
  File "/home/capture/.local/lib/python3.6/site-packages/urllib3/connection.py", line 168, in _new_conn
    self, "Failed to establish a new connection: %s" % e)
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f28e60b8588>: Failed to establish a new connection: [Errno -2] Name or service not known

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/capture/.local/lib/python3.6/site-packages/requests/adapters.py", line 449, in send
    timeout=timeout
  File "/home/capture/.local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 638, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "/home/capture/.local/lib/python3.6/site-packages/urllib3/util/retry.py", line 398, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='www.baidu.com', port=80): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f28e60b8588>: Failed to establish a new connection: [Errno -2] Name or service not known',))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/capture/PycharmProjects/class19/class19/baidu.py", line 3, in <module>
    print(requests.get('http://www.baidu.com'))
  File "/home/capture/.local/lib/python3.6/site-packages/requests/api.py", line 75, in get
    return request('get', url, params=params, **kwargs)
  File "/home/capture/.local/lib/python3.6/site-packages/requests/api.py", line 60, in request
    return session.request(method=method, url=url, **kwargs)
  File "/home/capture/.local/lib/python3.6/site-packages/requests/sessions.py", line 533, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/capture/.local/lib/python3.6/site-packages/requests/sessions.py", line 646, in send
    r = adapter.send(request, **kwargs)
  File "/home/capture/.local/lib/python3.6/site-packages/requests/adapters.py", line 516, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='www.baidu.com', port=80): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f28e60b8588>: Failed to establish a new connection: [Errno -2] Name or service not known',))

Process finished with exit code 1

常见的返回状态码

扫描二维码关注公众号,回复: 5866941 查看本文章
  • 200 -请求成功
  • 301 -资源被转移到其他url
  • 404 -请求的资源不存在
  • 500 -服务器的内部出错

状态码的分类

  • 1** :表示的是服务器收到了请求,需要请求者继续执行操作
  • 2** :表示的是请求成功,
  • 3** :重定向 需要进一步的操作来完成请求
  • 4** :客户端错误,请求包含语法错误或者是无法完成的请求
  • 5** :服务器错误,服务器在处理请求的时候出错

猜你喜欢

转载自blog.csdn.net/qq_42091045/article/details/89049048