python 学习笔记----网络爬虫(详细)

文章目录

1.爬虫简介
2.Requests库
3.Robots协议
4.爬取的五个实例
5.网络爬虫之提取---BeautifulSoup库
6.信息组织与提取
7.中国大学排名爬虫案例

说在前面的话：以下的图片是摘自嵩老师的ppt，大家可以到中国大学MOOC上看他的网课，我学过之后提取其中的精华分享给大家，望帮到大家学习。

1.爬虫简介

掌握定向网络数据爬取和网页解析的基本能力

在这里插入图片描述

2.Requests库

安装方法pip install requests

1.requests.get(url)函数

函数get构造一个向服务器请求资源的Requests对象，返回一个包含服务器资源的Response对象

requests.get(url,params=None,**kwargs)
# url:获取网络页面的url链接
# params：url中的额外参数，字典或字节流格式，可选
#**kwargs：十二个控制访问的参数

2.Response对象属性

r.status_code  #HTTP请求的返回状态，200表示链接成功，404或其他数字表示失败
r.text   #HTTP响应内容的字符串形式，即url对应的页面内容
r.encoding  #从HTTP header中猜测的响应内容的编码方式
r.apparent_encoding #从内容中分析出的响应内容的编码方式
r.content #HTTP响应内容的二进制形式

>>>import requests
>>>r = requests.get("http://www.baidu.com")#请求连接百度网站
>>>r.status_code     #链接状态200表示连接成功
200
>>>r.text             #查看爬取的内容
'...v\x94¨ç\x99¾åº¦å\x89\x8då¿\x85è¯»</a>&nbsp; <a href=http://jianyi.baidu.com/ class=cp- ...'
>>>r.encoding    #header中的编码形式  
'ISO-8859-1'
>>>r.apparent_encoding   #分析内容中的编码形式
'utf-8'
>>>r.encoding='utf-8'     #替换之后再次查看发现内容中的中文
>>>r.text
'... id=ftConw> <p id=lh> <a href=http://home.baidu.com>关于百度</a> <a href=http://ir.baidu.com>About Baidu</a> </p> <p id=cp>&copy;2017&nbsp;Baidu&nbsp;<a href=http://www.baidu.com/duty/>使用百度前必读</a>&nbsp; <a ...'

网络上的资源有他的编码，只有准确的编码方式才能看到网站上的可读内容

r.encoding的编码方式是从header中获得的，如果header中存在charset，则认为编码方式是有要求的，那么就返回他的编码方式，但如果header中不存在charset，则认为编码方式默认为为 ISO-8859-1（这样的编码不能解析中文）

r.apparent_encoding是分析网页内容，从而返回相应的编码，这个可以解析下面内容

Requests库的异常

在这里插入图片描述

有一个方法可以判断是否出现异常：r.raise_for_status()
如果不是200，产生异常request.HTTPError

下面给出爬取网页的通用代码框架：

这种框架可以使爬取变得更有效率，出现错误，就返回异常

注意网络连接有风险，异常处理很重要

import requests
def getext(url):
	try:
		r = requests.get(url,timeout=30)  #请求网站
		r.raise_for_status()  #如果状态不是200，则引发异常
		r.encoding=r.apparent_encoding    #把内容解码替换掉
		return r.text             #返回网页内容
	except:
		return "产生异常"           #如果出现异常则输出异常
if __name__ == "__main__"
	url = "http://www.baidu.com"
	print(getext(url))

4.HTTP协议

HTTP，Hypertext Transfer Protocol 超文本传输协议
他是一个基于“请求与响应”模式的，无状态的应用层协议（用户发出请求，服务器响应）
无状态：第一次请求与第二次无关联

URL格式 http://host[:port][path]
host: 合法的Internet主机域名或IP地址
port: 端口号，缺省端口默认为80
path: 请求资源的路径
http://ww.bit.edu.cn     #北京理工大学的官网首页
http://220.181.111.188/duty #这样一台IP主机上duty目录下的资源

URL是通过HTTP协议存取资源的Internet路径，一个URL对应一个数据资源

HTTP方法：
Alt text

5.requests库方法
在这里插入图片描述

>>>r=requests.head("http://httpbin.org/get")
>>>r.headers
{'Connection': 'keep-alive', 'Server': 'gunicorn/19.9.0', 'Date': 'Fri, 27 Jul 2018 04:35:39 GMT', 'Content-Type': 'application/json', 'Content-Length': '267', 'Access-Control-Allow-Origin': '*', 'Access-Control-Allow-Credentials': 'true', 'Via': '1.1 vegur'}
>>>r.text   #内容是空，以为是访问了header
''

>>>payload ={'key1':'value1','key2':'value2'}
>>>r =requests.post('http://httpbin.org/post',data = payload)
>>>print(r.text)
{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "key1": "value1",   #向URL POST一个字典自动编码为form（表单）
    "key2": "value2"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Connection": "close", 
    "Content-Length": "23", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.19.1"
  }, 
  "json": null, 
  "origin": "123.161.129.23", 
  "url": "http://httpbin.org/post"
}
# 向URL POST一个字符串自动编码为data

requests.request(method,url,**kwargs)
requests.get(url,params=None,**kwargs)
requests.head(url,**kwargs)
requests.post(url,data=None,json=None,**kwargs)
requests.put(url,data=None,**kwargs)
requests.patch(url,data=None,**kwargs)
requests.delete(url,**kwargs)
#**kwargs:13个方法
1.params：字典或字节序列，作为参数增加到url中
>>>kv={'key1':'value1','key2':'value2'}
>>>r=requests.request('GET','http://python123.io/ws',params=kv)
>>>print(r.url)
https://python123.io/ws?key1=value1&key2=value2
#服务器可以根据这些参数，筛选一些信息并返回回来
2.data：字典，字节序列或文件对象，作为Request的内容
>>>r=requests.request('POST','http://python123.io/ws',data=kv)
3.json:JSON格式的数据作为Request的内容
>>>r=requests.request('POST','http://python123.io/ws',json=kv)
4.headers：字典，HTTP定制头
>>>hd={'user-agent':'Chrome/10'}  #把这样字段赋给headers，再去访问时看到的user-anget字段就是chrome/10（chrome浏览器第十个版本）可以模拟任何浏览器向服务器发起访问
>>>r=requests.request('POST','http://python123.io/ws',headers=hd)
5.cookies:字典或CookieJar
6.auth：元组支持HTTP认证功能
7.files：字典类型，向服务器传输文件
>>>fs={'file':open('data.xls','rb')}
>>>r=requests.request('POST','http://python123.io/ws',files=fs)
8.timeout:设置超时时间吗，秒为单位若在请求时间内没有返回内容，则出现一个timeout异常
>>>r=requests.request('GET','http://python123.io/ws',timeout=10)

一般来说，没有哪个服务器是让用户无限制上传信息的，有可能有恶意文件，有网络安全隐患，所以最常用的是get，head方法，来爬取网络信息

3.Robots协议

在这里插入图片描述

大部分网站在根目录下都有Robots协议，作用是网站告知网络爬虫哪些可以爬取，哪些不可以爬取，在网站的根目录下的robots.txt文件中
https://www.jd.com/robots.txt #京东的robots协议

User-agent: *    #任何的网络爬虫来源
Disallow: /?*   #不允许访问？开头的路径
Disallow: /pop/*.html 
Disallow: /pinpai/*.html?* 
User-agent: EtaoSpider   #恶意爬虫
Disallow: /      #不允许爬取所有内容
User-agent: HuihuiSpider   
Disallow: / 
User-agent: GwdangSpider 
Disallow: / 
User-agent: WochachaSpider 
Disallow: /

# *代表所有    Disallow：是不允许访问的目录

4.爬取的五个实例

1.京东商品页面的爬取
*首先打开京东找到一个商品，复制他的url链接地址

import requests
url = "https://item.jd.com/28245104630.html"
try:
	r = requests.get(url)  #请求网站
	r.raise_for_status()  #如果状态不是200，则引发异常
	r.encoding=r.apparent_encoding    #把内容解码替换掉
	print(r.text[:1000])             #返回网页内容
except:
	print("产生异常")           #如果出现异常则输出异常

以下是爬取内容：
<!DOCTYPE HTML>
<html lang="zh-CN">
<head>
    <!-- shouji -->
    <meta http-equiv="Content-Type" content="text/html; charset=gbk" />
    <title>华为（HUAWEI） 荣耀9i手机 幻夜黑 全网通4+64G【图片 价格 品牌 报价】-京东</title>
    <meta name="keywords" content="华为（HUAWEI） 荣耀9i手机 幻夜黑 全网通4+64G,华为（HUAWEI）,,京东,网上购物"/>
    <meta name="description" content="华为（HUAWEI） 荣耀9i手机 幻夜黑 全网通4+64G图片、价格、品牌样样齐全！【京东正品行货，全国配送，心动不如行动，立即购买享受更多优惠哦！】" />
    <meta name="format-detection" content="telephone=no">
    <meta http-equiv="mobile-agent" content="format=xhtml; url=//item.m.jd.com/product/28245104630.html">
    <meta http-equiv="mobile-agent" content="format=html5; url=//item.m.jd.com/product/28245104630.html">
    <meta http-equiv="X-UA-Compatible" content="IE=Edge">
    <link rel="canonical" href="//item.jd.com/28245104630.html"/>
        <link rel="dns-prefetch" href="//misc.360buyimg.com"/>
    <link rel="dns-prefetch" href="//static.360buyimg.com"/>
    <link rel="dns-prefetch" href="//img10.360buyimg.com"/>
    <link rel

Process finished with exit code 0

2.亚马逊商品页面的爬取

>>>import requests
>>>r = requests.get("http://www.amazon.cn/gp/product/B01M8L5Z3Y")
>>>r.status_code
503                       #产生错误
>>>r.encoding
'ISO-8859-1'
>>>r.encoding=r.apparent_encoding
>>>r.request.headers  #查看访问时的头部信息  其中 User-agent代表查看的来源
{'User-Agent': 'python-requests/2.19.1', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
# 下面我们通过修改header的来源，去让网站认为是正常访问
>>>kv={'user-agent':'Mozilla/5.0'}# 编写一个键值对。'Mozilla/5.0'是一个最正规的网页标识
>>>url="http://www.amazon.cn/gp/product/B01M8L5Z3Y"
>>>r.requests.get(url,headers=kv)   #使头部信息用kv替换掉
>>>r.status_code   #发现访问成功
200
>>>r.request.headers   #可以看出访问的来源发生改变   
{'User-Agent': 'Mozilla/5.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}

很多网站对网络爬虫有限制，很多网站可以通过查看浏览的来源来控制爬虫，一般人上网查看所用的来源是正规网站，但我们爬取的来源是'User-Agent': 'python-requests/2.19.1'不是网站，那么网页维护人员就知道这个访问可能是爬虫，所以设置拒绝访问

上面的方法就是用正规的网站来源去替换我们的来源，让网页相信我们是正常访问

3.百度360搜索关键词提交

百度的关键词接口：
http://www.baidu.com/s?wd=keyword
360的关键词接口：
http://www.soo.com/s?q=keyword
搜索的关键词就是keword


>>>import requests
>>>kv = {'wd':'python'}   #用键值对中的关键词替换keyword再访问
>>>r = requests.get("http://www.baidu.com/s",params = kv)
>>>r.status_code  #判断是否访问成功
200
>>>r.request.url      #查看链接，发现后面的keyword被替换了
'http://www.baidu.com/s?wd=python'
>>>len(r.text)     #  此处只返回长度，对于内容的筛选后面讲解
304705

4.网络图片的爬取与存储

import requests
path ="E:\\robocup\\python\\abc.jpg"  #保存的路径（最后记得写保存的图片名字）
url ="http://d100.paixin.com/1154062/2634/v/380/depositphotos_26348807-stock-illustration-grey-abstract-background-for-design.jpg"#图片的url，通过右键查看
r = requests.get(url)
with open(path, 'wb') as f:         #打开文件
	f.write(r.content)        #以二进制格式保存图片
	f.close()                    #guan
	print("文件保存成功")

5.IP地址归属地的查询
通过www.ip138.com网站查询
输入的接口和百度的相似为http://m.ip138.com/ip.asp?ip=ipaddress
所以访问的时候可以将后面ipaddress替换要查询的ip

>>>import requests
>>>url="http://m.ip138.com/ip.asp?ip="
>>>r = requests.get(url + '202.204.80.112')
>>>r.status_code
200
>>>r.text[-500:]              #访问后500个字符
'value="查询" class="form-btn" />\r\n\t\t\t\t\t</form>\r\n\t\t\t\t</div>\r\n\t\t\t\t<div class="query-hd">ip138.com IP查询(搜索IP地址的地理位置)</div>\r\n\t\t\t\t<h1 class="query">您查询的IP：202.204.80.112</h1><p class="result">本站主数据：北京市海淀区 北京理工大学 教育网</p><p class="result">参考数据一：北京市 北京理工大学</p>\r\n\r\n\t\t\t</div>\r\n\t\t</div>\r\n\r\n\t\t<div class="footer">\r\n\t\t\t<a href="http://www.miitbeian.gov.cn/" rel="nofollow" target="_blank">沪ICP备10013467号-1</a>\r\n\t\t</div>\r\n\t</div>\r\n\r\n\t<script type="text/javascript" src="/script/common.js"></script></body>\r\n</html>\r\n'

5.网络爬虫之提取—BeautifulSoup库

1.BeautifulSoup库的安装及测试
pip install beautifulsoup4

这个库把HTML页面当作成一锅粥，用一些方法来煲这锅粥

#首先找一个HTML页面：http://python123.io/ws/demo.html
>>>import requests        #用requests方法请求获取页面代码
>>>r = requests.get("http://python123.io/ws/demo.html")
>>>r.text    #发现给出的是字符串，非常混乱
'<html><head><title>This is a python demo page</title></head>\r\n<body>\r\n<p class="title"><b>The demo python introduces several python courses.</b></p>\r\n<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>\r\n</body></html>'
>>>demo =r.text   #这时我们把这段代码赋值一个变量
>>>from bs4 import BeautifulSoup     #导入beautifulsoup库
>>>soup = BeautifulSoup(demo ,"html.parser")   #用'html.parser'方法解析
>>>print(soup.prettify())      #打印出结果，发现整洁多了
<html>
 <head>
  <title>
   This is a python demo page
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The demo python introduces several python courses.
   </b>
  </p>
  <p class="course">
   Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
   <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
    Basic Python
   </a>
   and
   <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
    Advanced Python
   </a>
   .
  </p>
 </body>
</html>

每一个网页代码都是由许多对尖括号（标签）加内容组成，构成了标签树

beautifulsoup库是对标签树的的解析，遍历，维护
from bs4 import BeautifulSoup 引入beautifulsoup库
soup = BeautifulSoup('<p>data</p>' , 'html.parser')
第一个参数是需要分析的代码，第二个参数是解析方法

2.BeautifulSoup库的基本元素：

标签 Tag

<p class="title"> ... </p>
名称'p'成对出现，后面比前面都一个'/'
class是一个属性 为'title'，标签可以有0个或多个属性，属性是由键值对构成的

BeautifulSoup是一个类，对应的是标签树，就是一个HTML/XML文档的全部内容，把内容赋给一个变量，就可以对内容进行操作和提取了

*BeautifulSoup解析器：在这里插入图片描述

>>>soup.title    #是网页左上角的标题
<title>This is a python demo page</title>
>>>tag =soup.a    #获取a标签
>>>print(tag)
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
>>>soup.a.name     #获取a标签的名字
'a'
>>>soup.a.parent.name   #获取a标签上一级的名字
'p'
>>>tag.attrs           #获取a标签的属性（以字典的形式）
{'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}
>>>tag.attrs['href']   #查看a标签属性中某一键值
'http://www.icourse163.org/course/BIT-268001'
>>>type(tag.attrs)    #查看属性类型
<class 'dict'>
>>>type(tag)    #  查看标签类型
<class 'bs4.element.Tag'>
>>>soup.a.string     #标签a中间的字符串
'Basic Python'
>>>soup.p.string    #标签p中间的字符串
'The demo python introduces several python courses.'

在这里插入图片描述
3.基于bs4库的HTML内容遍历方法

a.下行遍历

>>>soup.head
<head><title>This is a python demo page</title></head>
>>>soup.head.contents    #将head的儿子结点存入列表
[<title>This is a python demo page</title>]
>>>soup.body.contents     #注意'\n'也代表一个结点
['\n', <p class="title"><b>The demo python introduces several python courses.</b></p>, '\n', <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>, '\n']
>>>len(soup.body.contents)        #得到儿子结点的个数
5
>>>soup.body.contents[1]  #得到第一个儿子结点
<p class="title"><b>The demo python introduces several python courses.</b></p>
>>>for child in soup.body.children:   #循环遍历儿子结点
...    print(child)
    
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>

b.上行遍历

>>>soup.title.parent
<head><title>This is a python demo page</title></head>
>>>soup.html.parent   #他没有父亲结点了，所以是他本身
<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
</body></html>
>>>soup.parent    #soup本身没有父亲结点
>>>for parent in soup.a.parents:   #遍历所有的父亲结点
...    if parent is None:     #这个会遍历到soup，而soup没有父亲结点
...        print(parent)
...    else:                 #其他的输出标签名字
...        print(parent.name)
        
p
body
html
[document]

c.平行遍历：必须发生在同一个父亲结点下

>>>soup.a.next_sibling#平行标签之间可能有不是标签类型的结点'and'
' and '
>>>soup.a.next_sibling.next_sibling
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>
>>>soup.a.previous_sibling    #a标签之前的平行结点
'Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n'
>>>for sibling in soup.a.next_siblings:  #循环遍历a后序结点
...    print(sibling)
 and 
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>
.
>>>for sibling in soup.a.previous_siblings:  #循环遍历a前序结点
...    print(sibling)

4.基于bs4库的HTML的格式输出

>>>print(soup.a.prettify())   #prettify()可以给每个标签加上'\n'使文本更加可读
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
 Basic Python
</a>
>>>print(soup.prettify())

6.信息组织与提取

1.信息表示的三种方法
在这里插入图片描述

XML 使用每个尖括号组成的标签来表示

JSON 使用键值对来表示信息

YAML 使用缩进的形式来表示信息

2.信息标记的提取方法
在这里插入图片描述

>>>soup.find_all('a')#寻找a标签
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
>>>soup.find_all(['a','b'])   #寻找a标签和b标签
[<b>The demo python introduces several python courses.</b>, <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
>>>for tag in soup.find_all(ture)   #遍历所有子孙后代
...    print(tag.name)
html
head
title
body
p
b
p
a
a
>>>import re   #引入正则表达式（后面会讲）用于检索关键字
>>>for tag in soup.find_all(re.compile('b')): #查找b开头的
...    print(tag.name)    
body
b
>>>soup.find_all('p','course')   #  查找p标签属性有‘course’的
[<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>]
>>>soup.find_all(id='link1') #查找id属性为link1 的
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>]
>>>soup.find_all(id='link')   #查找id为link的为空 所以用正则表达式来寻找包含link的
[]
>>>soup.find_all(id=re.compile('link'))
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
>>>soup.find_all('a',recursive=False)  #仅对后面一代进行检索
[]
>>>soup.find_all(string=re.compile('python'))#寻找字符串区域的相关字符串
['This is a python demo page', 'The demo python introduces several python courses.']

在这里插入图片描述

7.中国大学排名爬虫案例

# 中国好大学排名网站：http://www.zuihaodaxue.cn/zuihaodaxuepaiming2018.html
import requests
from bs4 import BeautifulSoup
import bs4
def getHTMLtext(url):
    try:
        r = requests.get(url, timeout = 30)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        print("成功访问")
        return r.text
    except:
        print("访问错误")
        return" "

def fillist(list,html):
    soup = BeautifulSoup(html,"html.parser")
    for tr in soup.find('tbody').children:
        if isinstance(tr ,bs4.element.Tag):
            tds = tr.find_all('td')
            list.append([tds[0].string, tds[1].string, tds[3].string])

def printlist(list,num):
    f = open("E:\\robocup\\python\\中国大学排名2018.txt", 'w')
    f.write('排名\t学校名称\t\t总分\n')
    for i in range(num):
        u = list[i]
        f.write(u[0])
        f.write('\t')
        f.write(u[1])
        f.write('\t\t')
        f.write(u[2])
        f.write('\n')
    f.close()
    print("文件保存成功")
    # print("{0:^10}\t{1:{3}^10}\t{2:^10}".format("排名","学校名称","总分",chr(12288)))
    # for i in range(num):
    #     u = list[i]
    #     print("{0:^10}\t{1:{3}^10}\t{2:^10}".format(u[0],u[1],u[2],chr(12288)))

def main():
    mylist = []
    url = "http://www.zuihaodaxue.cn/zuihaodaxuepaiming2018.html"
    html = getHTMLtext(url)
    fillist(mylist,html)
    printlist(mylist,20)
main()

python 学习笔记----网络爬虫(详细)

文章目录

1.爬虫简介

2.Requests库

3.Robots协议

4.爬取的五个实例

5.网络爬虫之提取—BeautifulSoup库

6.信息组织与提取

7.中国大学排名爬虫案例

猜你喜欢