python学习之爬虫初体验

作业来源: "https://edu.cnblogs.com/campus/gzcc/GZCC-16SE2/homework/2851" **

1.简述爬虫原理

通用爬虫

即(搜索引擎)，通过各站点主动提交域名等信息，或与DNS服务商合作，爬取大部分站点信息

聚焦爬虫

通过模拟用户(即客户端浏览器)访问服务器的行为，从而达到欺骗服务器，获取数据。

2.理解（聚焦）爬虫开发过程

发起请求

向目标服务器发送一个伪造的请求报文
获取响应

得到服务器响应的数据
解析内容

将得到的数据按一定方式解析

保存数据

将解析后的数据收录入文本文件或数据库

浏览器工作原理:

向服务端发送请求报文，收到响应报文后解析其中数据，缓存部分数据。

抓取网站

使用第三方库requests


url = "http://news.gzcc.cn/html/xiaoyuanxinwen";

def use_requests(url):
    '''
       使用到了第三方库requests获取响应
   '''
    response = requests.get(url);
    response.encoding = "utf-8";
    return response;

运行结果

使用python自带库 urllib

def use_urllib(url):
    '''
           使用到了自带urllib获取响应
   '''
    response = request.urlopen(url);
    return response;

一个简单的html

<!DOCTYPE html>
<html>
<head>
    <meta charset="UTF-8">
    <title>这是一个简单的网页</title>
    <!--简单的样式定义-->
    <style>
        .class1 {
            background: green;
       }
        .ckass2 {
            background: yellow;
       }
    </style>
</head>
<body>
    <div class="class1">
        <strong id="strong">这是一个粗体标签</strong><br/>
        <b id="b">这依旧是一个粗体标签</b><br/>
        <big id="big">这貌似也是一个粗体标签</big><br/>
    </div>
    <div class="ckass2">
        <del id="del">这是一个删除线</del><br/>
        <s id="s">这也是一个删除线</s><br/>
        <strike id="strike">这同样是一个删除线</strike><br/>
    </div>
</body>
</html>

使用BeautifulSoup解析网页

from bs4 import BeautifulSoup

with open(r'simple.html','r',encoding='utf-8') as f:
    text = f.read()
dom_tree = BeautifulSoup(text, 'html.parser');

from_label = dom_tree.select('strong')[0].text;
from_class = dom_tree.select('.class1')[0].text;
from_id = dom_tree.select('#strong')[0].text;

print(from_label, from_class, from_id);

提取新闻


    dom_tree = BeautifulSoup(use_requests("http://news.gzcc.cn/html/2019/xiaoyuanxinwen_0320/11029.html").text, 
                             'html.parser');
    title_from_class = dom_tree.select(".show-title")[0].text;
    print(title_from_class);

    infos_from_class = dom_tree.select(".show-info")[0].text;
    list = infos_from_class.split()[0:-1];
    for i in list:
        print(i);

python学习之爬虫初体验

1.简述爬虫原理

通用爬虫

聚焦爬虫

2.理解（聚焦）爬虫开发过程

浏览器工作原理:

抓取网站

一个简单的html

使用BeautifulSoup解析网页

提取新闻

猜你喜欢