《python初级爬虫》(一)

版权声明:所爱隔山海。 https://blog.csdn.net/tongxinzhazha/article/details/78847724

前言

python初级爬虫只需要掌握以下四个技术

  • find 字符串函数
  • 列表切片list[-x:-y]
  • 文件读写操作
  • 循环体while

原理:
网页上的任何东西都对应着源代码, 所以爬虫的原理就是对网页上的源代码的爬取和访问两部分。
第一步:1 先对待爬取东西的代码截取,对于单篇文章而言

 <a title="" target="_blank" href="http://blog.sina.com.cn/s/blog_4701280b0102wrup.html">写给那个茶水妹的《乘风破浪》诞生…</a>

这是文章对应的代码部分,我们需要切取出所需的url为
http://blog.sina.com.cn/s/blog_4701280b0102wrup.html
第二步:访问该url并存盘
导入urllib内建库
content = urllib.urlopen(url).read()
filename=”xxx”
输出HTML
with open(filename,”w”) as f:
f.write(content)
或者输出TXT
with open(“…../”+filename+”.txt”,”w”) as f:
f.write(content)
但是对于首页的所有文章,则是读取首页的所有内容urllib.urlopen(url).read(),并在所读取的内容中截取文章的url并存盘。

数据源:韩寒博客
http://blog.sina.com.cn/s/articlelist_1191258123_0_1.html
这里写图片描述

第一部分 :下载单篇博文存储本地

第一步:分析html源文件
邮件审查元素可以看到网页的源文件(chrome快捷键F12),然后在源文件中查找文章名字:写给那个茶水妹的《乘风破浪》诞生…,在body中使用快捷键ctrl+F 查找:
这里写图片描述
找到其文章名字段落规则为:

 <a title="" target="_blank" href="http://blog.sina.com.cn/s/blog_4701280b0102wrup.html">写给那个茶水妹的《乘风破浪》诞生…</a>

则在此字符串查询所需要的部分。

第二步:代码处理以及url的提取
导入urllib内建库
content = urllib.urlopen(url).read()
filename=”xxx”
输出HTML
with open(filename,”w”) as f:
f.write(content)
或者输出TXT
with open(“…../”+filename+”.txt”,”w”) as f:
f.write(content)

代码实现

import urllib


# 使用转义符
str0 ="<a title=\"\" target=\"_blank\" href=\"http://blog.sina.com.cn/s/blog_4701280b0102wrup.html\">写给那个茶水妹的《乘风破浪》诞生…</a>"
# 其规则是 href="连接"   ">题目</a>"

# 截取题目索引
title_1 = str0.find(r">")
title_2 = str0.find(r"</a>")
title = str0[title_1+1:title_2]
print title
# 截取http连接索引
href = str0.find(r"href=")
html = str0.find(r".html")
# 截取 url
url = str0[href+6:html+5]

# 读出的是html码,type是str
content = urllib.urlopen(url).read()

m = url.find("blog_")
filename = url[m:]
filename_1 = "F://python/PyCharmWorkpalce/Crawler/pacong_data/" + filename

# 写成html
with open(filename_1 ,"w+") as f:
    f.write(content)
# 写成txt文件
with open(filename_1 + ".txt","w+") as f:
    f.write(content)
# 存成题目的txt文件
# 因为编译器是utf-8编码,则需要unicode编译
with open(unicode(filename_0 + title + ".txt","utf-8"),"w+") as f:
    f.write(content)

输出结果
这里写图片描述

第二部分:爬取首页的全部文章存入本地

同爬取单篇文章相类似,爬取首页的所有文章则需要读取首页的全部源代码,并对文章url进行截取,分析文章url

<a title="" target="_blank" href="http://blog.sina.com.cn/s/blog_4701280b0102wrup.html">写给那个茶水妹的《乘风破浪》诞生…</a></span> 

<a title="" target="_blank" href="http://blog.sina.com.cn/s/blog_4701280b0102eo83.html">《论电影的七个元素》——关于我对电…</a></span>                                               

则需要在整体所读内容中找到第一篇文章的规律。特殊字段为

# -*- coding: utf-8 -*-
# auther : santi
# function :
# time :

import urllib
import time
# 直接读取首页的全部内容
str0 = "http://blog.sina.com.cn/s/articlelist_1191258123_0_1.html"
con = urllib.urlopen(str0).read()

# 将con打印出来观察其文章题目规则
# <a title="" target="_blank" href="http://blog.sina.com.cn/s/blog_4701280b0102elmo.html">2013年09月27日</a></span>
# <a title="" target="_blank" href="http://blog.sina.com.cn/s/blog_4701280b0102wruo.html">写给那个茶水妹的《乘风破浪》诞生…</a></span>


with open(r"F:\python\PyCharmWorkpalce\Crawler\pacong_data\context.txt","w") as f:
    f.write(con)


url_all = [""] * 60
url_name = [""] * 60
index = con.find("<a title=\"\" target=\"_blank")
href = con.find("href=\"http:",index)
html = con.find(".html\">",href)
title = con.find("</a></span>",html)

i = 0
# find函数找不到会返回-1,则说明全部爬取,直接跳出while循环。
while index != -1 and href != -1 and html != -1 and title != -1 or  i < 50 :
    url_all[i] = con[href+6:html+5]
    url_name[i] = con[html+7:title]

    print "finding...   " + url_all[i]
    index = con.find("<a title=\"\" target=\"_blank",title)
    href = con.find("href=\"http:",index)
    html = con.find(".html\">",href)
    title = con.find("</a></span>",html)
    i += 1

else:
    print "Find End!"


# 本地存储
# http://blog.sina.com.cn/s/blog_4701280b0102wrup.html

m_0 = url_all[0].find("blog_")
m_1 = url_all[0].find(".html")+5
filename_0 = "F://python/PyCharmWorkpalce/Crawler/pacong_data/"

j = 0
while j < i:
    filename_1 = url_all[j][m_0:m_1]
    content = urllib.urlopen(url_all[j]).read()
    print "downloading.... " + filename_1

    with open(filename_0 + filename_1,"w+") as f:
        f.write(content)
    with open(filename_0 + filename_1 + ".txt","w+") as f:
        f.write(content)
    with open(unicode(filename_0 + url_name[j] + ".txt","utf-8"),"w+") as f:
        f.write(content)
    time.sleep(15)
    j += 1

print "Download article finished! "

输出结果:

finding...   http://blog.sina.com.cn/s/blog_4701280b0102wrup.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102wruo.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102eohi.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102eo83.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102elmo.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102eksm.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102ek51.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102egl0.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102ef4t.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102edcd.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102ecxd.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102eck1.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102ec39.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102eb8d.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102eb6w.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102eau0.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102e85j.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102e7wj.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102e7vx.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102e7pk.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102e7er.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102e63p.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102e5np.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102e4qq.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102e4gf.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102e4c3.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102e490.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102e42a.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102e3v6.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102e3nr.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102e150.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102e11n.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102e0th.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102e0p3.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102e0l4.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102e0ib.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102e0hj.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102e0fm.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102e0eu.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102e0ak.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102e07s.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102e074.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102e06b.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102e061.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102e02q.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102dz9f.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102dz84.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102dz5s.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102dyao.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102dxmp.html
Find End!
downloading.... blog_4701280b0102wrup.html
downloading.... blog_4701280b0102wruo.html
downloading.... blog_4701280b0102eohi.html
downloading.... blog_4701280b0102eo83.html
downloading.... blog_4701280b0102elmo.html
downloading.... blog_4701280b0102eksm.html
downloading.... blog_4701280b0102ek51.html
downloading.... blog_4701280b0102egl0.html
downloading.... blog_4701280b0102ef4t.html
downloading.... blog_4701280b0102edcd.html
downloading.... blog_4701280b0102ecxd.html
downloading.... blog_4701280b0102eck1.html
downloading.... blog_4701280b0102ec39.html
downloading.... blog_4701280b0102eb8d.html
downloading.... blog_4701280b0102eb6w.html
downloading.... blog_4701280b0102eau0.html
downloading.... blog_4701280b0102e85j.html
downloading.... blog_4701280b0102e7wj.html
downloading.... blog_4701280b0102e7vx.html
downloading.... blog_4701280b0102e7pk.html
downloading.... blog_4701280b0102e7er.html
downloading.... blog_4701280b0102e63p.html
downloading.... blog_4701280b0102e5np.html
downloading.... blog_4701280b0102e4qq.html
downloading.... blog_4701280b0102e4gf.html
downloading.... blog_4701280b0102e4c3.html
downloading.... blog_4701280b0102e490.html
downloading.... blog_4701280b0102e42a.html
downloading.... blog_4701280b0102e3v6.html
downloading.... blog_4701280b0102e3nr.html
downloading.... blog_4701280b0102e150.html
downloading.... blog_4701280b0102e11n.html
downloading.... blog_4701280b0102e0th.html
downloading.... blog_4701280b0102e0p3.html
downloading.... blog_4701280b0102e0l4.html
downloading.... blog_4701280b0102e0ib.html
downloading.... blog_4701280b0102e0hj.html
downloading.... blog_4701280b0102e0fm.html
downloading.... blog_4701280b0102e0eu.html
downloading.... blog_4701280b0102e0ak.html
downloading.... blog_4701280b0102e07s.html
downloading.... blog_4701280b0102e074.html
downloading.... blog_4701280b0102e06b.html
downloading.... blog_4701280b0102e061.html
downloading.... blog_4701280b0102e02q.html
downloading.... blog_4701280b0102dz9f.html
downloading.... blog_4701280b0102dz84.html
downloading.... blog_4701280b0102dz5s.html
downloading.... blog_4701280b0102dyao.html
downloading.... blog_4701280b0102dxmp.html
Download article finished! 

下载示意图:
这里写图片描述

猜你喜欢

转载自blog.csdn.net/tongxinzhazha/article/details/78847724
今日推荐