Python爬虫采集CloudBlog网站的文章

---------------------------------------------------------------------------------------------
[版权申明：本文系作者原创，转载请注明出处]
文章出处：http://blog.csdn.net/sdksdk0/article/details/76208980
作者：朱培 ID：sdksdk0
--------------------------------------------------------------------------------------------

本文通过使用python爬虫，来将一个网站中的文章获取下来，包括标题、发表时间、作者、文章内容等基本信息，并且将这些数据存储到数据库中，是一个非常完整的流程。获取首页所有的文章连接，并存放到URL集合中，然后再一个个的访问这些采集到的链接，来访问，并再次解析出文章详细的内容。

最近有个需求，需要采集金融财经类的新闻文章，获取首页所有的文章连接，并存放到URL集合中，在本文中，以采集CloudBlog的博客文章为例，如下图所示，首先采集这个页面的信息，主要是先采集列表，从列表中获取URL。为防止重复访问，设置一个历史访问，用于对新添加的URL进行过滤。解析DOM树，获取文章相关信息，并将信息存储到Article对象中。

采集号url之后，然后我们用爬虫去访问这个网址，循环读取，拿到这个详情页的标题、作者、发表时间和文章内容。以下图为例。将Article对象中的数据保存到Mysql数据库中。每完成一次数据的存储，计数器增加并打印文章标题，否则打印错误信息。如果集合中的URL全部读取完或数据数量达到设定值，程序结束。

具体实现如下：

1、数据库结构

    SET FOREIGN_KEY_CHECKS=0; 
  
    -- ---------------------------- 
  
    -- Table structure for news 
  
    -- ---------------------------- 
  
    DROP TABLE IF EXISTS `news`; 
  
    CREATE TABLE `news` ( 
  
      `id` int(6) unsigned NOT NULL AUTO_INCREMENT, 
  
      `url` varchar(255) NOT NULL, 
  
      `title` varchar(45) NOT NULL, 
  
      `author` varchar(12) DEFAULT NULL, 
  
      `date` varchar(25) DEFAULT NULL, 
  
      `content` longtext, 
  
      `zq_date` varchar(25) DEFAULT NULL, 
  
      PRIMARY KEY (`id`), 
  
      UNIQUE KEY `url_UNIQUE` (`url`) 
  
    ) ENGINE=InnoDB AUTO_INCREMENT=122 DEFAULT CHARSET=utf8;

2、python代码

import re # 网络连接模块
import bs4 # DOM解析模块
import pymysql # 数据库连接模块
import urllib.request # 网络访问模块
import time #时间模块

# 配置参数
maxcount = 100 # 数据数量
home = 'https://www.tianfang1314.cn/index.html' # 起始位置
# 数据库连接参数
db_config = {
'host': 'localhost',
'port': '3306',

'username': 'root',

'password': '123456',

'database': 'news',
'charset': 'utf8'
}

url_set = set() # url集合
url_old = set() # 过期url

# 获取首页链接
request = urllib.request.Request(home)
#爬取结果
response = urllib.request.urlopen(request)
html = response.read()
#设置解码方式
html = html.decode('utf-8')

soup = bs4.BeautifulSoup(html, 'html.parser')
pattern = '/blog/articles/\w+/\w+.html'
links = soup.find_all('a', href=re.compile(pattern))
for link in links:
url_set.add(link['href'])

# 文章类定义
class Article(object):
def __init__(self):
self.url = None #地址
self.title = None #标题
self.author = None #作者
self.date = None #时间
self.content = None #文章内容
self.zq_date=None; #文章采集时间

# 连接数据库
connect = pymysql.Connect(
host=db_config['host'],
port=int(db_config['port']),
user=db_config['username'],
passwd=db_config['password'],
db=db_config['database'],
charset=db_config['charset']
)
cursor = connect.cursor()

# 处理URL信息
count = 0
while len(url_set) != 0:
try:
# 获取链接
url = url_set.pop()
url='https://www.tianfang1314.cn'+url
url_old.add(url)

# 获取代码
response = urllib.request.urlopen(request)
html = response.read()
# 设置解码方式
html = html.decode('utf-8')

# DOM解析
soup = bs4.BeautifulSoup(html, 'html.parser')
pattern = 'https://www.tianfang1314.cn/blog/articles/\w+/\w+.html' # 链接匹配规则
links = soup.find_all('a', href=re.compile(pattern))

# 获取URL
for link in links:
if link['href'] not in url_old:
url_set.add(link['href'])

# 数据防重
sql = "SELECT id FROM news WHERE url = '%s' "
data = (url,)
cursor.execute(sql % data)
if cursor.rowcount != 0:
raise Exception('重复数据: ' + url)

# 获取详情页的链接
drequest = urllib.request.Request(url)
# 爬取结果
dresponse = urllib.request.urlopen(drequest)
dhtml = dresponse.read()
# 设置解码方式
dhtml = dhtml.decode('utf-8')
dsoup = bs4.BeautifulSoup(dhtml, 'html.parser')
# 获取信息
article = Article()
article.url = url # URL信息
page = dsoup.find('div', {'class': 'data_list'})
article.title=page.find('div', {'class': 'blog_title'}).get_text()
infoStr = page.find('div', {'class': 'blog_info'}).get_text() # 文章信息，例如发布时间：『 2016-12-14 11:26 』用户名：sdksdk0 阅读(938) 评论(3)

infoStr=infoStr.rsplit('『', 1)
infoStr=infoStr[1].rsplit('』', 1)
article.date = infoStr[0] # 时间
article.author = infoStr[1].rsplit('\xa0\xa0', 1)[0].rsplit('用户名：', 1)[1] #用户名
article.content = page.find('div', {'class': 'blog_content'}).get_text() # 获取文章
article.zq_date = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()) #采集时间

# 存储数据
sql = "INSERT INTO news( url, title, author, date, content,zq_date ) "
sql = sql + " VALUES ('%s', '%s', '%s', '%s', '%s','%s') "
data = (article.url, article.title, article.author, article.date, article.content,article.zq_date)
cursor.execute(sql % data)
connect.commit()

except Exception as e:
print(e)
continue
else:
print(article.title)
count += 1
finally:
# 判断数据是否收集完成
if count == maxcount:
break

# 关闭数据库连接
cursor.close()

connect.close()

3、运行效果

我们可以在数据库中可以查看到我们采集到的数据。 select * from news;

总结：在这个爬虫爬取的过程中，遇到了一些坑，主要就是CloudBlog的页面不够规范，所以在使用BeautifulSoup读取这个网页的时候，有的节点会有很多重复数据的现象，其次，这个网站的链接地址是/blog/articles/\w+/\w+.html这样的规则的，而不是直接带的https://的这种，所以我上面还拼接了一个网址前缀。在采集时间和用户的时候，采用了rsplit进行切分处理，可以看到我上面做回来很多的切分操作的，当然，你也可以选择用正则来匹配获取数据。

Python爬虫采集CloudBlog网站的文章

猜你喜欢