使用爬虫进行一次 hexo 构建的博客爬取并且生成 md 文档

起因

由于以前的博客文章在电脑重装的时候全没了，直接 cv 战士难免太过麻烦，正好好久没有写 python 了，于是决定写一个爬虫来爬取文章并且生成 md 文档

分析

使用的技术和库

这里使用 python + BeautifulSoup4（网页装载与解析） + urllib（发起请求） + codecs（写入文件）

主页

我们来看看主页，一篇文章的位置
这里写图片描述

再来看看所有文章是怎么分布的
这里写图片描述
这简直就是最简单的 list 结构嘛

分页

文章不可能就只有一页，所以对分页的研究就体现在分页的 url 上，这样我们就能狗一次爬到底
看看第二页的url
这里写图片描述
推断一下，第 6 页应该是 http://wintersmilesb101.online/page/6

果然没错
那么看看第一页是否可以写成 http://wintersmilesb101.online/page/1 呢？
这里写图片描述

说明首页需要特殊处理，即 http://wintersmilesb101.online
这里写图片描述

抓取页面大小

页面大小的 dom 结构如下
这里写图片描述

可以看到，这几个页面 index 的 class 是一致的，所以我们需要通过 BeautifulSoup 来选中上一个元素（这里可以看出上一个是这个结构中唯一的），或者是通过 BeautifulSoup 的 select 方法选中 class = page-number 的元素列表，最后一个即为 pageSize 的元素

文章信息

我们需要哪些文章信息？
由于我们这里是使用的 hexo 来构建的博客，所以要按照他的规则来，一般来说我们需要如下结构

---
title: Python3.7 爬虫（二）使用 Urllib2 与 BeautifulSoup 抓取解析网页
date: 2017-04-08
date: 2017-04-09
categories: 
- 爬虫
- Python 爬虫
tags: - Python3
- 爬虫
- Urllib2
- BeautifulSoup4
---

这些信息，除了标签，我们都可以在文章列表页面就获取到了，如下：
这里写图片描述

当然这些信息，正文也页都有，正文页的链接，我们可以在 title 的位置获取到，与网站基础 url 拼接就可以获取到最终链接，不过有些 url 中有中文，因此我们需要使用 urllib.request.quote(link) 来把链接中的中文编码成 url 中的正确编码，这里会把 : 也转码了，转换成 %3A 因此，转换之后，我们还需要还原 %3A 为 :

正文的转换

正文就直接通过获取到 class = post-body的元素，然后遍历子元素（通过 children 属性，注意 type 为 bs4.element.NavigableString 的元素，是无效元素，需要跳过），然后根据，html 与 markdown 的对应关系来转换成对应的 markdown 写法，不过在 BeautifulSoup 中还是有不少的坑点，这里代码中注释写的很清楚，就不赘述了

实现

import urllib.request
from bs4 import BeautifulSoup
import bs4
import re
import xlwt
import os
import codecs

filePath = r"H:/GIT/Blog/WinterSmileSB101/source/_posts/old/"
url = "http://wintersmilesb101.online"

user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
req = urllib.request.Request(url, headers={
    'User-Agent': 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
})
print('发送 页面网络请求')
response = urllib.request.urlopen(req)
content = response.read().decode('utf-8')
# output content of page
#print(content)

soup = BeautifulSoup(content, "lxml")

# 获取页面数量
spans = soup.select('span.space')

pageHref = spans[spans.__len__()-1].nextSibling['href']
# get total num
pageNum = int(pageHref.split('/')[2])
print(pageNum)

# get other page
urlBase = "http://wintersmilesb101.online/page/"
index = 1
while index <= pageNum:
    # 索引大于 1 的时候需要重新指定 url
    if index > 1:
        url = urlBase+str(index)
        print(url)
        req = urllib.request.Request(url, headers={
            'User-Agent': 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
        })
        print('发送 页面网络请求 : '+url)
        response = urllib.request.urlopen(req)
        content = response.read().decode('utf-8')
        soup = BeautifulSoup(content, "lxml")

    # 获取文章 list
    articles = soup.find_all('article')

    # 处理每篇文章
    for article in articles:
        # 获取创建时间
        createTime = article.find('time', title="创建于").text.strip()
        # 获取创建时间
        updateTime = article.find('time', title="更新于").text.strip()
        # 获取分类
        categoies = article.find_all('a', attrs = {'itemprop': "url", 'rel': "index"})

        # 分类的 url，Name
        categoryUrl = ''
        categoryName = ''
        for category in categoies:
            #print(category)
            categoryUrl += category['href']+','
            #print(categoryUrl)
            categoryName += category.text.strip()+','
            #print(categoryName)
        categoryUrl = categoryUrl[0:categoryUrl.__len__()-1]
        categoryName = categoryName[0:categoryName.__len__() - 1]
        # 获取正文
        urlMain = ''
        link = article.link['href']
        articleTitle = link.split('/')[link.split('/').__len__()-2]
        # print(articleTitle)
        # 转换中文 url 编码
        urlMain = urllib.request.quote(link)
        # 把多余的转换 : ==> %3A ，还原
        urlMain = urlMain.replace('%3A', ':')
        # print(urlMain)
        print('发送 文章网络请求')
        req = urllib.request.Request(urlMain, headers={
            'User-Agent': 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
        })
        response = urllib.request.urlopen(req)
        mainContent = response.read().decode('utf-8')
        # output content of page
        # print(mainContent)
        mainSoup = BeautifulSoup(mainContent,'lxml')
        body = mainSoup.find('div', itemprop="articleBody")

        blockquote = body.blockquote
        if blockquote != None:
            blockquoteText = blockquote.p.text
            # print(blockquote.p)
            extenalUrl = None
            mineUrl = blockquote.p.a['href']
            if blockquote.p.find('a', rel="external"):
                extenalUrl = blockquote.p.find('a', rel="external")['href']
            # print(extenalUrl)
            # 把其中的链接替换为 md 语法
            if extenalUrl:
                blockquoteText = blockquoteText.replace("原文地址", "[原文地址](" + extenalUrl + ")")

            blockquoteText = blockquoteText.replace(mineUrl, "[" + mineUrl + "](" + mineUrl + ")")
        # 获取标签
        tags = mainSoup.find_all('a', rel='tag')
        # print(tags)
        # 写入 md 文件
        # 判断路径是否存在
        if not os.path.exists(filePath + str(index) + '/'):
            os.makedirs(filePath + str(index) + '/')
        file = codecs.open(filePath + str(index) + '/' + articleTitle + '.md', "w", encoding='utf8')  # 指定文件的编码格式

        # 写入前置申明
        file.write('---\n')
        file.write("title: " + articleTitle + '\n')
        file.write("date: " + createTime + '\n')
        file.write("date: " + updateTime + '\n')
        file.write("categories: " + '\n')
        for category in categoryName.split(','):
            file.writelines('- ' + category + '\n')
        file.writelines("tags: ")
        for tag in tags:
            tag = tag.text.replace('# ', '')
            file.writelines('- ' + tag + '\n')
        file.writelines('---' + '\n')
        # 写入引用块
        if blockquote != None:
            file.writelines('> ' + blockquoteText)
            # 遍历正文块，写入文件,注意遍历文档树的时候 next_sibling 是紧紧接着的，比如这里是 \n,所以需要两个
            # print(blockquote.next_sibling.next_sibling)

        for nextTag in body.children:
            # print(nextTag)
            # print(type(nextTag))
            if type(nextTag) == bs4.element.NavigableString:
                continue
            tagName = ''
            codeType = ''
            codeStart = ''
            codeEnd = ''
            tagContent = nextTag.text.strip()
            if nextTag.name == 'h1':
                tagName = '# '
                file.write(tagName + tagContent + '\n')
                continue
            if nextTag.name == 'h2':
                tagName = '## '
                file.write(tagName + tagContent + '\n')
                continue
            if nextTag.name == 'h3':
                tagName = '### '
                file.write(tagName + tagContent + '\n')
                continue
            if nextTag.name == 'h4':
                tagName = '##### '
                file.write(tagName + tagContent + '\n')
                continue
            # 代码块
            if nextTag.select('figure').__len__() > 0 or nextTag.name == 'figure':
                # 如果 select 的 length 大于 0 则表示这个元素是 包含 figure 的元素
                if nextTag.select('figure').__len__() > 0:
                    nextTag = nextTag.select('figure')[0]

                codeType = nextTag['class'][nextTag['class'].__len__() - 1] + '\n'
                codeStart = '``` '
                codeEnd = '```\n'
                codeLine = ''
                lineNumber = nextTag.table.tr.find('td', attrs={'class': 'gutter'}).text
                code = nextTag.table.tr.find('td', attrs={'class': 'code'}).text
                tagContent = tagContent.replace(lineNumber, '').replace(code, '')
                # print(lineNumber)
                # print(code)
                # print(tagContent)
                for line in nextTag.table.tr.find('td', attrs={'class' : 'code'}).find_all('div'):
                    codeLine += line.text.strip()+'\n'
                file.write(tagContent+'\n')
                file.write(codeStart + codeType + codeLine + '\n' + codeEnd)
                continue

            # 无序列表
            if nextTag.name == 'ul':
                for li in nextTag.find_all('li'):
                    file.write('- ' + li.text.strip() + '\n')
                    continue
            # 有序列表
            if nextTag.name == 'ol':
                olIndex = 1
                for li in nextTag.find_all('li'):
                    file.write(olIndex + '. ' + li.text.strip() + '\n')
                    olIndex += 1
                continue
            if nextTag.name == 'p':
                # 为空表示是图片
                tagContent = nextTag.text.strip()
                if tagContent == '':
                    file.write("![image](" + nextTag.find('img')['src'] + ")\n")
                    continue
                else:
                    links = nextTag.find_all('a')
                    for link in links:
                        tagContent = tagContent.replace(link.text, "[" + link['href'] + "](" + link['href'] + ")")
                    file.write(tagContent + '\n')
                    continue
        file.close()
    index = index+1

效果

第一页的文章

这里写图片描述

第二页的文章

这里写图片描述

第一篇文章，感觉效果还是不错的
这里写图片描述

代码

文章所有代码已经提交到 git