爬虫实战

最近迷上一本小说,但是要收费,后来就想用爬虫把书籍进行获取,

但是这个网址使用的gbk编码,获取的数据全是乱码,经过好几个小时的研究终于找到方法了,后来进行了整理。

一:查看网站具体编码格式

得到编码格式之后,就好解决问题了。就编写了如下代码

#!/usr/bin/env python
# coding=utf-8

from lxml import etree
import requests

# 指定URL
url = "https://www.sangwu.org/book/"
# 伪装UA
headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
}

# 发送请求方式
page_text = requests.get(url=url,headers=headers).text
# 重点 先把gbk进行编码(encode),在解码(decode)
page_text = page_text.encode('ISO-8859-1').decode(requests.utils.get_encodings_from_content(page_text)[0])
tree = etree.HTML(page_text)
dd_list = tree.xpath('//div[@class="main"]/dl/dd')

fp = open('book.txt',mode='a',encoding='utf-8')
for i in dd_list:
    i_url = i.xpath('a/@href')[0]			# 获取url
    title = i.xpath('a/text()')[0]			# 获取标签
    # print(i_url,title)

    new_url = url + i_url				# 拼接新的URL
    html = requests.get(url=new_url).text
    a = html.encode('ISO-8859-1').decode(requests.utils.get_encodings_from_content(html)[0])
    ret_tree = etree.HTML(a)
    book = ret_tree.xpath('//div[@class="centent"]/text()')
    fp.write(title + '\n')
    print(title)
    for i in book:
        a = i.replace("\n", "")
        fp.write(a)
        # print(a)

print("over !!!")
fp.close()

  

猜你喜欢

转载自www.cnblogs.com/happlyp/p/10273623.html