Python学习爬虫(8)--实战中阶:爬取豆瓣书名

作者:IT小样
前面只是简单的实现了爬虫功能,可以继续完善,本篇文章主要是完善爬取的网页数据,爬取tag/漫画下所有书的书名,作者信息,将函数进行封装,并添加保存数据功能
之前爬取的目的url为:https://book.douban.com/tag/漫画 ,进入该网页,活动窗口到最下面,可见这只是第一页的书而已,如图:
第一页
怎么完整的爬取下所有的页面的书名呢?通过点击几个分页,我们发现该URL后缀有两个参数,实际上第一页的URL为:https://book.douban.com/tag/漫画?start=0&type=T ,两个参数为:start,type;通过点击发现type参数不变,start参数=(页数-1)*20,直接先放上未封装代码:

import requests
from bs4 import BeautifulSoup

header = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36"}
# get page count
# text = requests.get("https://book.douban.com/tag/%E6%BC%AB%E7%94%BB?start=0&type=T",headers=header,verify=False).text
# soup_count = BeautifulSoup(text,'lxml')
# div_soup = soup_count.find("span",attrs={"class":"break"})
# print(div_soup)
# a_soup = div_soup.next_sibling.next_sibling.next_sibling.next_sibling
# page_count = int(a_soup.string)

# get page text  
url_temp = "https://book.douban.com/tag/%E6%BC%AB%E7%94%BB?start={}&type=T"	
offset = 0
result = []
offset=20
for page in range(1):
    title = ''
    author = ''
    comment_count = ''
    start = offset*page
    url = url_temp.format(start)
    text = requests.get(url,headers=header,verify=False).text
    # print(text)
    soup = BeautifulSoup(text,'lxml')
    ul_soup = soup.find(attrs={"class":"subject-list"})
    li_soup = ul_soup.find_all("li",attrs={"class":"subject-item"})
    for li in li_soup:
        result_list = []
        title = li.h2.get_text().replace('  ','').replace('\n','')
        author = li.find("div",attrs={"class":"pub"}).get_text().replace('  ','').replace('\n','')
        #comment_count = li.find("div",attrs={"class":"star"}).find("span",attrs={"class":"p1"}).get_text()
        result_list.append(title)
        result_list.append(author)
        #result_list.append(comment_count)
        result.append(result_list)
    print(result)

with open('a.txt','w',encoding='utf-8') as f:
    for result_list in result:
        s = ' '.join(result_list)
        f.write(s)
        f.write('\n')

以上是直接编写的代码,我们可以将其进行封装成函数,这样子增加代码复用性。放上封装后的代码:

import requests
from bs4 import BeautifulSoup

def  get_page_count(url):
    #get page counts
    text = requests.get(url,verify=False).text
    soup_count = BeautifulSoup(text,'lxml')
    div_soup = soup_count.find("span",attrs={"class":"break"})
    a_soup = div_soup.next_sibling.next_sibling.next_sibling.next_sibling
    page_count = int(a_soup.string)
    return page_count

def  get_page_text(page_count):
    text = []
    offset = 20
    url_temp = "https://book.douban.com/tag/%E6%BC%AB%E7%94%BB?start={}&type=T"	
    for page in range(0,page_count):
        start = offset*page
        text = requests.get(url_temp.format(start),verify=False).text
        soup = BeautifulSoup(text,'lxml')
        ul_soup = soup.find(attrs={"class":"subject-list"})
        li_soup = ul_soup.find_all("li",attrs={"class":"subject-item"})
        for li in li_soup:
            result_list = []
            title = li.h2.get_text().replace('  ','').replace('\n','')
            author = li.find("div",attrs={"class":"pub"}).get_text().replace('  ','').replace('\n','')
            result_list.append(title)
            result_list.append(author)
        	text.append(result_list)		
    return  text
    
def  save_data(book_text):
    with open('a.txt','w') as f:
        for  book in book_text:
            book_author = ' '.join(book)
            f.write(book_author)
            f.write('\n')

if  __name__ =="__main__":
    page_count = get_page_count("https://book.douban.com/tag/%E6%BC%AB%E7%94%BB?start=0&type=T")   
    book_text = []
    book_text = get_page_text(page_count)	
    save_data(book_text)

以上的代码,是先获得整个页面的页面数,然后获取整个页面所有的书名以及作者信息,最后保存数据。当然也可以自己选定需要获取多少页的书目信息。
最后就是该程序运行起来很慢,可以进行继续的完善。

上一篇:实战入门
下一篇:实战高阶–多线程

发布了39 篇原创文章 · 获赞 16 · 访问量 1万+

猜你喜欢

转载自blog.csdn.net/weixin_31315135/article/details/89145676