Want to watch free novels? Want to crawl novels? Then please do this with me

Preface

I started to use python this semester. I think simplicity and ease of use is the biggest advantage of python. The code format requirements are not very strict. The use of writing code is excellent. In python, I am most interested in crawlers. More data, and better data analysis, can get more value

Crawling novel ideas

Observe the novel first, distinguish whether it is static or dynamic, and click on a chapter of a novel at will. Through the Elements option of F12, you can check that the content of the article is stored in the div id='content' tag, indicating that the website is static If you want to use dynamic Selenium crawling, it is also possible, but the website is static, so we don’t need to use dynamic crawling.

Then after selecting the target novel, click on the novel catalog page, and through the Elements option of F12, you can observe that the URLs of all chapters of the novel are regular.

Crawl to the url of all chapters to save 1, then complete the url, and then enter each chapter to crawl the title and body content, and then save it in txt.

Function module realization

After clarifying the idea, follow the steps to complete the function step by step.

1. Use request request library and data cleaning matching re library

The import requests import re
re module is a unique Python string matching module. Many functions in this module are implemented based on regular expressions, and regular expressions are used to perform fuzzy matching on strings to extract the string parts they need. He is universal to all languages. note:

(1) The re module is unique to python;

(2) Regular expressions can be used in all programming languages;

(3) The re module and regular expressions operate on strings.

I have a circle of learning and communication. Although the number of people is not very large, everyone who learns python comes to learn and communicate. What problems
are encountered? Discuss with each other and exchange academic issues with each other. Learn to communicate with the penguin group: 745895701

2. Send a URL request to the target website

s = requests.Session()url = 'https://www.xsbiquge.com/96_96293/'html = s.get(url)html.encoding = 'utf-8'

3. Find the URL of all chapters on the website directory page

#获取章节
caption_title_1 = re.findall(r'<a href="(/96_96293/.*?\.html)">.*?</a>',html.text) 

4. For the convenience of asking again, complete the acquisition of all chapter urls

for i in caption_title_1:   caption_title_1 = '新笔趣阁_书友最值得收藏的网络小说阅读网!'+i 

5. Access each url obtained to find the title and body content

s1 = requests.Session()r1 = s1.get(caption_title_1)r1.encoding = 'utf-8' # 获取章节名
#meta是head头文件中的内容,用这个获取章节名
name = re.findall(r'<meta name="keywords" content="(.*?)" />',r1.text)[0]print(name)
#这里print出章节名,方便程序运行后检查文本保存有无遗漏
chapters = re.findall(r'<div id="content">(.*?)</div>',r1.text,re.S)[0] 

6. Clean up the acquired body content

chapters = chapters.replace(' ', '')  
chapters = chapters.replace('readx();', '')  
chapters = chapters.replace('& lt;!--go - - & gt;', '')  
chapters = chapters.replace('&lt;!--go--&gt;', '')  
chapters = chapters.replace('()', '')  # 转换字符串  
s = str(chapters)  #将内容中的<br>替换  
s_replace = s.replace('<br/>',"\n")  while True:      
index_begin = s_replace.find("<")      
index_end = s_replace.find(">",index_begin+1)      
if index_begin == -1:          
break     
s_replace = s_replace.replace(s_replace[index_begin:index_end+1],"")  
pattern = re.compile(r'&nbsp;',re.I)#使匹配对大小写不敏感  
fiction = pattern.sub(' ',s_replace)

7. Save the data to the preset txt

path = r'F:\title.txt'     # 这是我存放的位置,你可以进行更改#a是追加
file_name = open(path,'a',encoding='utf-8')
file_name.write(name)
file_name.write('\n')
file_name.write(fiction)
file_name.write('\n')#保存完之后关闭
file_name.close() 

operation result


Program source code

import requestsimport res = requests.Session()url = 'https://www.xsbiquge.com/96_96293/'
html = s.get(url)html.encoding = 'utf-8'# 获取章节
caption_title_1 = re.findall(r'<a href="(/96_96293/.*?\.html)">.*?</a>',html.text)
# 写文件+path = r'F:\title.txt'     
# 这是我存放的位置,你可以进行更改
#a是追加
file_name = open(path,'a',encoding='utf-8')
# 循环下载每一张
for i in caption_title_1:   caption_title_1 = '新笔趣阁_书友最值得收藏的网络小说阅读网!'+i   
# 网页源代码   
s1 = requests.Session()   
r1 = s1.get(caption_title_1)   
r1.encoding = 'utf-8'  
# 获取章节名   
#meta是head头文件中的内容,用这个获取章节名   name = re.findall(r'<meta name="keywords" content="(.*?)" />',r1.text)[0]   
print(name)   
file_name.write(name)   
file_name.write('\n')   
# 获取章节内容   
#re.S在字符串a中,包含换行符\n,在这种情况下:   
#如果不使用re.S参数,则只在每一行内进行匹配,如果一行没有,就换下一行重新开始。而使用re.S参数以后,正则表达式会将这个字符串作为一个整体,在整体中进行匹配。   
chapters = re.findall(r'<div id="content">(.*?)</div>',r1.text,re.S)[0]   
#换行   
chapters = chapters.replace(' ', '')   
chapters = chapters.replace('readx();', '')   
chapters = chapters.replace('& lt;!--go - - & gt;', '')   
chapters = chapters.replace('&lt;!--go--&gt;', '')   
chapters = chapters.replace('()', '')   # 转换字符串   
s = str(chapters)   #将内容中的<br>替换   
s_replace = s.replace('<br/>',"\n")   while True:       
index_begin = s_replace.find("<")       
index_end = s_replace.find(">",index_begin+1)       
if index_begin == -1:           
break      
 s_replace = s_replace.replace(s_replace[index_begin:index_end+1],"")  
pattern = re.compile(r'&nbsp;',re.I)#使匹配对大小写不敏感   
fiction = pattern.sub(' ',s_replace)   
file_name.write(fiction)   
file_name.write('\n')file_name.close()

to sum up

Through this project, I have been able to design and implement the overall functional modules of the system, which has greatly benefited me and improved my self-study ability, especially in data mining and data analysis.

The program code realizes the function of crawling novels and cleans the data. But this is just crawling a novel. If you add a big loop to the function module, you can get the URLs of all novels on the website. These are my thoughts and ideas. If you have other ideas, you can comment and exchange them.

Guess you like

Origin blog.csdn.net/Python_xiaobang/article/details/112919146