Python crawl article reading online combat

In order to tell my girlfriend a story every night, I decided to crawl some articles and store them on my phone to read one for her every night. So I focused on the love column of the article reading network.

First of all, prepare for crawling and clarify the idea of ​​crawling

  1. Enter the article reading network, click on the love article
    Insert picture description here

  2. Our goal is to crawl the title and content of the article. Pull down to see the love article column has a total of 109 pages.
    Insert picture description here

  3. Now that you know this, you already know what you are doing. Then open pycharm and start code preparation.

  4. Open pycharm and create a new sanwen.py

from urllib import request
from lxml import etree
import requests
#首先定义一个请求头
headers = {
	'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36'
}

Insert picture description here
Open the developer tools on this website, refresh the next page, click Network, find User-Agent, paste and copy.

I said a total of 109 pages, all of which need to be crawled down. After reading the url link on the next page, there is a rule
Insert picture description here
that this total is 109 pages. Each time it becomes only the number of the value of this property changes, so I am like this do

#这是一个完整的url代码需要与上述的代码结合才能得到每一页
url_1 = 'http://www.duwenzhang.com/wenzhang/aiqingwenzhang/'
for i in range(110):
	if i >= 1:
		page_url = 'list_1_' + str(i) + '.html'
		#下一页
		next_page = url_1 + page_url
		#请求网页
		res = requests.get(next_page)
		#看到网页源代码是gb2312
		res.encoding = 'gb2312'
		#将网页代码转化成xml语言,便于xpath提取
		res_xpath = etree.HTML(res.text)
		#获取一页中的所有文章标题的url
		titles_url = res_xpath.xpath("//table[@class='tbspan']//tr[2]//b/a/@href")
		#遍历每一页的url,即每一篇文章的url
		for title_url in titles_url:
			#请求网页
			content = requests.get(title_url)
			content.encoding = 'gb2312'
			content_xpath = etree.HTML(content.text)
			content_text = content_xpath.xpath(".//div[@id='wenzhangziti']/p//text()")
			#上行代码得到的是一个列表,要把它变成字符串
			content_text_strs = " ".join(content_text)
			#拿到文章的标题
			title = content_xpath.xpath(".//td/h1/text()")
			#变成字符串
			title_strs = " ".join(title)
			#写入txt文件
			f = open("sanwen.txt","a",encoding='utf-8')
			f.write(title_strs)
			f.write("\n")
			f.write(content_text_strs)
			f.write("\n完篇=======")
			f.close()
		#为了方便查看程序运行的状态
		print("已写入了"+str(i)+"页")		

Explain, the xpath search method in the code:
Insert picture description here
get the text of h1 using the text () method
Insert picture description here

The anti-reptile mechanism of this website is not too restrictive, and even without anti-reptile, it is relatively easy.

Insert picture description here
Run it again, and finish the flowering.

Published 26 original articles · praised 5 · visits 777

Guess you like

Origin blog.csdn.net/weixin_44730235/article/details/104563615