In order to tell my girlfriend a story every night, I decided to crawl some articles and store them on my phone to read one for her every night. So I focused on the love column of the article reading network.
First of all, prepare for crawling and clarify the idea of crawling
-
Enter the article reading network, click on the love article
-
Our goal is to crawl the title and content of the article. Pull down to see the love article column has a total of 109 pages.
-
Now that you know this, you already know what you are doing. Then open pycharm and start code preparation.
-
Open pycharm and create a new sanwen.py
from urllib import request
from lxml import etree
import requests
#首先定义一个请求头
headers = {
'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36'
}
Open the developer tools on this website, refresh the next page, click Network, find User-Agent, paste and copy.
I said a total of 109 pages, all of which need to be crawled down. After reading the url link on the next page, there is a rule
that this total is 109 pages. Each time it becomes only the number of the value of this property changes, so I am like this do
#这是一个完整的url代码需要与上述的代码结合才能得到每一页
url_1 = 'http://www.duwenzhang.com/wenzhang/aiqingwenzhang/'
for i in range(110):
if i >= 1:
page_url = 'list_1_' + str(i) + '.html'
#下一页
next_page = url_1 + page_url
#请求网页
res = requests.get(next_page)
#看到网页源代码是gb2312
res.encoding = 'gb2312'
#将网页代码转化成xml语言,便于xpath提取
res_xpath = etree.HTML(res.text)
#获取一页中的所有文章标题的url
titles_url = res_xpath.xpath("//table[@class='tbspan']//tr[2]//b/a/@href")
#遍历每一页的url,即每一篇文章的url
for title_url in titles_url:
#请求网页
content = requests.get(title_url)
content.encoding = 'gb2312'
content_xpath = etree.HTML(content.text)
content_text = content_xpath.xpath(".//div[@id='wenzhangziti']/p//text()")
#上行代码得到的是一个列表,要把它变成字符串
content_text_strs = " ".join(content_text)
#拿到文章的标题
title = content_xpath.xpath(".//td/h1/text()")
#变成字符串
title_strs = " ".join(title)
#写入txt文件
f = open("sanwen.txt","a",encoding='utf-8')
f.write(title_strs)
f.write("\n")
f.write(content_text_strs)
f.write("\n完篇=======")
f.close()
#为了方便查看程序运行的状态
print("已写入了"+str(i)+"页")
Explain, the xpath search method in the code:
get the text of h1 using the text () method
The anti-reptile mechanism of this website is not too restrictive, and even without anti-reptile, it is relatively easy.
Run it again, and finish the flowering.