[Python realizes web crawler 20] Knowing the hot list crawl


Manual anti-crawler: original blog address

 知识梳理不易,请尊重劳动成果,文章仅发布在CSDN网站上,在其他网站看到该博文均属于未经作者授权的恶意爬取信息

If reprinted, please indicate the source, thank you!

1. Destination URL

The URL to be crawled is as follows: Zhihu Hot List

Insert picture description here
Crawl content: title , popularity , news introduction and pictures , the focus is on exception handling, some hot searches may not have all the content

2. Actual combat analysis

First import commonly used crawler modules, and set headers to request the target URL, the code is as follows

import requests
from bs4 import BeautifulSoup
import os
import re

headers = {
    
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'}

url = 'https://www.zhihu.com/billboard'
html = requests.get(url, headers = headers)
print(html)

The output result is: (you can request the information of the webpage normally, so you can proceed to the next step)

Insert picture description here
Then in the blank interface of Hot Search, right-click the pop-up option to view the source code. You can find that the title, popularity, and link of the picture are all in the tab of the web page, and the specific news introduction is in the script. Therefore, for the convenience of crawling, you can use the bs library to analyze the tag crawling, and the following news introduction uses regular expression extraction

Insert picture description here

2.1 Title information crawling

It can be found that there must be a title for the information that appears on the hot search, so you only need to view the tag to obtain the corresponding content. Go back to the hot search interface, right-click to enter the check interface, locate the title tag, a total of 50 information, The matching verification is as follows

Insert picture description here
This part of the code is as follows

soup = BeautifulSoup(html.text, 'lxml')
titles = soup.select('.HotList-itemTitle')
for title in titles:
    print('title:', title.text)

The output result is: (Intercept part of the output result)

Insert picture description here

2.2 Heat information crawling

Similar to the title information acquisition above, locate the title tag, a total of 50 pieces of information, and perform matching verification as follows

Insert picture description here
This part of the code is as follows:

hots = soup.select('.HotList-itemMetrics')
for hot in hots:
    print('hot:', hot.text)

The output result is: (Intercept part of the output result)

Insert picture description here

2.3 Image crawling

The label positioning of pictures is also similar, as follows, but there are some cases where pictures do not exist in hot searches, for example, there are only 37 pictures here

Insert picture description here
The code is as follows, the basic code is mentioned in the previous content, here is directly given

number = 0
imgs = soup.select('.HotList-itemImgContainer img')
for img in imgs:
    print('img', img['src'])
    alt = img['alt']
    #如果不存在zhihuImg文件夹,就创建
    if not os.path.exists('zhihuImg'):
        os.mkdir('zhihuImg')

    with open('zhihuImg/{}.jpeg'.format(number),'wb') as f:
        f.write(requests.get(img['src']).content)
        print('图片{}已写入完成'.format(alt))

    number += 1

The output result is: (Intercept part of the output result)

Insert picture description here

2.4 News introduction crawl

This part of the information is in the script, so it is more convenient to use regular expressions for matching. You only need to use .*? instead of the target content . The page is parsed as follows (note that this is in the source code of the web page, the first three The information is obtained by checking the interface, in fact, you can directly match the previous three information directly in the source code interface)
Insert picture description here
This part of the code is as follows, pay attention to the missing data processing:

int_re = re.compile('"excerptArea":{"text":"(.*?)"}',re.S|re.I)
int_results = int_re.findall(html.text)
for int_r in int_results:
    if int_r is None or int_r == '':
        continue
    print(int_r)
    print('-'*20)

The output result is: (Intercept part of the result)

Insert picture description here

3 All codes

So far, the contents of the four fields have been obtained, and all the codes are as follows

import requests
from bs4 import BeautifulSoup
import os
import re

headers = {
    
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'}

url = 'https://www.zhihu.com/billboard'
html = requests.get(url, headers = headers)
# print(html)
soup = BeautifulSoup(html.text, 'lxml')
titles = soup.select('.HotList-itemTitle')
for title in titles:
    print('title:', title.text)

hots = soup.select('.HotList-itemMetrics')
for hot in hots:
    print('hot:', hot.text)

number = 0
imgs = soup.select('.HotList-itemImgContainer img')
for img in imgs:
    print('img', img['src'])
    alt = img['alt']
    #如果不存在zhihuImg文件夹,就创建
    if not os.path.exists('zhihuImg'):
        os.mkdir('zhihuImg')

    with open('zhihuImg/{}.jpeg'.format(number),'wb') as f:
        f.write(requests.get(img['src']).content)
        print('图片{}已写入完成'.format(alt))

    number += 1

int_re = re.compile('"excerptArea":{"text":"(.*?)"}',re.S|re.I)
int_results = int_re.findall(html.text)
for int_r in int_results:
    if int_r is None or int_r == '':
        continue
    print(int_r)
    print('-'*20)

Guess you like

Origin blog.csdn.net/lys_828/article/details/108592466