知乎简单爬虫代码
简单思路
1、集成BeautifulSoup
2、用urllib.request解析 url
3、用bs4解析
4、soup.findAll找出某一类class
5、对该类别中的标签进行解析
import time
import urllib.request
from bs4 import BeautifulSoup
for p in range(1, 76):
url = "http://www.zhihu.com/collection/27109279?page=" + str(p)
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, 'html.parser') # 使用html解析器进行解析
allp = soup.findAll(class_='zm-item')
print(' 第' + str(p) + '页\n')
for each in allp:
answer = each.findNext(class_='zh-summary summary clearfix')
answer = answer.text.replace('显示全部', '')
answer = answer.replace('\n', '')
if len(answer) > 200:
continue
problem = each.findNext(class_='zm-item-title')
print(str(allp.index(each) + 1) + '、问题: ' + problem.text)
print(' 神回:' + answer)
time.sleep(5)
源码下载请点击:20行python代码爬取知乎
爬取的神回复内容非常搞笑:一起笑一笑