BeautifulSoup库中文文档:https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/
常用函数:
soup.select():
按路径搜索需要的内容
soup.select("p nth-of-type(3)")
# [<p class="story">...</p>]
soup.find_all():
如果想要得到所有的<a>标签,或是通过名字得到比一个tag更多的内容的时候,就需要用到 Searching the tree 中描述的方法,比如: find_all()
from bs4 import BeautifulSoup
soup = BeautifulSoup(html.text, "html.parser")
soup.find_all('a')
soup.get_text():
如果只想得到tag中包含的文本内容,那么可以用 get_text() 方法。这个方法获取到tag中包含的所有文版内容包括子孙tag中的内容,并将结果作为Unicode字符串返回
markup = '<a href="http://example.com/">\nI linked to <i>example.com</i>\n</a>'
soup = BeautifulSoup(markup)
soup.get_text()
u'\nI linked to example.com\n'
soup.i.get_text()
u'example.com'
soup.get():
可用于提取图片
代码实例
from bs4 import BeautifulSoup
data = []
path = './web/new_index.html'
with open(path, 'r') as f:
Soup = BeautifulSoup(f.read(), 'lxml')
titles = Soup.select('ul > li > div.article-info > h3 > a')
pics = Soup.select('ul > li > img')
descs = Soup.select('ul > li > div.article-info > p.description')
rates = Soup.select('ul > li > div.rate > span')
cates = Soup.select('ul > li > div.article-info > p.meta-info')
for title, pic, desc, rate, cate in zip(titles, pics, descs, rates, cates):
info = {
'title': title.get_text(),
'pic': pic.get('src'),
'descs': desc.get_text(),
'rate': rate.get_text(),
'cate': list(cate.stripped_strings)
}
data.append(info)
for i in data:
if len(i['rate']) >= 3:
print(i['title'], i['cate'])