一、格式化输出
prettify()方法将BeautifulSoup文档以格式化的方法输出
from bs4 import BeautifulSoup
markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup,'lxml')
print(soup.prettify())
<html>
<body>
<a href="http://example.com/">
I linked to
<i>
example.com
</i>
</a>
</body>
</html>
二、压缩输出
如果只想得到字符串,不重视格式的话,可以使用str()方法
str(soup)
'<html><body><a href="http://example.com/">I linked to <i>example.com</i></a></body></html>'
三、HTML特殊字符
soup = BeautifulSoup("“Dammit!” he said.","lxml")
str(soup)
'<html><body><p>“Dammit!” he said.</p></body></html>'
四、获取该tag中所有的文本内容:get_text()
markup = '<a href="http://example.com/">\nI linked to <i>example.com</i>\n</a>'
soup = BeautifulSoup(markup,'lxml')
print(soup.get_text())
print(soup.i.get_text())
I linked to example.com
example.com
指定分隔符
soup.get_text("|")
'\nI linked to |example.com|\n'
去掉空白符
soup.getText("|",strip=True)
'I linked to|example.com'