1.开发环境
Python 3.6.0 |Anaconda 4.3.0 (64-bit)| (default, Dec 23 2016, 11:57:41) [MSC v.1900 64 bit (AMD64)] on win32
2.编码
网站的编码是gb2312
<meta http-equiv="Content-Type" content="text/html; charset=gb2312">
所以获取网页
req = requests.get(url=target)
req.encoding = 'gb2312'
写txt
with open("test.txt","a",encoding='gb2312') as f:
网页中有些代码用gb2312写txt会报错
UnicodeEncodeError: 'gb2312' codec can't encode character '\xa0' in position 5217: illegal multibyte sequence
把它们都替换了
with open("test.txt","a") as f:
#\xa0 ->
#\ufffd ->��
#\u30fb
#2个<br><br>替换为2个换行再加一个段落首行空格
f.write(text_delete_bmp.replace('\ufffd','').\
replace('\u30fb','').\
replace('\xa0', '').\
replace(' ',"\n ").\
replace('\n\n',"\n ")) # 自带文件关闭功能,不需要再写f.close()
3.去除特定字符串
文章中有些特定的字符串是不需要的,例如
{
ewcMVIMAGE,MVIMAGE, !09100020_0014_1.bmp}{
ewc MVIMAGE,MVIMAGE, !09100020_0015_1.bmp}
利用正则把它们都去除掉。
字符串规则:以"{ewc开头",以“.bmp}”结尾
text_delete_bmp=re.sub(r'{
ewc.*?\.bmp}', "", text_context[0].text)