Article Directory
background
- Training the text classification model requires preprocessing of numbers and special symbols in the text
Ideas
1 Since it is to extract numbers, the form of numbers is generally: integer, decimal, integer plus decimal;
2 So it is generally in the form: ----.-----;
3 According to the meaning of the above regular expression, the following expression can be written: "\d+.?\d*";
4 \d+ matches one or more digits. Note that you should not write * here, because even if it is a decimal, there must be a digit before the decimal point; .? This matches the decimal point, which may or may not; \d*This is Match the number after the decimal point, so it is 0 or more
code
# -*- coding: cp936 -*-
import re
string="A1.45,b5,6.45,8.82"
print(re.findall(r"\d+\.?\d*",string)) # 查找
# ['1.45', '5', '6.45', '8.82']
res = re.sub(r"\d+\.?\d*", "", string) # 过滤
- Other similar:
- Such as filtering Chinese and English punctuation and special symbols
- Filter special symbols such as line breaks
# 替换 空格 \t \r \n
import re
str1='123 456 7\t8\r9\n10'
str1 = re.sub('[\s+]', '', str1)
print(str1)