When doing NLP (Natural Language Processing) related tasks, we often encounter the need to identify and extract provinces, cities, and administrative regions. Although we can also achieve the purpose of extraction by searching the keyword table one by one, we need to collect the keyword tables of provinces and cities first, which is relatively cumbersome.
Today I will introduce a module to you. You only need to pass a string to this module, and it will return you the keywords of province, city, and district in the string, and mark it on the picture for you. It is Cpca module.
1. Prepare
Before starting, you need to make sure that Python and pip have been successfully installed on your computer. If not, you can visit this article: Super detailed Python installation guide for installation.
(Optional 1) If you use Python for data analysis, you can install Anaconda directly: Anaconda, a good helper for Python data analysis and mining , has built-in Python and pip.
(Optional 2) In addition, it is recommended that you use the VSCode editor, which has many advantages: The best partner for Python programming—VSCode detailed guide .
Please choose one of the following ways to enter the command to install dependencies :
1. Open Cmd (Start-Run-CMD) in the Windows environment.
2. Open Terminal in the MacOS environment (command+space to enter Terminal).
3. If you are using VSCode editor or Pycharm, you can directly use the Terminal at the bottom of the interface.
pip install cpca
Note that currently the cpca module only supports Python3 and above.
On windows, problems similar to the following may occur:
Building wheel for pyahocorasick (setup.py) ... error
First read the original text to download Microsoft Visual C++ Build Tools, install VC++ build tools, and then re-pip install cpca to solve the problem.
2. Basic use
The most basic extraction of provinces and cities can be achieved with two lines of code:
# 公众号: Python 实用宝典
# 2022/06/23
import cpca
location_str = [
"广东省深圳市福田区巴丁街深南中路1025号新城大厦1层",
"特斯拉上海超级工厂是特斯拉汽车首座美国本土以外的超级工厂,位于中华人民共和国上海市。",
"三星堆遗址位于中国四川省广汉市城西三星堆镇的鸭子河畔,属青铜时代文化遗址"
]
df = cpca.transform(location_str)
print(df)
The effect is as follows:
省 市 区 地址 adcode
0 广东省 深圳市 福田区 巴丁街深南中路1025号新城大厦1层 440304
1 上海市 None None 。310000
2 四川省 德阳市 广汉市 城西三星堆镇的鸭子河畔,属青铜时代文化遗址 510681
Pay attention to Guanghan City in the third article, cpca not only recognizes Guanghan City, a county-level city in the sentence, but also automatically matches Deyang City, which is the city under its escrow, which has to be said to be very powerful.
If you want to know that the program extracts the name of the province and city from the position of the string, you can add a pos_sensitive=True parameter:
# 公众号: Python 实用宝典
# 2022/06/23
import cpca
location_str = [
"广东省深圳市福田区巴丁街深南中路1025号新城大厦1层",
"特斯拉上海超级工厂是特斯拉汽车首座美国本土以外的超级工厂,位于中华人民共和国上海市。",
"三星堆遗址位于中国四川省广汉市城西三星堆镇的鸭子河畔,属青铜时代文化遗址"
]
df = cpca.transform(location_str, pos_sensitive=True)
print(df)
The effect is as follows:
(base) G:\push\20220623>python 1.py
省 市 区 地址 adcode 省_pos 市_pos 区_pos
0 广东省 深圳市 福田区 巴丁街深南中路1025号新城大厦1层 440304 0 3 6
1 上海市 None None 。310000 38 -1 -1
2 四川省 德阳市 广汉市 城西三星堆镇的鸭子河畔,属青铜时代文化遗址 510681 9 -1 12
It marks the key position (index) that identifies the province, city, and district. Of course, if it is the special identification of Deyang City, it will be marked as -1.
3. Advanced use
It can also batch identify multiple regions from large chunks of text:
# 公众号: Python 实用宝典
# 2022/06/23
import cpca
long_text = "对一个城市的评价总会包含个人的感情。如果你喜欢一个城市,很有可能是喜欢彼时彼地的自己。"\
"在广州、香港读过书,工作过,在深圳买过房、短暂生活过,去北京出了几次差。"\
"想重点比较一下广州、深圳和香港,顺带说一下北京。总的来说,觉得广州舒适、"\
"香港精致、深圳年轻气氛好、北京大气又粗糙。答主目前选择了广州。"
df = cpca.transform_text_with_addrs(long_text, pos_sensitive=True)
print(df)
The effect is as follows:
(base) G:\push\20220623>python 1.py
省 市 区 地址 adcode 省_pos 市_pos 区_pos
0 广东省 广州市 None 440100 -1 44 -1
1 香港特别行政区 None None 810000 47 -1 -1
2 广东省 深圳市 None 440300 -1 58 -1
3 北京市 None None 110000 71 -1 -1
4 广东省 广州市 None 440100 -1 86 -1
5 广东省 深圳市 None 440300 -1 89 -1
6 香港特别行政区 None None 810000 92 -1 -1
7 北京市 None None 110000 100 -1 -1
8 广东省 广州市 None 440100 -1 110 -1
9 香港特别行政区 None None 810000 115 -1 -1
10 广东省 深圳市 None 440300 -1 120 -1
11 北京市 None None 110000 128 -1 -1
12 广东省 广州市 None 440100 -1 143 -1
Not only that, but the module also comes with some simple drawing tools, which can draw the data output above in the form of a heat map on the map:
# 公众号: Python 实用宝典
# 2022/06/23
import cpca
from cpca import drawer
long_text = "对一个城市的评价总会包含个人的感情。如果你喜欢一个城市,很有可能是喜欢彼时彼地的自己。"\
"在广州、香港读过书,工作过,在深圳买过房、短暂生活过,去北京出了几次差。"\
"想重点比较一下广州、深圳和香港,顺带说一下北京。总的来说,觉得广州舒适、"\
"香港精致、深圳年轻气氛好、北京大气又粗糙。答主目前选择了广州。"
df = cpca.transform_text_with_addrs(long_text, pos_sensitive=True)
drawer.draw_locations(df[cpca._ADCODE], "df.html")
This error may be reported when running:
(base) G:\push\20220623>python 1.py
Traceback (most recent call last):
File "1.py", line 12, in <module>
drawer.draw_locations(df[cpca._ADCODE], "df.html")
File "G:\Anaconda3\lib\site-packages\cpca\drawer.py", line 41, in draw_locations
import folium
ModuleNotFoundError: No module named 'folium'
Use pip to install:
pip install folium
Then re-run the code, df.html will be generated in the current directory, double-click to open, the effect is as follows:
How to use it, is it very convenient? In the future, this module will be sufficient for location identification.
There are more details you can visit the Github homepage of this project to read. The README of this project is written in Chinese and is very easy to read:
https://github.com/DQinYuan/chinese_province_city_area_mapper
If you can't access GitHub, you can also reply to the background of the official account of Python Practical Collection: cpca to download the complete project.
This is the end of our article. If you like today's Python practical tutorial, please continue to pay attention to Python Practical Collection.
If you have any questions, you can reply in the background of the official account: join the group , answer the corresponding red letter verification information , and enter the mutual assistance group to ask.
Originality is not easy, I hope you can give me a thumbs up below and watch to support me to continue creating, thank you!
Click below to read the original text for a better reading experience
Python Practical Collection (pythondict.com)
is not just a collection.
Welcome to pay attention to the official account: Python Practical Collection