Python artifact! Automatically identify provinces, cities and districts in text and draw them

e8d2a02b3c2cf261e8646a6d7cc8c51c.png

When doing NLP (Natural Language Processing) related tasks, we often encounter the need to identify and extract provinces, cities, and administrative regions. Although we can also achieve the purpose of extraction by searching the keyword table one by one, we need to collect the keyword tables of provinces and cities first, which is relatively cumbersome.

Today I will introduce a module to you. You only need to pass a string to this module, and it will return you the keywords of province, city, and district in the string, and mark it on the picture for you. It is Cpca module.

1. Prepare

Before starting, you need to make sure that Python and pip have been successfully installed on your computer. If not, you can visit this article: Super detailed Python installation guide  for installation.

(Optional 1)  If you use Python for data analysis, you can install Anaconda directly: Anaconda, a good helper for Python data analysis and mining , has built-in Python and pip.

(Optional 2)  In addition, it is recommended that you use the VSCode editor, which has many advantages: The best partner for Python programming—VSCode detailed guide .

Please choose one of the following ways to enter the command to install dependencies :
1. Open Cmd (Start-Run-CMD) in the Windows environment.
2. Open Terminal in the MacOS environment (command+space to enter Terminal).
3. If you are using VSCode editor or Pycharm, you can directly use the Terminal at the bottom of the interface.

pip install cpca

Note that currently the cpca module only supports Python3 and above.

On windows, problems similar to the following may occur:

Building wheel for pyahocorasick (setup.py) ... error

First read the original text to download Microsoft Visual C++ Build Tools, install VC++ build tools, and then re-pip install cpca to solve the problem.

2. Basic use

The most basic extraction of provinces and cities can be achieved with two lines of code:

# 公众号: Python 实用宝典
# 2022/06/23

import cpca

location_str = [
    "广东省深圳市福田区巴丁街深南中路1025号新城大厦1层",
    "特斯拉上海超级工厂是特斯拉汽车首座美国本土以外的超级工厂,位于中华人民共和国上海市。",
    "三星堆遗址位于中国四川省广汉市城西三星堆镇的鸭子河畔,属青铜时代文化遗址"
]
df = cpca.transform(location_str)
print(df)

The effect is as follows:

省 市 区 地址 adcode
0 广东省 深圳市 福田区 巴丁街深南中路1025号新城大厦1层 440304
1 上海市 None None 。310000
2 四川省 德阳市 广汉市 城西三星堆镇的鸭子河畔,属青铜时代文化遗址 510681

Pay attention to Guanghan City in the third article, cpca not only recognizes Guanghan City, a county-level city in the sentence, but also automatically matches Deyang City, which is the city under its escrow, which has to be said to be very powerful.

If you want to know that the program extracts the name of the province and city from the position of the string, you can add a pos_sensitive=True parameter:

# 公众号: Python 实用宝典
# 2022/06/23

import cpca

location_str = [
    "广东省深圳市福田区巴丁街深南中路1025号新城大厦1层",
    "特斯拉上海超级工厂是特斯拉汽车首座美国本土以外的超级工厂,位于中华人民共和国上海市。",
    "三星堆遗址位于中国四川省广汉市城西三星堆镇的鸭子河畔,属青铜时代文化遗址"
]
df = cpca.transform(location_str, pos_sensitive=True)
print(df)

The effect is as follows:

(base) G:\push\20220623>python 1.py
     省 市 区 地址 adcode 省_pos 市_pos 区_pos
0  广东省 深圳市 福田区 巴丁街深南中路1025号新城大厦1层 440304      0      3      6
1  上海市 None None 。310000     38     -1     -1
2  四川省 德阳市 广汉市 城西三星堆镇的鸭子河畔,属青铜时代文化遗址 510681      9     -1     12

It marks the key position (index) that identifies the province, city, and district. Of course, if it is the special identification of Deyang City, it will be marked as -1.

3. Advanced use

It can also batch identify multiple regions from large chunks of text:

# 公众号: Python 实用宝典
# 2022/06/23

import cpca

long_text = "对一个城市的评价总会包含个人的感情。如果你喜欢一个城市,很有可能是喜欢彼时彼地的自己。"\
    "在广州、香港读过书,工作过,在深圳买过房、短暂生活过,去北京出了几次差。"\
    "想重点比较一下广州、深圳和香港,顺带说一下北京。总的来说,觉得广州舒适、"\
    "香港精致、深圳年轻气氛好、北京大气又粗糙。答主目前选择了广州。"
df = cpca.transform_text_with_addrs(long_text, pos_sensitive=True)
print(df)

The effect is as follows:

(base) G:\push\20220623>python 1.py
          省 市 区 地址 adcode 省_pos 市_pos 区_pos
0       广东省 广州市 None     440100     -1     44     -1
1   香港特别行政区 None  None     810000     47     -1     -1
2       广东省 深圳市 None     440300     -1     58     -1
3       北京市 None  None     110000     71     -1     -1
4       广东省 广州市 None     440100     -1     86     -1
5       广东省 深圳市 None     440300     -1     89     -1
6   香港特别行政区 None  None     810000     92     -1     -1
7       北京市 None  None     110000    100     -1     -1
8       广东省 广州市 None     440100     -1    110     -1
9   香港特别行政区 None  None     810000    115     -1     -1
10      广东省 深圳市 None     440300     -1    120     -1
11      北京市 None  None     110000    128     -1     -1
12      广东省 广州市 None     440100     -1    143     -1

Not only that, but the module also comes with some simple drawing tools, which can draw the data output above in the form of a heat map on the map:

# 公众号: Python 实用宝典
# 2022/06/23

import cpca
from cpca import drawer

long_text = "对一个城市的评价总会包含个人的感情。如果你喜欢一个城市,很有可能是喜欢彼时彼地的自己。"\
    "在广州、香港读过书,工作过,在深圳买过房、短暂生活过,去北京出了几次差。"\
    "想重点比较一下广州、深圳和香港,顺带说一下北京。总的来说,觉得广州舒适、"\
    "香港精致、深圳年轻气氛好、北京大气又粗糙。答主目前选择了广州。"
df = cpca.transform_text_with_addrs(long_text, pos_sensitive=True)
drawer.draw_locations(df[cpca._ADCODE], "df.html")

This error may be reported when running:

(base) G:\push\20220623>python 1.py
Traceback (most recent call last):
  File "1.py", line 12, in <module>
    drawer.draw_locations(df[cpca._ADCODE], "df.html")
  File "G:\Anaconda3\lib\site-packages\cpca\drawer.py", line 41, in draw_locations
    import folium
ModuleNotFoundError: No module named 'folium'

Use pip to install:

pip install folium

Then re-run the code, df.html will be generated in the current directory, double-click to open, the effect is as follows:

bd55c6a5306cd8c97f8d46dde548dc62.png

How to use it, is it very convenient? In the future, this module will be sufficient for location identification.

There are more details you can visit the Github homepage of this project to read. The README of this project is written in Chinese and is very easy to read:

https://github.com/DQinYuan/chinese_province_city_area_mapper

If you can't access GitHub, you can also reply to the background of the official account of Python Practical Collection: cpca to download the complete project.

This is the end of our article. If you like today's Python practical tutorial, please continue to pay attention to Python Practical Collection.

If you have any questions, you can reply in the background of the official account: join the group , answer the corresponding red letter verification information , and enter the mutual assistance group to ask.

Originality is not easy, I hope you can give me a thumbs up below and watch to support me to continue creating, thank you!

Click below to read the original text for a better reading experience

Python Practical Collection (pythondict.com)
is not just a collection.
Welcome to pay attention to the official account: Python Practical Collection

ba58529615f46880f07b6591836c1270.png

Guess you like

Origin blog.csdn.net/u010751000/article/details/125437750