实体链接Entity Linking开源工具:dexter2

版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/qq_37043191/article/details/81906433

实体链接(Entity Linking)

在自然语言处理中,实体链接,命名实体链接(NEL),命名实体消歧(NED),命名实体识别和消歧(NERD)或命名实体规范化(NEN),都是是确定实体(Entity)的Identity的任务。 例如,对于句子“巴黎是法国的首都”,Entity Linking的想法是确定句中“巴黎”指的是巴黎市,而不是巴黎希尔顿或任何其他可称为“巴黎”的实体。再例如,对于句子”James Bond is cool”,我们期望获得“James_Bond”这整个经过链接后的名字。

Dexter2

Dexter是一个Entity Linking的开源框架,其利用维基百科(英文)中的词条来实现实体链接。

下载

dexter on github
这里有编译好的二进制文件和source code,本文直接上编译好的bin file
windows的话在解压后的当前目录:

java -Xmx4000m -jar dexter-2.1.0.jar

或者在linux上

wget http://hpc.isti.cnr.it/~ceccarelli/dexter2.tar.gz
tar -xvzf dexter2.tar.gz
cd dexter2
java -Xmx4000m -jar dexter-2.1.0.jar

于是本地端口8080开启,如果是windows或者有可视化的linux上直接打开浏览器输入http://localhost:8080/dexter-webapp/dev/ 即可查看api。如果dexter是在服务器上的话那么直接用Python request利用url获取结果(见后文)。

使用

所有使用api可以参考本地或者官网 。都有可执行的例子。本文举例说明。

1. annotate, spot

  • annotate
    Performs the entity linking on a given text, annotating maximum n entities.

  • spot
    It only performs the first step of the entity linking process, i.e., find all the mentions that could refer to an entity

两者都是对一句query中的词进行entity linking。区别是annotate会找出最相关的前n个linking。按需使用。

扫描二维码关注公众号,回复: 2894634 查看本文章

例如,查找

Bob Dylan and Johnny Cash had formed a mutual admiration society even before they met in the early 1960s

中的linked entity

当然,可以直接输入网址进行demo查看。linking的confidence设置为0.5:

http://localhost:8080/dexter-webapp/api/rest/annotate?text=Bob%20Dylan%20and%20Johnny%20Cash%20had%20formed%20a%20mutual%20admiration%20society%20even%20before%20they%20met%20in%20the%20early%201960s&n=50&wn=false&debug=false&format=text&min-conf=0.5

可以得到Annotate的结果:

"value": "<a href=\"#\" onmouseover='manage(4637590)' >Bob Dylan</a> and <a href=\"#\" onmouseover='manage(11983070)' >Johnny Cash</a> had formed a mutual admiration society even before they met in the early 1960s"

其中annotate也可以给出spot的结果:

"spots": [
    {
      "mention": "johnny cash",
      "linkProbability": 1,
      "start": 14,
      "end": 25,
      "linkFrequency": 2558,
      "documentFrequency": 1932,
      "entity": 11983070,
      "field": "body",
      "entityFrequency": 2540,
      "commonness": 0.9929632525410477,
      "score": 0.9929632525410477
    },
    {
      "mention": "bob dylan",
      "linkProbability": 1,
      "start": 0,
      "end": 9,
      "linkFrequency": 5588,
      "documentFrequency": 4275,
      "entity": 4637590,
      "field": "body",
      "entityFrequency": 5547,
      "commonness": 0.9926628489620616,
      "score": 0.9926628489620616
    }
  ]

可以看出这个程序给我们link出了bob dylan和johnny cash两个置信度高于0.5的linked entity,并给出了两个entity的id。我们可以用这些id去做些其他的操作,具体在后文讲解。

如果运行spot api:

http://localhost:8080/dexter-webapp/api/rest/spot?text=Bob%20Dylan%20and%20Johnny%20Cash%20had%20formed%20a%20mutual%20admiration%20society%20even%20before%20they%20met%20in%20the%20early%201960s&wn=false&debug=false&format=text

可以得到结果:

"spots": [
    {
      "mention": "mutual admiration society",
      "linkProbability": 1,
      "field": "body",
      "start": 39,
      "end": 64,
      "linkFrequency": 33,
      "documentFrequency": 31,
      "candidates": [
        {
          "entity": 2319591,
          "freq": 13,
          "commonness": 0.3939393939393939
        },
        {
          "entity": 2648616,
          "freq": 9,
          "commonness": 0.2727272727272727
        },
        {
          "entity": 2319544,
          "freq": 6,
          "commonness": 0.18181818181818182
        },
        {
          "entity": 3001631,
          "freq": 4,
          "commonness": 0.12121212121212122
        },
        {
          "entity": 32742,
          "freq": 1,
          "commonness": 0.030303030303030304
        }
      ]
    },
    {
      "mention": "johnny cash",
      "linkProbability": 1,
      "field": "body",
      "start": 14,
      "end": 25,
      "linkFrequency": 2558,
      "documentFrequency": 1932,
      "candidates": [
        {
          "entity": 11983070,
          "freq": 2540,
          "commonness": 0.9929632525410477
        },
        {
          "entity": 12326526,
          "freq": 14,
          "commonness": 0.00547302580140735
        }
      ]
    },
    {
      "mention": "bob dylan",
      "linkProbability": 1,
      "field": "body",
      "start": 0,
      "end": 9,
      "linkFrequency": 5588,
      "documentFrequency": 4275,
      "candidates": [
        {
          "entity": 4637590,
          "freq": 5547,
          "commonness": 0.9926628489620616
        },
        {
          "entity": 438899,
          "freq": 35,
          "commonness": 0.006263421617752327
        }
      ]
    }
  ],
  "nSpots": 3,
  "querytime": 264

我们发现事实上dexter不仅找到了bob dylan和johnny cash,它还找到了mutual admiration society。但mutual admiration society有很多词条含义,比如Mutual_Admiration_Society_(song),Mutual_Admiration_Society_(album),Mutual_Admiration_Society_(collaboration),Mutual_Admiration_Society_–Joe_Locke&_David_Hazeltine_Quartet。
但事实上我们一看就应该知道这个Multual admiration society应该是首歌或者专辑,这说明dexter的算法应该是context-free的,和上下文无关。所以dexter其实只提供了linking的接口,如果需要解决多义性则还需其他工具。

2. get-id

输入实体获取id(在wiki中编好的号码)

http://localhost:8080/dexter-webapp/api/rest/get-id?title=johnny%20cash

http://localhost:8080/dexter-webapp/api/rest/get-id?title=johnny_cash

二者得到的结果都是:

{
  "title": "Johnny_cash",
  "url": "",
  "id": 11983070
}

3. get-desc

输入id获取description。可以理解为输入id获取entity

http://localhost:8080/dexter-webapp/api/rest/get-desc?id=11983070&title-only=true

记得把title-only参数改成true,不然无法输出实体。

4. 用Python批量处理

开启8080端口后,可以使用urllib和json来批量处理信息。举个例子

import urllib
from urllib import request
from urllib import parse
import json


def GetAnnotateUrl(query, n = 5, conf = 0.5):
  url = 'http://localhost:8080/dexter-webapp/api/rest/annotate?text='
  query = query.replace(' ', '%20')
  url += query
  url += ('&n=' + str(n))
  url += ('&min-conf=' + str(conf))
  url += '&wn=false&debug=false&format=text'
  return url

def GetId2EntityUrl(id):
  url = 'http://localhost:8080/dexter-webapp/api/rest/get-desc?title-only=true&id='
  url += str(id)
  return url

def GetRequest(url):
  req = request.Request(url)
  data = request.urlopen(req).read().decode('utf-8')
  Json = json.loads(data)
  return Json

def GetEntitiesByQuery(query, n = 5, conf = 0.5 ):
  url = GetAnnotateUrl(query, n, conf)
  AnnoData = GetRequest(url)
  # AnnoData = json.dumps(AnnoData, indent = 4, separators = (',', ':'))
  # print(AnnoData) # Use the above dumps command to print structured json
  Spots = AnnoData['spots']
  Entities = {}
  for session in Spots:
    url = GetId2EntityUrl(session["entity"])
    Entities[session["entity"]] = GetRequest(url)["title"]
  return Entities

Entities = GetEntitiesByQuery('bob dylan and johnny cash')
print(Entities)
output:
{4637590: 'Bob_Dylan', 11983070: 'Johnny_Cash'}

可以批量处理一波。这东西也不像TX/ALI云里面的开源项目还要限制流量的,可以无限用。

Enjoy!

猜你喜欢

转载自blog.csdn.net/qq_37043191/article/details/81906433