使用Apache OpenNLP探索NLP概念

Introduction

After looking at a lot of Java/JVM based NLP libraries listed on 一种wesome AI/ML/DL I decided to pick the 一种pache OpenNLP library. One of the reasons comes from the fact another developer (who had a look at it previously) recommended it. Besides, it’s an 一种pache project, they have been great supporters of F/OSS Java projects for the last two decades or so (see Wikipedia). It also goes without saying that 一种pache OpenNLP is backed by the 一种pache 2.0 license.

另外,来自NLP研究人员的这则推特也增加了对此事的信心:

I’ll like to say my personal experience has been similar with Apache OpenNLP so far and I echo the simplicity and user-friendly API and design. You will see as we explore it further, that being the case.

Exploring NLP using Apache OpenNLP

Java bindings

We won’t be covering the Java API to Apache OpenNLP tool in this post but you can find a number of examples in their docs. A bit later you will also need some of the resources enlisted in the Resources section at the bottom of this post in order to progress further.

Command-line Interface

我被可用的CLI的简单性吸引了,它可以在需要模型和提供模型的情况下直接使用。 它无需额外配置即可工作。

To make it easier to use and also not have to remember all the CLI parameters it supports I have put together some shell scripts. Have a look at the README to get more insight into what they are and how to use them.

Getting started

从现在开始,您将需要以下内容:

  • Git client 2.x or higher (an account on GitHub to fork the repo)
  • Java 8 or higher (suggest install GraalVM CE 19.x or higher)
  • Docker CE 19.x or higher and check it is running before going further
  • Ability to run shell scripts from the CLI
  • Understand reading/writing shell scripts (optional)

ñote: At the time of the writing version 1.9.1 of Apache OpenNLP was available.

We have put together scripts to make these steps easy for everyone:

    $ git clone [email protected]:valohai/nlp-java-jvm-example.git
    or 
    $ git clone https://github.com/valohai/nlp-java-jvm-example.git
    $ cd nlp-java-jvm-example

这将使我们进入包含以下文件的文件夹:

    LICENSE.txt      
    README.md        
    docker-runner.sh     <=== only this one concerns us at startup
    images
    shared               <=== created just when you run the container

ñote: a docker image has been provided to be able to run a docker container that would contain all the tools you need to go further. You can see the *shared* folder has been created, which is a volume mounted into your container but it’s actually a directory created on your local machine and mapped to this volume. So anything created or downloaded there will be available even after you exit out of your container!

Have a quick read of the main README file to get an idea of how to go about using the docker-runner.sh shell script, and take a quick glance at the Usage section ***as well.* Thereafter also take a look into the Apache OpenNLP README file to see the usages of the scripts provided there in.

Run the NLP Java/JVM docker container

在项目根目录下的本地计算机命令提示符下,执行以下操作:

    $ ./docker-runner.sh --runContainer

在得到提示之前,有机会先获得此功能:

    Unable to find image 'neomatrix369/nlp-java:0.1' locally
    0.1: Pulling from neomatrix369/nlp-java
    f476d66f5408: ...
    .
    .
    .
    Digest: sha256:53b89b166d42ddfba808575731f0a7a02f06d7c47ee2bd3622e980540233dcff
    Status: Downloaded newer image for neomatrix369/nlp-java:0.1

然后您将在容器内看到提示:

    Running container neomatrix369/nlp-java:0.1

    ++ pwd
    + time docker run --rm --interactive --tty --workdir /home/nlp-java --env JDK_TO_USE= --env JAVA_OPTS=<--snipped>
    nlp-java@cf9d493f0722:~$

The container is packed with all the Apache OpenNLP scripts/tools you need to get started with exploring various NLP solutions.

Installing Apache OpenNLP inside the container

当您位于容器内时,在容器命令提示符下,这是我们从此处走的更远的方法:

    nlp-java@cf9d493f0722:~$ cd opennlp


    nlp-java@cf9d493f0722:~$ ./opennlp.sh

您会看到apache-opennlp-1.9.1-bin.tar.gz正在下载工件并将其扩展到共享夹:

    % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                     Dload  Upload   Total   Spent    Left  Speed
    100 10.6M  100 10.6M    0     0  4225k      0  0:00:02  0:00:02 --:--:-- 4225k
    apache-opennlp-1.9.1/
    apache-opennlp-1.9.1/NOTICE
    apache-opennlp-1.9.1/LICENSE
    apache-opennlp-1.9.1/README.html
    .
    .
    .
    apache-opennlp-1.9.1/lib/jackson-jaxrs-json-provider-2.8.4.jar
    apache-opennlp-1.9.1/lib/jackson-module-jaxb-annotations-2.8.4.jar

Viewing and accessing the shared folder

就像您运行容器一样,将创建一个共享文件夹,开始时它可能是空的,但是随着我们的前进,我们会发现它充满了不同的文件和文件夹。

It’s also where you will find the downloaded models and the Apache OpenNLP binary exploded into its own directory (by the name apache-opennlp-1.9.1).

您还可以从命令提示符(在容器外部)访问并查看其内容:

    ### Open a new command prompt
    $ cd nlp-java-jvm-example
    $ cd images/java/opennlp
    $ ls ..
    Dockerfile       corenlp.sh       opennlp          reverb.sh        word2vec.sh
    cogcomp-nlp.sh   mallet.sh        openregex.sh     shared
    common.sh        nlp4j.sh         rdrposttagger.sh version.txt

    $ ls ../shared
    apache-opennlp-1.9.1   en-ner-date.bin        en-sent.bin
    en-chunker.bin         en-parser-chunking.bin langdetect-183.bin

    ### In your case the contents of the shared folder may vary but the way to get to the folder is above.

从容器内部,您将看到:

    nlp-java@cf9d493f0722:~$ ls 
    cogcomp-nlp.sh   corenlp.sh  nlp4j.sh  openregex.sh        reverb.sh  word2vec.sh
    common.sh        mallet.sh   opennlp   rdrposttagger.sh        shared

    nlp-java@cf9d493f0722:~$ ls shared
    MyFirstJavaNotebook.ipynb      en-ner-date.bin           en-pos-maxent.bin          
    langdetect-183.bin
    apache-opennlp-1.9.1           en-ner-time.bin           en-pos-perceptron.bin  
    notebooks
    en-chunker.bin                 en-parser-chunking.bin    en-token.bin

    ### In your case the contents of the shared folder may vary but the way to get to the folder is above.

Performing NLP actions inside the container

The good thing is without ever leaving your current folder you can perform these NLP actions (check out the Exploring NLP Concepts section in the README):

任何脚本的使用帮助:在任何时间点,您始终可以通过以下方式调用它们来查询脚本:

    nlp-java@cf9d493f0722:~$ ./[script-name.sh] --help

例如

    nlp-java@cf9d493f0722:~$ ./detectLanguage.sh --help

给我们这个用法文本作为输出:

           Detecting language in a single-line text or article

           Usage: ./detectLanguage.sh --text [text]
                     --file [path/to/filename]
                     --help

           --text      plain text surrounded by quotes
           --file      name of the file containing text to pass as command arg
           --help      shows the script usage help text
    nlp-java@cf9d493f0722:~$ ./detectLanguage.sh --text "This is an english sentence"

    eng This is an english sentence

See Detecting languages section in the README for more examples and detailed output.

  • Detecting sentences in a single line text or article.
    nlp-java@cf9d493f0722:~$ ./detectSentence.sh --text "This is an english sentence. And this is another sentence."


    This is an english sentence.
    And this is another sentence.

See Detecting sentences section in the README for more examples and detailed output.

  • Finding person name, organisation name, date, time, money, location, percentage information in a single line text or article.
    nlp-java@cf9d493f0722:~$ ./nameFinder.sh --method person  --text "My name is John"


    My name is <START:person> John <END>

See Finding names section in the README for more examples and detailed output. There are a number of types of name finder examples in this section.

  • Tokenize a line of text or an article into its smaller components (i.e. words, punctuation, numbers).
    nlp-java@cf9d493f0722:~$ ./tokenizer.sh --method simple --text "this-is-worth,tokenising.and,this,is,another,one"


    this - is - worth , tokenising . and , this , is , another , one

See Tokenise section in the README for more examples and detailed output.

    nlp-java@cf9d493f0722:~$ ./parser.sh --text "The quick brown fox jumps over the lazy dog ."


    (TOP (NP (NP (DT The) (JJ quick) (JJ brown) (NN fox) (NNS jumps)) (PP (IN over) (NP (DT the) (JJ lazy) (NN dog))) (. .)))

See Parser section in the README for more examples and detailed output.

    nlp-java@cf9d493f0722:~$ ./posTagger.sh --method maxent --text "This is a simple text to tag"


    This_DT is_VBZ a_DT simple_JJ text_NN to_TO tag_NN

See Tag Parts of 小号peech section in the README for more examples and detailed output.

  • Text chunking by dividing a text or an article into syntactically correlated parts of words, like noun groups, verb groups. You apply this feature on the tagged parts of speech text or article. Apply chunking on a text already tagged by PoS tagger (see Penn Treebank tag set for the legend of token types, also see https://nlpforhackers.io/text-chunking/).
    nlp-java@cf9d493f0722:~$ ./chunker.sh --text "This_DT is_VBZ a_DT simple_JJ text_NN to_TO tag_NN"


    \[NP This_DT \] [VP is_VBZ ] \[NP a_DT simple_JJ text_NN \] [PP to_TO ] [NP tag_NN]

See Chunking section in the README for more examples and detailed output.

Exiting from the NLP Java/JVM docker container

就这么简单:

    nlp-java@f8562baf983d:~/opennlp$ exit
    exit
           67.41 real         0.06 user         0.05 sys

然后您回到本地计算机提示符。

Benchmarking

该工具的主要功能之一是,它记录并报告其在不同执行点处的操作的指标-在微观和宏观水平上花费的时间,下面是一个示例输出,以说明此功能:

    Loading Token Name Finder model ... done (1.200s)
    My name is <START:person> John <END>


    Average: 24.4 sent/s
    Total: 1 sent
    Runtime: 0.041s
    Execution time: 1.845 seconds

综上所述,我遇到了5个指标,这些指标对我作为科学家,分析师或工程师都非常有用:

    Took 1.200s to load the model into memory

    (Average) Processed at an average rate of 24.4 sentences per second
    (Total) Processed 1 sentence
    (Runtime) It took 0.040983606557377 (0.041 seconds) to process this 1 sentence
    (Execution time) The whole process ran for 1.845 seconds (startup, processing sentence(s) and shutdown)

在进行性能比较时,像这样的信息是非常宝贵的:

  • 在两个或多个模型之间(加载时间和运行时性能)在两个或多个环境或配置之间between applications doing the same NLP, action put together using different tech stacks还包括不同的语言查找处理过的不同文本数据集之间的关联(定量和定性比较)

Empirical example

BetterNLP library written in python is doing something similar, see Kaggle kernels: Better NLP Notebook and Better NLP Summarisers Notebook (search for time_in_secs inside both the notebooks to see the metrics reported).

Alt Text
Alt Text

就个人而言,这很有启发性,并且可以验证这是向最终用户提供的有用功能(或操作)。

Other concepts, libraries and tools

There are other Java/JVM based NLP libraries mentioned in the Resources section below, for brevity we won’t cover them. The links provided will lead to further information for your own pursuit.

Within the Apache OpenNLP tool itself, we have only covered the command-line access part of it and not the Java Bindings. In addition, we haven’t gone through all the NLP concepts or features of the tool again for brevity have only covered a handful of them. But the documentation and resources on the GitHub repo should help in further exploration.

You can also find out how to build the docker image for yourself, by examining the docker-runner script.

Conclusion

After going through the above, we can conclude the following about the 一种pache OpenNLP tool by exploring its pros and cons:

优点

  • It’s an easy to use API and understand
  • Shallow learning curve and detailed documentation with lots of examples
  • Covers a lot of NLP functionality, there’s more in the docs to explore than we did above
  • Easy shell scripts and Apache OpenNLP scripts have been provided to play with the tool
  • Lots of resources available below to learn more about NLP (See the Resources section below)
  • Resources provided to quickly get started and explore the Apache OpenNLP tool

缺点

  • Looking at the GitHub repo, it seems the development is slow or has been stagnated (last two commits have a wide gap i.e. May 2019 and Oct 15, 2019)
  • A few models are missing when going through the examples in the documentation (manual)
  • The current models provided may need further training as per your use case(s), see this tweet:

Resources

Apache OpenNLP

About me

Mani Sarkar is a passionate developer mainly in the Java/JVM space, currently strengthening teams and helping them accelerate when working with small teams and startups, as a freelance software/data/ml engineer, more….

Ťwitter: @theNeomatrix369 | GitHub: @neomatrix369

Originally published at https://blog.valohai.com.

from: https://dev.to//neomatrix369/exploring-nlp-concepts-using-apache-opennlp-5g3o

发布了0 篇原创文章 · 获赞 0 · 访问量 639

猜你喜欢

转载自blog.csdn.net/cunxiedian8614/article/details/105691213