【NLP】play with stanford nlp

版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/wuzh1230/article/details/76864459

PlayNLP on GitHub

A Powerful Parser with xinhuaFactoredSegmenting.ser.gz

#!/bin/bash

java    -cp "*" edu.stanford.nlp.parser.lexparser.LexicalizedParser \
    -encoding utf-8 \
    -outputFormat "penn,typedDependenciesCollapsed" \
    edu/stanford/nlp/models/lexparser/xinhuaFactoredSegmenting.ser.gz \
    $1

the above command line can be use to as a general purpose utility to parse Chinese sentences.

note: xinhuaFactoredSegmenting

example input and output:

目前,《新华日报》国内外总发行量40万份
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
Parsing file: /tmp/2.txt
Parsing [sent. 1 len. 11]: 目前 , 《 新华 日报 》 国内外 总 发行量 40万 份
(ROOT
  (IP
    (NP (NT 目前))
    (PU ,)
    (NP (PU 《) (NR 新华) (NN 日报) (PU 》))
    (NP
      (NP (NN 国内外))
      (ADJP (JJ 总))
      (NP (NN 发行量)))
    (VP
      (QP (CD 40万)
        (CLP (M 份))))))

nmod:tmod(40万-10, 目前-1)
punct(40万-10, ,-2)
punct(日报-5, 《-3)
compound:nn(日报-5, 新华-4)
nmod:topic(40万-10, 日报-5)
punct(日报-5, 》-6)
compound:nn(发行量-9, 国内外-7)
amod(发行量-9, 总-8)
nsubj(40万-10, 发行量-9)
root(ROOT-0, 40万-10)
mark:clf(40万-10, 份-11)

Parsed file: /tmp/2.txt [1 sentences].
Parsed 11 words in 1 sentences (9.58 wds/sec; 0.87 sents/sec).

Segment with Custom Dictionary

Custom Dictionary

java -mx1g -cp seg.jar edu.stanford.nlp.ie.crf.CRFClassifier  
     -sighanCorporaDict data  
     -loadClassifier data/ctb.gz  
     -testFile preprocess-$1.txt  
     -inputEncoding UTF-8  
     -sighanPostProcessing true  
     -serDictionary data/dict-chris6.ser.gz,data/cedict.txt,data/ntusd.txt  
     -keepAllWhitespaces false >$1_seged.txt 

Check the segmeng.sh in stanford-segmenter-3.8.0.zip

command-line:

java -mx2g -cp $BASEDIR/*: edu.stanford.nlp.ie.crf.CRFClassifier \
     -sighanCorporaDict ./data \
     -textFile shit.txt \
     -inputEncoding UTF-8 \
     -sighanPostProcessing true \
     -keepAllWhitespaces false \
     -loadClassifier ./data/ctb.gz \
     -serDictionary ./data/dict-chris6.ser.gz,./names.txt
#!/bin/sh

usage() {
  echo "Usage: $0 [ctb|pku] filename encoding kBest" >&2
  echo "  ctb : use Chinese Treebank segmentation" >&2
  echo "  pku : Beijing University segmentation" >&2
  echo "  kBest: print kBest best segmenations; 0 means kBest mode is off." >&2
  echo >&2
  echo "Example: $0 ctb test.simp.utf8 UTF-8 0" >&2
  echo "Example: $0 pku test.simp.utf8 UTF-8 0" >&2
  exit
}

if [ $# -lt 4 -o $# -gt 5 ]; then
    usage
fi

ARGS="-keepAllWhitespaces false"
if [ $# -eq 5 -a "$1"=="-k" ]; then
        ARGS="-keepAllWhitespaces true"
        lang=$2
        file=$3
        enc=$4
        kBest=$5
else 
    if [ $# -eq 4 ]; then
        lang=$1
        file=$2
        enc=$3
        kBest=$4
    else
        usage   
    fi
fi

if [ $lang = "ctb" ]; then
    echo "(CTB):" >&2
elif [ $lang = "pku" ]; then
    echo "(PKU):" >&2
else
    echo "First argument should be either ctb or pku. Abort"
    exit
fi

echo -n "File: " >&2
echo $file >&2
echo -n "Encoding: " >&2
echo $enc >&2
echo "-------------------------------" >&2

BASEDIR=`dirname $0`
DATADIR=$BASEDIR/data
#LEXDIR=$DATADIR/lexicons
JAVACMD="java -mx2g -cp $BASEDIR/*: edu.stanford.nlp.ie.crf.CRFClassifier -sighanCorporaDict $DATADIR -textFile $file -inputEncoding $enc -sighanPostProcessing true $ARGS"
DICTS=$DATADIR/dict-chris6.ser.gz,./names.txt
KBESTCMD=""

if [ $kBest != "0" ]; then
    KBESTCMD="-kBest $kBest"
fi

if [ $lang = "ctb" ]; then
  $JAVACMD -loadClassifier $DATADIR/ctb.gz -serDictionary $DICTS $KBESTCMD
elif [ $lang = "pku" ]; then
  $JAVACMD -loadClassifier $DATADIR/pku.gz -serDictionary $DICTS $KBESTCMD
fi

see:

DICTS=$DATADIR/dict-chris6.ser.gz,./names.txt

demo:

$ cat names.txt 
哈马尼克斯
啊部
阿三的
猫
跳
上
树枝
黑色的
$ cat shit.txt 
哈马尼克斯啊部阿三的。
$ sh segment.sh ctb shit.txt UTF-8 0
(CTB):
File: shit.txt
Encoding: UTF-8
-------------------------------
Invoked on Tue Aug 08 07:16:58 CST 2017 with arguments: -sighanCorporaDict ./data -textFile shit.txt -inputEncoding UTF-8 -sighanPostProcessing true -keepAllWhitespaces false -loadClassifier ./data/ctb.gz -serDictionary ./data/dict-chris6.ser.gz,./names.txt
serDictionary=./data/dict-chris6.ser.gz,./names.txt
loadClassifier=./data/ctb.gz
sighanCorporaDict=./data
inputEncoding=UTF-8
textFile=shit.txt
sighanPostProcessing=true
keepAllWhitespaces=false
Loading Chinese dictionaries from 2 files:
  ./data/dict-chris6.ser.gz
  ./names.txt
  ./names.txt: 8 entries
Done. Unique words in ChineseDictionary is: 423204.
Loading classifier from ./data/ctb.gz ... done [20.0 sec].
Loading character dictionary file from ./data/dict/character_list [done].
Loading affix dictionary from ./data/dict/in.ctb [done].
哈马尼克斯 啊部 阿三的 。
CRFClassifier tagged 11 words in 1 documents at 81.48 words per second.

Train a Specific Parser with Corpus in Penn Tree Bank

corpus

$ cat train.txt 
(ROOT
  (IP
    (NP (NN 哈马尼克斯))
    (VP
      (ADVP (VBD 啊部))
      (VP (NN 阿三的)))
    (. 。)))

nsubj(阿三的-3, 哈马尼克斯-1)
advmod(阿三的-3, 啊部-2)
root(ROOT-0, 阿三的-3)
dep(阿三的-3, 。-4)

(ROOT
  (IP
    (NP
      (ADJP (JJ 黑色的))
      (NP (NN 猫)))
    (VP
      (ADVP (VBD 跳))
      (VP (IN 上)
        (NP (NN 树枝))))
    (. 。)))

amod(猫-2, 黑色的-1)
nsubj(树枝-5, 猫-2)
advmod(树枝-5, 跳-3)
dep(树枝-5, 上-4)
root(ROOT-0, 树枝-5)
dep(树枝-5, 。-6)

train.sh

扫描二维码关注公众号,回复: 3007583 查看本文章
$ cat train.sh 
#!/bin/bash

java  -cp "*"  -mx800m edu.stanford.nlp.parser.lexparser.LexicalizedParser \
    -evals "factDA,tsv" \
    -chineseFactored -PCFG -hMarkov 1 -nomarkNPconj -compactGrammar 0 \
    -tLPP edu.stanford.nlp.parser.lexparser.ChineseTreebankParserParams \
    -PCFG \
    -chinesePCFG \
    -saveToSerializedFile ./trained.ser.gz \
    -maxLength 40 \
    -encoding utf-8 \
    -train $1 \
    -test $1

command-line:

sh train.sh train.txt

output:

$ sh train.sh train.txt
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
done [read 12 trees]. Time elapsed: 0 ms
useUnknownWordSignatures 0
smoothInUnknownsThreshold 100
smartMutation false
useUnicodeType true
unknownSuffixSize 1
unknownPrefixSize 1
flexiTag false
useSignatureForKnownSmoothing false
wordClassesFile null
parserParams edu.stanford.nlp.parser.lexparser.ChineseTreebankParserParams
forceCNF false
doPCFG true
doDep false
freeDependencies false
directional true
genStop true
distance true
coarseDistance false
dcTags false
nPrune false
Using ChineseTreebankParserParams chineseSplitDouHao=false chineseSplitPunct=true chineseSplitPunctLR=true markVVsisterIP=true markVPadjunct=true chineseSplitVP=0 mergeNNVV=false unaryIP=false unaryCP=false paRootDtr=false markPsisterIP=false markIPsisterVVorP=true markADgrandchildOfIP=false gpaAD=false markIPsisterBA=true markNPmodNP=true markNPconj=false markMultiNtag=false markIPsisDEC=false markIPconj=false markIPadjsubj=false markPostverbalP=false markPostverbalPP=false baseNP=false headFinder=levy discardFrags=false dominatesV=false
done. Time elapsed: 35 ms
done. Time elapsed: 22 ms
done. Time elapsed: 32 ms
done Time elapsed: 0 ms
useUnknownWordSignatures 0
smoothInUnknownsThreshold 100
smartMutation false
useUnicodeType true
unknownSuffixSize 1
unknownPrefixSize 1
flexiTag false
useSignatureForKnownSmoothing false
wordClassesFile null
parserParams edu.stanford.nlp.parser.lexparser.ChineseTreebankParserParams
forceCNF false
doPCFG true
doDep false
freeDependencies false
directional true
genStop true
distance true
coarseDistance false
dcTags false
nPrune false
Using ChineseTreebankParserParams chineseSplitDouHao=false chineseSplitPunct=true chineseSplitPunctLR=true markVVsisterIP=true markVPadjunct=true chineseSplitVP=0 mergeNNVV=false unaryIP=false unaryCP=false paRootDtr=false markPsisterIP=false markIPsisterVVorP=true markADgrandchildOfIP=false gpaAD=false markIPsisterBA=true markNPmodNP=true markNPconj=false markMultiNtag=false markIPsisDEC=false markIPconj=false markIPadjsubj=false markPostverbalP=false markPostverbalPP=false baseNP=false headFinder=levy discardFrags=false dominatesV=false
Parsing [len. 4]: 哈马尼克斯 啊部 阿三的 。
(ROOT
  (IP
    (NP (NN 哈马尼克斯))
    (VP
      (ADVP (VBD 啊部))
      (VP (NN 阿三的)))
    (. 。)))

 P: 100.0 R: 100.0
pcfg LP/LR F1: 100.0 N: 1.0
 P: 100.0 R: 100.0
factor LP/LR F1: 100.0 N: 1.0
 P: 100.0 R: 100.0
factor Tag F1: 100.0 N: 1.0

Parsing [len. 1]: 哈马尼克斯-1
(ROOT (FRAG 哈马尼克斯-1))

 P: 0.0 R: 0.0
pcfg LP/LR F1: 0.0 N: 2.0
 P: 0.0 R: 0.0
factor LP/LR F1: 0.0 N: 2.0
 P: 100.0 R: 100.0
factor Tag F1: 100.0 N: 2.0

Parsing [len. 1]: 啊部-2
(ROOT (FRAG 啊部-2))

 P: 0.0 R: 0.0
pcfg LP/LR F1: 0.0 N: 3.0
 P: 0.0 R: 0.0
factor LP/LR F1: 0.0 N: 3.0
 P: 100.0 R: 100.0
factor Tag F1: 100.0 N: 3.0

Parsing [len. 1]: 阿三的-3
(ROOT (FRAG 阿三的-3))

 P: 0.0 R: 0.0
pcfg LP/LR F1: 0.0 N: 4.0
 P: 0.0 R: 0.0
factor LP/LR F1: 0.0 N: 4.0
 P: 100.0 R: 100.0
factor Tag F1: 100.0 N: 4.0

Parsing [len. 1]: 。-4
(ROOT (FRAG 。-4))

 P: 0.0 R: 0.0
pcfg LP/LR F1: 0.0 N: 5.0
 P: 0.0 R: 0.0
factor LP/LR F1: 0.0 N: 5.0
 P: 100.0 R: 100.0
factor Tag F1: 100.0 N: 5.0

Parsing [len. 6]: 黑色的 猫 跳 上 树枝 。
(ROOT
  (IP
    (NP
      (ADJP (JJ 黑色的))
      (NP (NN 猫)))
    (VP
      (ADVP (VBD 跳))
      (VP (IN 上)
        (NP (NN 树枝))))
    (. 。)))

 P: 100.0 R: 100.0
pcfg LP/LR F1: 100.0 N: 6.0
 P: 100.0 R: 100.0
factor LP/LR F1: 100.0 N: 6.0
 P: 100.0 R: 100.0
factor Tag F1: 100.0 N: 6.0

Parsing [len. 1]: 黑色的-1
(ROOT (FRAG 黑色的-1))

 P: 0.0 R: 0.0
pcfg LP/LR F1: 0.0 N: 7.0
 P: 0.0 R: 0.0
factor LP/LR F1: 0.0 N: 7.0
 P: 100.0 R: 100.0
factor Tag F1: 100.0 N: 7.0

Parsing [len. 1]: 猫-2
(ROOT (FRAG 猫-2))

 P: 0.0 R: 0.0
pcfg LP/LR F1: 0.0 N: 8.0
 P: 0.0 R: 0.0
factor LP/LR F1: 0.0 N: 8.0
 P: 100.0 R: 100.0
factor Tag F1: 100.0 N: 8.0

Parsing [len. 1]: 跳-3
(ROOT (FRAG 跳-3))

 P: 0.0 R: 0.0
pcfg LP/LR F1: 0.0 N: 9.0
 P: 0.0 R: 0.0
factor LP/LR F1: 0.0 N: 9.0
 P: 100.0 R: 100.0
factor Tag F1: 100.0 N: 9.0

Parsing [len. 1]: 上-4
(ROOT (FRAG 上-4))

 P: 0.0 R: 0.0
pcfg LP/LR F1: 0.0 N: 10.0
 P: 0.0 R: 0.0
factor LP/LR F1: 0.0 N: 10.0
 P: 100.0 R: 100.0
factor Tag F1: 100.0 N: 10.0

Parsing [len. 1]: 树枝-5
(ROOT (FRAG 树枝-5))

 P: 0.0 R: 0.0
pcfg LP/LR F1: 0.0 N: 11.0
 P: 0.0 R: 0.0
factor LP/LR F1: 0.0 N: 11.0
 P: 100.0 R: 100.0
factor Tag F1: 100.0 N: 11.0

Parsing [len. 1]: 。-6
(ROOT (FRAG 。-6))

 P: 0.0 R: 0.0
pcfg LP/LR F1: 0.0 N: 12.0
 P: 0.0 R: 0.0
factor LP/LR F1: 0.0 N: 12.0
 P: 100.0 R: 100.0
factor Tag F1: 100.0 N: 12.0

pcfg LP/LR summary evalb: LP: 100.0 LR: 61.9 F1: 76.47 Exact: 16.66 N: 12
dep DA summary evalb: LP: 0.0 LR: 0.0 F1: 0.0 Exact: 0.0 N: 0
factor LP/LR summary evalb: LP: 100.0 LR: 61.9 F1: 76.47 Exact: 16.66 N: 12
factor DA summary evalb: LP: 0.0 LR: 0.0 F1: 0.0 Exact: 0.0 N: 0
factor Tag summary evalb: LP: 100.0 LR: 100.0 F1: 100.0 Exact: 100.0 N: 12
factF1  factDA  factEx  pcfgF1  depDA   factTA  num
76.47       16.67   76.47       100.00  12

Fallback to Parse Manually-tagged Sentence

if you try to parse a never-seen sentence like this:

哈马尼克斯啊部阿三的。

,then you may fail.

but if you pos tag this sentence:

$cat shit.txt

哈马尼克斯/NN 啊部/VBD 阿三的/NN 。/.

and assume that, this is just the meaning of the sentence.

Ok, now use the following command-line parse-pre-tagged.sh to parse the above sentence:

#!/bin/bash
java    -mx500m -cp "*" edu.stanford.nlp.parser.lexparser.LexicalizedParser \
        -encoding utf-8 \
        -sentences newline \
        -tokenized \
        -tagSeparator / \
        -tokenizerFactory edu.stanford.nlp.process.WhitespaceTokenizer \
        -tokenizerMethod newCoreLabelTokenizerFactory \
        -outputFormat "penn,typedDependenciesCollapsed" \
        edu/stanford/nlp/models/lexparser/chinesePCFG.ser.gz \
        $1

Note: here use the PCFG model chinesePCFG

demo,

sh parse-pre-tagged.sh shit.txt

SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
Parsing file: shit.txt
Parsing [sent. 1 len. 4]: 哈马尼克斯 啊部 阿三的 。
(ROOT
  (IP
    (NP (NN 哈马尼克斯))
    (VP
      (ADVP (VBD 啊部))
      (VP (NN 阿三的)))
    (. 。)))

nsubj(阿三的-3, 哈马尼克斯-1)
advmod(阿三的-3, 啊部-2)
root(ROOT-0, 阿三的-3)
dep(阿三的-3, 。-4)

Parsed file: shit.txt [1 sentences].
Parsed 4 words in 1 sentences (10.78 wds/sec; 2.70 sents/sec).

Save the words to custom dictionary

while parsing the manually tagged sentence, the words(tokens) should be pushed into custom dictionary for future use.

Use the Manually parsed PTB to Train Parser

with parse-pre-tagged.sh, we got some corpus in PTB:

$ cat corpus.txt

(ROOT
  (IP
    (NP (NN 哈马尼克斯))
    (VP
      (ADVP (VBD 啊部))
      (VP (NN 阿三的)))
    (. 。)))

nsubj(阿三的-3, 哈马尼克斯-1)
advmod(阿三的-3, 啊部-2)
root(ROOT-0, 阿三的-3)
dep(阿三的-3, 。-4)

(ROOT
  (IP
    (NP
      (ADJP (JJ 黑色的))
      (NP (NN 猫)))
    (VP
      (ADVP (VBD 跳))
      (VP (IN 上)
        (NP (NN 树枝))))
    (. 。)))

amod(猫-2, 黑色的-1)
nsubj(树枝-5, 猫-2)
advmod(树枝-5, 跳-3)
dep(树枝-5, 上-4)
root(ROOT-0, 树枝-5)
dep(树枝-5, 。-6)

view train.sh

#!/bin/bash

java  -cp "*"  -mx800m edu.stanford.nlp.parser.lexparser.LexicalizedParser \
    -evals "factDA,tsv" \
    -chineseFactored -PCFG -hMarkov 1 -nomarkNPconj -compactGrammar 0 \
    -tLPP edu.stanford.nlp.parser.lexparser.ChineseTreebankParserParams \
    -PCFG \
    -chinesePCFG \
    -saveToSerializedFile ./trained.ser.gz \
    -maxLength 40 \
    -encoding utf-8 \
    -train $1 \
    -test $1

demo of training as following:

sh train.sh corpus.txt

after training, we got trainded.ser.gz

use this model to parse a special sentence, such as:

$ cat test.txt

output:

猫 啊部 阿三的 。
猫 啊部 树枝 。
树枝 跳 上 哈马尼克斯 。
$ cat parse-with-model.sh

output:

#!/bin/bash
java    -cp "*" edu.stanford.nlp.parser.lexparser.LexicalizedParser \
    -encoding utf-8 \
    -outputFormat "penn,typedDependenciesCollapsed" \
    ./trained.ser.gz \
    $1

demo:

sh parse-with-model.sh test.txt

output:

Parsing file: ./test.txt
Parsing [sent. 1 len. 4]: 猫 啊部 阿三的 。
(ROOT
  (IP
    (NP (NN 猫))
    (VP
      (ADVP (VBD 啊部))
      (VP (NN 阿三的)))
    (. 。)))

nsubj(阿三的-3, 猫-1)
advmod(阿三的-3, 啊部-2)
root(ROOT-0, 阿三的-3)
dep(阿三的-3, 。-4)

Parsing [sent. 2 len. 4]: 猫 啊部 树枝 。
(ROOT
  (IP
    (NP (NN 猫))
    (VP
      (ADVP (VBD 啊部))
      (VP (NN 树枝)))
    (. 。)))

nsubj(树枝-3, 猫-1)
advmod(树枝-3, 啊部-2)
root(ROOT-0, 树枝-3)
dep(树枝-3, 。-4)

Parsing [sent. 3 len. 5]: 树枝 跳 上 哈马尼克斯 。
(ROOT
  (IP
    (NP (NN 树枝))
    (VP
      (ADVP (VBD 跳))
      (VP (IN 上)
        (NP (NN 哈马尼克斯))))
    (. 。)))

nsubj(哈马尼克斯-4, 树枝-1)
advmod(哈马尼克斯-4, 跳-2)
dep(哈马尼克斯-4, 上-3)
root(ROOT-0, 哈马尼克斯-4)
dep(哈马尼克斯-4, 。-5)

Parsed file: ./test.txt [3 sentences].
Parsed 13 words in 3 sentences (61.03 wds/sec; 14.08 sents/sec).

Parse non-segmented sentence with specific model

if you input sentence is not segmented, then use the above custom dictionary to segment.

the output of the segment can be used as the input of parse-with-model.sh

猜你喜欢

转载自blog.csdn.net/wuzh1230/article/details/76864459
NLP