Skip to content

Latest commit

 

History

History
68 lines (58 loc) · 2.44 KB

README.md

File metadata and controls

68 lines (58 loc) · 2.44 KB

Stanford CoreNLP Shift-Reduce Parser for Japanese (日本語係り受け解析)

Chart-based Japanese parsers exist, why bother? Because shift-reduce is much faster and Stanford's is accurate.

Build

> ./gradlew build
> ./gradlew copyRuntimeLibs

Prepare

Annotate Japanese sentences with word boundaries and part-of-speeches (POS). For example, use KyTea

> kytea -notag 2 < sentence_file.txt > tagged_sentence_file.txt

Note that words must be delimited by half-width whitespace and POS must attach to each word with a slash as separator.

すもも/名詞 も/助詞 もも/名詞 も/助詞 もも/名詞 の/助詞 うち/名詞

Parse

Be careful with classpath and model path.

> java -cp build/libs/yaraku-nlp-0.1.jar:lib/* \
    com.yaraku.nlp.parser.shiftreduce.demo.JapaneseShiftReduceParserDemo \
    -model ja.beam.rightmost.model.ser.gz \
    < tagged_sentence_file.txt > parsed_sentence_file.txt

For example, input

すもも/名詞 も/助詞 もも/名詞 も/助詞 もも/名詞 の/助詞 うち/名詞

and expect the outcome like

(ROOT (名詞P (助詞P (助詞P (名詞 すもも) (助詞 も)) (助詞P (名詞 もも) (助詞 も))) (名詞P (助詞P (名詞 もも) (助詞 の)) (名詞 うち))))

Train

If you must....

  1. Get a Japanese treebank such as Japanese Dependency Corpus (JDC)
  2. Prepare the trees in Penn Treebank S-expression.
  3. Build a model with training and development sets, e.g. JDC/train/all.cfg and JDC/dev/all.cfg
> java -cp build/libs/yaraku-nlp-0.1.jar:lib/* \
    edu.stanford.nlp.parser.shiftreduce.ShiftReduceParser \
    -headFinder com.yaraku.nlp.trees.RightHeadFinder \
    -trainTreebank JDC/cfg/train/all.cfg \
    -devTreebank JDC/cfg/dev/all.cfg \
    -serializedPath ja.beam.rightmost.model.ser.gz \
    -trainingThreads 8 \
    -trainingIterations 60 \
    -stalledIterationLimit 20 \
    -trainingMethod REORDER_BEAM \
    -trainBeamSize 4 \
    -randomSeed 31337

Acknowledgement