LING 581: Advanced Computational Linguistics Lecture Notes February 16th.

LING 581: Advanced Computational Linguistics Lecture Notes February 16th

Bikel Collins From treebanks search to stochastic parsers trained on the WSJ Penn treebank Java re-implementation of Collins’ parser Paper – Daniel M. Bikel. 2004. Intricacies of Collins’ Parsing Model. (PS) (PDF) in Computational Linguistics, 30(4), pp. 479- 511.PS) (PDF) in Computational Linguistics, 30(4), pp. 479- 511. – http://www.cis.upenn.edu/~dbikel/papers/collins- intricacies.pdf Software – http://www.cis.upenn.edu/~dbikel/software.html#stat- parser

Bikel Collins The wrapper is syntactic sugar for various commands Scripting language is TCL/TK (“tickle T K”) Assume variables – set prefix "/Users/sandiway/research/" – set dbprefix "$prefix/dbparser" – set tbvprefix "/Applications/treebankviewer.app/Contents/MacOS" POS tagging (MXPOST, in directory jmx) – $prefix/jmx/mxpost $prefix/jmx/tagger.project /tmp/err.txt Parsing – $dbprefix/bin/parse 400 $dbprefix/settings/$properties $dbprefix/bin/$ddf /tmp/test2.txt 2>@ stdout Training – $dbprefix/bin/train 800 $dbprefix/settings/$properties $dbprefix/bin/$mrg 2>@ stdout

Bikel Collins POS tagging (MXPOST, in directory jmx) – tagger_input – $prefix/jmx/mxpost $prefix/jmx/tagger.project /tmp/err.txt Parsing – set ddf "wsj-02-21.obj.gz” – set properties "collins.properties" – parser_input – $dbprefix/bin/parse 400 $dbprefix/settings/$properties $dbprefix/bin/$ddf /tmp/test2.txt 2>@ stdout Training – set mrg "wsj-02-21.mrg” – set properties "collins.properties" – $dbprefix/bin/train 800 $dbprefix/settings/$properties $dbprefix/bin/$mrg 2>@ stdout Unix file descriptors 0 Standard input (stdin) 1Standard output (stdout) 2Standard error(stderr) GUI components frame.input text.input.t -height 4 -yscrollcommand {.input.s set} scrollbar.input.s -command {.input.t yview} frame.tagged text.tagged.t -height 9 -yscrollcommand {.tagged.s set} scrollbar.tagged.s -command {.tagged.t yview} Code proc tagger_input {} { set lines [.input.t get 1.0 end] set infile [open "/tmp/test.txt" w] puts -nonewline $infile [string trimright $lines] close $infile } proc parser_input {} { set lines [.tagged.t get 1.0 end] set infile [open "/tmp/test2.txt" w] puts -nonewline $infile [string trimright $lines] close $infile } Unix file descriptors 0 Standard input (stdin) 1Standard output (stdout) 2Standard error(stderr) GUI components frame.input text.input.t -height 4 -yscrollcommand {.input.s set} scrollbar.input.s -command {.input.t yview} frame.tagged text.tagged.t -height 9 -yscrollcommand {.tagged.s set} scrollbar.tagged.s -command {.tagged.t yview} Code proc tagger_input {} { set lines [.input.t get 1.0 end] set infile [open "/tmp/test.txt" w] puts -nonewline $infile [string trimright $lines] close $infile } proc parser_input {} { set lines [.tagged.t get 1.0 end] set infile [open "/tmp/test2.txt" w] puts -nonewline $infile [string trimright $lines] close $infile }

Bikel Collins There’s also a simple tree viewer I wrote but it may not run on your system…

Bikel Collins Relevant files and directories bikeldemo – wrapper2.tcl(prefix set to /Users/sandiway) jmx – mxpost(shell script) – mxpost.jar(Java code) dbparser – dbparser/bin/parse(shell script) – dbparser/bin/train(shell script) – dbparser/dbparser.jar(Java code) – dbparser/userguide/guide.pdf

EVALB How to evaluate parsing accuracy? – count bracketing matches – (LR) Bracketing recall = (number of correct constituents) ---------------------------------------- (number of constituents in the goldfile) – (LP) Bracketing precision = (number of correct constituents) ---------------------------------------- (number of constituents in the parsed file) Program called evalb – http://nlp.cs.nyu.edu/evalb/ http://nlp.cs.nyu.edu/evalb/ – written in C – get it to compile on your system (Makefile)

EVALB http://www.aclweb.org/anthology-new/H/H91/H91-1060.pdf

Homework Task 2 Part 1 – Run the examples you showed on your slides from Homework Task 1 using the Bikel Collins parser. – Evaluate how close the parses are to the “gold standard” Part 2 – WSJ corpus: sections 00 through 24 – Evaluation: on section 23 – Training: normally 02-21 (20 sections) – How does the Bikel Collins vary in accuracy if you randomly pick 1, 2, 3,…20 sections to do the training with… plot graph with evalb…

LING 581: Advanced Computational Linguistics Lecture Notes February 16th.

Similar presentations

Presentation on theme: "LING 581: Advanced Computational Linguistics Lecture Notes February 16th."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

LING 581: Advanced Computational Linguistics Lecture Notes February 16th.

Similar presentations

Presentation on theme: "LING 581: Advanced Computational Linguistics Lecture Notes February 16th."— Presentation transcript:

Similar presentations

About project

Feedback