LING 581: Advanced Computational Linguistics Lecture Notes February 16th
Bikel Collins From treebanks search to stochastic parsers trained on the WSJ Penn treebank Java re-implementation of Collins’ parser Paper – Daniel M. Bikel Intricacies of Collins’ Parsing Model. (PS) (PDF) in Computational Linguistics, 30(4), pp PS) (PDF) in Computational Linguistics, 30(4), pp – intricacies.pdf Software – parser
Bikel Collins The wrapper is syntactic sugar for various commands Scripting language is TCL/TK (“tickle T K”) Assume variables – set prefix "/Users/sandiway/research/" – set dbprefix "$prefix/dbparser" – set tbvprefix "/Applications/treebankviewer.app/Contents/MacOS" POS tagging (MXPOST, in directory jmx) – $prefix/jmx/mxpost $prefix/jmx/tagger.project /tmp/err.txt Parsing – $dbprefix/bin/parse 400 $dbprefix/settings/$properties $dbprefix/bin/$ddf /tmp/test2.txt stdout Training – $dbprefix/bin/train 800 $dbprefix/settings/$properties $dbprefix/bin/$mrg stdout
Bikel Collins POS tagging (MXPOST, in directory jmx) – tagger_input – $prefix/jmx/mxpost $prefix/jmx/tagger.project /tmp/err.txt Parsing – set ddf "wsj obj.gz” – set properties "collins.properties" – parser_input – $dbprefix/bin/parse 400 $dbprefix/settings/$properties $dbprefix/bin/$ddf /tmp/test2.txt stdout Training – set mrg "wsj mrg” – set properties "collins.properties" – $dbprefix/bin/train 800 $dbprefix/settings/$properties $dbprefix/bin/$mrg stdout Unix file descriptors 0 Standard input (stdin) 1Standard output (stdout) 2Standard error(stderr) GUI components frame.input text.input.t -height 4 -yscrollcommand {.input.s set} scrollbar.input.s -command {.input.t yview} frame.tagged text.tagged.t -height 9 -yscrollcommand {.tagged.s set} scrollbar.tagged.s -command {.tagged.t yview} Code proc tagger_input {} { set lines [.input.t get 1.0 end] set infile [open "/tmp/test.txt" w] puts -nonewline $infile [string trimright $lines] close $infile } proc parser_input {} { set lines [.tagged.t get 1.0 end] set infile [open "/tmp/test2.txt" w] puts -nonewline $infile [string trimright $lines] close $infile } Unix file descriptors 0 Standard input (stdin) 1Standard output (stdout) 2Standard error(stderr) GUI components frame.input text.input.t -height 4 -yscrollcommand {.input.s set} scrollbar.input.s -command {.input.t yview} frame.tagged text.tagged.t -height 9 -yscrollcommand {.tagged.s set} scrollbar.tagged.s -command {.tagged.t yview} Code proc tagger_input {} { set lines [.input.t get 1.0 end] set infile [open "/tmp/test.txt" w] puts -nonewline $infile [string trimright $lines] close $infile } proc parser_input {} { set lines [.tagged.t get 1.0 end] set infile [open "/tmp/test2.txt" w] puts -nonewline $infile [string trimright $lines] close $infile }
Bikel Collins There’s also a simple tree viewer I wrote but it may not run on your system…
Bikel Collins Relevant files and directories bikeldemo – wrapper2.tcl(prefix set to /Users/sandiway) jmx – mxpost(shell script) – mxpost.jar(Java code) dbparser – dbparser/bin/parse(shell script) – dbparser/bin/train(shell script) – dbparser/dbparser.jar(Java code) – dbparser/userguide/guide.pdf
EVALB How to evaluate parsing accuracy? – count bracketing matches – (LR) Bracketing recall = (number of correct constituents) (number of constituents in the goldfile) – (LP) Bracketing precision = (number of correct constituents) (number of constituents in the parsed file) Program called evalb – – written in C – get it to compile on your system (Makefile)
EVALB
Homework Task 2 Part 1 – Run the examples you showed on your slides from Homework Task 1 using the Bikel Collins parser. – Evaluate how close the parses are to the “gold standard” Part 2 – WSJ corpus: sections 00 through 24 – Evaluation: on section 23 – Training: normally (20 sections) – How does the Bikel Collins vary in accuracy if you randomly pick 1, 2, 3,…20 sections to do the training with… plot graph with evalb…