Download presentation
Presentation is loading. Please wait.
Published byIsaac Harper Modified over 9 years ago
1
Conversion of Penn Treebank Data to Text
2
Penn TreeBank Project “A Bank of Linguistic Trees” (as of 11/1992) University of Pennsylvania, LINC Laboratory 4.5 million words of American English Annotation of naturally-occurring text for linguistic structure
3
Tokenization –Treatment of punctuation, words, etc. as separate tokens Children’s Children ’s Part-of-speech (POS) tagging –Text first assigned POS tags automatically –Human annotators correct first-pass POS tags Bracketing –(Fidditch, a deterministic parser (Hindle 1983, 1989) ) –Two-stage parsing process made explicit with brackets Tree Linguistic Components
4
Penn TreeBank: Brown Corpus (as of 11/1992) POS Tags (Tokens) 1,172,041 Skeletal Parsing (Tokens) 1,172,041
5
You know you’re in trouble when … Robert MacIntyre Programmer/Data Manager Penn Treebank Project robertm@unagi.cis.upenn.edu ftp://ftp.cis.upenn.edu/pub/treebank/doc/faq.cd2 “0. You will always have a certain amount of error. Sometimes there is just no way to find the head of a phrase, because it is tagged or parsed completely incorrectly. (no big surprise, that)”
6
( END_OF_TEXT_UNIT ) ( (`` ``) (S (NP (PRP I) ) (VP (VBP leave) (NP (DT this) (NN church) ) (PP (IN with) (NP (DT a) (NN feeling) (SBAR (IN that) (S (NP (DT a) (JJ great) (NN weight) ) (AUX (VBZ has) ) (VP (VBN been) (VP (VBN lifted) (PP (IN off) (NP (PRP$ my) (NN heart) )))))))))) (,,) (S (NP (PRP I) ) (AUX (VBP have) ) (VP (VP (VBN left) (NP (PRP$ my) (NN grudge) ) (PP (IN at) (NP (DT the) (NN altar) ))) (CC and) (VP (VBN forgiven) (NP (PRP$ my) (NN neighbor) ))))) ('' '') (..) ) ( END_OF_TEXT_UNIT ) cb08_42 ``I leave this church with a feeling that a great weight has been lifted off my heart, I have left my grudge at the altar and forgiven my neighbor''. Tree Conversion: Clean Case
7
( (S (NP (PRP He) ) (VP (VBD reported) (SBAR (IN that) (S (NP (NP (DT the) (NN city) ) (POS 's) (NNS contributions) (PP (IN for) (NP (NN animal) (NN care) ))) (VP (VBD included) (NP (NP ($ $) (CD 67,000) (PP (TO to) (NP (NP (DT the) (NNS Women) ) (POS 's) (NN S.P.C.A.) ))) (: ;) (: ;) (NP (NP ($ $) (CD 15,000) ) (S (NP (-NONE- T) ) (AUX (TO to) ) (VP (VB pay) (NP (NP (CD six) (NNS policemen) ) (VP (VBN assigned) (PP (IN as) (NP (NN dog) (NNS catchers) ))))))) (CC and) (NP (NP ($ $) (CD 15,000) ) (S (NP (-NONE- T) ) (AUX (TO to) ) (VP (VB investigate) (NP (NN dog) (NNS bites) )))))))))) (..) ) ( END_OF_TEXT_UNIT ) ca09_46 He reported that the city's contributions for animal care included $67,000 to the Women's S.P.C.A.;; $15,000 to pay six policemen assigned as dog catchers and $15,000 to investigate dog bites. (NP (DT the) (NNS Women) ) (POS 's) (NN S.P.C.A.) ))) (: ;) (: ;) (NP (NP ($ $) (CD 15,000) ) (S (NP (-NONE- T) ) (AUX (TO to) ) (VP (VB pay) (NP (NP (CD six) (NNS policemen) ) Tree Conversion : Problematic Case
8
Summary of Problems Encountered Typing Errors – Punctuation duplication in data Special notation for delimiter characters – RRB, LRB, RSB, LSB, RCB, LCB Special Null Elements – ( -NONE- ) * 0 T NIL ** Conventions for final output need to consider these lessons
9
Future Recommendations Put POS tree data into proper database –Increases confidence in correctness of data –Minimizes error Spend more effort upfront *once* to clean data SQL queries more reusable than (write-only) perl scripts Due to random graduate student ability If DB option not available –Avoid duplication of data in final output –Avoid text delimiters that exist as data tokens (“ ‘, \s ) –Do thoughtful labeling conventions
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.