Download presentation
Presentation is loading. Please wait.
Published byDorcas Flynn Modified over 8 years ago
1
LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 17 th
2
Today's Topics Tregex homework 3 review Treebanks and Statistical parsers Homework 4: – Install the Bikel-Collins Parser
3
Homework 3 review docs/prsguid1.pdf (page 252)
4
Homework 3 review Using tregex find all the small clauses in the WSJ corpus Question 1: How many are there in total? Question 2: Would you expect the NP-PRD type to be the most numerous? Is it? Question 3: List all the different types and their frequencies (in descending order) Show your work: – i.e. show the tregex search terms you used
5
Homework 3 Review Search: –/^VP/ < ( /^S/ < /^NP-SBJ/ < /-PRD/) – Note: /^S/ S-MNR (manner) – Found: 1842 – 746 /^VP/ < ( /^S/ < /^NP-SBJ/ < /^NP-PRD/) – 938 /^VP/ < ( /^S/ < /^NP-SBJ/ < /^ADJP-PRD/)
6
Homework 3 Review 1842 is approx. 1% of all the clauses in the WSJ Treebank /^S/ 178231 158 is just under 10% of all the small clauses in the WSJ Treebank
7
Homework 3 Review 57 /^VP/ < ( /^S/ < /^NP-SBJ/ < /^PP-PRD/)
8
Homework 3 Review 57 /^VP/ < ( /^S/ < /^NP-SBJ/ < /^PP-PRD/)
9
Homework 3 Review 101 is approx. 5.5% of all the small clauses in the WSJ Treebank
10
Homework 3 Review 39 /^VP/ < ( /^S/ < /^NP-SBJ/ < /^ADVP-PRD/)
11
Homework 3 Review 62 is approx. 5.5% of all the small clauses in the WSJ Treebank
12
Homework 3 Review 16 /^VP/ < ( /^S/ < /^NP-SBJ/ < /^NP-TTL-PRD/) Subtag -TTL = Title
13
Homework 3 Review 16 /^VP/ < ( /^S/ < /^NP-SBJ/ < /^NP-TTL-PRD/)
14
Homework 3 Review 29 /^VP/ < ( /^S/ < /^NP-SBJ/ < /^PP-LOC-PRD/)
15
Homework 3 Review 17 is approx. 0.9% of all the small clauses in the WSJ Treebank
16
Homework 3 Review 7 /^VP/ < ( /^S/ < /^NP-SBJ/ < /^UCP-PRD/) Tag UCP = Unlike Coordinated Phrase Tag UCP = Unlike Coordinated Phrase
17
Homework 3 Review 7 /^VP/ < ( /^S/ < /^NP-SBJ/ < /^UCP-PRD/)
18
Homework 3 Review 5 /^VP/ < ( /^S/ < /^NP-SBJ/ < /^ADVP-LOC-PRD/)
19
Homework 3 Review 5 /^VP/ < ( /^S/ < /^NP-SBJ/ < /^ADVP-LOC-PRD/)
20
Homework 3 Review 5 /^VP/ < ( /^S/ < /^NP-SBJ/ < /^ADVP-LOC-PRD/)
21
Homework 3 Review 5 is approx. 0.3% of all the small clauses in the WSJ Treebank
22
Homework 3 Review 1 /^VP/ < ( /^S/ < /^NP-SBJ/ < /^PP-TTL-PRD/) Subtag -TTL = Title
23
Homework 3 Review 3 /^VP/ < ( /^S/ < /^NP-SBJ/ < /^S-TTL-PRD/) Subtag -TTL = Title
24
Homework 3 Review 3 /^VP/ < ( /^S/ < /^NP-SBJ/ < /^S-TTL-PRD/) Subtag -TTL = Title
25
Homework 3 Review 1 /^VP/ < ( /^S/ < /^NP-SBJ/ < /^PP-DIR-PRD/) Subtag –DIR = Direction PP-DIR-PRD : frequency 5
26
Homework 3 Review
27
16 /^VP/ < ( /^S/ < /^NP-SBJ/ < /^NP-TTL-PRD/) 3 /^VP/ < ( /^S/ < /^NP-SBJ/ < /^NP-PRD-TTL/)
28
Homework 3 Review
29
Today's Topics Tregex homework 3 review Treebanks and Statistical parsers Homework 4: – Install the Bikel-Collins Parser
30
Relevance of Treebanks Statistical parsers typically construct syntactic phrase structure – they’re trained on Treebank corpora like the Penn Treebank Note: some use dependency graphs, not trees
31
Berkeley Parser Note: no empty categories, or subtags … 31
32
Stanford Parser 32
33
Statistical Parsers Don’t recover fully-annotated trees – not trained using nodes with indices or empty ( -NONE- ) nodes – not trained using functional tags, e.g. –SBJ, –PRD Therefore they don’t fully parse the PTB Berkeley parser HW3 -PRD
34
Statistical Parsers docs/prsguid1.pdf (page 20) SBAR (sentence level above S)
35
Statistical Parsers Example: no SBAR node in … a movie to see Stanford parser SBAR
36
Parsers trained on the Treebank SBAR can be forced by the presence of an overt relative pronoun, but note there is no subject gap: -NONE-
37
Parsers trained on the Treebank Probabilities are estimated from frequency information of each node given surrounding context (e.g. parent node, or the word that heads the node) Still these systems have enormous problems with prepositional phrase (PP) attachment Examples: (borrowed from Igor Malioutov) – A boy with a telescope kissed Mary on the lips – Mary was kissed by a boy with a telescope on the lips PP with a telescope should adjoin to the noun phrase (NP) a boy PP on the lips should adjoin to the verb phrase (VP) headed by kiss
38
Active/passive sentences Examples using the Stanford Parser: Both active and passive sentences are parsed incorrectly Both active and passive sentences are parsed incorrectly
39
Active/passive sentences Examples using the Stanford Parser: Both active and passive sentences are parsed incorrectly Both active and passive sentences are parsed incorrectly
40
Active/passive sentences Examples: X on the lips modifies Mary X on the lips modifies telescope
41
Treebank Rules Just how many rules are there in the WSJ treebank? What’s the most common POS tag? What’s the most common syntax rule?
42
Penn Treebank Total number of rules: 978,873 Number of different rules: 31,338 Top 5: 1.S NP-SBJ VP 2.PP IN NP 3.NP-SBJ -NONE- 4.NP NP PP 5.NP DT NN How many grammar rules are there in the treebank? 42
43
Treebank
44
Total # of rules: 978,873 # of different rules: 17,554
45
Treebank
46
Penn Treebank How about POS tags? Total number of tags: 1,253,013 Top 5: 1.NN singular common noun 2.IN preposition 3.NNP singular proper noun 4.DT determiner 5.JJ adjective CategoryFrequencyPercentage N355,03928.3% V154,97512.4% 46
47
Using the Treebank What is the grammar of the Treebank? – We can extract the phrase structure rules used, and – count the frequency of rules, and construct a stochastic parser
48
Using the Treebank Breakthrough in parsing accuracy with lexicalized trees – think of expanding the nonterminal names to include head information and the words that are at the leaves of the subtrees.
49
Bikel Collins Parser Java re-implementation of Collins’ parser Paper – Daniel M. Bikel. 2004. Intricacies of Collins’ Parsing Model. (PS) (PDF) in Computational Linguistics, 30(4), pp. 479-511.PS) (PDF) in Computational Linguistics, 30(4), pp. 479-511. Software – http://www.cis.upenn.edu/~dbikel/software.html#st at-parser (page no longer exists) http://www.cis.upenn.edu/~dbikel/software.html#st at-parser
50
Bikel Collins Download and install Dan Bikel’s parser – dbp.zip (on course homepage) File: install.sh – Java code – but at this point I think Windows won’t work because of the shell script (.sh) – maybe after files are extracted?
51
Bikel Collins Download and install the POS tagger MXPOST parser doesn’t actually need a separate tagger…
52
Bikel Collins Training the parser with the WSJ PTB See guide – userguide/guide.pdf directory: TREEBANK_3/parsed/mrg/wsj chapters 02-21: create one single.mrg file events:wsj-02-21.obj.gz directory: TREEBANK_3/parsed/mrg/wsj chapters 02-21: create one single.mrg file events:wsj-02-21.obj.gz
53
Bikel Collins Settings:
54
Bikel Collins Parsing – Command – Input file format (sentences)
55
Bikel Collins Verify the trainer and parser work on your machine
56
Bikel Collins File: bin/parse is a shell script that sets up program parameters and calls java
57
Bikel Collins
58
File: bin/train is another shell script
59
Bikel Collins Relevant WSJ PTB files
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.