Presentation is loading. Please wait.

Presentation is loading. Please wait.

LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 17 th.

Similar presentations


Presentation on theme: "LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 17 th."— Presentation transcript:

1 LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 17 th

2 Today's Topics Tregex homework 3 review Treebanks and Statistical parsers Homework 4: – Install the Bikel-Collins Parser

3 Homework 3 review docs/prsguid1.pdf (page 252)

4 Homework 3 review Using tregex find all the small clauses in the WSJ corpus Question 1: How many are there in total? Question 2: Would you expect the NP-PRD type to be the most numerous? Is it? Question 3: List all the different types and their frequencies (in descending order) Show your work: – i.e. show the tregex search terms you used

5 Homework 3 Review Search: –/^VP/ < ( /^S/ < /^NP-SBJ/ < /-PRD/) – Note: /^S/ S-MNR (manner) – Found: 1842 – 746 /^VP/ < ( /^S/ < /^NP-SBJ/ < /^NP-PRD/) – 938 /^VP/ < ( /^S/ < /^NP-SBJ/ < /^ADJP-PRD/)

6 Homework 3 Review 1842 is approx. 1% of all the clauses in the WSJ Treebank /^S/ 178231 158 is just under 10% of all the small clauses in the WSJ Treebank

7 Homework 3 Review 57 /^VP/ < ( /^S/ < /^NP-SBJ/ < /^PP-PRD/)

8 Homework 3 Review 57 /^VP/ < ( /^S/ < /^NP-SBJ/ < /^PP-PRD/)

9 Homework 3 Review 101 is approx. 5.5% of all the small clauses in the WSJ Treebank

10 Homework 3 Review 39 /^VP/ < ( /^S/ < /^NP-SBJ/ < /^ADVP-PRD/)

11 Homework 3 Review 62 is approx. 5.5% of all the small clauses in the WSJ Treebank

12 Homework 3 Review 16 /^VP/ < ( /^S/ < /^NP-SBJ/ < /^NP-TTL-PRD/) Subtag -TTL = Title

13 Homework 3 Review 16 /^VP/ < ( /^S/ < /^NP-SBJ/ < /^NP-TTL-PRD/)

14 Homework 3 Review 29 /^VP/ < ( /^S/ < /^NP-SBJ/ < /^PP-LOC-PRD/)

15 Homework 3 Review 17 is approx. 0.9% of all the small clauses in the WSJ Treebank

16 Homework 3 Review 7 /^VP/ < ( /^S/ < /^NP-SBJ/ < /^UCP-PRD/) Tag UCP = Unlike Coordinated Phrase Tag UCP = Unlike Coordinated Phrase

17 Homework 3 Review 7 /^VP/ < ( /^S/ < /^NP-SBJ/ < /^UCP-PRD/)

18 Homework 3 Review 5 /^VP/ < ( /^S/ < /^NP-SBJ/ < /^ADVP-LOC-PRD/)

19 Homework 3 Review 5 /^VP/ < ( /^S/ < /^NP-SBJ/ < /^ADVP-LOC-PRD/)

20 Homework 3 Review 5 /^VP/ < ( /^S/ < /^NP-SBJ/ < /^ADVP-LOC-PRD/)

21 Homework 3 Review 5 is approx. 0.3% of all the small clauses in the WSJ Treebank

22 Homework 3 Review 1 /^VP/ < ( /^S/ < /^NP-SBJ/ < /^PP-TTL-PRD/) Subtag -TTL = Title

23 Homework 3 Review 3 /^VP/ < ( /^S/ < /^NP-SBJ/ < /^S-TTL-PRD/) Subtag -TTL = Title

24 Homework 3 Review 3 /^VP/ < ( /^S/ < /^NP-SBJ/ < /^S-TTL-PRD/) Subtag -TTL = Title

25 Homework 3 Review 1 /^VP/ < ( /^S/ < /^NP-SBJ/ < /^PP-DIR-PRD/) Subtag –DIR = Direction PP-DIR-PRD : frequency 5

26 Homework 3 Review

27 16 /^VP/ < ( /^S/ < /^NP-SBJ/ < /^NP-TTL-PRD/) 3 /^VP/ < ( /^S/ < /^NP-SBJ/ < /^NP-PRD-TTL/)

28 Homework 3 Review

29 Today's Topics Tregex homework 3 review Treebanks and Statistical parsers Homework 4: – Install the Bikel-Collins Parser

30 Relevance of Treebanks Statistical parsers typically construct syntactic phrase structure – they’re trained on Treebank corpora like the Penn Treebank Note: some use dependency graphs, not trees

31 Berkeley Parser Note: no empty categories, or subtags … 31

32 Stanford Parser 32

33 Statistical Parsers Don’t recover fully-annotated trees – not trained using nodes with indices or empty ( -NONE- ) nodes – not trained using functional tags, e.g. –SBJ, –PRD Therefore they don’t fully parse the PTB Berkeley parser HW3 -PRD

34 Statistical Parsers docs/prsguid1.pdf (page 20) SBAR (sentence level above S)

35 Statistical Parsers Example: no SBAR node in … a movie to see Stanford parser SBAR

36 Parsers trained on the Treebank SBAR can be forced by the presence of an overt relative pronoun, but note there is no subject gap: -NONE-

37 Parsers trained on the Treebank Probabilities are estimated from frequency information of each node given surrounding context (e.g. parent node, or the word that heads the node) Still these systems have enormous problems with prepositional phrase (PP) attachment Examples: (borrowed from Igor Malioutov) – A boy with a telescope kissed Mary on the lips – Mary was kissed by a boy with a telescope on the lips PP with a telescope should adjoin to the noun phrase (NP) a boy PP on the lips should adjoin to the verb phrase (VP) headed by kiss

38 Active/passive sentences Examples using the Stanford Parser: Both active and passive sentences are parsed incorrectly Both active and passive sentences are parsed incorrectly

39 Active/passive sentences Examples using the Stanford Parser: Both active and passive sentences are parsed incorrectly Both active and passive sentences are parsed incorrectly

40 Active/passive sentences Examples: X on the lips modifies Mary X on the lips modifies telescope

41 Treebank Rules Just how many rules are there in the WSJ treebank? What’s the most common POS tag? What’s the most common syntax rule?

42 Penn Treebank Total number of rules: 978,873 Number of different rules: 31,338 Top 5: 1.S  NP-SBJ VP 2.PP  IN NP 3.NP-SBJ  -NONE- 4.NP  NP PP 5.NP  DT NN How many grammar rules are there in the treebank? 42

43 Treebank

44 Total # of rules: 978,873 # of different rules: 17,554

45 Treebank

46 Penn Treebank How about POS tags? Total number of tags: 1,253,013 Top 5: 1.NN singular common noun 2.IN preposition 3.NNP singular proper noun 4.DT determiner 5.JJ adjective CategoryFrequencyPercentage N355,03928.3% V154,97512.4% 46

47 Using the Treebank What is the grammar of the Treebank? – We can extract the phrase structure rules used, and – count the frequency of rules, and construct a stochastic parser

48 Using the Treebank Breakthrough in parsing accuracy with lexicalized trees – think of expanding the nonterminal names to include head information and the words that are at the leaves of the subtrees.

49 Bikel Collins Parser Java re-implementation of Collins’ parser Paper – Daniel M. Bikel. 2004. Intricacies of Collins’ Parsing Model. (PS) (PDF) in Computational Linguistics, 30(4), pp. 479-511.PS) (PDF) in Computational Linguistics, 30(4), pp. 479-511. Software – http://www.cis.upenn.edu/~dbikel/software.html#st at-parser (page no longer exists) http://www.cis.upenn.edu/~dbikel/software.html#st at-parser

50 Bikel Collins Download and install Dan Bikel’s parser – dbp.zip (on course homepage) File: install.sh – Java code – but at this point I think Windows won’t work because of the shell script (.sh) – maybe after files are extracted?

51 Bikel Collins Download and install the POS tagger MXPOST parser doesn’t actually need a separate tagger…

52 Bikel Collins Training the parser with the WSJ PTB See guide – userguide/guide.pdf directory: TREEBANK_3/parsed/mrg/wsj chapters 02-21: create one single.mrg file events:wsj-02-21.obj.gz directory: TREEBANK_3/parsed/mrg/wsj chapters 02-21: create one single.mrg file events:wsj-02-21.obj.gz

53 Bikel Collins Settings:

54 Bikel Collins Parsing – Command – Input file format (sentences)

55 Bikel Collins Verify the trainer and parser work on your machine

56 Bikel Collins File: bin/parse is a shell script that sets up program parameters and calls java

57 Bikel Collins

58 File: bin/train is another shell script

59 Bikel Collins Relevant WSJ PTB files


Download ppt "LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 17 th."

Similar presentations


Ads by Google