Presentation is loading. Please wait.

Presentation is loading. Please wait.

Parsing Long and Complex Natural Language Sentences

Similar presentations


Presentation on theme: "Parsing Long and Complex Natural Language Sentences"— Presentation transcript:

1 Parsing Long and Complex Natural Language Sentences
November 27, 2014 Shonan Meeting Parsing Long and Complex Natural Language Sentences Yuji Matsumoto Nara Institute of Science and Technology (NAIST)

2 Problems in Parsing Long Sentences
Sentences in scientific or legal domains tend to have long and complex sentences Major hindrance to syntactic analysis of natural language sentences Difficult syntactic structures in long sentences Coordinate structures Complex syntactic patterns (with multiple clauses) Ordinary CFG grammars and lexicons have difficulty in handing (or representing) such phenomena 2

3 Issues in Parsing that lie between Lexicon and Grammar
(word tokens) (grammar rules) Coordinate Structures Extra-grammatical phenomenon Grammatical Units Multiword Expressions (functional MWEs) Syntactically or semantically idiosyncratic expressions that should be registered in lexicon Complex Sentence Patterns Subordinate clauses Embedded clauses Other complex sentence patterns 3

4 Problems of Coordinate Structures
Any constituents can be coordinated Non-constituent structures (sequences of constituents) can be coordinated “ John saw Mary yesterday and Bill today. ” Coordinate structures can be nested “ months and 8.9 months in arm A and 7.2 months and 9.5 months in arm B. ” Scope ambiguity old [men and women] vs [old men and women]” 4

5 Identification of coordinate structure helps improve parsing accuracy
“Median times to progression and median survival times were 6.1 months and 8.9 months in arm A and 7.2 months and 9.5 months in arm B.”                ⇓ “Median times … were 6.1 months in arm A” “Median times … were 7.2 months in arm B” “median survival times were 8.9 months in arm A” “median survival times were 9.5 months in arm B” 5

6 Joint analysis of grammatical and alignment methods [Hara, et al 09]
Learning scores for alignment Feature based learning [Shimbo & Hara 07] Phrase structure grammar rules for coordinate structure are defined to ensure the structural constraints The weights for alignment are jointly learned with the structural constraints Combination of CKY parsing algorithm with perceptron learning of alignment weights Masashi Shimbo and Kazuo Hara, "A Discriminative Learning Model for Coordinate Conjunctions,“ EMNLP-CoNLL, pp , June 2007. Kazuo Hara, Masashi Shimbo, Hideharu Okuma and Yuji Matsumoto, "Coordinate Structure Analysis with Global Structural Constraints and Alignment-Based Local Features," Proceedings of ACL-IJCNLP 2009, pp , August 2009. 6 6

7 Coordination structure analysis
Alignment of corresponding parts “the standard arm and the dose dense arm” the standard arm the dose dense arm 7 7

8 DP matching method for alignment
the dose dense arm the standard arm the standard arm the dose dense arm 8 8

9 Our first method represents coordinate structure as a path on a triangular alignment graph
Progression Median median survival times and to start Median times to progression and median survival “Median times to progression and median survival times” To search coordinations in a given sentence, Our previous method [EMNLP-2007] represents coordinate structure as a path on a triangular alignment graph. end

10 A path representing correct structure
Progression survival Median median times and to start Median times to progression and median survival This is a path representing the correct coordinate structure. Median times to progression and median survival times end

11 Drawback of path-based method
It cannot cope with nested coordinations, such as: “Median times to progression and median survival times were 6.1 months and 8.9 months in arm A and 7.2 months and 9.5 months in arm B.” 6.1 months and 8.9 months 7.2 months and 9.5 months In introduction, I mentioned our previous method cannot cope with nested coordinations. Here I explain the details using the underlined part of this sentence as an example, which includes three coordinations in nested structure. 6.1 months and 8.9 months in arm A and 7.2 months and 9.5 months in arm B

12 6.1 months and 8.9 in arm A 7.2 9.5 B start “6.1 months and 8.9 months in arm A and 7.2 months and 9.5 months in arm B” This is the alignment graph for this example end

13 6.1 months 6.1 months and 8.9 months and 8.9 in arm A 7.2 9.5 B start
A path representing the first coordination is this. end

14 7.2 7.2 months and 9.5 months 6.1 months and 8.9 in arm A 9.5 B start
A path representing the second coordination is this. 7.2 months and 9.5 months end

15 6.1 months and 8.9 months in arm A 7.2 months and 9.5 months in arm B
start 6.1 months and 8.9 months in arm A and 7.2 months and 9.5 months in arm B A path representing a larger coordination that includes first and second coordinations is this. end

16 6.1 months and 8.9 months in arm A 7.2 months and 9.5 months in arm B
start 6.1 months and 8.9 months 6.1 months and 8.9 months in arm A and 7.2 months and 9.5 months in arm B But no single path cannot represents these three coordinations at the same time. 7.2 months and 9.5 months end

17 There is no single path to connect all three segments
6.1 months and 8.9 in arm A 7.2 9.5 B start There is no single path to connect all three segments +We see there is no single path that connects all three segments. end

18 Constituent tree structure can represent coordinate structure as a tree
Our new method [ACL-2009] represents coordinate structure as a tree. Median times to progression and median survival times were 6.1 months and 8.9 in group A 7.2 9.5 B

19 We use Grammar rules only to ensure consistent global coordinate structure
For any two coordinate structures in a sentence, the following must be fulfilled. Either their scope is completely disjoint (non-overlapping flat coordinate structures), or one is embedded in a conjunct of another coordinate structue (nested sturcutres). The proposed method requires a Grammar, needed only to ensure consistent global coordinate structure. For any two coordinations in a sentence, the following must be fulfilled. Either their scope is completely disjoint, or one is embedded in a conjunct of the other coordination, that is, nested coordinations.

20 Grammar rules: just to ensure global consistency of coordinate structure
Coordination COORD → CJT CC CJT COORD → CJT SEP COORD′ COORD′→ CJT CC CJT COORD′→ CJT SEP COORD′ Conjunct CJT → (COORD | N) Non-coordination N → COORD N N → W N N → W Pre-terminal CC → ( and | or | but ) SEP → ( , | ; ) W → ∗ Here are the Grammar rules. +This rule indicates coordinations consist of two conjuncts and one coordinate conjunction. +Partial coordination COORD’ and Separator are prepared for coordination with three or more conjuncts, such as “a , b , and c” +For pre-terminal rules, coordinate conjunction does not have to be limited as “and” “or” “but”. Other coordination clues such as “as well as” “versus” can be added, if necessary.

21 Parse trees produced by the grammar rules
W N CC COORD CJT W N CC SEP COORD’ COORD CJT COORD’ COORD These are example parse trees constructed by our grammar for nested coordinations and single coordination with three conjuncts. +Their trees are almost same except COORD and COORD’. “a , b and c” “a or b and c” single coordination with three conjuncts nested coordination

22 We incorporate “sequence alignment” into the tree-based method
We like to measure local similarity between conjuncts in each coordination by sequence alignment. To do so, we attach an alignment graph to each COORD node in a tree. Then we Incorporate “sequence alignment” into the tree-based method. We like to measure local similarity between conjuncts in each coordination by sequence alignment. To do so, we attach an alignment graph to each COORD node in trees.

23 Attach an alignment graph to each COORD node (in the correct tree)
6.1 months and 8.9 in group A 7.2 9.5 B N N Median times to progression median survival COORD 6.1 months 8.9 months 9.5 7.2 COORD CJT CJT CJT N CJT COORD N COORD N N N CJT CJT N CJT CJT N N N N N N N N N W We attach an alignment graph to each COORD node, to calculate a score of the tree. The rectangular alignment graph is constructed from its conjunct pair. +For the actual example sentence, the correct tree attached with alignment graphs becomes like this. W W W CC W W W W W W CC W W W W W CC W W CC W W W W W Median times to progression and median survival times were 6.1 months and 8.9 in group A 7.2 9.5 B

24 Attach an alignment graph to each COORD node (in an incorrect tree)
Median times to progression median survival N N 6.1 months 8.9 N COORD A 7.2 months 9.5 N CJT COORD N CJT COORD COORD N N N CJT CJT N CJT CJT CJT CJT N N N N N N N N N N N W This is an example attachment for an incorrect tree. W W W CC W W W W W W CC W W W W W CC W W CC W W W W W Median times to progression and median survival times were 6.1 months and 8.9 in group A 7.2 9.5 B

25 Score of a tree total score = 8.8 c” and b or “a node score = 5.5
= sum of all the scores of COORD/COORD’ nodes in the tree a b and c c” and b or “a W N CC COORD CJT node score = 5.5 b c node score = 3.3 total score = 8.8 From now on, I explain how to calculate the score of a tree. Firstly, the score of a tree is the sum of all the scores of COORD/COORD’ nodes in the tree. +If this node score is 5.5, and +this node score is 3.3, then +the score of this tree becomes 8.8.

26 Experiments: Comparison with other parsers (on Genea Corpus)
Coordination type Number Proposed method Bikel-Collins Overall 3598 61.5 52.1 (with Gold POS) Coordination type Number Proposed method Charniak-Johnson Overall 3598 57.5 52.9 There are 3598 coordinations in GENIA, and the results are obtained from 5-fold cross validation. Because Charniak-Johnson parser takes inputs as raw sentences without POS, for fair comparison, we gave proposed method POS that Charniak-Johnson parser output, when comparing with Charniak-Johnson parser. (with auto-tagged POS)

27 Breakdown of the results per coordination of different types
Coordination type Number Proposed method Bikel-Collins NP 2317 64.2 45.5 VP 465 54.2 67.7 ADJP 321 80.4 66.4 S 188 22.9 67.0 PP 167 59.9 53.3 UCP 60 36.7 18.3 SBAR 56 51.8 85.7 ADVP 21 90.5 Others 3 66.7 33.3

28 Breakdown of the results per coordination of different types
Coordination type Number Proposed method Charniak-Johnson NP 2317 62.5 50.1 VP 465 42.6 61.9 ADJP 321 76.3 48.6 S 188 15.4 63.3 PP 167 53.9 58.1 UCP 60 38.3 26.7 SBAR 56 33.9 83.9 ADVP 21 85.7 90.5 Others 3 33.3 0.0 This table shows Breakdown of the results per coordination of different types. Our method was superior to parsers in noun phrase and adjective phrase coordinations. In contrast, parsers performed quite well in verb phrase and sentence coordinations.

29 Our current annotation scheme for coordination and dependency
ChaKi: General annotation tool for POS, chunks, dependency, links in natural language sentences Coordinate structure and dependency structure are annotated independently 29

30 ChaKi: Corpus annotation and management tool
30

31 Current Project: Joint coordination and dependency parsing
Coordinate structure analysis alignment-based coordination structure analysis Dependency analysis Eisner algorithm (CKY style dynamic programming) Extended Eisner algorithm Need to accumulate training examples 31

32 Complex Sentence Patterns
Long sentences, having subordinate clauses or embedded clauses, are difficult to parse We investigated variation of clause patterns around “SBAR” Extracted SBAR patterns in auto-parsed (by Berkeley parser) corpus, then merged the patterns into manageable size

33 Analysis of SBAR patterns in complex sentences
Examine SBAR and its relations to its parents, sister and children nodes. If we want to analyze and extract the patterns of relative clauses, I look at the tree structure of each sentence, take out the relations of each SBAR with its sisters, parents, and children nodes. For the SBAR’s parents and sister nodes , I extract the main POS tags. For SBAR’s children nodes, which contains the SBAR’s function words, like “who” in this case, I extract both the POS tag and the surface word because SBAR’s function words are important elements, which help to define the pattern and the relations among the components of a sentence. 33

34 Extracted SBAR Pattern
(NP (NP) (SBAR (WHNP (WP who)) (S (VP)))) It expresses the relations between SBAR and its parents, sisters, and two children nodes. Then I summarize all SBAR patterns as the structure of POS tag of SBAR’s parents and sisters, pairs of POS & surface of SBAR’s function words, and SBAR’s children. I have tried to to extract SBAR patterns like this out of the corpus data I have. Let me turn to the corpus data and the patterns extracted. Here is an example of SBAR patterns I have extracted.

35 Corpus data: Hiragana Times (English-Japanse Parallel text)
The corpus I used for analysis of patterns was Hiragana Times, one of the famous magazines for introducing Japan to non-Japanese. Articles are written in both English and Japanese, so we can use them as a parallel corpus in future analysis. Today, I introduce analysis of English part of the corpus.

36 Statistics of patterns extracted from Hiragana Times (English part)
Number of sentences 171,098 Number of complex sentences 70,134 Number of SBARs 114,840 Number of distinct SBAR patterns 21,090 Today, I introduce analysis of English part of the corpus. The HiraganaTimes data contains 172,398 sents, of which 70,715 sents are complex ones. The number of distinct patterns I got is This is a big number for human to control. The results I get, not yet trimming down, is 22,040 SBAR patterns. I work on these 22,040 patterns to group similar patterns together. Start with patterns of top frequency going down to less frequent patterns, I try to group the patterns automatically first,

37 Top 10 SBAR patterns of high frequency
Rank SBAR Pattern Freq. 1 (NP (NP) (SBAR (WHNP (WP who)) (S (VP)))) 6921 2 (NP (NP) (SBAR (S (NP) (VP)))) 4982 3 (NP (NP) (SBAR (WHNP (WDT that)) (S (VP)))) 4772 4 3017 5 (VP (VBP) (SBAR (S (NP) (VP)))) 2768 6 (NP (NP) (,) (SBAR (WHNP (WDT which)) (S (VP)))) 1619 7 (VP (VBD) (SBAR (IN that) (S (NP) (VP)))) 1583 8 (VP (VB) (SBAR (IN that) (S (NP) (VP)))) 1564 9 (VP (VBD) (SBAR (S (NP) (VP)))) 1542 10 (VP (VBZ) (SBAR (IN that) (S (NP) (VP)))) 1389 (sum) 30157 / (29.9%) I got distinct SBAR patterns, frequencies of which rank from 1 to 22,040. Here is examples of the top 10 extracted SBAR patterns with their frequency. The top 10 high frequency patterns cover 30% of all the relative clauses in the data.

38 Grouping SBAR patterns
Grouping Criteria Head function words should be the same Parent nodes of SBAR should be the same C-commanding nodes of SBAR should be the same Clause structure under SBAR should be the same

39 Examples of Distinct Patterns to be Grouped Together
Rank SBAR Pattern Freq. 1 (NP (NP) (SBAR (WHNP (WP who)) (S (VP)))) 6921 21 (NP (NP) (,) (SBAR (WHNP (WP who)) (S (VP)))) 646 151 (NP (NP) (,) (SBAR (WHNP (WP who)) (S (ADVP) (VP)))) 51 557 (NP (NP) (ADJP) (SBAR (WHNP (WP who)) (S (VP)))) 13 2359 (NP (NP) (PP) (,) (SBAR (WHNP (WP who)) (S (ADVP) (VP)))) 2 22,040 is a big number and difficult for human to summarize the characteristics of these patterns. How we may make these patterns more compact and reduce the number of patterns? I examined these 22,040 and group patterns which share certain structure and expressions. For example these patterns are distinct patterns from the 22,000 patterns, with their rank are like this, And here is the frequency of each pattern. For example, these SBAR patterns all contain function words as “who”, and share the main structure, That belong to an NP, modifying an NP. For these patterns I would group them together as the pattern “(NP (NP) (SBAR (WHNP (WP who)) (S (VP)))) After I tried group similar patterns to the top 300 patterns, I get a set of patterns which cover 87% of all relative clauses And with the top 150 patterns , the set cover 86%

40 Grouping SBAR patterns reduces the number of distinct patterns
Number of grouped patterns Coverage 150 86,553/100,537 (86%) 300 87,789/100,537 (87.3%) The results I get, not yet trimming down, is 22,040 SBAR patterns. I work on these 22,040 patterns to group similar patterns together. Start with patterns of top frequency going down to less frequent patterns, I try to group the patterns automatically first, Number of grouped patterns

41 …since… : time-related meaning
One grouped pattern may have different meaning (different translations in Japanese) “Business has doubled every year since we began,” Stuart says . 「 私 たち が 始めてから 、取引 高 は 毎年 、 倍 に 成長 し て い ます 」 と 、 スチュウアート さんは 言う 。 (VP (VB.*) (NP) (SBAR (IN since) (S (NP) (VP)))) I have examined part of the 150 high frequency patterns, and I found that there are cases when one pattern may be translated in more than one ways in Japanese. For example, “since” may be translated as “in the intervening period between (the time mentioned) and the time under consideration” Or as “for the reason that…” And when I looked at the sentences with “since”, which are translated to “time” meaning , most of them have a sister node as “(VBN)”, Which often expresses the perfect present or perfect pas tense in English, while the SBAR children nodes are in past tense …since… : time-related meaning

42 …since… : reason-related meaning
One grouped pattern may have different meaning (different translations in Japanese) This is so great that you can also practice Japanese pronunciation since the guide vocal and Japanese song book in Romaji are attached. また 、 日本語 の ガイド ボーカル が 付き 、 ローマ字 付き の 日本語 歌詞 本 も 付い て いる ので 、日本語 の 発音 の 練習 も できる という すぐれ もの だ 。 (VP (VB.*) (NP) (SBAR (IN since) (S (NP) (VP)))) . So for a pattern we have to decide which meaning should be translated to? …since… : reason-related meaning

43 Relations between the nodes in SBAR patterns
(VP (VB.*) (NP) (SBAR (IN since) (S (NP) (VP) ))) Present Perfect time- related meaning Past Tense Future Tense Present Perfect / Past Present   reason-related meaning Present / Future Past Past / Future So how it would be possible to recognize and decide the meaning of a function word in a sentence if we analyze and try to see the characteristics of each component in the patterns, as well as the relations between SBAR’s parents & sisters, and SBAR’s children nodes. For example, when the SBAR’s sister nodes and SBAR’s children node both are in present tense or both are in future tense, the meaning of “since ” tends to be related to “reason”. Or When in SBAR’s children nodes and sister nodes contain time expressions, or when sisters and children nodes are a pair of “Present Perfect Tense ~ Past Tense ”, “since” often has the meaning related to “time” When one grouped pattern may have different translations, it is necessary to decide which meaning a sentence should be translated. Then I will analyze particular patterns in the group to find out the rules to decide which meaning should be translated to.

44 Evaluation of grouped patterns via Statistical MT
Top 100 SBAR patterns (coverage 82%) 13 patterns have multiple translations Hand-write translation templates and disambiguation rules “Divide and Rewrite” approach to translation Complex sentence pattern matching Sub-clauses are translated by existing SMT system Translated clauses are put in translation templates Existing SMT systems Google translate, Moses & Giza++

45 Experiment Training (17,000 sentences), Dev(500), Test(500) from Hiragana Times (English) In test sentences, there are 232 complex sentences, of which 185 matched (80%) all test sentences complex sentences only Moses Google without complex sentence patterns 15.26 24.36 12.84 15.61 with complex sentence patterns 17.49 24.73 15.97 16.43

46 Current Projects Joint Coordination and Dependency Parsing MWE Lexicon
Extended Eisner algorithm MWE Lexicon Functional expressions: preposition, determiner, conjunction, adverb Phrasal verbs (Flexible MWEs: MWEs with gaps) Training data for disambiguation Complex sentence patterns Previous evaluation was done only with sentences that have one SBAR structure Sentence pattern acquisition and disambiguation a JJ kind of not only … (but) also The JJR ... V…,the JJR…V


Download ppt "Parsing Long and Complex Natural Language Sentences"

Similar presentations


Ads by Google