Presentation is loading. Please wait.

Presentation is loading. Please wait.

Detecting Multiword Verbs (MWVs) in MEDLINE Abstracts Chun Xiao and Dietmar Rösner Institut für Wissens- und Sprachverarbeitung, Otto-von-Guericke-Universität.

Similar presentations


Presentation on theme: "Detecting Multiword Verbs (MWVs) in MEDLINE Abstracts Chun Xiao and Dietmar Rösner Institut für Wissens- und Sprachverarbeitung, Otto-von-Guericke-Universität."— Presentation transcript:

1 Detecting Multiword Verbs (MWVs) in MEDLINE Abstracts Chun Xiao and Dietmar Rösner Institut für Wissens- und Sprachverarbeitung, Otto-von-Guericke-Universität Magdeburg

2 2 Outline 1.MWVs in MEDLINE abstracts 2.Collecting MWV candidates 3.Case study of overgeneration 4.Ranking proper MWV candidates 5.Evaluation 6.Summary

3 3 MEDLINE Abstracts Outline 1. MWVs in Medline abstracts 2. Collecting MWV candidates 3. Case study of overgeneration 4. Ranking proper MWV candidates 5. Evaluation 6. Summary MEDLINE Abstracts Access:PubMed® Domain: clinical medicine, biomedicine, biological and physical sciences. Source: articles from over 4,600 journals published throughout the world. Coverage of: abstracts are included for about 52% of the articles, over 10 Mio. abstracts. GENIA Corpus http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/ 2000 MEDLINE abstracts collected using keywords human, blood cell, transcription factor. (1800 in test) A POS-tag-annotated version. An NE-annotated version.

4 4 Information Extraction Beyond NER We show that in this work …. High levels of UDG expression in a transient transfection assay result in the down-regulation of transcriptional activity through elements specific for E2F-mediated transcription. Outline 1. MWVs in Medline abstracts 2. Collecting MWV candidates 3. Case study of overgeneration 4. Ranking proper MWV candidates 5. Evaluation 6. Summary named entities (NEs),domain-specific verb high levels of UDG expressiondown-regulation of transcriptional activity result in relational information

5 5 MWVs in MEDLINE Abstracts Examples of MWVs be able to, shed light on, take place, result in Possible ambiguities caused if without MWV detection: shed light on :light is not an NE take place :place is not an NE result in/from :in or from should not construct prepositional phrases as in general cases Appropriate handling of MWVs simplifies the processing. Reliable detection of MWVs (such as: interact with) contributes to relational information extraction. Outline 1. MWVs in Medline abstracts 2. Collecting MWV candidates 3. Case study of overgeneration 4. Ranking proper MWV candidates 5. Evaluation 6. Summary

6 6 Collecting MWV Candidates from Corpus S is the set of automaton states, S = {nextOut, stop, nextIn, head, halt}; I is the input set, namely the chunks in both POS tag sequence and lexical sequence; O is the output set, namely the MWV candidates, O = {o i | o i is a successful MWV candidate}; F is the set of output controlling functions; G is the set of automaton state transition functions; START is the beginning state for MWV collecting, START = head. Outline 1. MWVs in Medline abstracts 2. Collecting MWV candidates 3. Case study of overgeneration 4. Ranking proper MWV candidates 5. Evaluation 6. Summary Definition of an automaton T= {S, I, O, F, G, START}, where

7 7 Working Mechanism: Non-contiguous Model Outline 1. MWVs in Medline abstracts 2. Collecting MWV candidates 3. Case study of overgeneration 4. Ranking proper MWV candidates 5. Evaluation 6. Summary SEPRSentence separator. EVPVerb phrase. IVPInfinitive verb phrase. alegal chunks Limited by chunk types, i.e., not in a stoplist {EVP, COMMA, SEPR, PUNC, …}; Limited by tokens in chunks, i.e., containing only one token. billegal chunks, i.e., chunks that are not legal. c >= 0. x is limited by the given right- side window size s, i,e., x<=s-1. b/a does not include SEPR, EVP and IVP. nextOut halt EVP/IVP acac head nextIn stop ab b b/a SEPR SEPR/Ø nextOut bxbx SEPR EVP/IVP b/a (b/a) c

8 8 Working Mechanism: Contiguous model Outline 1. MWVs in Medline abstracts 2. Collecting MWV candidates 3. Case study of overgeneration 4. Ranking proper MWV candidates 5. Evaluation 6. Summary SEPRSentence separator. EVPVerb phrase. IVPInfinitive verb phrase. alegal chunks Limited by chunk types, i.e., not in a stoplist {EVP, COMMA, SEPR, PUNC, …}; Limited by tokens in chunks, i.e., containing only one token. billegal chunks, i.e., chunks that are not legal. The stop state will trigger an output operation. c >= 0. b/a does not include SEPR, EVP and IVP. nextOut halt EVP/IVP acac head nextIn stop a b b b/a SEPR SEPR/Ø EVP/IVP b/a (b/a) c acac head nextIn stop a b b

9 9 Fragment of the Automaton for MWV Collecting SEPRSentence separator. EVPVerb phrase. IVPInfinitive verb phrase. alegal chunks Limited by chunk types, i.e., not in a stoplist {EVP, COMMA, SEPR, PUNC, …}. Limited by tokens in chunks, i.e., containing only one token. billegal chunks, i.e., chunks that are not legal. cc>=0. stopThe stop state will trigger an output operation. Outline 1. MWVs in Medline abstracts 2. Collecting MWV candidates 3. Case study of overgeneration 4. Ranking proper MWV candidates 5. Evaluation 6. Summary head stop EVP/IVP a b acac b NextIn

10 10 An Example Chunks of a sentenceChunk tag State transition Output InitializationnextOutØ The 3'NF-E2/AP1 motifENPnextOutØ isEVPheadO i =“be” ableADJPnextInO i =“be able” to exertIVPstopO i =“be able to”, (success) headO i+1 =“exert” both positive and negative regulatory effects ENPSstop, nextOutO i+1 =“exert” (failure) onINnextOutØ the zeta 2-globin promoter activityENPnextOutØ inINnextOutØ K562 cellsENPnextOutØ SEPRhaltØ Outline 1. MWVs in Medline abstracts 2. Collecting MWV candidates 3. Case study of overgeneration 4. Ranking proper MWV candidates 5. Evaluation 6. Summary

11 11 Case Study of Overgeneration Case 1: take place, take place in, take place at pattern of POS Tag: Verb Noun (Preposition) Case 2: be able to, be important for 83% able in be able to, 8.4% important in be important for Case 3: take place, bind DNA DNA is a named entity, but place is not Case 4: be able to, be unaffected acceptable boundary words should not be adjectives Case 5: associate with, be associated with, associated with be associated with associated with, no difference use to, be used to, used to; sometimes be used to differs from used to Example: “He used to smoke a pipe.” Outline 1. MWVs in Medline abstracts 2. Collecting MWV candidates 3. Case study: overgeneration 4. Ranking proper MWV candidates 5. Evaluation 6. Summary

12 12 Ranking Proper MWV Candidates Head(c): the priority of selecting a head word is from the left to right side of a candidate, but a word that is one of the most frequent verbs (be, have, do,…), or a preposition is excluded. result in, be able to, … Assumption: the following aspects can be important for ranking a proper MWV candidate c. aspect 1: f(c), absolute frequency of c. aspect 2: f(c) /f(head(c)), the proportion of f(c) to the frequency of MWV head of c, i.e., head(c). aspect 3: F(c) /f(head(c)), the proportion of the sum of all occurrences of candidates that share the same MWV head with c, F(c), to the frequency of MWV head of c. Outline 1. MWVs in Medline abstracts 2. Collecting MWV candidates 3. Case study of overgeneration 4. Ranking proper MWV candidates 5. Evaluation 6. Summary

13 13 Flowchart of Ranking MWVs Outline 1. MWVs in Medline abstracts 2. Collecting MWV candidates 3. Case study of overgeneration 4. Ranking proper MWV candidates 5. Evaluation 6. Summary Output: reliable MWV candidate c Contiguous MWV candidates Non-contiguous MWV candidates if: R(c) > t Ranking of reliability Controls of overgeneration Domain- specific terms Verb chunker for common verb lemmata Examine the compatibility of long &short candidates Examine candidates of passive and active forms Filter candidates with open boundaries head(c), the MWV head of a candidate c; f(c), the frequency of c; f(head(c)), the frequency of head(c); F(c), sum of all occurrences of candidates that share the same MWV head with c; c 1, c 2 and c 3 are coefficients; R(c), the value of reliability evaluation; t, threshold.

14 14 Selection of Sample Set for Result Evaluation Selection of candidates for result evaluation Most frequent 33 candidates (f(c) >= 60) 31 candidates with moderate frequencies (19 >= f(c) >= 14) 95 candidates with low frequencies (f(c) = 6 or 7) Evaluation according to: Oxford Advanced Learner’s Dictionary of Current English (encyclopedic edition) LEO Germany English online dictionary (http://dict.leo.org/) Outline 1. MWVs in Medline abstracts 2. Collecting MWV candidates 3. Case study of overgeneration 4. Ranking proper MWV candidates 5. Evaluation 6. Summary

15 15 Baseline Performance t = a minR(c) + b. (baseline: a=1, b=0) Note: in this case, all candidates in the sample set are given a positive ranking value. c1c1 c2c2 c3c3 t Precision Recall F-measure 0.0030.582.270.451610.6222 0.0030.5102.810.456510.6268 0.0030.5122.810.456510.6268 0.0031102.880.456510.6268 0.010.5102.860.442110.6131 0.10.5103.490.415810.5874 Outline 1. MWVs in Medline abstracts 2. Collecting MWV candidates 3. Case study of overgeneration 4. Ranking proper MWV candidates 5. Evaluation 6. Summary

16 16 Evaluation Baseline: a=1, b=0, then P=0.4565, R=1, F=0.6273. Let a=2.3, b=0.1(or 0.2), then P=0.6863, R=0.8333, F=0.7527. Outline 1. MWVs in Medline abstracts 2. Collecting MWV candidates 3. Case study of overgeneration 4. Ranking proper MWV candidates 5. Evaluation 6. Summary

17 17 Result: A List of MWVs in Ranking Order Outline 1. MWVs in Medline abstracts 2. Collecting MWV candidates 3. Case study of overgeneration 4. Ranking proper MWV candidates 5. Evaluation 6. Summary No.MWV candidateNo.MWV candidate 1(8) be subject to13(5) synergizes with 2(5) subject* to14(35) base* on 3(7) give rise to15(42) interfere with 4(7) take place16(91) derive* from 5(325) result in17(62) consist of 6(271) lead to18(19) belong to 7(293) associate* with19(111) contribute to 8(89) fail to20(17) attribute* to 9(7) culminate in21(41) compose* of 10(5) challenge* with22(31) result from 11(5) coincide with23(56) be present in 12(5) submit* to24(5) base* upon Note: ( ) -- occurrences, * -- dominated by passive form.

18 18 Summary Outline 1. MWVs in Medline abstracts 2. Collecting MWV candidates 3. Case study of overgeneration 4. Ranking proper MWV candidates 5. Evaluation 6. Summary Our results present a sound balance between the low- and high-frequency MWV candidates in the sublanguage corpus. Find out MWVs share the same head with different accessories (base on and base upon), with different perspectives (result in vs. result from); POS tag errors affect the ranking process (related JJ to); Some specific entries are difficult to evaluate (synergize, pretreat, etc) ; Most frequent verbs/auxiliaries (be, have, do) were not considered in this experiment. Ongoing and future works UMLS (unified medical language system) specialist lexicon instead of WordNet for verb stemming; Recognition of derivational forms of specific verbs; Combination with domain-specific analysis.

19 Thank you for your audience! Questions?


Download ppt "Detecting Multiword Verbs (MWVs) in MEDLINE Abstracts Chun Xiao and Dietmar Rösner Institut für Wissens- und Sprachverarbeitung, Otto-von-Guericke-Universität."

Similar presentations


Ads by Google