Presentation is loading. Please wait.

Presentation is loading. Please wait.

Recognition of Multi-sentence n-ary Subcellular Localization Mentions in Biomedical Abstracts G. Melli, M. Ester, A. Sarkar Dec. 6, 2007

Similar presentations


Presentation on theme: "Recognition of Multi-sentence n-ary Subcellular Localization Mentions in Biomedical Abstracts G. Melli, M. Ester, A. Sarkar Dec. 6, 2007"— Presentation transcript:

1 Recognition of Multi-sentence n-ary Subcellular Localization Mentions in Biomedical Abstracts G. Melli, M. Ester, A. Sarkar Dec. 6, 2007 http://www.gabormelli.com/2007/2007_MultiNaryBio_Melli_Presentation.ppt

2 Introduction We propose a method for detecting n-ary relations that may span multiple sentences Motivation is to support the semi- automated population of subcellular localizations in db.psort.org.db.psort.org –Organism / Protein / Location We cast each document as a text graph and use machine learning to detect patterns in the graph.

3 Is there an SCL in this text?

4 Yes: ( V. cholerae, TcpC, outer membrane ) Current algorithms are restricted to the detection of binary relations within one sentence: ( TcpC, outer membrane ). Here is the relevant passage

5 Challenge #1 A significant number of the relation cases (~40%) span multiple sentences. Proposed solution: –Create a text graph for the entire document –The graph can contain a superset of the information used by the current binary relation single sentence approaches. (Jiang and Zhai, 2007; Zhou et al, 2007)

6 ORG LOC PROT LOC pilus LOC Automated Markup Syntactic analysis 1.End of sent. 2.Part-of-speech 3.Parse tree Semantic analysi 1.Named-entity recognition 2.Coreference resolution

7 A Single Relation Case

8 Challenge #2 An n -ary Relation –The task involves three entity mentions: Organism, Protein, Subcellular Loc. –Current approaches designed for detecting mentions with two entities. Proposed solution –Create a feature vector that contains the information for three pairings `

9 3 -ary Relation Feature Vector

10 PPLRE v1.4 Data Set 540 true and 4,769 false curated relation cases drawn from 843 research paper abstracts. 267 of the 540 true relation cases (~49%) span multiple sentences. Data available at koch.pathogenomics.ca/pplre/ koch.pathogenomics.ca/pplre/

11 Performance Results Tested against two baselines that were tuned to this task: YSRL and Zparser. TeGRR achieved the highest F-score (by significantly increasing the Recall). 5-fold cross validated

12 Research Directions 1.Actively grow the PSORTdb curated set 2.Qualifying the Certainty of a Case  E.g. label cases with: “experiment”, “hypothesized”, “assumed”, and “False”. 3.Ontology constrained predictions  E.g. Gram-positive bacteria do not have a periplasm therefore do not predict periplasm. 4.Application to other tasks

13 Recognition of Multi-sentence n-ary Subcellular Localization Mentions in Biomedical Abstracts G. Melli, M. Ester, A. Sarkar Dec. 6, 2007 http://www.gabormelli.com/2007/2007_MultiNaryBio_Melli_Presentation.ppt

14 Extra Slides for Questions

15 Shortened Reference List M. Craven, and J. Kumlien. (1999). Constructing Biological Knowledge- bases by Extracting Information from Text Sources. In Proc. of the International Conference on Intelligent Systems for Molec. Bio.Constructing Biological Knowledge- bases by Extracting Information from Text Sources. K. Fundel, R. Kuffner, and R. Zimmer. (2007). RelEx--Relation Extraction Using Eependency Parse Trees. Bioinformatics. 23(3).RelEx--Relation Extraction Using Eependency Parse Trees J. Jiang and C. Zhai. (2007). A Systematic Exploration of the Feature Space for Relation Extraction. In Proc. of NAACL/HLT-2007.A Systematic Exploration of the Feature Space for Relation Extraction Y. Liu, Z. Shi and A. Sarkar. (2007). Exploiting Rich Syntactic Information for Relation Extraction from Biomedical Articles. In Proc. of NAACL/HLT-2007.Exploiting Rich Syntactic Information for Relation Extraction from Biomedical Articles Z. Shi, A. Sarkar and F. Popowich. (2007). Simultaneous Identification of Biomedical Named-Entity and Functional Relation Using Statistical Parsing Techniques. Proc. of NAACL/HLT-2007Simultaneous Identification of Biomedical Named-Entity and Functional Relation Using Statistical Parsing Techniques M. Skounakis, M. Craven and S. Ray. (2003). Hierarchical Hidden Markov Models for Information Extraction. In Proc. of IJCAI-2003.Hierarchical Hidden Markov Models for Information Extraction. Zhang M, Zhang J, Su J: Exploring Syntactic Features for Relation Extraction using a Convolution Tree Kernel. Procs. of NAACL/HLT-2006; 2006.Exploring Syntactic Features for Relation Extraction using a Convolution Tree Kernel

16 Pipelined Process Framework

17 Relation Case Generation Input: (D, R): A text document D and a set of semantic relations R with a arguments. Output: (C): A set of unlabelled semantic relation cases. Method: Identify all e entity mentions E i in D Create every combination of a entity mentions from the e mentions in the document (without replacement). –For intrasentential semantic relation detection and classification tasks, limit the entity mentions to be from the same sentence. –For typed semantic relation detection and classification tasks, limit the combinations to those where there is a match between the semantic classes of each of the entity mentions E i and the semantic class of their corresponding relation argument A i.

18 Relation Case Labeling

19 Naïve Baseline Algorithms Predict True: Always predicts “True” regardless of the contents of the relation case –Attains the maximum Recall by any algorithm on the task. –Attains the maximum F1 by any naïve algorithm. –Most commonly used naïve baseline.

20 Prediction Outcome Labels true positive ( tp ) –predicted to have the label True and whose label is indeed True. false positive ( fp ) –predicted to have the label True but whose label is instead False. true negative ( tn ) –predicted to have the label False and whose label is indeed False. false negative ( fn ) –predicted to have the label False and whose label is instead True.

21 Performance Metrics Precision ( P ): probability that a test case that is predicted to have label True is tp. Recall ( R ): probability that a True test case will be tp. F-measure ( F1 ): Harmonic mean of the Precision and Recall estimates.

22 Token-based Features “Protein1 is a Location1...” Token Distance –2 intervening tokens Token Sequence(s) –Unigrams –Bigrams

23 Token-based Features (cont.) Token Part-of-Speech Role Sequences

24 Additional Features/Knowledge Expose additional features that can identify the more esoteric ways of expressing a relation. Features from outside of the “shortest-path”. –Challenge: past open-ended attempts have reduced performance ( Jiang and Zhi, 2007 ) –( Zhou et al, 2007 ) add heuristics for five common situations. Use domain-specific background knowledge. –E.g. Gram-positive bacteria (such as M. tuberculosis) do not have a periplasm therefore do not predict periplasm.

25 Challenge: Qualifying the Certainty of a Relation Case It would be useful qualify the certainty that can be assigned to a relation mention. E.g. In the news domain, distinguish relation mentions based on first hand information versus those based on hearsay. Idea: Add an additional label to each relation case that qualifies the certainty of the statement. E.g. in the PPLRE task label cases with: “directly validated”, “indirectly validated”, “hypothesized”, and “assumed”.


Download ppt "Recognition of Multi-sentence n-ary Subcellular Localization Mentions in Biomedical Abstracts G. Melli, M. Ester, A. Sarkar Dec. 6, 2007"

Similar presentations


Ads by Google