Recognition of Multi-sentence n-ary Subcellular Localization Mentions in Biomedical Abstracts G. Melli, M. Ester, A. Sarkar Dec. 6, 2007

Slides:



Advertisements
Similar presentations
Machine Learning Approaches to the Analysis of Large Corpora : A Survey Xunlei Rose Hu and Eric Atwell University of Leeds.
Advertisements

Document Summarization using Conditional Random Fields Dou Shen, Jian-Tao Sun, Hua Li, Qiang Yang, Zheng Chen IJCAI 2007 Hao-Chin Chang Department of Computer.
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
Machine learning continued Image source:
MINING FEATURE-OPINION PAIRS AND THEIR RELIABILITY SCORES FROM WEB OPINION SOURCES Presented by Sole A. Kamal, M. Abulaish, and T. Anwar International.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Sentiment Analysis An Overview of Concepts and Selected Techniques.
Pattern Recognition and Machine Learning
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
An Overview of Text Mining Rebecca Hwa 4/25/2002 References M. Hearst, “Untangling Text Data Mining,” in the Proceedings of the 37 th Annual Meeting of.
Predicting Gene Functions from Text Using a Cross- Species Approach Emilia Stoica and Marti Hearst SIMS University of California, Berkeley.
1 The Expected Performance Curve Samy Bengio, Johnny Mariéthoz, Mikaela Keller MI – 25. oktober 2007 Kresten Toftgaard Andersen.
Object Class Recognition Using Discriminative Local Features Gyuri Dorko and Cordelia Schmid.
Presented by Zeehasham Rasheed
Mining the Medical Literature Chirag Bhatt October 14 th, 2004.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Inductive Approaches to the Detection and Classification of Semantic Relation Mentions Depth Report Examination Presentation Gabor Melli August 27, 2007.
1/24 Learning to Extract Genic Interactions Using Gleaner LLL05 Workshop, 7 August 2005 ICML 2005, Bonn, Germany Mark Goadrich, Louis Oliphant and Jude.
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.
Inductive Logic Programming Includes slides by Luis Tari CS7741L16ILP.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Authors: Ting Wang, Yaoyong Li, Kalina Bontcheva, Hamish Cunningham, Ji Wang Presented by: Khalifeh Al-Jadda Automatic Extraction of Hierarchical Relations.
Exploiting Ontologies for Automatic Image Annotation M. Srikanth, J. Varner, M. Bowden, D. Moldovan Language Computer Corporation
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
2007. Software Engineering Laboratory, School of Computer Science S E Towards Answering Opinion Questions: Separating Facts from Opinions and Identifying.
Accomplishments and Challenges in Literature Data Mining for Biology L. Hirschman et al. Presented by Jing Jiang CS491CXZ Spring, 2004.
Ling 570 Day 17: Named Entity Recognition Chunking.
Automatically Generating Gene Summaries from Biomedical Literature (To appear in Proceedings of PSB 2006) X. LING, J. JIANG, X. He, Q.~Z. MEI, C.~X. ZHAI,
A Weakly-Supervised Approach to Argumentative Zoning of Scientific Documents Yufan Guo Anna Korhonen Thierry Poibeau 1 Review By: Pranjal Singh Paper.
1 Exploiting Syntactic Patterns as Clues in Zero- Anaphora Resolution Ryu Iida, Kentaro Inui and Yuji Matsumoto Nara Institute of Science and Technology.
1 Automated recognition of malignancy mentions in biomedical literature BMC Bioinformatics 2006, 7:492 Speaker: Yu-Ching Fang Advisors: Hsueh-Fen Juan.
Special topics on text mining [ Part I: text classification ] Hugo Jair Escalante, Aurelio Lopez, Manuel Montes and Luis Villaseñor.
TEMPLATE DESIGN © Zhiyao Duan 1,2, Lie Lu 1, and Changshui Zhang 2 1. Microsoft Research Asia (MSRA), Beijing, China.2.
A S URVEY ON I NFORMATION E XTRACTION FROM D OCUMENTS U SING S TRUCTURES OF S ENTENCES Chikayama Taura Lab. M1 Mitsuharu Kurita 1.
Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute.
1 Learning Sub-structures of Document Semantic Graphs for Document Summarization 1 Jure Leskovec, 1 Marko Grobelnik, 2 Natasa Milic-Frayling 1 Jozef Stefan.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
COLING 2012 Extracting and Normalizing Entity-Actions from Users’ comments Swapna Gottipati, Jing Jiang School of Information Systems, Singapore Management.
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
A Systematic Exploration of the Feature Space for Relation Extraction Jing Jiang & ChengXiang Zhai Department of Computer Science University of Illinois,
1 Intelligente Analyse- und Informationssysteme Frank Reichartz, Hannes Korte & Gerhard Paass Fraunhofer IAIS, Sankt Augustin, Germany Dependency Tree.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
An Entity-Mention Model for Coreference Resolution with Inductive Logic Programming Xiaofeng Yang 1 Jian Su 1 Jun Lang 2 Chew Lim Tan 3 Ting Liu 2 Sheng.
Domain Adaptation for Biomedical Information Extraction Jing Jiang BeeSpace Seminar Oct 17, 2007.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Psychiatric document retrieval using a discourse-aware model Presenter : Wu, Jia-Hao Authors : Liang-Chih.
Semi-automatic Product Attribute Extraction from Store Website
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
GENDER AND AGE RECOGNITION FOR VIDEO ANALYTICS SOLUTION PRESENTED BY: SUBHASH REDDY JOLAPURAM.
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
Discovering Evolutionary Theme Patterns from Text - An Exploration of Temporal Text Mining Qiaozhu Mei and ChengXiang Zhai Department of Computer Science.
Wei Lu, Hwee Tou Ng, Wee Sun Lee National University of Singapore
Applications of HMMs in Computational Biology BMI/CS 576 Colin Dewey Fall 2010.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Genetic Algorithms (in 1 Slide) l GA: based on an analogy to biological evolution l Each.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
Combining Evolutionary Information Extracted From Frequency Profiles With Sequence-based Kernels For Protein Remote Homology Detection Name: ZhuFangzhi.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Final Report (30% final score) Bin Liu, PhD, Associate Professor.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Department of Computer Science The University of Texas at Austin USA Joint Entity and Relation Extraction using Card-Pyramid Parsing Rohit J. Kate Raymond.
BioCreAtIvE Critical Assessment for Information Extraction in Biology Granada, Spain, March28-March 31, 2004 Task 2: Functional annotation of gene products.
Automatically Labeled Data Generation for Large Scale Event Extraction
A Brief Introduction to Distant Supervision
Ping LUO*, Fen LIN^, Yuhong XIONG*, Yong ZHAO*, Zhongzhi SHI^
Stance Classification of Ideological Debates
Presentation transcript:

Recognition of Multi-sentence n-ary Subcellular Localization Mentions in Biomedical Abstracts G. Melli, M. Ester, A. Sarkar Dec. 6,

Introduction We propose a method for detecting n-ary relations that may span multiple sentences Motivation is to support the semi- automated population of subcellular localizations in db.psort.org.db.psort.org –Organism / Protein / Location We cast each document as a text graph and use machine learning to detect patterns in the graph.

Is there an SCL in this text?

Yes: ( V. cholerae, TcpC, outer membrane ) Current algorithms are restricted to the detection of binary relations within one sentence: ( TcpC, outer membrane ). Here is the relevant passage

Challenge #1 A significant number of the relation cases (~40%) span multiple sentences. Proposed solution: –Create a text graph for the entire document –The graph can contain a superset of the information used by the current binary relation single sentence approaches. (Jiang and Zhai, 2007; Zhou et al, 2007)

ORG LOC PROT LOC pilus LOC Automated Markup Syntactic analysis 1.End of sent. 2.Part-of-speech 3.Parse tree Semantic analysi 1.Named-entity recognition 2.Coreference resolution

A Single Relation Case

Challenge #2 An n -ary Relation –The task involves three entity mentions: Organism, Protein, Subcellular Loc. –Current approaches designed for detecting mentions with two entities. Proposed solution –Create a feature vector that contains the information for three pairings `

3 -ary Relation Feature Vector

PPLRE v1.4 Data Set 540 true and 4,769 false curated relation cases drawn from 843 research paper abstracts. 267 of the 540 true relation cases (~49%) span multiple sentences. Data available at koch.pathogenomics.ca/pplre/ koch.pathogenomics.ca/pplre/

Performance Results Tested against two baselines that were tuned to this task: YSRL and Zparser. TeGRR achieved the highest F-score (by significantly increasing the Recall). 5-fold cross validated

Research Directions 1.Actively grow the PSORTdb curated set 2.Qualifying the Certainty of a Case  E.g. label cases with: “experiment”, “hypothesized”, “assumed”, and “False”. 3.Ontology constrained predictions  E.g. Gram-positive bacteria do not have a periplasm therefore do not predict periplasm. 4.Application to other tasks

Recognition of Multi-sentence n-ary Subcellular Localization Mentions in Biomedical Abstracts G. Melli, M. Ester, A. Sarkar Dec. 6,

Extra Slides for Questions

Shortened Reference List M. Craven, and J. Kumlien. (1999). Constructing Biological Knowledge- bases by Extracting Information from Text Sources. In Proc. of the International Conference on Intelligent Systems for Molec. Bio.Constructing Biological Knowledge- bases by Extracting Information from Text Sources. K. Fundel, R. Kuffner, and R. Zimmer. (2007). RelEx--Relation Extraction Using Eependency Parse Trees. Bioinformatics. 23(3).RelEx--Relation Extraction Using Eependency Parse Trees J. Jiang and C. Zhai. (2007). A Systematic Exploration of the Feature Space for Relation Extraction. In Proc. of NAACL/HLT-2007.A Systematic Exploration of the Feature Space for Relation Extraction Y. Liu, Z. Shi and A. Sarkar. (2007). Exploiting Rich Syntactic Information for Relation Extraction from Biomedical Articles. In Proc. of NAACL/HLT-2007.Exploiting Rich Syntactic Information for Relation Extraction from Biomedical Articles Z. Shi, A. Sarkar and F. Popowich. (2007). Simultaneous Identification of Biomedical Named-Entity and Functional Relation Using Statistical Parsing Techniques. Proc. of NAACL/HLT-2007Simultaneous Identification of Biomedical Named-Entity and Functional Relation Using Statistical Parsing Techniques M. Skounakis, M. Craven and S. Ray. (2003). Hierarchical Hidden Markov Models for Information Extraction. In Proc. of IJCAI-2003.Hierarchical Hidden Markov Models for Information Extraction. Zhang M, Zhang J, Su J: Exploring Syntactic Features for Relation Extraction using a Convolution Tree Kernel. Procs. of NAACL/HLT-2006; 2006.Exploring Syntactic Features for Relation Extraction using a Convolution Tree Kernel

Pipelined Process Framework

Relation Case Generation Input: (D, R): A text document D and a set of semantic relations R with a arguments. Output: (C): A set of unlabelled semantic relation cases. Method: Identify all e entity mentions E i in D Create every combination of a entity mentions from the e mentions in the document (without replacement). –For intrasentential semantic relation detection and classification tasks, limit the entity mentions to be from the same sentence. –For typed semantic relation detection and classification tasks, limit the combinations to those where there is a match between the semantic classes of each of the entity mentions E i and the semantic class of their corresponding relation argument A i.

Relation Case Labeling

Naïve Baseline Algorithms Predict True: Always predicts “True” regardless of the contents of the relation case –Attains the maximum Recall by any algorithm on the task. –Attains the maximum F1 by any naïve algorithm. –Most commonly used naïve baseline.

Prediction Outcome Labels true positive ( tp ) –predicted to have the label True and whose label is indeed True. false positive ( fp ) –predicted to have the label True but whose label is instead False. true negative ( tn ) –predicted to have the label False and whose label is indeed False. false negative ( fn ) –predicted to have the label False and whose label is instead True.

Performance Metrics Precision ( P ): probability that a test case that is predicted to have label True is tp. Recall ( R ): probability that a True test case will be tp. F-measure ( F1 ): Harmonic mean of the Precision and Recall estimates.

Token-based Features “Protein1 is a Location1...” Token Distance –2 intervening tokens Token Sequence(s) –Unigrams –Bigrams

Token-based Features (cont.) Token Part-of-Speech Role Sequences

Additional Features/Knowledge Expose additional features that can identify the more esoteric ways of expressing a relation. Features from outside of the “shortest-path”. –Challenge: past open-ended attempts have reduced performance ( Jiang and Zhi, 2007 ) –( Zhou et al, 2007 ) add heuristics for five common situations. Use domain-specific background knowledge. –E.g. Gram-positive bacteria (such as M. tuberculosis) do not have a periplasm therefore do not predict periplasm.

Challenge: Qualifying the Certainty of a Relation Case It would be useful qualify the certainty that can be assigned to a relation mention. E.g. In the news domain, distinguish relation mentions based on first hand information versus those based on hearsay. Idea: Add an additional label to each relation case that qualifies the certainty of the statement. E.g. in the PPLRE task label cases with: “directly validated”, “indirectly validated”, “hypothesized”, and “assumed”.