ISMB 2003 presentation Extracting Synonymous Gene and Protein Terms from Biological Literature Hong Yu and Eugene Agichtein Dept. Computer Science, Columbia.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Context-based object-class recognition and retrieval by generalized correlograms by J. Amores, N. Sebe and P. Radeva Discussion led by Qi An Duke University.
Supervised Learning Techniques over Twitter Data Kleisarchaki Sofia.
Problem Semi supervised sarcasm identification using SASI
Ke Liu1, Junqiu Wu2, Shengwen Peng1,Chengxiang Zhai3, Shanfeng Zhu1
NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen.
A Self Learning Universal Concept Spotter By Tomek Strzalkowski and Jin Wang Original slides by Iman Sen Edited by Ralph Grishman.
Multi-View Learning in the Presence of View Disagreement C. Mario Christoudias, Raquel Urtasun, Trevor Darrell UC Berkeley EECS & ICSI MIT CSAIL.
LEDIR : An Unsupervised Algorithm for Learning Directionality of Inference Rules Advisor: Hsin-His Chen Reporter: Chi-Hsin Yu Date: From EMNLP.
Introduction to Natural Language Processing Phenotype RCN Meeting Feb 2013.
Snowball : Extracting Relations from Large Plain-Text Collections
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
SUPPORT VECTOR MACHINES PRESENTED BY MUTHAPPA. Introduction Support Vector Machines(SVMs) are supervised learning models with associated learning algorithms.
An Overview of Text Mining Rebecca Hwa 4/25/2002 References M. Hearst, “Untangling Text Data Mining,” in the Proceedings of the 37 th Annual Meeting of.
Designing clustering methods for ontology building: The Mo’K workbench Authors: Gilles Bisson, Claire Nédellec and Dolores Cañamero Presenter: Ovidiu Fortu.
Mining the Medical Literature Chirag Bhatt October 14 th, 2004.
1 Extracting Relations from Large Text Collections Eugene Agichtein, Eleazar Eskin and Luis Gravano Department of Computer Science Columbia University.
Introduction to Data Mining Engineering Group in ACL.
Information Extraction with Unlabeled Data Rayid Ghani Joint work with: Rosie Jones (CMU) Tom Mitchell (CMU & WhizBang! Labs) Ellen Riloff (University.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department.
Recognition of Multi-sentence n-ary Subcellular Localization Mentions in Biomedical Abstracts G. Melli, M. Ester, A. Sarkar Dec. 6, 2007
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Knowledge and Tree-Edits in Learnable Entailment Proofs Asher Stern, Amnon Lotan, Shachar Mirkin, Eyal Shnarch, Lili Kotlerman, Jonathan Berant and Ido.
2007. Software Engineering Laboratory, School of Computer Science S E Towards Answering Opinion Questions: Separating Facts from Opinions and Identifying.
On the Issue of Combining Anaphoricity Determination and Antecedent Identification in Anaphora Resolution Ryu Iida, Kentaro Inui, Yuji Matsumoto Nara Institute.
Automatic Detection of Tags for Political Blogs Khairun-nisa Hassanali Vasileios Hatzivassiloglou The University.
This work is supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number.
Querying Text Databases for Efficient Information Extraction Eugene Agichtein Luis Gravano Columbia University.
SYMPOSIUM ON SEMANTICS IN SYSTEMS FOR TEXT PROCESSING September 22-24, Venice, Italy Combining Knowledge-based Methods and Supervised Learning for.
Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from.
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
Constructing Knowledge Graph from Unstructured Text Image Source: Kundan Kumar Siddhant Manocha.
Beyond Sliding Windows: Object Localization by Efficient Subwindow Search The best paper prize at CVPR 2008.
Seeking Abbreviations From MEDLINE Jeffrey T. Chang Hinrich Schütze Russ B. Altman Presented by: Bo Han.
Ground Truth Free Evaluation of Segment Based Maps Rolf Lakaemper Temple University, Philadelphia,PA,USA.
A Semantic Approach to IE Pattern Induction Mark Stevenson and Mark A. Greenwood Natural Language Processing Group University of Sheffield, UK.
CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching.
Protein Fold Recognition as a Data Mining Coursework Project Badri Adhikari Department of Computer Science University of Missouri-Columbia.
A Systematic Exploration of the Feature Space for Relation Extraction Jing Jiang & ChengXiang Zhai Department of Computer Science University of Illinois,
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Date : 2013/03/18 Author : Jeffrey Pound, Alexander K. Hudek, Ihab F. Ilyas, Grant Weddell Source : CIKM’12 Speaker : Er-Gang Liu Advisor : Prof. Jia-Ling.
Retrieval of Highly Related Biomedical References by Key Passages of Citations Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan.
Distribution of information in biomedical abstracts and full- text publications M. J. Schuemie et al. Dept. of Medical Informatics, Erasmus University.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
Domain Adaptation for Biomedical Information Extraction Jing Jiang BeeSpace Seminar Oct 17, 2007.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Psychiatric document retrieval using a discourse-aware model Presenter : Wu, Jia-Hao Authors : Liang-Chih.
Exploring in the Weblog Space by Detecting Informative and Affective Articles Xiaochuan Ni, Gui-Rong Xue, Xiao Ling, Yong Yu Shanghai Jiao-Tong University.
Semi-automatic Product Attribute Extraction from Store Website
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Application of latent semantic analysis to protein remote homology detection Wu Dongyin 4/13/2015.
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
Answer Mining by Combining Extraction Techniques with Abductive Reasoning Sanda Harabagiu, Dan Moldovan, Christine Clark, Mitchell Bowden, Jown Williams.
Automatic Assignment of Biomedical Categories: Toward a Generic Approach Patrick Ruch University Hospitals of Geneva, Medical Informatics Service, Geneva.
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
Discovering Relations among Named Entities from Large Corpora Takaaki Hasegawa *, Satoshi Sekine 1, Ralph Grishman 1 ACL 2004 * Cyberspace Laboratories.
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Final Report (30% final score) Bin Liu, PhD, Associate Professor.
1 GAPSCORE: Finding Gene and Protein Names one Word at a Time Jeffery T. Chang 1, Hinrich Schutze 2 & Russ B. Altman 1 1 Department of Genetics, Stanford.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
BioCreAtIvE Critical Assessment for Information Extraction in Biology Granada, Spain, March28-March 31, 2004 Task 2: Functional annotation of gene products.
Major Issues n Information is mostly online n Information is increasing available in full-text (full-content) n There is an explosion in the amount of.
Using the Fisher kernel method to detect remote protein homologies Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16)
Distant supervision for relation extraction without labeled data
Introduction Task: extracting relational facts from text
Extracting Why Text Segment from Web Based on Grammar-gram
Presentation transcript:

ISMB 2003 presentation Extracting Synonymous Gene and Protein Terms from Biological Literature Hong Yu and Eugene Agichtein Dept. Computer Science, Columbia University, New York, USA {hongyu,

Significance and Introduction Genes and proteins are often associated with multiple names Apo3, DR3, TRAMP, LARD, and lymphocyte associated receptor of death Authors often use different synonyms Information extraction benefits from identifying those synonyms Synonym knowledge sources are not complete Developing automate approaches for identifying gene/protein synonyms from literature

Background-synonym identification Semantically related words Distributional similarity [Lin 98][Li and Abe 98][Dagan et al 95] “beer” and “wine” “drink”, “people”, “bottle” and “make” Mapping abbreviations to full forms Map LARD to lymphocyte associated receptor of death [Bowden et al. 98] [Hisamitsu and Niwa 98] [Liu and Friedman 03] [Pakhomov 02] [Park and Byrd 01] [Schwartz and Hearst 03] [Yoshida et al. 00] [Yu et al. 02] Methods for detecting biomedical multiword synonyms Sharing a word(s) [Hole 00] cerebrospinal fluid  cerebrospinal fluid protein assay Information retrieval approach Trigram matching algorithm [Wilbur and Kim 01] Vector space model cerebrospinal fluid  cer, ere, …, uid cerebrospinal fluid protein assay  cer,ere, …, say

Background-synonym identification GPE [Yu et al 02] A rule-based approach for detecting synonymous gene/protein terms Manually recognize patterns authors use to list synonyms Apo3/TRAMP/WSL/DR3/LARD Extract synonym candidates and heuristics to filter out those unrelated terms ng/kg/min Advantages and disadvantages High precision (90%) Recall might be low, expensive to build up

Background—Machine-learning Machine-learning reduces manual effort by automatically acquiring rules from data Unsupervised and supervised Semi-supervised Bootstrapping [Hearst 92, Yarowsky 95] [Agichtein and Gravano 00] Hyponym detection [Hearst 92] The bow lute, such as the Bambara ndang, is plucked and has an individual curved neck for each string. A Bambara ndang is a kind of bow lute Co-training [Blum and Mitchell 98]

Method-Outline Machine-learning Unsupervised Similarity [Dagan et al 95] Semi-supervised Bootstrapping SNOWBALL [Agichtein and Gravano 02] Supervised Support Vector Machine Comparison between machine-learning and GPE Combined approach

Method--Unsupervised Contextual similarity [Dagan et al 95] Hypothesis: synonyms have similar surrounding words Mutual information Similarity

Methods—semi-supervised SNOWBALL [Agichtein and Gravano 02] Bootrapping Starts with a small set of user-provided seed tuples for the relation, automatically generates and evaluates patterns for extracting new tuples. {Apo3, DR3} “Apo3, also known as DR3…” “, also known as ” {DR3, LARD} “DR3, also called LARD…” “, also called ” {LARD, Apo3}

Method--Supervised Support Vector Machine State-of-the-art text classification method SVM light Training sets: The same sets of positive and negative tuples as the SNOWBALL Features: the same terms and term weights used by SNOWBALL Kernel function Radial basis kernel (rbf) kernel function

Methods—Combined Rational Machine-learning approaches increase recall The manual rule-based approach GPE has a high precision with lower recall Combined will boost both recall and precision Method Assume each system is an independent predictor Prob=1-Prob that all systems extracted incorrectly

Evaluation-data Data GeneWays corpora [Friedman et al 01] 52,000 full-text journal articles Science, Nature, Cell, EMBO, Cell Biology, PNAS, Journal of Biochemistry Preprocessing Gene/Protein name entity tagging Abgene [Tanabe and Wilbur 02] Segmentation SentenceSplitter Training and testing 20,000 articles for training Tuning SNOWBALL parameters such as context window, etc. 32,000 articles for testing

Evaluation-matrices Estimating precision Randomly select 20 synonyms with confident scores ( , , …, ) Biological experts judged the correctness of synonym pairs Estimating recall SWISSPROT—Gold Standard 989 pairs of SWISSPROT synonyms co-appear in at least one sentence in the test set Biological experts judged 588 pairs were indeed synonyms “…and cdc47, cdc21, and mis5 form another complex, which relatively weakly associates with mcm2…”

Results Patterns SNOWBALL found Of 148 evaluated synonym pairs, 62(42%) were not listed as synonyms in SWISSPROT Conf Left - Middle Right -

Results

System performance System Tagging Similarity Snowball SVM GPE Time 7 hs 40 mins 2 hs 1.5 h 35 mins

Conclusions Extraction techniques can be used as a valuable supplement to resources such as SWISSPROT Synonym relations can be automated through machine-learning approaches SNOWBALL can be applied successfully for recognizing the patterns