Playing Biology ’ s Name Game: Identifying Protein Names In Scientific Text Daniel Hanisch, Juliane Fluck, Heinz-Theodor Mevissen and Ralf Zimmer Pac Symp.

Slides:



Advertisements
Similar presentations
SVM—Support Vector Machines
Advertisements

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
UNIT-III By Mr. M. V. Nikum (B.E.I.T). Programming Language Lexical and Syntactic features of a programming Language are specified by its grammar Language:-
Semantic analysis Parsing only verifies that the program consists of tokens arranged in a syntactically-valid combination, we now move on to semantic analysis,
Information Retrieval in Practice
Temporal Pattern Matching of Moving Objects for Location-Based Service GDM Ronald Treur14 October 2003.
Predicting the Semantic Orientation of Adjective Vasileios Hatzivassiloglou and Kathleen R. McKeown Presented By Yash Satsangi.
Document and Query Forms Chapter 2. 2 Document & Query Forms Q 1. What is a document? A document is a stored data record in any form A document is a stored.
Compiler Summary Mooly Sagiv html://
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.
Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002.
1 ES 314 Advanced Programming Lec 2 Sept 3 Goals: Complete the discussion of problem Review of C++ Object-oriented design Arrays and pointers.
Mining the Medical Literature Chirag Bhatt October 14 th, 2004.
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
Overview of Search Engines
Information Retrieval in Practice
Data Mining Techniques
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
Chapter 1 Introduction Dr. Frank Lee. 1.1 Why Study Compiler? To write more efficient code in a high-level language To provide solid foundation in parsing.
Chapter 7. BEAT: the Behavior Expression Animation Toolkit
Resolving abbreviations to their senses in Medline S. Gaudan, H. Kirsch and D. Rebholz-Schuhmann European Bioinformatics Institute, Wellcome Trust Genome.
Outline Quick review of GS Current problems with GS Our solutions Future work Discussion …
Learning Phonetic Similarity for Matching Named Entity Translation and Mining New Translations Wai Lam, Ruizhang Huang, Pik-Shan Cheung ACM SIGIR 2004.
IProLINK – A Literature Mining Resource at PIR (integrated Protein Literature INformation and Knowledge ) Hu ZZ 1, Liu H 2, Vijay-Shanker K 3, Mani I 4,
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.
RELATIONAL FAULT TOLERANT INTERFACE TO HETEROGENEOUS DISTRIBUTED DATABASES Prof. Osama Abulnaja Afraa Khalifah
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Discovering Gene-Disease Association using On-line Scientific Text Abstracts. Raj Adhikari Advisor: Javed Mostafa.
Describing Images using Inferred Visual Dependency Representations Authors : Desmond Elliot & Arjen P. de Vries Presentation of Paper by : Jantre Sanket.
Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.
Analysis of Complex Proteomic Datasets Using Scaffold Free Scaffold Viewer can be downloaded at:
1 CS 430: Information Discovery Lecture 3 Inverted Files.
University of Texas at Austin Machine Learning Group Integrating Co-occurrence Statistics with IE for Robust Retrieval of Protein Interactions from Medline.
Mining Reference Tables for Automatic Text Segmentation Eugene Agichtein Columbia University Venkatesh Ganti Microsoft Research.
Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.
Seeking Abbreviations From MEDLINE Jeffrey T. Chang Hinrich Schütze Russ B. Altman Presented by: Bo Han.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Instance Filtering for Entity Recognition Advisor : Dr.
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Event retrieval in large video collections with circulant temporal encoding CVPR 2013 Oral.
Joey Paquet, 2000, Lecture 2 Lexical Analysis.
Distribution of information in biomedical abstracts and full- text publications M. J. Schuemie et al. Dept. of Medical Informatics, Erasmus University.
Learning Phonetic Similarity for Matching Named Entity Translations and Mining New Translations Wai Lam Ruizhang Huang Pik-Shan Cheung Department of Systems.
1 Compiler & its Phases Krishan Kumar Asstt. Prof. (CSE) BPRCE, Gohana.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Automatic Assignment of Biomedical Categories: Toward a Generic Approach Patrick Ruch University Hospitals of Geneva, Medical Informatics Service, Geneva.
A Patent Document Retrieval System Addressing Both Semantic and Syntactic Properties Liang Chen*,Naoyuki Tokuda+, Hisahiro Adachi+ *University of Northern.
TileSoft: Sequence Optimization Software for Designing DNA Secondary Structures P. Yin*, B. Guo*, C. Belmore*, W. Palmeri*, E. Winfree †, T. H. LaBean*
Short Video Metadata Acquisition Game Aleš Mäsiar, Jakub Šimko
Keyword Translation Accuracy and Cross-Lingual Question Answering in Chinese and Japanese Teruko Mitamura Mengqiu Wang Hideki Shima Frank Lin In CMU EACL.
Dynamic programming with more complex models When gaps do occur, they are often longer than one residue.(biology) We can still use all the dynamic programming.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Compiler Construction CPCS302 Dr. Manal Abdulaziz.
HANGMAN OPTIMIZATION Kyle Anderson, Sean Barton and Brandyn Deffinbaugh.
Integrating linguistic knowledge in passage retrieval for question answering J¨org Tiedemann Alfa Informatica, University of Groningen HLT/EMNLP 2005.
1 GAPSCORE: Finding Gene and Protein Names one Word at a Time Jeffery T. Chang 1, Hinrich Schutze 2 & Russ B. Altman 1 1 Department of Genetics, Stanford.
Biocomputational Languages December 1, 2011 Greg Antell & Khoa Nguyen.
Information Retrieval in Practice
eHS AI component roadmap: Step I: prototype with fuzzy matching
A Mental Game as a Source of CS Case Studies
Text Based Information Retrieval
Multimedia Information Retrieval
N-Gram Model Formulas Word sequences Chain rule of probability
Batyr Charyyev.
謝孫源 (Sun-Yuan Hsieh) 成功大學 電機資訊學院 資訊工程系
Presentation transcript:

Playing Biology ’ s Name Game: Identifying Protein Names In Scientific Text Daniel Hanisch, Juliane Fluck, Heinz-Theodor Mevissen and Ralf Zimmer Pac Symp Biocomput. 2003;:

Abstract Construction of a comprehensive general purpose name dictionary An accompanying automatic curation procedure based on a simple token model of protein names An efficient search algorithm to analyze all abstracts in MEDLINE Parameters are optimized using machine learning techniques

Model for protein and gene names Protein names are often composed of more than one word (token) The “ order ” of these words is not very important – permutation of tokens may occur General-purpose dictionaries of protein names must be automatically composed

Token classes (1/3)

Token classes (2/3) Extract all words from the dictionary with frequency of occurrence > 100 Non-descriptive tokens: words occurring in databases but rarely used in free text or have no influence on the significance of match Modifier tokens: words crucial for correct recognition

Token classes (3/3) Specifier tokens: Arabic and Roman numbers and Greek letters Delimiter tokens: used to gain specificity in the matching procedure – help identify name boundaries Common words: obtained by comparison to a standard English dictionary Standard tokens: gene identifiers as they cannot be easily assigned to a separate calss

Automatic generation of the dictionary Extract gene symbols, alias names, and full names for all human genes from the HUGO Nomenclature database Create an entry for each official gene symbol and add the corresponding names in the OMIM database Extract all synonyms in SWISSPROT and TREMBL database and match these to HUGO entries

Curation of the dictionary (1/3) To resolve ambiguities and to remove nosensical names from the dictionary A curation procedure consists of two phases – expansion and pruning Expansion:

Curation of the dictionary (2/3) Pruning: remove redundancies, ambiguities, and irrelevant synonyms First: synonyme  a sequence of token class identifiers Use regular expression to search unspecific synonyms (e.g. only non-descriptive tokens, only specifier tokens, etc.) Finally, a list of ambiguous names is stored separately with reference to their original records

Curation of the dictionary (3/3) The ambiguity list can be used to identify such entries and move them to the manual curation list based on their frequency of occurrence.

Efficient detection of names (1/3) MEDLINE contains about 11 million abstracts Linear time in the number of tokens of the parsed text To sweep over the abstract, processing one token at a time and keep a set of candidate solutions and two associated scoring measures, boundary score s  and acceptance score s , for the present position

Efficient detection of names (2/3) boundary score s  : controls the end of the extension of a candidate match and is increased on a token mismatch. The candidate is pruned if s  >boundary threshold acceptance score s  : determine whether the candidate is reported as a match. s  is a linear combination of token-class-specific match and mismatch terms. In other words, the significance of token classes vary.

Efficient detection of names (3/3) Example: Only the non-descriptive token “ precursor ” is unmatched in the candidate  a nearly maximal match score would be computed (if non-descriptive tokens receive a small weight) However, the semantically significant modifier token “ receptor ” leads to a substantial mismatch term (if weights are set appropriately)

Parameter optimization Robust linear programming (RPL) was used to compute a set of sensible weights This supervised machine learning techniques uses a set of positive samples, i.e. correctly identified protein names, and a set of negative ones. The match and mismatch weighting parameters for delimiter, specifier, modifier, and standard tokens were tuned. The optimized weightings penalize mismatch of modifier and number tokens and reward matching of other token classes to various extend

Evaluation The test dataset is based on the TRANSPATH database on regulatory interactions. Extracted all human proteins with SWISSPROT annotations Discarded abstracts if no text was available or if a protein was described for the first time Resulting benchmark set consists of 611 associations (141 objects in 470 abstracts)

Results – 5-fold c.v.