Linking Text Mentions to Biological Identifiers Alexander A. Morgan MITRE Corporation

Slides:



Advertisements
Similar presentations
1 OOA-HR Workshop, 11 October 2006 Semantic Metadata Extraction using GATE Diana Maynard Natural Language Processing Group University of Sheffield, UK.
Advertisements

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
New Technologies Supporting Technical Intelligence Anthony Trippe, 221 st ACS National Meeting.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
© 2002 The MITRE Corporation. ALL RIGHTS RESERVED. Co-Chair: Alexander Yeh, MITRE Corp. Data: FlyBase ( July 2002 KDD Cup 2002 Task1:
Mining External Resources for Biomedical IE Why, How, What Malvina Nissim
LESSONS FROM THE BIOCREATIVE PROTEIN- PROTEIN INTERACTION (PPI) TASK RegCreative Jamboree, Friday, December, 1st, (2006) MARTIN KRALLINGER, 2006 LESSONS.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Introduction to Natural Language Processing Phenotype RCN Meeting Feb 2013.
© 2003 The MITRE Corporation. ALL RIGHTS RESERVED. MITRE Critical Assessment of Information Extraction Systems in Biology (BioCreAtIvE) Marc Colosimo Lynette.
© 2006 The MITRE Corporation. ALL RIGHTS RESERVED. Lynette Hirschman The MITRE Corporation Bedford, MA, USA RegCreative Jamboree Nov 29-Dec 1, 2006 Text.
Information Retrieval in Practice
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
An Overview of Text Mining Rebecca Hwa 4/25/2002 References M. Hearst, “Untangling Text Data Mining,” in the Proceedings of the 37 th Annual Meeting of.
Use of Ontologies in the Life Sciences: BioPax Graciela Gonzalez, PhD (some slides adapted from presentations available at
QuASI: Question Answering using Statistics, Semantics, and Inference Marti Hearst, Jerry Feldman, Chris Manning, Srini Narayanan Univ. of California-Berkeley.
DI FC UL1 Gene Function Prediction by Mining Biomedical Literature Pooja Jain Master in Bioinformatics Supervisor - Mário Jorge Costa Gaspar.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.
Mining the Medical Literature Chirag Bhatt October 14 th, 2004.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Cis-Regulatory/ Text Mining Interface Discussion.
1 iProLINK: An integrated protein resource for literature mining and literature-based curation 1. Bibliography mapping - UniProt mapped citations 2. Annotation.
Knowledge Science & Engineering Institute, Beijing Normal University, Analyzing Transcripts of Online Asynchronous.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
Srihari-CSE730-Spring 2003 CSE 730 Information Retrieval of Biomedical Text and Data Inroduction.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Lawrence Hunter & K. Bretonnel Cohen Center for Computational Pharmacology UCHSC School of Medicine Using.
Accomplishments and Challenges in Literature Data Mining for Biology L. Hirschman et al. Presented by Jing Jiang CS491CXZ Spring, 2004.
Resolving abbreviations to their senses in Medline S. Gaudan, H. Kirsch and D. Rebholz-Schuhmann European Bioinformatics Institute, Wellcome Trust Genome.
Outline Quick review of GS Current problems with GS Our solutions Future work Discussion …
IProLINK – A Literature Mining Resource at PIR (integrated Protein Literature INformation and Knowledge ) Hu ZZ 1, Liu H 2, Vijay-Shanker K 3, Mani I 4,
1 Technologies for (semi-) automatic metadata creation Diana Maynard.
Automatically Generating Gene Summaries from Biomedical Literature (To appear in Proceedings of PSB 2006) X. LING, J. JIANG, X. He, Q.~Z. MEI, C.~X. ZHAI,
Combining terminology resources and statistical methods for entity recognition: an evaluation Angus Roberts, Robert Gaizauskas, Mark Hepple, Yikun Guo.
Analysis of Complex Proteomic Datasets Using Scaffold Free Scaffold Viewer can be downloaded at:
Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute.
Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.
Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.
Playing Biology ’ s Name Game: Identifying Protein Names In Scientific Text Daniel Hanisch, Juliane Fluck, Heinz-Theodor Mevissen and Ralf Zimmer Pac Symp.
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
The Gene Ontology and its insertion into UMLS Jane Lomax.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
Copyright OpenHelix. No use or reproduction without express written consent1.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Date : 2013/03/18 Author : Jeffrey Pound, Alexander K. Hudek, Ihab F. Ilyas, Grant Weddell Source : CIKM’12 Speaker : Er-Gang Liu Advisor : Prof. Jia-Ling.
Using Domain Ontologies to Improve Information Retrieval in Scientific Publications Engineering Informatics Lab at Stanford.
Statistical Testing with Genes Saurabh Sinha CS 466.
Distribution of information in biomedical abstracts and full- text publications M. J. Schuemie et al. Dept. of Medical Informatics, Erasmus University.
Bioinformatics and Computational Biology
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
High throughput biology data management and data intensive computing drivers George Michaels.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
BioCreAtIvE Critical Assessment for Information Extraction in Biology Granada, Spain, March28-March 31, 2004 Task 2: Functional annotation of gene products.
UNIFIED MEDICAL LANGUAGE SYSTEMS (UMLS)
Concept Grounding to Multiple Knowledge Bases via Indirect Supervision
A knowledge-based text annotation tool
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Department of Genetics • Stanford University School of Medicine
Modified from slides from Jim Hu and Suzi Aleksander Spring 2016
Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan
Presentation transcript:

Linking Text Mentions to Biological Identifiers Alexander A. Morgan MITRE Corporation

2 Copyright 2005, MITRE Corporation MITRE (Bedford) Bioinformatics Marc Colosimo Lynette Hirschman Benjamin Wellner Alexander S. Yeh For references, copies of slides, or more information, please contact:

3 Copyright 2005, MITRE Corporation Outline Overview of problem Evaluations –Numerical results Methodology Analysis of results (Importance of lexical resources) –Baseline system –BioCreAtIvE Task 1B Improved matching –GO concept normalization Developing a concept mention lexicon Final comments

4 Copyright 2005, MITRE Corporation Obligatory Information Explosion in Biomedical Research Slide Exponential increase in published research literature Massive growth in genes being sequenced and studied Demand for large scale studies that require (computable) knowledge of thousands of genes FlyBase References per Year

5 Copyright 2005, MITRE Corporation What does it mean? All the public human knowledge of biology and medicine is stored in the research literature in 'natural language' One of the main goals of the automatic processing of biomedical text is to extract some of the underlying semantics of the text Extracting meaning requires understanding what is being discussed and mapping the entities (e.g. molecules) and concepts (e.g. physiological processes) being discussed to some underlying definition or model

6 Copyright 2005, MITRE Corporation Biological Nomenclature: “V-SNARE” SNAP Receptor Vesicle SNARE V-SNARE N-Ethylmaleimide-Sensitive Fusion Protein Soluble NSF Attachment Protein Maleic acid N-ethylimide Vesicle Soluble Maleic acid N-ethylimide Sensitive Fusion Protein Attachment Protein Receptor

7 Copyright 2005, MITRE Corporation Different Meanings of Meaning? The V-SNARE example highlights that biology is understood at a variety of levels of complexity, a variety of scales. Even something as fundamental as the mention of a gene can mean different things (protein, allele, sequence of bases, chromosomal locus, inheritable trait). We read biological literature and immediately are able to associate meaning with the text; how do we describe this formally so a machine can do it? Linguists have yet to solve these problems in semantics. Are we all doomed to become philosophers?

8 Copyright 2005, MITRE Corporation There is Salvation: Biological Databases Annotation in a biological database is a model for putting the knowledge from free text into a formal representation Curators develop controlled vocabularies and ontologies to do the annotation The vocabularies provide targets for the "grounding" of mentions, linking mentions in the text with entities and concepts in the vocabularies

9 Copyright 2005, MITRE Corporation Example Database Biological Database FBgn : Toll, Tl, CG5490, Fs(1)Tl, dToll, CT17414, Toll-1, Fs(3)Tl, mat(3)9, mel(3)10, mel(3)9 FlyBase GeneID

10 Copyright 2005, MITRE Corporation Linking Literature, Databases, Ontologies, Data MEDLINE Literature Collections Genbank Databases SwissProt Ontologies Data integration via metaschemas Experimental Data

11 Copyright 2005, MITRE Corporation Task Definition Given a piece of text, associate all the relevant mentions in the text with the unique identifiers associated with the entities or concepts mentioned.

12 Copyright 2005, MITRE Corporation Outline Overview of problem Evaluations –Numerical results Methodology Analysis of results (Importance of lexical resources) –Baseline system –BioCreAtIvE Task 1B Improved matching –GO concept normalization Developing a concept mention lexicon Final comments

13 Copyright 2005, MITRE Corporation Evaluations Systematic evaluations are the only way to show the relative efficacy of different techniques on a task We (MITRE BioNLP group) helped organize two community evaluations including mapping a mention to unique identifier as a component KDD Cup Challenge 2002 (with Mark Craven): included an aspect requiring the linkage of the mentions of transcripts or proteins of given genes (defined by their unique identifiers in FlyBase) BioCreAtIvE 2004 (with Blaschke & Valencia): Task 1B required groups to return all the unique identifiers from model organism databases (mouse, fly, and yeast) for the genes/proteins of a given organism mentioned, and Task 2 involved linking text mentions with Gene Ontology concepts

14 Copyright 2005, MITRE Corporation BioCreAtivE - Critical Assessment of Information Extraction in Biology BioCreAtIvE Task 1B focused on extracted normalized gene names for 3 model organisms Input (for each organism) –Lexicon of unique IDs and synonyms –Noisy training data –Set of unannotated abstracts Output (systems produced this) –List of unique identifiers for all the genes mentioned in each unannotated abstract Evaluation –Comparison of the system produced lists to one produced by human effort

15 Copyright 2005, MITRE Corporation Given an organism’s synonym list, for each abstract, return unique gene IDs mentioned in the abstract A locus has been found, an allele of which causes a modification of some allozymes of the enzyme esterase 6 in Drosophila melanogaster. There are two alleles of this locus, one of which is dominant to the other and results in increased electrophoretic mobility of affected allozymes. The locus responsible has been mapped to on the standard genetic map (Est-6 is at ). Of 13 other enzyme systems analyzed, only leucine aminopeptidase is affected by the modifier locus. Neuraminidase incubations of homogenates altered the electrophoretic mobility of esterase 6 allozymes, but the mobility differences found are not large enough to conclude that esterase 6 is sialylated. Task 1B: Normalized Gene List fly_00035_trainingFBgn fly_00035_trainingFBgn Excerpt from Fly synonym list (Gene ID and synonyms): FBgn : Est-6, Esterase 6, CG6917, Est-D, EST6, est-6, Est6, Est, EST-6, Esterase-6, est6, Est-5, Carboxyl ester hydrolase

16 Copyright 2005, MITRE Corporation Task 1B Results How did systems do?

17 Copyright 2005, MITRE Corporation Task 1B Performance Lynette Hirschman, Marc Colosimo, Alexander A. Morgan, Alexander S. Yeh. "Overview of BioCreAtIvE task 1B: Normalized Gene Lists," accepted by BMC Bioinformatics.

18 Copyright 2005, MITRE Corporation Outline Overview of problem Evaluations –Numerical results Methodology Analysis of results (Importance of lexical resources) –Baseline system –BioCreAtIvE Task 1B Improved matching –GO concept normalization Developing a concept mention lexicon Final comments

19 Copyright 2005, MITRE Corporation Methodology How does one link text mentions with the unique identifiers in the controlled vocabulary?

20 Copyright 2005, MITRE Corporation General Steps 1)Locate candidate mentions in text 2)Match possible mentions against unique identifier list 3)Disambiguate mentions that link can link to multiple identifiers or to meanings not represented in list of identifiers

21 Copyright 2005, MITRE Corporation Identifying Candidate Mentions Two alternate approaches: –Use some sort of tagger or sentence classifier to identify candidate phrases –Search for known mention forms from a lexicon 1)Locate candidates 2)Match unique identifiers 3)Disambiguate mentions

22 Copyright 2005, MITRE Corporation Tagging Approach Forms of mentions never seen before can be identified based on context Named entity tagging the biomedical generally has performance limitations at about 0.8 F* measure Such an approach is inherently limited by the accuracy of the tagger, missed mentions are not considered Trainable Tagger Training data 1)Locate candidates 2)Match unique identifiers 3)Disambiguate mentions *Alexander S. Yeh, Lynette Hirschman, Alexander A. Morgan, Marc Colosimo. "BioCreAtIve Task 1A: Gene Mention Finding Evaluation," accepted by BMC Bioinformatics.

23 Copyright 2005, MITRE Corporation Scan & Lexical Lookup By considering all possible phrases as candidates, nothing is lost, but it can become computationally intractable Filtering using part of speech features or phrase chunking can help alleviate the problem with corresponding tradeoffs in accuracy Searching for phrases from a lexicon "inverts" the problem and facilitates later matching Lexicon development can be a labor intensive process 1)Locate candidates 2)Match unique identifiers 3)Disambiguate mentions

24 Copyright 2005, MITRE Corporation Matching Mentions Against UID List Associating candidate mentions with unique identifiers is typically done by comparing the mention with known synonyms for the entity or concept associated with the UID The match process is trivial for systems that find candidate mentions through lexical search Matching can be exact or "fuzzy" Correctly matching novel mention forms with the correct UID is very difficult 1)Locate candidates 2)Match unique identifiers 3)Disambiguate mentions

25 Copyright 2005, MITRE Corporation Disambiguation Certain mentions may map to many UID's (polysemy/ambiguity) Disambiguation may be aided by contextual information Filtering out mentions that are spuriously linked to UID's may be viewed as a distinct step or performed simultaneously Disambiguation and filtering may be done using heuristics or through a statistical classifier 1)Locate candidates 2)Match unique identifiers 3)Disambiguate mentions Good study of gene name ambiguity: Tuason O, Chen L, Liu H, Blake JA, Friedman C. Biological nomenclatures: a source of lexical knowledge and ambiguity. Pac Symp Biocomput 2004:

26 Copyright 2005, MITRE Corporation Ambiguity in Practice This is a histogram of the multiple name forms occurring in the Task 1B devtest set for fly. The horizontal axis is the number of variants forms appearing in the abstract, and the horizontal is the counts of each level of variaton. Expectation Value: 2.7 variant mentions/gene 1)Locate candidates 2)Match unique identifiers 3)Disambiguate mentions

27 Copyright 2005, MITRE Corporation Outline Overview of problem Evaluations –Numerical results Methodology Analysis of results (Importance of lexical resources) –Baseline system –BioCreAtIvE Task 1B Improved matching –GO concept normalization Developing a concept mention lexicon Final comments

28 Copyright 2005, MITRE Corporation Key Problem - Lexical Resources Lack of high quality lexical resources or annotated data Names associated with biological identifiers often unrelated to how concept/entity is described in text Synonym lists, when present, often very incomplete and/or full of noise

29 Copyright 2005, MITRE Corporation Outline Overview of problem Evaluations –Numerical results Methodology Analysis of results (Importance of lexical resources) –Baseline system –BioCreAtIvE Task 1B Improved matching –GO concept normalization Developing a concept mention lexicon Final comments

30 Copyright 2005, MITRE Corporation Baseline: Extracting Gene Names Given the FlyBase synonym list Apply longest-first pattern matching to text Generate the unique Flybase identifiers for genes found in an abstract or in full text Compare these to the gene lists from the model organism DB Pattern Matching FB Gene List Synonym List FBgn : Toll, Tl, CG5490, Fs(1)Tl, dToll, CT17414, Toll-1, Fs(3)Tl, mat(3)9, mel(3)10, mel(3)9 FBgn FBgn FBgn FBgn

31 Copyright 2005, MITRE Corporation Results of Baseline Recall 95%Precision 2.9% Recall 72%Precision 50% Alexander A. Morgan, Lynette Hirschman, Marc Colosimo, Alexander Yeh, Jeff Colombe. "Gene Name Identification and Normalization Using a Model Organism Database," Journal of Biomedical Informatics: 2004 Dec;37(6):

32 Copyright 2005, MITRE Corporation Problems: Text to Gene List Typography: Mapping of Greek letters, italics, sub/super scripts, and capitalization into ASCII loses semantic distinctions Synonymy: Many synonyms are missing from list due to typography variants, hyphenation Abbreviations: often look like common English words (not, we, if, for), causing false positives Ambiguity: more false positives –Some genes share synonyms: “P450” is a synonym for 20 different Drosophila genes

33 Copyright 2005, MITRE Corporation Some Other Complications Doing mapping to a controlled vocabulary of unique identifiers has intrinsic problems Tokenization: what is a word? – " Toll/IL-1R" [one word or two?] – " Toll-6, -7, -8" [how to know “ -8 ” is Toll-8?] What constitutes the mention of a gene? – " Toll-related" [= Toll?] – " Six Toll-related genes (Toll-3 to Toll-8)" = [Toll-3, Toll-4, Toll-5, Toll-6, Toll-7, Toll-8]

34 Copyright 2005, MITRE Corporation Outline Overview of problem Evaluations –Numerical results Methodology Analysis of results (Importance of lexical resources) –Baseline system –BioCreAtIvE Task 1B Improved matching –GO concept normalization Developing a concept mention lexicon Final comments

35 Copyright 2005, MITRE Corporation Lexicon, Name Length, Ambiguity Yeast: smallest vocab, shortest names, least ambiguity Mouse: largest vocabulary, longest names less ambiguity than fly Fly:large vocabulary,medium names, most ambiguity OrganismName Length (SDev) Yeast1.00 (0.05) Mouse2.77 (2.57) Fly1.47 (0.97) Lynette Hirschman, Marc Colosimo, Alexander A. Morgan, Alexander S. Yeh. "Overview of BioCreAtIvE task 1B: Normalized Gene Lists," accepted by BMC Bioinformatics.

36 Copyright 2005, MITRE Corporation Task 1B Performance Lynette Hirschman, Marc Colosimo, Alexander A. Morgan, Alexander S. Yeh. "Overview of BioCreAtIvE task 1B: Normalized Gene Lists," accepted by BMC Bioinformatics.

37 Copyright 2005, MITRE Corporation Outline Overview of problem Evaluations –Numerical results Methodology Analysis of results (Importance of lexical resources) –Baseline system –BioCreAtIvE Task 1B Improved matching –GO concept normalization Developing a concept mention lexicon Final comments

38 Copyright 2005, MITRE Corporation Improving Mention Finding Entity names, particularly gene/protein names, are often modified in a relatively consistent manner, according to an underlying 'grammar' suggesting three approaches Interleukin 2Valid Interleukin-2Valid Interleukin2Valid IL-2Valid IL2Valid IL-4Invalid Interleukin 2 promoterInvalid IL-2 receptorInvalid 1)Develop a set of rules that transforms the dictionary in a controlled number of ways ( Hanisch D, Fluck J, Mevissen HT, Zimmer R. Playing biology's name game: identifying protein names in scientific text. Pac Symp Biocomput 2003: ) 2)Define a distance metric that evaluates the deviance of a candidate mention from the allowed forms (Tsuruoka Y, Tsujii J. Improving the performance of dictionary-based approaches in protein name recognition. J Biomed Inform 2004;37(6):461-70) 3)Train a generative statistical model that can provide a likelihood estimate for a candidate mention (us)

39 Copyright 2005, MITRE Corporation Improved Match with CRF's Ben Wellner (MITRE) is trying to develop character based models and word based models of the allowed sense-preserving transformations of gene names This is learning the probability that a particular mention is an allowed transform of a known gene name e.g. P("IL-2" | "Interleukin-2")

40 Copyright 2005, MITRE Corporation Outline Overview of problem Evaluations –Numerical results Methodology Analysis of results (Importance of lexical resources) –Baseline system –BioCreAtIvE Task 1B Improved matching –GO concept normalization Developing a concept mention lexicon Final comments

41 Copyright 2005, MITRE Corporation BioCreAtIvE Task 2 This task involved identifying mentions of Gene Ontology concepts described in papers Groups had a very difficult time linking concepts from the GO ontology to particular mentions in text

42 Copyright 2005, MITRE Corporation GO as a Lexicon GO concepts are named to assist annotators, but this doesn't necessarily have significant overlap with how the concept is mentioned. Often concepts are described rather than directly named Insulin isolated from the pancreas of a diabetic patient with fasting hyperinsulinaemia showed decreased activity in binding to cell membrane insulin receptors and in stimulating cellular 2-deoxyglucose transport and glucose oxidation. GO:6006 Glucose metabolism

43 Copyright 2005, MITRE Corporation Compositional Nature of GO Philip Ogren has investigate the compositional nature of GO, showing that many GO terms may be viewed as arising from finite state automata 60% of GO terms contain other GO terms Ogren PV, Cohen KB, Acquaah-Mensah GK, Eberlein J, Hunter L. The compositional structure of Gene Ontology terms. Pac Symp Biocomput 2004:

44 Copyright 2005, MITRE Corporation We Can Use Inter-Ontology Mappings to Infer Some Labels

45 Copyright 2005, MITRE Corporation Example Matches for GO #8121 GO Name: –ubiquinol-cytochrome-c reductase activity EC Name: –Ubiquinol--cytochrome c reductase MESH Name: –Electron Transport Complex III TIGR Name: –ubiquinol-cytochrome c reductase, iron-sulfur subunit Pfam Name: –UcrQ

46 Copyright 2005, MITRE Corporation Lexicon development process 1.Expand vocabulary by using mappings from ontology to ontology 2.Use heuristics to expand the vocabulary further (calcium -> Ca 2+ ) 3.Measure mutual information with suggested phrases and document level concept associations 4.Iterate with human editing to cull poor phrases and to suggest new ones Note, since 500 GO concepts account for 69% of annotation, it is possible to maintain reasonable coverage with a subset of GO concepts

47 Copyright 2005, MITRE Corporation Outline Overview of problem Evaluations –Numerical results Methodology Analysis of results (Importance of lexical resources) –Baseline system –BioCreAtIvE Task 1B Improved matching –GO concept normalization Developing a concept mention lexicon Final comments

48 Copyright 2005, MITRE Corporation Key Points Extracting meaning from biomedical text involves linking mentions with controlled definitions of entities and concepts. Creative use of existing resources is important. Improved lexical resources can facilitate linking mentions to identifiers. Research in text mining for biology needs to be driven by the real needs of biologists.

49 Copyright 2005, MITRE Corporation Thanks! MITRE (Bedford) Bioinformatics Marc Colosimo Lynette Hirschman Benjamin Wellner Alexander S. Yeh For references, copies of slides, or more information, please contact:

50 Copyright 2005, MITRE Corporation References Marc Colosimo, Lynette Hirschman, Alex Morgan, Alexander S. Yeh. "Data Preparation and Interannotator Agreement BioCreAtivE Task 1B," accepted by BMC Bioinformatics. Hanisch D, Fluck J, Mevissen HT, Zimmer R. Playing biology's name game: identifying protein names in scientific text. Pac Symp Biocomput 2003: Lynette Hirschman, Marc Colosimo, Alexander A. Morgan, Alexander S. Yeh. "Overview of BioCreAtIvE task 1B: Normalized Gene Lists," accepted by BMC Bioinformatics. Hirschman L, Morgan AA, Yeh AS. Rutabaga by any other name: extracting biological names. J Biomed Inform 2002;35(4): Alexander A. Morgan, Lynette Hirschman, Marc Colosimo, Alexander Yeh, Jeff Colombe. "Gene Name Identification and Normalization Using a Model Organism Database," Journal of Biomedical Informatics: 2004 Dec;37(6): Ogren PV, Cohen KB, Acquaah-Mensah GK, Eberlein J, Hunter L. The compositional structure of Gene Ontology terms. Pac Symp Biocomput 2004: Tsuruoka Y, Tsujii J. Improving the performance of dictionary-based approaches in protein name recognition. J Biomed Inform 2004;37(6): Tuason O, Chen L, Liu H, Blake JA, Friedman C. Biological nomenclatures: a source of lexical knowledge and ambiguity. Pac Symp Biocomput 2004: Yeh AS, Hirschman L, Morgan AA. Evaluation of text data mining for database curation: lessons learned from the KDD Challenge Cup. Bioinformatics 2003;19 Suppl 1:i Alexander S. Yeh, Lynette Hirschman, Alexander A. Morgan, Marc Colosimo. "BioCreAtIve Task 1A: Gene Mention Finding Evaluation," accepted by BMC Bioinformatics.

51 Copyright 2005, MITRE Corporation Abstract There are many motivations for the mining of biological literature, but one of the major aims is to automatically extract semantics (meaning) from the natural language text that is used to describe the results of scientific enquiry, namely the reviewed and published scientific literature. Researchers in natural language processing have done quite a bit of work in trying to associate text mentions of entities with some underlying representation of that entity (grounding) or to link mentions of the same entity into an equivalence class (co-reference). This is facilitated in the biomedical domain because many of the entities mentioned have been already categorized, ordered and assigned unique identifiers in a variety of biologically relevant databases (GENBANK, HUGO, Swiss-Prot, FlyBase, MGI, SGD, PDB, etc.). With the advent of function and process ontologies such as the Gene Ontology, we now also have targets for linking descriptive mentions in text with a structured representation of the concepts they describe. The start of efforts such as the Semantic Web for Life Sciences continues this trend. We have investigated problems in the grounding of gene and protein names and helped run a community evaluation in this area (BioCreAtIvE). We are currently working on improving methods to link these entity mentions with unique biological identifiers and to develop a phrase lexicon of terms associated with the Gene Ontology. The linking of text mentions to biological identifiers is an important step toward extracting the semantics encoded in the biomedical research literature.