Linking Text Mentions to Biological Identifiers Alexander A. Morgan MITRE Corporation
2 Copyright 2005, MITRE Corporation MITRE (Bedford) Bioinformatics Marc Colosimo Lynette Hirschman Benjamin Wellner Alexander S. Yeh For references, copies of slides, or more information, please contact:
3 Copyright 2005, MITRE Corporation Outline Overview of problem Evaluations –Numerical results Methodology Analysis of results (Importance of lexical resources) –Baseline system –BioCreAtIvE Task 1B Improved matching –GO concept normalization Developing a concept mention lexicon Final comments
4 Copyright 2005, MITRE Corporation Obligatory Information Explosion in Biomedical Research Slide Exponential increase in published research literature Massive growth in genes being sequenced and studied Demand for large scale studies that require (computable) knowledge of thousands of genes FlyBase References per Year
5 Copyright 2005, MITRE Corporation What does it mean? All the public human knowledge of biology and medicine is stored in the research literature in 'natural language' One of the main goals of the automatic processing of biomedical text is to extract some of the underlying semantics of the text Extracting meaning requires understanding what is being discussed and mapping the entities (e.g. molecules) and concepts (e.g. physiological processes) being discussed to some underlying definition or model
6 Copyright 2005, MITRE Corporation Biological Nomenclature: “V-SNARE” SNAP Receptor Vesicle SNARE V-SNARE N-Ethylmaleimide-Sensitive Fusion Protein Soluble NSF Attachment Protein Maleic acid N-ethylimide Vesicle Soluble Maleic acid N-ethylimide Sensitive Fusion Protein Attachment Protein Receptor
7 Copyright 2005, MITRE Corporation Different Meanings of Meaning? The V-SNARE example highlights that biology is understood at a variety of levels of complexity, a variety of scales. Even something as fundamental as the mention of a gene can mean different things (protein, allele, sequence of bases, chromosomal locus, inheritable trait). We read biological literature and immediately are able to associate meaning with the text; how do we describe this formally so a machine can do it? Linguists have yet to solve these problems in semantics. Are we all doomed to become philosophers?
8 Copyright 2005, MITRE Corporation There is Salvation: Biological Databases Annotation in a biological database is a model for putting the knowledge from free text into a formal representation Curators develop controlled vocabularies and ontologies to do the annotation The vocabularies provide targets for the "grounding" of mentions, linking mentions in the text with entities and concepts in the vocabularies
9 Copyright 2005, MITRE Corporation Example Database Biological Database FBgn : Toll, Tl, CG5490, Fs(1)Tl, dToll, CT17414, Toll-1, Fs(3)Tl, mat(3)9, mel(3)10, mel(3)9 FlyBase GeneID
10 Copyright 2005, MITRE Corporation Linking Literature, Databases, Ontologies, Data MEDLINE Literature Collections Genbank Databases SwissProt Ontologies Data integration via metaschemas Experimental Data
11 Copyright 2005, MITRE Corporation Task Definition Given a piece of text, associate all the relevant mentions in the text with the unique identifiers associated with the entities or concepts mentioned.
12 Copyright 2005, MITRE Corporation Outline Overview of problem Evaluations –Numerical results Methodology Analysis of results (Importance of lexical resources) –Baseline system –BioCreAtIvE Task 1B Improved matching –GO concept normalization Developing a concept mention lexicon Final comments
13 Copyright 2005, MITRE Corporation Evaluations Systematic evaluations are the only way to show the relative efficacy of different techniques on a task We (MITRE BioNLP group) helped organize two community evaluations including mapping a mention to unique identifier as a component KDD Cup Challenge 2002 (with Mark Craven): included an aspect requiring the linkage of the mentions of transcripts or proteins of given genes (defined by their unique identifiers in FlyBase) BioCreAtIvE 2004 (with Blaschke & Valencia): Task 1B required groups to return all the unique identifiers from model organism databases (mouse, fly, and yeast) for the genes/proteins of a given organism mentioned, and Task 2 involved linking text mentions with Gene Ontology concepts
14 Copyright 2005, MITRE Corporation BioCreAtivE - Critical Assessment of Information Extraction in Biology BioCreAtIvE Task 1B focused on extracted normalized gene names for 3 model organisms Input (for each organism) –Lexicon of unique IDs and synonyms –Noisy training data –Set of unannotated abstracts Output (systems produced this) –List of unique identifiers for all the genes mentioned in each unannotated abstract Evaluation –Comparison of the system produced lists to one produced by human effort
15 Copyright 2005, MITRE Corporation Given an organism’s synonym list, for each abstract, return unique gene IDs mentioned in the abstract A locus has been found, an allele of which causes a modification of some allozymes of the enzyme esterase 6 in Drosophila melanogaster. There are two alleles of this locus, one of which is dominant to the other and results in increased electrophoretic mobility of affected allozymes. The locus responsible has been mapped to on the standard genetic map (Est-6 is at ). Of 13 other enzyme systems analyzed, only leucine aminopeptidase is affected by the modifier locus. Neuraminidase incubations of homogenates altered the electrophoretic mobility of esterase 6 allozymes, but the mobility differences found are not large enough to conclude that esterase 6 is sialylated. Task 1B: Normalized Gene List fly_00035_trainingFBgn fly_00035_trainingFBgn Excerpt from Fly synonym list (Gene ID and synonyms): FBgn : Est-6, Esterase 6, CG6917, Est-D, EST6, est-6, Est6, Est, EST-6, Esterase-6, est6, Est-5, Carboxyl ester hydrolase
16 Copyright 2005, MITRE Corporation Task 1B Results How did systems do?
17 Copyright 2005, MITRE Corporation Task 1B Performance Lynette Hirschman, Marc Colosimo, Alexander A. Morgan, Alexander S. Yeh. "Overview of BioCreAtIvE task 1B: Normalized Gene Lists," accepted by BMC Bioinformatics.
18 Copyright 2005, MITRE Corporation Outline Overview of problem Evaluations –Numerical results Methodology Analysis of results (Importance of lexical resources) –Baseline system –BioCreAtIvE Task 1B Improved matching –GO concept normalization Developing a concept mention lexicon Final comments
19 Copyright 2005, MITRE Corporation Methodology How does one link text mentions with the unique identifiers in the controlled vocabulary?
20 Copyright 2005, MITRE Corporation General Steps 1)Locate candidate mentions in text 2)Match possible mentions against unique identifier list 3)Disambiguate mentions that link can link to multiple identifiers or to meanings not represented in list of identifiers
21 Copyright 2005, MITRE Corporation Identifying Candidate Mentions Two alternate approaches: –Use some sort of tagger or sentence classifier to identify candidate phrases –Search for known mention forms from a lexicon 1)Locate candidates 2)Match unique identifiers 3)Disambiguate mentions
22 Copyright 2005, MITRE Corporation Tagging Approach Forms of mentions never seen before can be identified based on context Named entity tagging the biomedical generally has performance limitations at about 0.8 F* measure Such an approach is inherently limited by the accuracy of the tagger, missed mentions are not considered Trainable Tagger Training data 1)Locate candidates 2)Match unique identifiers 3)Disambiguate mentions *Alexander S. Yeh, Lynette Hirschman, Alexander A. Morgan, Marc Colosimo. "BioCreAtIve Task 1A: Gene Mention Finding Evaluation," accepted by BMC Bioinformatics.
23 Copyright 2005, MITRE Corporation Scan & Lexical Lookup By considering all possible phrases as candidates, nothing is lost, but it can become computationally intractable Filtering using part of speech features or phrase chunking can help alleviate the problem with corresponding tradeoffs in accuracy Searching for phrases from a lexicon "inverts" the problem and facilitates later matching Lexicon development can be a labor intensive process 1)Locate candidates 2)Match unique identifiers 3)Disambiguate mentions
24 Copyright 2005, MITRE Corporation Matching Mentions Against UID List Associating candidate mentions with unique identifiers is typically done by comparing the mention with known synonyms for the entity or concept associated with the UID The match process is trivial for systems that find candidate mentions through lexical search Matching can be exact or "fuzzy" Correctly matching novel mention forms with the correct UID is very difficult 1)Locate candidates 2)Match unique identifiers 3)Disambiguate mentions
25 Copyright 2005, MITRE Corporation Disambiguation Certain mentions may map to many UID's (polysemy/ambiguity) Disambiguation may be aided by contextual information Filtering out mentions that are spuriously linked to UID's may be viewed as a distinct step or performed simultaneously Disambiguation and filtering may be done using heuristics or through a statistical classifier 1)Locate candidates 2)Match unique identifiers 3)Disambiguate mentions Good study of gene name ambiguity: Tuason O, Chen L, Liu H, Blake JA, Friedman C. Biological nomenclatures: a source of lexical knowledge and ambiguity. Pac Symp Biocomput 2004:
26 Copyright 2005, MITRE Corporation Ambiguity in Practice This is a histogram of the multiple name forms occurring in the Task 1B devtest set for fly. The horizontal axis is the number of variants forms appearing in the abstract, and the horizontal is the counts of each level of variaton. Expectation Value: 2.7 variant mentions/gene 1)Locate candidates 2)Match unique identifiers 3)Disambiguate mentions
27 Copyright 2005, MITRE Corporation Outline Overview of problem Evaluations –Numerical results Methodology Analysis of results (Importance of lexical resources) –Baseline system –BioCreAtIvE Task 1B Improved matching –GO concept normalization Developing a concept mention lexicon Final comments
28 Copyright 2005, MITRE Corporation Key Problem - Lexical Resources Lack of high quality lexical resources or annotated data Names associated with biological identifiers often unrelated to how concept/entity is described in text Synonym lists, when present, often very incomplete and/or full of noise
29 Copyright 2005, MITRE Corporation Outline Overview of problem Evaluations –Numerical results Methodology Analysis of results (Importance of lexical resources) –Baseline system –BioCreAtIvE Task 1B Improved matching –GO concept normalization Developing a concept mention lexicon Final comments
30 Copyright 2005, MITRE Corporation Baseline: Extracting Gene Names Given the FlyBase synonym list Apply longest-first pattern matching to text Generate the unique Flybase identifiers for genes found in an abstract or in full text Compare these to the gene lists from the model organism DB Pattern Matching FB Gene List Synonym List FBgn : Toll, Tl, CG5490, Fs(1)Tl, dToll, CT17414, Toll-1, Fs(3)Tl, mat(3)9, mel(3)10, mel(3)9 FBgn FBgn FBgn FBgn
31 Copyright 2005, MITRE Corporation Results of Baseline Recall 95%Precision 2.9% Recall 72%Precision 50% Alexander A. Morgan, Lynette Hirschman, Marc Colosimo, Alexander Yeh, Jeff Colombe. "Gene Name Identification and Normalization Using a Model Organism Database," Journal of Biomedical Informatics: 2004 Dec;37(6):
32 Copyright 2005, MITRE Corporation Problems: Text to Gene List Typography: Mapping of Greek letters, italics, sub/super scripts, and capitalization into ASCII loses semantic distinctions Synonymy: Many synonyms are missing from list due to typography variants, hyphenation Abbreviations: often look like common English words (not, we, if, for), causing false positives Ambiguity: more false positives –Some genes share synonyms: “P450” is a synonym for 20 different Drosophila genes
33 Copyright 2005, MITRE Corporation Some Other Complications Doing mapping to a controlled vocabulary of unique identifiers has intrinsic problems Tokenization: what is a word? – " Toll/IL-1R" [one word or two?] – " Toll-6, -7, -8" [how to know “ -8 ” is Toll-8?] What constitutes the mention of a gene? – " Toll-related" [= Toll?] – " Six Toll-related genes (Toll-3 to Toll-8)" = [Toll-3, Toll-4, Toll-5, Toll-6, Toll-7, Toll-8]
34 Copyright 2005, MITRE Corporation Outline Overview of problem Evaluations –Numerical results Methodology Analysis of results (Importance of lexical resources) –Baseline system –BioCreAtIvE Task 1B Improved matching –GO concept normalization Developing a concept mention lexicon Final comments
35 Copyright 2005, MITRE Corporation Lexicon, Name Length, Ambiguity Yeast: smallest vocab, shortest names, least ambiguity Mouse: largest vocabulary, longest names less ambiguity than fly Fly:large vocabulary,medium names, most ambiguity OrganismName Length (SDev) Yeast1.00 (0.05) Mouse2.77 (2.57) Fly1.47 (0.97) Lynette Hirschman, Marc Colosimo, Alexander A. Morgan, Alexander S. Yeh. "Overview of BioCreAtIvE task 1B: Normalized Gene Lists," accepted by BMC Bioinformatics.
36 Copyright 2005, MITRE Corporation Task 1B Performance Lynette Hirschman, Marc Colosimo, Alexander A. Morgan, Alexander S. Yeh. "Overview of BioCreAtIvE task 1B: Normalized Gene Lists," accepted by BMC Bioinformatics.
37 Copyright 2005, MITRE Corporation Outline Overview of problem Evaluations –Numerical results Methodology Analysis of results (Importance of lexical resources) –Baseline system –BioCreAtIvE Task 1B Improved matching –GO concept normalization Developing a concept mention lexicon Final comments
38 Copyright 2005, MITRE Corporation Improving Mention Finding Entity names, particularly gene/protein names, are often modified in a relatively consistent manner, according to an underlying 'grammar' suggesting three approaches Interleukin 2Valid Interleukin-2Valid Interleukin2Valid IL-2Valid IL2Valid IL-4Invalid Interleukin 2 promoterInvalid IL-2 receptorInvalid 1)Develop a set of rules that transforms the dictionary in a controlled number of ways ( Hanisch D, Fluck J, Mevissen HT, Zimmer R. Playing biology's name game: identifying protein names in scientific text. Pac Symp Biocomput 2003: ) 2)Define a distance metric that evaluates the deviance of a candidate mention from the allowed forms (Tsuruoka Y, Tsujii J. Improving the performance of dictionary-based approaches in protein name recognition. J Biomed Inform 2004;37(6):461-70) 3)Train a generative statistical model that can provide a likelihood estimate for a candidate mention (us)
39 Copyright 2005, MITRE Corporation Improved Match with CRF's Ben Wellner (MITRE) is trying to develop character based models and word based models of the allowed sense-preserving transformations of gene names This is learning the probability that a particular mention is an allowed transform of a known gene name e.g. P("IL-2" | "Interleukin-2")
40 Copyright 2005, MITRE Corporation Outline Overview of problem Evaluations –Numerical results Methodology Analysis of results (Importance of lexical resources) –Baseline system –BioCreAtIvE Task 1B Improved matching –GO concept normalization Developing a concept mention lexicon Final comments
41 Copyright 2005, MITRE Corporation BioCreAtIvE Task 2 This task involved identifying mentions of Gene Ontology concepts described in papers Groups had a very difficult time linking concepts from the GO ontology to particular mentions in text
42 Copyright 2005, MITRE Corporation GO as a Lexicon GO concepts are named to assist annotators, but this doesn't necessarily have significant overlap with how the concept is mentioned. Often concepts are described rather than directly named Insulin isolated from the pancreas of a diabetic patient with fasting hyperinsulinaemia showed decreased activity in binding to cell membrane insulin receptors and in stimulating cellular 2-deoxyglucose transport and glucose oxidation. GO:6006 Glucose metabolism
43 Copyright 2005, MITRE Corporation Compositional Nature of GO Philip Ogren has investigate the compositional nature of GO, showing that many GO terms may be viewed as arising from finite state automata 60% of GO terms contain other GO terms Ogren PV, Cohen KB, Acquaah-Mensah GK, Eberlein J, Hunter L. The compositional structure of Gene Ontology terms. Pac Symp Biocomput 2004:
44 Copyright 2005, MITRE Corporation We Can Use Inter-Ontology Mappings to Infer Some Labels
45 Copyright 2005, MITRE Corporation Example Matches for GO #8121 GO Name: –ubiquinol-cytochrome-c reductase activity EC Name: –Ubiquinol--cytochrome c reductase MESH Name: –Electron Transport Complex III TIGR Name: –ubiquinol-cytochrome c reductase, iron-sulfur subunit Pfam Name: –UcrQ
46 Copyright 2005, MITRE Corporation Lexicon development process 1.Expand vocabulary by using mappings from ontology to ontology 2.Use heuristics to expand the vocabulary further (calcium -> Ca 2+ ) 3.Measure mutual information with suggested phrases and document level concept associations 4.Iterate with human editing to cull poor phrases and to suggest new ones Note, since 500 GO concepts account for 69% of annotation, it is possible to maintain reasonable coverage with a subset of GO concepts
47 Copyright 2005, MITRE Corporation Outline Overview of problem Evaluations –Numerical results Methodology Analysis of results (Importance of lexical resources) –Baseline system –BioCreAtIvE Task 1B Improved matching –GO concept normalization Developing a concept mention lexicon Final comments
48 Copyright 2005, MITRE Corporation Key Points Extracting meaning from biomedical text involves linking mentions with controlled definitions of entities and concepts. Creative use of existing resources is important. Improved lexical resources can facilitate linking mentions to identifiers. Research in text mining for biology needs to be driven by the real needs of biologists.
49 Copyright 2005, MITRE Corporation Thanks! MITRE (Bedford) Bioinformatics Marc Colosimo Lynette Hirschman Benjamin Wellner Alexander S. Yeh For references, copies of slides, or more information, please contact:
50 Copyright 2005, MITRE Corporation References Marc Colosimo, Lynette Hirschman, Alex Morgan, Alexander S. Yeh. "Data Preparation and Interannotator Agreement BioCreAtivE Task 1B," accepted by BMC Bioinformatics. Hanisch D, Fluck J, Mevissen HT, Zimmer R. Playing biology's name game: identifying protein names in scientific text. Pac Symp Biocomput 2003: Lynette Hirschman, Marc Colosimo, Alexander A. Morgan, Alexander S. Yeh. "Overview of BioCreAtIvE task 1B: Normalized Gene Lists," accepted by BMC Bioinformatics. Hirschman L, Morgan AA, Yeh AS. Rutabaga by any other name: extracting biological names. J Biomed Inform 2002;35(4): Alexander A. Morgan, Lynette Hirschman, Marc Colosimo, Alexander Yeh, Jeff Colombe. "Gene Name Identification and Normalization Using a Model Organism Database," Journal of Biomedical Informatics: 2004 Dec;37(6): Ogren PV, Cohen KB, Acquaah-Mensah GK, Eberlein J, Hunter L. The compositional structure of Gene Ontology terms. Pac Symp Biocomput 2004: Tsuruoka Y, Tsujii J. Improving the performance of dictionary-based approaches in protein name recognition. J Biomed Inform 2004;37(6): Tuason O, Chen L, Liu H, Blake JA, Friedman C. Biological nomenclatures: a source of lexical knowledge and ambiguity. Pac Symp Biocomput 2004: Yeh AS, Hirschman L, Morgan AA. Evaluation of text data mining for database curation: lessons learned from the KDD Challenge Cup. Bioinformatics 2003;19 Suppl 1:i Alexander S. Yeh, Lynette Hirschman, Alexander A. Morgan, Marc Colosimo. "BioCreAtIve Task 1A: Gene Mention Finding Evaluation," accepted by BMC Bioinformatics.
51 Copyright 2005, MITRE Corporation Abstract There are many motivations for the mining of biological literature, but one of the major aims is to automatically extract semantics (meaning) from the natural language text that is used to describe the results of scientific enquiry, namely the reviewed and published scientific literature. Researchers in natural language processing have done quite a bit of work in trying to associate text mentions of entities with some underlying representation of that entity (grounding) or to link mentions of the same entity into an equivalence class (co-reference). This is facilitated in the biomedical domain because many of the entities mentioned have been already categorized, ordered and assigned unique identifiers in a variety of biologically relevant databases (GENBANK, HUGO, Swiss-Prot, FlyBase, MGI, SGD, PDB, etc.). With the advent of function and process ontologies such as the Gene Ontology, we now also have targets for linking descriptive mentions in text with a structured representation of the concepts they describe. The start of efforts such as the Semantic Web for Life Sciences continues this trend. We have investigated problems in the grounding of gene and protein names and helped run a community evaluation in this area (BioCreAtIvE). We are currently working on improving methods to link these entity mentions with unique biological identifiers and to develop a phrase lexicon of terms associated with the Gene Ontology. The linking of text mentions to biological identifiers is an important step toward extracting the semantics encoded in the biomedical research literature.