BioNLP Tutorial PSB 2006 Wailea, Maui, HI K. Bretonnel Cohen Olivier Bodenreider Lynette Hirschman.

Slides:



Advertisements
Similar presentations
Annotation of Gene Function …and how thats useful to you.
Advertisements

Bio-Medical Interaction Extractor Syed Toufeeq Ahmed ASU.
Annotation standards in ORegAnno (Draft) Obi Griffith The RegCreative Jamboree Nov 29, 2006 Ghent, Belgium.
An Information Retrieval and Extraction System for C. elegans Literature.
Biological literature mining
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Lecture Outline Introduction Data mining sources: –GO, InterPro, KEGG, UniProt Tools to do the data mining: –FatiGO –FatiWISE.
LESSONS FROM THE BIOCREATIVE PROTEIN- PROTEIN INTERACTION (PPI) TASK RegCreative Jamboree, Friday, December, 1st, (2006) MARTIN KRALLINGER, 2006 LESSONS.
Introduction to Natural Language Processing Phenotype RCN Meeting Feb 2013.
Linking Text Mentions to Biological Identifiers Alexander A. Morgan MITRE Corporation
Bioinformatics for biomedicine Summary and conclusions. Further analysis of a favorite gene Lecture 8, Per Kraulis
© 2003 The MITRE Corporation. ALL RIGHTS RESERVED. MITRE Critical Assessment of Information Extraction Systems in Biology (BioCreAtIvE) Marc Colosimo Lynette.
© 2006 The MITRE Corporation. ALL RIGHTS RESERVED. Lynette Hirschman The MITRE Corporation Bedford, MA, USA RegCreative Jamboree Nov 29-Dec 1, 2006 Text.
An open source, ontology-driven concept analysis engine, with applications to capturing knowledge regarding protein transport, protein interactions and.
Biological Databases Notes adapted from lecture notes of Dr. Larry Hunter at the University of Colorado.
Biomedical Information Extraction. Outline Intro to biomedical information extraction PASTA [Demetriou and Gaizauskas] Biomedical named entities Name.
QuASI: Question Answering using Statistics, Semantics, and Inference Marti Hearst, Jerry Feldman, Chris Manning, Srini Narayanan Univ. of California-Berkeley.
UCB BioText TREC 2003 Participation Participants: Marti Hearst Gaurav Bhalotia, Presley Nakov, Ariel Schwartz Track: Genomics, tasks 1 and 2.
Predicting Gene Functions from Text Using a Cross- Species Approach Emilia Stoica and Marti Hearst SIMS University of California, Berkeley.
Class Projects. Future Work and Possible Project Topic in Gene Regulatory network Learning from multiple data sources; Learning causality in Motifs; Learning.
Sequence Analysis. Today How to retrieve a DNA sequence? How to search for other related DNA sequences? How to search for its protein sequence? How to.
Cis-Regulatory/ Text Mining Interface Discussion.
1 iProLINK: An integrated protein resource for literature mining and literature-based curation 1. Bibliography mapping - UniProt mapped citations 2. Annotation.
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
NOTES: CH 18 part 2 - The Molecular Biology of Cancer
Using The Gene Ontology: Gene Product Annotation.
Lawrence Hunter & K. Bretonnel Cohen Center for Computational Pharmacology UCHSC School of Medicine Using.
© Wiley Publishing All Rights Reserved. Protein and Specialized Sequence Databases.
Linking Diseases and Genes through Informatics Knowledge Bases and Ontologies Joyce A. Mitchell, Ph.D. National Library of Medicine University of Missouri.
CACAO Training Fall Community Assessment of Community Annotation with Ontologies (CACAO)
NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015.
Accomplishments and Challenges in Literature Data Mining for Biology L. Hirschman et al. Presented by Jing Jiang CS491CXZ Spring, 2004.
Outline Quick review of GS Current problems with GS Our solutions Future work Discussion …
Intralab Workshop - Reactome CMAP Chang-Feng Quo June 29 th, 2006.
BioNLP related talks and demos at ACL and CONLL ‘05 Presented by Beatrice Alex BioNLP meeting 11 th of July 2005.
GENE ONTOLOGY FOR THE NEWBIES Suparna Mundodi, PhD The Arabidopsis Information Resources, Stanford, CA.
Automatically Generating Gene Summaries from Biomedical Literature (To appear in Proceedings of PSB 2006) X. LING, J. JIANG, X. He, Q.~Z. MEI, C.~X. ZHAI,
Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text.
DONNA MAGLOTT, PH.D. PRO AND MEDICAL GENETICS RESOURCES AT NCBI.
Lars Juhl Jensen Biomedical text mining. exponential growth.
Computational biology of cancer cell pathways Modelling of cancer cell function and response to therapy.
The Gene Ontology project Jane Lomax. Ontology (for our purposes) “an explicit specification of some topic” – Stanford Knowledge Systems Lab Includes:
University of Texas at Austin Machine Learning Group Integrating Co-occurrence Statistics with IE for Robust Retrieval of Protein Interactions from Medline.
Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from.
Organizing information in the post-genomic era The rise of bioinformatics.
24th Feb 2006 Jane Lomax GO Further. 24th Feb 2006 Jane Lomax GO annotations Where do the links between genes and GO terms come from?
The Gene Ontology and its insertion into UMLS Jane Lomax.
Other biological databases and ontologies. Biological systems Taxonomic data Literature Protein folding and 3D structure Small molecules Pathways and.
Respective contributions of MIAME, GeneOntology and UMLS for transcriptome analysis Fouzia Moussouni, Anita Burgun, Franck Le Duff, Emilie Guérin, Olivier.
Ontologies Working Group Agenda MGED3 1.Goals for working group. 2.Primer on ontologies 3.Working group progress 4.Example sample descriptions from different.
Distribution of information in biomedical abstracts and full- text publications M. J. Schuemie et al. Dept. of Medical Informatics, Erasmus University.
Bioinformatics and Computational Biology
Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University.
Two powerful transgenic techniques Addition of genes by nuclear injection Addition of genes by nuclear injection Foreign DNA injected into pronucleus of.
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
5/6/04Biolink1 Integrated Annotation for Biomedical IE Mining the Bibliome: Information Extraction from the Biomedical Literature NSF ITR grant EIA
Tools in Bioinformatics Ontologies and pathways. Why are ontologies needed? A free text is the best way to describe what a protein does to a human reader.
Protein databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen and from CSC bio-opas
Literature Mining and Database Annotation of Protein Phosphorylation Using a Rule-based System Z. Z. Hu 1, M. Narayanaswamy 2, K. E. Ravikumar 2, K. Vijay-Shanker.
Joined up ontologies: incorporating the Gene Ontology into the UMLS.
 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?
BioCreAtIvE Critical Assessment for Information Extraction in Biology Granada, Spain, March28-March 31, 2004 Task 2: Functional annotation of gene products.
Concept Grounding to Multiple Knowledge Bases via Indirect Supervision
Department of Genetics • Stanford University School of Medicine
Functional Annotation of the Horse Genome
Annotation: linking literature to gene products
Literature Data Mining and Protein Ontology Development
What is an Ontology An ontology is a set of terms, relationships and definitions that capture the knowledge of a certain domain. (common ontology ≠ common.
Complex Sentence Processor
Biomedical Language Processing: What's Beyond PubMed?
Presentation transcript:

BioNLP Tutorial PSB 2006 Wailea, Maui, HI K. Bretonnel Cohen Olivier Bodenreider Lynette Hirschman

The Biological Data Cycle MEDLINE Literature Collections Experimental Data Ontologies Expert Curation Databases SwissProt Genbank Bottleneck: getting knowledge from literature to databases Solution: text mining 1

MEDLINE 1. Select papers 2. List genes for curation 3. Curate genes from paper Model Organism Curation Pipeline 1

Double exponential growth in the literature New entries in Medline with publication date in Jan-Aug 2005: 431,478 (avg. 1775/ day) 1

Examples of BioNLP in action 1

1

1

Application types Information retrieval: find documents in response to an “information need” p53 Resistance to apoptosis, increased growth potential, and altered gene expression in cells that survived genotoxic hexavalent chromium exposure. PMID:

Application types Question-answering: question as input, answer as output What is BRCA1 ? A gene located on the seventeenth chromosome associated with a risk of breast and ovarian cancer 2 (Yu and Sable 2005)

Application types Summarization –Input: one or more texts –Output: single (shorter) text Information extraction: Information extraction systems find statements about some specified type of relationship in text. Entity identification is a necessary prerequisite to information extraction. Information retrieval: Information retrieval is classically defined as the location of documents that are relevant to some information need. PubMed is a premier example of a sophisticated biomedical information retrieval system. Summarization systems benefit from high-performance entity identification and normalization. Other approaches involve information extraction. 2 Ling et al. (multiple documents) Lu et al. (single document)

Application types Information extraction: relationships between things BINDING_EVENT Binder: Bound: 2

Application types Met28 binds to DNA. BINDING_EVENT Binder: Met28 Bound: DNA 2 Lussier (gene/phenotype) Maguitman (protein/family) Chun (gene/disease) Höglund (protein/location) Stoica (protein/function)

Application types HSP60 Hsp-60 heat shock protein 60 Cerberus wingless Ken and Barbie the Entity identification 3

Application types Entity normalization: find concepts in text and map them to unique identifiers A locus has been found, an allele of which causes a modification of some allozymes of the enzyme esterase 6 in Drosophila melanogaster. There are two alleles of this locus, one of which is dominant to the other and results in increased electrophoretic mobility of affected allozymes. The locus responsible has been mapped to on the standard genetic map (Est-6 is at ). Of 13 other enzyme systems analyzed, only leucine aminopeptidase is affected by the modifier locus. Neuraminidase incubations of homogenates altered the electrophoretic mobility of esterase 6 allozymes, but the mobility differences found are not large enough to conclude that esterase 6 is sialylated. 3

Perfect entity identification finds 5 mentions; they correspond to just 2 genes: –FBgn (esterase 6) –FBgn (leucine aminopeptidase) A locus has been found, an allele of which causes a modification of some allozymes of the enzyme esterase 6 in Drosophila melanogaster. There are two alleles of this locus, one of which is dominant to the other and results in increased electrophoretic mobility of affected allozymes. The locus responsible has been mapped to on the standard genetic map (Est-6 is at ). Of 13 other enzyme systems analyzed, only leucine aminopeptidase is affected by the modifier locus. Neuraminidase incubations of homogenates altered the electrophoretic mobility of esterase 6 allozymes, but the mobility differences found are not large enough to conclude that esterase 6 is sialylated. Application types 3

Partial list of synonyms for FBgn : –Esterase 6 –Carboxyl ester hydrolase –CG6917 –Est6 –Est-D –Est-5 3 Chun (gene/disease) Johnson (ontology alignment) Stoica (gene/function) Vlachos (FlyBase mapping)

Biological Nomenclature: “V-SNARE” SNAP Receptor Vesicle SNARE V-SNARE N-Ethylmaleimide-Sensitive Fusion Protein Soluble NSF Attachment Protein Maleic acid N-ethylimide Vesicle Soluble Maleic acid N-ethylimide Sensitive Fusion Protein Attachment Protein Receptor (A. Morgan) 4

The Biological Data Cycle MEDLINE Literature Collections Experimental Data Ontologies Expert Curation Databases SwissProt Genbank What’s the organizing principle for all of this? 4

Organizing principles Biomedical literature Biomedical literature MeSH Genome annotations Genome annotations GO Model organisms Model organisms NCBI Taxonomy Genetic knowledge bases OMIM Clinical repositories Clinical repositories SNOMED Other subdomains Other subdomains … Anatomy UWDA UMLS 4

Organizing principles 4

Neurofibromatosis type 2 (NF2) is often not recognised as a distinct entity from peripheral neurofibromatosis. NF2 is a predominantly intracranial condition whose hallmark is bilateral vestibular schwannomas. NF2 results from a mutation in the gene named merlin, located on chromosome 22. (Uppal, S., and A. P. Coatesworth. “Neurofibromatosis Type 2.” Int J Clin Pract, 57, no. 8, 2003, pp ) Ontologies as text mining resources 4

Neurofibromatosis type 2 (NF2) is often not recognised as a distinct entity from peripheral neurofibromatosis. NF2 is a predominantly intracranial condition whose hallmark is bilateral vestibular schwannomas. NF2 results from a mutation in the gene named merlin, located on chromosome 22. Ontologies as text mining resources Disease Tumor Gene Chromosome vestibular schwannoma manifestation of neurofibromatosis 2 neurofibromatosis 2 associated with mutation of merlin merlin located on chromosome 22 Tumor manifestation of Disease Disease associated with mutation of Gene Gene located on Chromosome 4

What’s the state of the art? Tasks differ greatly: finding human protein interactions (Bunescu ‘05) may be harder than finding “inhibition” relations (Pustejovsky ‘02) Need a CASP-style competitive evaluation Precision ≈ Specificity Recall ≈ Sensitivity 4

What’s the state of the art? KDD Cup (2002) TREC Genomics (2003, 2004, 2005) BioCreAtIvE (2004) BioNLP (2004)

MEDLINE 1. Select papers KDD 2002, TREC Genomics List genes for curation BioCreAtIvE entity identification and entity normalization tasks 3. Curate genes from paper BioCreAtIvE information extraction task: PDB → Gene Ontology What’s the state of the art? 5

**F-measure is balanced precision and recall: 2*P*R/(P+R) Recall: # correctly identified/# possible correct Precision:# correctly identified/# identified Yeast results good: High: 0.93 F Smallest vocab Short names Little ambiguity Fly: 0.82 F High ambiguity Mouse: 0.79 F Large vocabulary Long names What’s the state of the art? 3

Blaschke et al. 5

What’s the state of the art? Cellular Component: 34.61% (561/1621) Molecular Function: 33.00%(933/2827) Biological Process: 23.02% (1011/4391) Cellular component is easier because task is relation between “entities” located_in (protein,cell_component) Biological process is hardest because it is the most abstract Blaschke et al. 5

2.5 types of solutions Rule-based –Patterns –Grammars Statistical/machine learning –Labelled training data –Noisy training data Hybrid statistical/rule-based Höglund (information extraction, gene → localiz.) Maguitman (info. extract., SWISSPROT → Pfam) Vlachos (entity normalization, gene → FlyBase) Stoica (gene → GO code) Chun (IE, multiple gene -> UMLS disease) Ling (summarization, FlyBase) Johnson (ontology alignment, GO → other OBO) Lu (summarization, Entrez Gene → GeneRIFs) Lussier (info. extraction, GOA -> phenotype) Vlachos (coreference, FlyBase & Sequence Ont.) 5

Common tools/techniques “Stop word” removal: eliminate features that are rarely helpful the, a, and… (Porter) stemming: convert inflected words to their roots promot, mitochondri, cytochrom POS: “part of speech”— ≈80 categories 5

Why text mining is difficult Variability Pervasive ambiguity at every level of analysis 5

Why text mining is difficult Met28 binds to DNA …binding of Met28 to DNA… …Met28 and DNA bind… …binding between Met28 and DNA… …Met28 is sufficient to bind DNA… …DNA bound by Met28… 2(6)

Why text mining is difficult …binding of Met28 to DNA… …binding under unspecified conditions of Met28 to DNA… …binding of this translational variant of Met28 to DNA… …binding of Met28 to upstream regions of DNA… 2(6)

Why text mining is difficult …binding under unspecified conditions of this translational variant of Met28 to upstream regions of DNA… 3(6)

Why text mining is difficult Document segmentation Sentence segmentation Tokenization Part of speech tagging Parsing 5

Why text mining is difficult Here, we show that Bifocal (Bif), a putative cytoskeletal regulator, is a component of the Msn pathway for regulating R cell growth targeting. bif displays strong genetic interaction with msn. (Ruan et al. 2002) F-measure MaxEnt_1.40 MaxEnt_2.67 KeX.95 LingPipe.96 (Baumgartner, in prep.) 6

Why text mining is difficult lead 69 tokens in GENIA –“bare stem” verb: 34 –3 rd person singular present tense verb: 29 –Noun: 3 –Past tense verb: 2 –Past participle: 1 6

Why text mining is difficult HUNK Human natural killer (cell type) HUN kinase (gene/protein) Radiological/orthopedic classification scheme Piece of something 6

NaCT is expressed in liver, testis and brain in rat and shows preference for citrate over dicarboxylates… (GeneRIF : ) NACT: neoadjuvant chemotherapy (PMID ) N-acetyltransferase (PMID ) Na+-coupled citrate transporter (PMID ) Why text mining is difficult 6

NaCT is expressed in liver, testis and brain in rat and shows preference for citrate over dicarboxylates… (GeneRIF : ) (liver), (testis) and (brain in rat) liver, (testis and brain in rat) (liver, testis and brain in rat) 6

Why text mining is difficult NaCT is expressed in liver, testis and brain in rat and shows preference for citrate over dicarboxylates… (GeneRIF : ) shows preference for (citrate over dicarboxylates) shows preference (for citrate) (over dicarboxylates) 7

Why text mining is difficult regulation of cell migration and proliferation (PMID …) serine phosphorylation, translocation, and degradation of IRS-1 (PMID ) !proliferation and regulation of cell migration !regulation of proliferation and cell migration regulation of cell migration and regulation of cell proliferation 7

Why text mining is difficult regulation of cell migration and proliferation (PMID …) serine phosphorylation, translocation, and degradation of IRS-1 (PMID ) !degradation of IRS-1, translocation, and serine phosphorylation !serine phosphorylation, serine translocation, and serine degradation (of IRS-1) 7

Most biomedical text mining to date: “ungrounded” Drosophila OBP76a is necessary for fruit flies to respond to the aggregation pheromone 11-cis vaccenyl acetate (PMID ) lush is completely devoid of evoked activity to the pheromone 11-cis vaccenyl acetate (VA), revealing that this binding protein is absolutely required for activation of pheromone-sensitive chemosensory neurons (PMID ) 7 Entrez Gene ID:40136

The next step Text mining can be key tool for linking biological knowledge from the literature to structured data in biological databases… …and databases to each other. 7

Papers in the text mining session 5 papers on linkage to ontologies Höglund et al.: generating cellular localization annotations Lussier et al.: PhenoGO for capture of phenome data Stoica and Hearst: functional annotation of proteins Johnson et al.: ontology alignments Vlachos et al.: ontology for name extraction, anaphora 2 papers linking other sets of resources Maguitman et al. on “bibliome” to reproduce Pfam classes Chun et al. on linking genes and diseases 2 papers on summarization, using linked resources Lu et al.: automated GeneRIF extraction Ling et al.: automated gene summary generation 7

Acknowledgements Alex Morgan for several slides Christian Blaschke for data and slides Bill Baumgartner for sentence segmenter performance data Helen Johnson for data on POS ambiguity in GENIA Lu Zhiyong for syntactic ambiguity examples Larry Hunter for current PubMed graph 7

How big is a humuhumunukunukuapua’a?