Michigan, 2005 Alfonso Valencia CNB-CSIC Text Mining ISMB05 Alfonso Valencia CNB-CSIC.

Slides:



Advertisements
Similar presentations
Annotation of Gene Function …and how thats useful to you.
Advertisements

Bio-Medical Interaction Extractor Syed Toufeeq Ahmed ASU.
FP7 meeting - Gent - Carlos Rodríguez - April 18 WP4: Conceptual Mining from Text for Knowledge Engineering State of the Art WP Coordinators: Alfonso Valencia.
Microarray Data Analysis Day 2
Biological literature mining
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Global Mapping of the Yeast Genetic Interaction Network Tong et. al, Science, Feb 2004 Presented by Bowen Cui.
LESSONS FROM THE BIOCREATIVE PROTEIN- PROTEIN INTERACTION (PPI) TASK RegCreative Jamboree, Friday, December, 1st, (2006) MARTIN KRALLINGER, 2006 LESSONS.
The STRING database Michael Kuhn EMBL Heidelberg.
Gene Ontology John Pinney
University of Texas at Austin Machine Learning Group Using Biomedical Literature Mining to Consolidate the Set of Known Human Protein-Protein Interactions.
© 2003 The MITRE Corporation. ALL RIGHTS RESERVED. MITRE Critical Assessment of Information Extraction Systems in Biology (BioCreAtIvE) Marc Colosimo Lynette.
IntEx: A Syntactic Role Driven Protein-Protein Interaction Extractor for Bio-Medical Text Syed Toufeeq Ahmed Deepthi Chidambaram Hasan Davulcu Chitta Baral.
© 2006 The MITRE Corporation. ALL RIGHTS RESERVED. Lynette Hirschman The MITRE Corporation Bedford, MA, USA RegCreative Jamboree Nov 29-Dec 1, 2006 Text.
CSE 591 (99689) Application of AI to molecular Biology (5:15 – 6: 30 PM, PSA 309) Instructor: Chitta Baral Office hours: Tuesday 2 to 5 PM.
STRING Modeling of biological systems through cross-species data integration.
Semantic Relation Detection in Bioscience Text Marti Hearst SIMS, UC Berkeley Supported by NSF DBI and a gift from.
Computational Molecular Biology (Spring’03) Chitta Baral Professor of Computer Science & Engg.
Bioinformatics: a Multidisciplinary Challenge Ron Y. Pinter Dept. of Computer Science Technion March 12, 2003.
Yeast Dataset Analysis Hongli Li Final Project Computer Science Department UMASS Lowell.
The BioText Project: Recent Work Marti Hearst SIMS, UC Berkeley Supported by NSF DBI and a gift from Genentech.
Biomedical Information Extraction. Outline Intro to biomedical information extraction PASTA [Demetriou and Gaizauskas] Biomedical named entities Name.
Pathway Informatics 6 th July, 2015 Ansuman Chattopadhyay, PhD Head, Molecular Biology Information Services Health Sciences Library System University of.
Cis-Regulatory/ Text Mining Interface Discussion.
1 iProLINK: An integrated protein resource for literature mining and literature-based curation 1. Bibliography mapping - UniProt mapped citations 2. Annotation.
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
BeeSpace: An Interactive Environment for Analyzing Nature and Nurture in Societal Roles Bruce Schatz Institute for Genomic Biology University of Illinois.
ISMB 2003 presentation Extracting Synonymous Gene and Protein Terms from Biological Literature Hong Yu and Eugene Agichtein Dept. Computer Science, Columbia.
Knowledge Integration for Gene Target Selection Graciela Gonzalez, PhD Juan C. Uribe Contact:
Analysis Environments For Scientific Communities From Bases to Spaces Bruce R. Schatz Institute for Genomic Biology University of Illinois at Urbana-Champaign.
NLP for Biomedicine - Ontology building and Text Mining - Junichi Tsujii GENIA Project ( Computer Science Graduate.
Lawrence Hunter & K. Bretonnel Cohen Center for Computational Pharmacology UCHSC School of Medicine Using.
Biological Pathways & Networks
Accomplishments and Challenges in Literature Data Mining for Biology L. Hirschman et al. Presented by Jing Jiang CS491CXZ Spring, 2004.
Outline Quick review of GS Current problems with GS Our solutions Future work Discussion …
IProLINK – A Literature Mining Resource at PIR (integrated Protein Literature INformation and Knowledge ) Hu ZZ 1, Liu H 2, Vijay-Shanker K 3, Mani I 4,
Introduction to Bioinformatics Spring 2002 Adapted from Irit Orr Course at WIS.
BioNLP related talks and demos at ACL and CONLL ‘05 Presented by Beatrice Alex BioNLP meeting 11 th of July 2005.
Flexible Text Mining using Interactive Information Extraction David Milward
Towards Building A Database of Phosphorylate Interactions Extracting Information from the Literature M. Narayanaswamy & K. E. Ravikumar AU-KBC Center,
A COMPREHENSIVE GENE REGULATORY NETWORK FOR THE DIAUXIC SHIFT IN SACCHAROMYCES CEREVISIAE GEISTLINGER, L., CSABA, G., DIRMEIER, S., KÜFFNER, R., AND ZIMMER,
Automatically Generating Gene Summaries from Biomedical Literature (To appear in Proceedings of PSB 2006) X. LING, J. JIANG, X. He, Q.~Z. MEI, C.~X. ZHAI,
1 Bio-Trac 40 (Protein Bioinformatics) October 8, 2009 Zhang-Zhi Hu, M.D. Associate Professor Department of Oncology Department of Biochemistry and Molecular.
BioLINK Talks BioLINK,Detroit, June 24 (Edinburgh July 11) Linking Literature, Information and Knowledge for Biology.
Network & Systems Modeling 29 June 2009 NCSU GO Workshop.
University of Texas at Austin Machine Learning Group Integrating Co-occurrence Statistics with IE for Robust Retrieval of Protein Interactions from Medline.
A Method for Protein Functional Flow Configuration and Validation Woo-Hyuk Jang 1 Suk-Hoon Jung 1 Dong-Soo Han 1
Cell Signaling Ontology Takako Takai-Igarashi and Toshihisa Takagi Human Genome Center, Institute of Medical Science, University of Tokyo.
Modeling of complex systems: what is relevant? Arno Knobbe, Marvin Meeng, Joost Kok Leiden Institute of Advanced Computer Science (LIACS)
Daejeon, 2005 Alfonso Valencia CNB-CSIC Text mining in Bioinformatics The First International Symposium on Languages in Biology and Medicine (LMB2005)
Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.
1 Gene function annotation. 2 Outline  Functional annotation  Controlled vocabularies  Functional annotation at TAIR  Resources and tools at TAIR.
Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry University of Tennessee.
A collaborative tool for sequence annotation. Contact:
Bioinformatics and Computational Biology
A literature network of human genes for high-throughput analysis of gene expression Speaker : Shih-Te, YangShih-Te, Yang Advisor : Ueng-Cheng, YangUeng-Cheng,
Opportunities for Text Mining in Bioinformatics (CS591-CXZ Text Data Mining Seminar) Dec. 8, 2004 ChengXiang Zhai Department of Computer Science University.
TWC Illuminate Knowledge Elements in Geoscience Literature Xiaogang (Marshall) Ma, Jin Guang Zheng, Han Wang, Peter Fox Tetherless World Constellation.
Information Extraction from BioMedical Abstracts Dr. Hasan Davulcu Syed Toufeeq Ahmed Deepthi Chidambaram.
Copyright OpenHelix. No use or reproduction without express written consent1 1.
Tools in Bioinformatics Ontologies and pathways. Why are ontologies needed? A free text is the best way to describe what a protein does to a human reader.
Literature Mining and Database Annotation of Protein Phosphorylation Using a Rule-based System Z. Z. Hu 1, M. Narayanaswamy 2, K. E. Ravikumar 2, K. Vijay-Shanker.
 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?
Pathway Informatics 30 th March, 2016 Ansuman Chattopadhyay, PhD Head, Molecular Biology Information Services Health Sciences Library System University.
Protein association networks with STRING
STRING Large-scale data and text mining
Mental Functioning and the Gene Ontology
What is an Ontology An ontology is a set of terms, relationships and definitions that capture the knowledge of a certain domain. (common ontology ≠ common.
Network biology An introduction to STRING and Cytoscape
Biomedical Language Processing: What's Beyond PubMed?
Presentation transcript:

Michigan, 2005 Alfonso Valencia CNB-CSIC Text Mining ISMB05 Alfonso Valencia CNB-CSIC

Michigan, 2005 Alfonso Valencia CNB-CSIC SLIDE WINDOW APPROACH Krallinger Valencia Drug Discovery Today 2005 ISMB-Biolink

Michigan, 2005 Alfonso Valencia CNB-CSIC BioLINK SIG: Linking Literature, Information and Knowledge for Biology A Joint Meeting of The ISMB BioLINK Special Interest Group on Text Data Mining and The ACL Workshop on Linking Biological Literature, Ontologies and Databases: Mining Biological Semantics Christian Blaschke, Hagit Shatkay, Kevin B. Cohen, Lynette Hirschman 1. InTex: a Syntactic Role Driven Protein-Protein Interaction Extractor for Bio-Medical Text. S. T. Ahmed, D. Chidambaram, H. Davulcu, C. Baral 2. Corpus Design for Biomedical Natural Language Processing. K. B. Cohen, L. Fox, P. V. Ogren, L. Hunter 3. Unsupervised Gene/Protein Named Entity Normalization using Automatically Extracted Dictionaries. A. M. Cohen 4. Using Biomedical Literature Mining to Consolidate the Set of Known Human Protein-Protein Interactions. A. Ramani, E. Marcotte, R. Bunescu, R. Mooney 5. MedTag: a Collection of Biomedical Annotations. L.H Smith, L. Tanabe, T. Rindflesch, W. John Wilbur 6. A Machine Learning Approach to Acronym Generation. Y. Tsuruoka, S. Ananiadou, J. Tsujii 7. Weakly Supervised Learning Methods for Improving the Quality of Gene Name Normalization Data. B. Wellner 8. Adaptive String Similarity Metrics for Biomedical Reference Resolution. B. Wellner, J. Castaño, J. Pustejovsky 9. A Cross-Domain Application of Natural Language Processing in Biology. I. Chiu, L. H. Shu 10. Functional Annotation of Genes Using Hierarchical Text Categorization. S. Kiritchenko, S. Matwin, A. F. Famili 11. Scaling Up BioNLP: Application of a Text Annotation Architecture to Noun Compound Bracketing. P. Nakov, A. Schwartz, B. Wolf, M. Hearst 12. Searching for High-Utility Text in the Biomedical Literature. H. Shatkay, A. Rzhetsky, W. J. Wilbur 13. Automatic Highlighting of Bioscience Literature. H. Wang, S. Bradshaw, M. Light BioLINK SIG / BioOntologies in ECCB05 Madrid Sept.

Michigan, 2005 Alfonso Valencia CNB-CSIC Competitions -BioCreAtIve Task 1: Extraction of gene / protein names from text, mapping to identifiers (fly, mouse, yeast) Task 2: GO to protein via text for a collection of human genes. -TREC I, II -KDD -JNLPBA -others Text Mining vs. Curation Text Mining supports curation Curators build and maintain ontologies and databases Text Mining profits from data from different resources: ontologies, databases BioCreAtIvE ©

Michigan, 2005 Alfonso Valencia CNB-CSIC Text mining in a nutshell 1.Protein / gene names Interspecies Linking to DBs 2.Relations between entities Protein-protein Other entities (regulation, drugs) Function 3.Type of Relation Proteins Metabolic pathways 1. 80% prec/recall (BioCreative) Far less than that Essential (Bioinformatics not NLP) 2. Easy on the surface Best known one (accessible?) Dictionaries Very difficult (i.e. GO in BioCreative) 3. Semantic Summaries very difficult New challenge, unexplored Hoffmann et al., Science STKE 2005 Krallinger et al., Genome Biology 2005 Krallinger et al., DDToday 2005

Michigan, 2005 Alfonso Valencia CNB-CSIC Krallinger et al., Genome Biology 2005

Michigan, 2005 Alfonso Valencia CNB-CSIC Text mining in a nutshell 1.Protein / gene names 1.Interspecies 2.Linking to DBs 2.Relations 1.Protein protein 2.Others (regulation, drugs) 3.Function 3.Type of Relation 1.Proteins 2.Metabolic pathways 4.Concepts for groups of genes 1.Existing 2.Creating new ones 1. 80% prec/recall (biocreative) 1.Far less than that 2.Essential (not NLP) 2. Easy on the surface 1.Best known one (accessible?) 2.Dictionaries 3.Very difficult (to GO Biocreative) 3. Semantic 1.Summaries very difficult 2.New challenge, unexplored 4. Knowledge discovery 1.Summaries and generalization 2.Not jet Hoffmann et al., Science STKE 2005 Krallinger et al., Genome Biology 2005

Michigan, 2005 Alfonso Valencia CNB-CSIC Meiosis Cyclin Checkpoint Interphase Nucleoplasma Division Histone Replication Chromatid Dipeptidyl Prolyl nmr Collagen-binding 17 genes PCNA CDC2 MSH2 LBR TOP2A genes ABCA5 CAT ELF2 PIM1 WNT2... Cell cycle Unknown DNA replication DNA metabolism Cell Cycle control PCNA-MSH2 The binding of PCNA to MSH2 may reflect linkage between mismatch repair and replication. LBR-CDC2 LBR undergoes mitotic phosphorylation mediated by p34(cdc2) protein kinase. Words GO codes Sentences Words Blaschke, et al., Funct. Integ. Genomics 2001

Michigan, 2005 Alfonso Valencia CNB-CSIC AC Intro 1:30-1:45pm Text Mining: Dietrich Rebholz-Schuhmann 7. High-recall Protein Entity Recognition Using a Dictionary. Kou, Cohen, Murphy 1:45-2:10pm 9. Beyond The Clause: Extraction of Phosphorylation Information from Medline Abstracts. Narayanaswamy, Ravikumar, Vijay- Shanker 2:10-2:35pm

Michigan, 2005 Alfonso Valencia CNB-CSIC

Michigan, 2005 Alfonso Valencia CNB-CSIC Exponential Growth in Data EMBL Total Entries / year Medline Total Articles / year Medline New Articles / year

Michigan, 2005 Alfonso Valencia CNB-CSIC OFFICIAL % ALIAS % PROTEIN % The 2492 selected genes in the year 2002 were cited times Tamames et al., 2005

Michigan, 2005 Alfonso Valencia CNB-CSIC Leon et al., pathways with more than one step (information available for 73) individual steps. Protein-compound links in abstracts Total2111 steps 856 linked (40 %) Bacterial chemotaxis (89 %) Glutathione metabolism7 6 (85 %) Fatty acid biosynthesis -path (78 %) in sentences Total 2111 steps611 linked(29%) Bacterial chemotaxis (65 %) Two-component system (61 %) Citrate cycle -TCA cycle (63 %) KEGG links to literature

Michigan, 2005 Alfonso Valencia CNB-CSIC Years Evolution of gene names Hoffmann, Valencia TIGs 2003 Gene names The evolution of gene names over time is a “scale free” process - “critical state” system - the evolution of a gene name cannot be predicted - some gene name act as attractors of other names

Michigan, 2005 Alfonso Valencia CNB-CSIC Hoffmann Valencia Nat Genet 2004

Michigan, 2005 Alfonso Valencia CNB-CSIC

Michigan, 2005 Alfonso Valencia CNB-CSIC SOTA clustering versus significance of Geisha terms. Oliveros, Blaschke, GIW 2000 ©

Michigan, 2005 Alfonso Valencia CNB-CSIC SOTA and GEISA mixed information Blaschke, Herrero, Dopazo, Valencia 2002 Expression based clustering Weight (expression)+ Weight (text) Term (text) based clustering

Michigan, 2005 Alfonso Valencia CNB-CSIC

Michigan, 2005 Alfonso Valencia CNB-CSIC Stable clusters > central processes with expression and functional information agree Unstable groups > contradictory information “jumping” genes, divergent expression and functional classifications. (Gene of very unstable behavior > related with insufficient information)