Retrospective study of a gene by mining texts : The Hepcidin use-case Fouzia Moussouni-Marzolf.

Slides:



Advertisements
Similar presentations
Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
Advertisements

Searching and Exploring Biomedical Data Vagelis Hristidis School of Computing and Information Sciences Florida International University.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
1 Enriching UK PubMed Central SPIDER launch meeting, Wolfson College, Oxford Paul Davey, UK PubMed Central Engagement Manager.
Literature Informatics Beyond PubMed: Next Generation Literature Searching Carrie Iwema, PhD, MLS 24 th August 2011.
Computational Molecular Biology (Spring’03) Chitta Baral Professor of Computer Science & Engg.
Article Review Study Fulltext vs Metadata Searching Brad Hemminger School of Information and Library Science University of North Carolina.
Bioinformatics: a Multidisciplinary Challenge Ron Y. Pinter Dept. of Computer Science Technion March 12, 2003.
1 CIS607, Fall 2006 Semantic Information Integration Instructor: Dejing Dou Week 10 (Nov. 29)
Classification of Gene-Phenotype Co-Occurences in Biological Literature Using Maximum Entropy CIS Term Project Proposal November 1, 2002 Sharon Diskin.
DI FC UL1 Gene Function Prediction by Mining Biomedical Literature Pooja Jain Master in Bioinformatics Supervisor - Mário Jorge Costa Gaspar.
Faculty of Computer Science © 2006 CMPUT 605March 31, 2008 Towards Applying Text Mining and Natural Language Processing for Biomedical Ontology Acquisition.
BioText Infrastructure Ariel Schwartz Gaurav Bhalotia 10/07/2002.
1 BrainWave Biosolutions Limited Accelerating Life Science Research through Technology.
B IOMEDICAL T EXT M INING AND ITS A PPLICATION IN C ANCER R ESEARCH Henry Ikediego
Knowledge Integration for Gene Target Selection Graciela Gonzalez, PhD Juan C. Uribe Contact:
Automated Explanation of Gene-Gene Relationships Wacek Kuśnierczyk.
Title: GeneWiz browser: An Interactive Tool for Visualizing Sequenced Chromosomes By Peter F. Hallin, Hans-Henrik Stærfeldt, Eva Rotenberg, Tim T. Binnewies,
BeeSpace Informatics Research: From Information Access to Knowledge Discovery ChengXiang Zhai Nov. 7, 2007.
Information Need Question Understanding Selecting Sources Information Retrieval and Extraction Answer Determina tion Answer Presentation This work is supported.
Graph and Topological Structure Mining on Scientific Articles Fan Wang, Ruoming Jin, Gagan Agrawal and Helen Piontkivska The Ohio State University The.
Outline Quick review of GS Current problems with GS Our solutions Future work Discussion …
Knowledge Representation and Indexing Using the Unified Medical Language System Kenneth Baclawski* Joseph “Jay” Cigna* Mieczyslaw M. Kokar* Peter Major.
Biomedical Databases & Tools Rolando Garcia-Milian Biomedical & Health Information Services Department Health Sciences Center Library.
Flexible Text Mining using Interactive Information Extraction David Milward
Automatically Generating Gene Summaries from Biomedical Literature (To appear in Proceedings of PSB 2006) X. LING, J. JIANG, X. He, Q.~Z. MEI, C.~X. ZHAI,
Lars Juhl Jensen Biomedical text mining. exponential growth.
Agent-based methods for translational cancer multilevel modelling Sylvia Nagl PhD Cancer Systems Science & Biomedical Informatics UCL Cancer Institute.
Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from.
Ontologies GO Workshop 3-6 August Ontologies  What are ontologies?  Why use ontologies?  Open Biological Ontologies (OBO), National Center for.
BAA - Big Mechanism using SIRA Technology Chuck Rehberg CTO at Trigent Software and Chief Scientist at Semantic Insights™
Data provenance in biomedical discovery Donald Dunbar Queen’s Medical Research Institute University of Edinburgh Workshop on Principles of Provenance in.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
Overview of Bioinformatics 1 Module Denis Manley..
Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry University of Tennessee.
Retrieval of Highly Related Biomedical References by Key Passages of Citations Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan.
Respective contributions of MIAME, GeneOntology and UMLS for transcriptome analysis Fouzia Moussouni, Anita Burgun, Franck Le Duff, Emilie Guérin, Olivier.
WP2: ONTOLOGY ENRICHMENT METHODOLOGIES Carole Goble (IMG) Robert Stevens (BHIG) Mikel Egaña Aranguren (BHIG) Manchester University Computer Science: IMG:
BeeSpace Informatics Research: From Information Access to Knowledge Discovery ChengXiang Zhai Nov. 14, 2007.
Using Domain Ontologies to Improve Information Retrieval in Scientific Publications Engineering Informatics Lab at Stanford.
1 Automatic indexing Salton: When the assignment of content identifiers is carried out with the aid of modern computing equipment the operation becomes.
A collaborative tool for sequence annotation. Contact:
Bioinformatics and Computational Biology
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
Implementation of a Relational Database as an Aid to Automatic Target Recognition Christopher C. Frost Computer Science Mentor: Steven Vanstone.
Opportunities for Text Mining in Bioinformatics (CS591-CXZ Text Data Mining Seminar) Dec. 8, 2004 ChengXiang Zhai Department of Computer Science University.
Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.
Computer Science and Engineering PhD in Computer Science Monday, November 07, :00 a.m. – 11:00 a.m. Swearingen Conference Room 3A75 Network Based.
Biotechnology and Bioinformatics: Bioinformatics Essential Idea: Bioinformatics is the use of computers to analyze sequence data in biological research.
Open access – making the most of biomedical literature mining Lars Juhl Jensen EMBL Heidelberg.
Ukpmc.ac.uk As a result of the mandates Research in the open How mandates work in practice 29 th May, 2009 Paul Davey, UK PubMed Central Engagement Manager,
OncoTrack Bioinformatics Workshop Max Planck Institute for Molecular Genetics, Berlin Wednesday 6 th November 2013 TimeSubject 13:30-15:00 Introduction.
Major Issues n Information is mostly online n Information is increasing available in full-text (full-content) n There is an explosion in the amount of.
Compiling Information and Inferring Useful Knowledge for Systems Biology by Text Mining the Literature Anália Lourenço IBB – Institute for Biotechnology.
Large Scale Semantic Data Integration and Analytics through Cloud: A Case Study in Bioinformatics Tat Thang Parallel and Distributed Computing Centre,
Biological Databases By: Komal Arora.
Biomedical Text Mining and Its Applications
Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan
STRING Large-scale data and text mining
Introduction to Data Mining
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Data challenges in the pharmaceutical industry
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Data Warehousing and Data Mining
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Introduction of KNS55 Platform
The Future of Genetic Research
Fouzia Moussouni, Anita Burgun, Franck Le Duff,
Citation-based Extraction of Core Contents from Biomedical Articles
Presentation transcript:

Retrospective study of a gene by mining texts : The Hepcidin use-case Fouzia Moussouni-Marzolf

Introduction Life Science is becoming the most VOLUMINOUS science. 3 major reasons : Modern digital revolution : INTERNET Increasing incitment to publish : The competition pressure Evaluation concerns at several levels Sharing of knowledge at a global scale

Rapid Expansion of the biomedical literature available papers exploding Introduction Increased demand for effective text mining tools to find quickly relevant information. The comprehension of iron regulation system is still difficult Comprehension of associated diseases by medical experts Hepcidin Since dec 2000 BOOM of publications since 2000 MLTrends

Introduction These tools extract a deluge of information Very dense data Hepcidin : January 2011 Hepcidin : Febrary 2011 non expert Information dense and unreadable For an expert A considerable amount of well known data (background). Many common events few news The pertinent information is hidden biologists are rapidly discouraged from using these tools. Text Mining with Ali-baba and a global Query « Hepcidin » [1] [1] Plake, C., Schiemann, T., Pankalla, M., Hakenberg, J. & Leser, U. AliBaba: PubMed as a graph. Bioinformatics. 22, (2006).

Introduction Which solutions for managing this increasing flood of information extracted ? Ability to locate trivial information repeatedly published and extracted [2] time Unfolding time during the process of text mining Reduce the density of information at each period of time Perception of a certain chronology in the sequence of events linked to a gene: enhance comprehension Select the most relevant events over time = Reduced density of information [1] Jensen, L.J., Saric, J. & Bork, P., Literature mining for the biologist: from information retrieval to biological discovery. Nat Rev Genet. 7, (2006).

Methods Focus on 2 frames of study 1. Exploit Text Mining Engine Ali-Baba (HU-Berlin) Information Extraction Tool from Medline abstracts resulting from a PubMed Query Ali-baba is not a simple pattern matching tool for counting keyword occurrences. It recognizes effective biological entities localized in the abstracts using dictionnaries. Different sorts of bio-entities extracted proteins Cell Type Disease tissue Drug Specie Hepcidin 2005 [dp]

Methods Ali-Baba extracts relationships between recognized bio-entities, namely bio- events. Curcuminhepcidin1 reduce …. STAT3 inhibitors, including curcumin, AG490 and a peptide (PpYLKTK), reduced hepcidin1”, …. AG490 hepcidin1 reduce Peptide (PpYLKTK) hepcidin1 reduce Source EntityRelationshipTarget Entity Biological Events

Co-occurrence Natural Language processing (NLP) Extraction of Bio-events Methods Abstracts of « Hepcidin 2005 [dp] » Graph of events

Methods 2.Focus on Hepcidin gene Retrospective study of Hepcidin over time period = 1 month Filter trivial bio-events Select relevant bio-entities dec2000 June 2012 time Corpus of linked biological events published since gene discovery until today

Methods Definition A biological entity e recognized by an IE based text mining system is time relevant for period t if it achieves at time t a maximum of relationships with other biological entities recognized by the same IE based system. What is a time relevant biological entity ? e Highly Targeted by other bio-entities Graph G(Nodes,Edges) of extracted bio-events, e t-relevant biological entity e at time t

T-Relevance can be computed for different sorts of biological entities Methods Protein Cell Type Disease Tissue Drug Specie Different valuable information for each kind of relevance Source Entity Relationships Protein Cell Type Disease tTssue Drug Specie Target Entity

Methods What is a trivial biological event at time t ? G0 = Graph of events at time t 0 G1 = Graph of events at time t 1 = t 0 +pG2 = Graph of events at time t 2 = t 0 +2p A trivial event T e = event already published before t T e Є G1 and T e Є G0 T e Є G2 and (T e Є G1 or T e Є G0)... t 0 +p t 0 +2p t 0 +3p

Data Processing Pipeline Methods integrated time-based events of the decade For each period t in [t 0,t n ] : Query(t) = « Gene t [dp]" events extracted and drawn for period t Ali-baba web-service for Query(t) GraphML database graphML export insert final retrospective data analysis Clearing of trivial data Selection of t-relevant bio-entities Data transformation Data integration Data stamping

Database of more than 50,000 published biological events. Considerable amount of trivial eventsBackground ? Results Hepcidin Gene Use case - from t 0 = 12/2000 to t n = 12/ % of published events on the whole Hepcidin decade are trivial Cumulative Quantification of trivial events over time

Relevant bio-entities over time Permanent visibility of Hepcidin as relevant New information emerge as highly targeted : several proteins regulate Hepcidin Transcription Before clearing trivials After Clearing Relevant Proteins over time Results Hepcidin Gene Use case

Relevant diseases over time Permanent visibility of hemochromatosis and iron overload New diseases linked to Hepcidin and iron, emerge as highly targeted, like the neurological diseases Before clearing trivials After Clearing Relevant bio-entities over time Results Hepcidin Gene Use case

More annotations of the “relevant entities” Results

Conclusion A new straightforward approach for retrospective studies of genes has been proposed. This work is still ongoing. Current developments … Time has been coupled to the process of information extraction to improve comprehension of the considerable amount of biological events linked to a Hepcidin gene since its discovery in dec Toward a generalization to queries of any biological entities Exclude review papers, sections “background” and “methods” from mining to minimize trivial events and entities Threshold of relevance, threshold of triviality

Acknowledgments Contributors Bertrand De-Cadeville Master2 MSB Olivier Loréal, resp. Iron Ieam INSERM UMR 991 Ulf Leser, resp. Bioinformatics Team HU-Berlin Astrid Rheinlander Ali-baba Team at Berlin