Download presentation
Presentation is loading. Please wait.
Published byLewis Davis Modified over 9 years ago
1
Retrospective study of a gene by mining texts : The Hepcidin use-case Fouzia Moussouni-Marzolf
2
Introduction Life Science is becoming the most VOLUMINOUS science. 3 major reasons : Modern digital revolution : INTERNET Increasing incitment to publish : The competition pressure Evaluation concerns at several levels Sharing of knowledge at a global scale
3
Rapid Expansion of the biomedical literature available papers exploding Introduction Increased demand for effective text mining tools to find quickly relevant information. The comprehension of iron regulation system is still difficult Comprehension of associated diseases by medical experts Hepcidin Since dec 2000 BOOM of publications since 2000 MLTrends
4
Introduction These tools extract a deluge of information Very dense data Hepcidin : January 2011 Hepcidin : Febrary 2011 non expert Information dense and unreadable For an expert A considerable amount of well known data (background). Many common events few news The pertinent information is hidden biologists are rapidly discouraged from using these tools. Text Mining with Ali-baba and a global Query « Hepcidin » [1] [1] Plake, C., Schiemann, T., Pankalla, M., Hakenberg, J. & Leser, U. AliBaba: PubMed as a graph. Bioinformatics. 22, 2444-2445 (2006).
5
Introduction Which solutions for managing this increasing flood of information extracted ? Ability to locate trivial information repeatedly published and extracted [2] time Unfolding time during the process of text mining Reduce the density of information at each period of time Perception of a certain chronology in the sequence of events linked to a gene: enhance comprehension Select the most relevant events over time = Reduced density of information [1] Jensen, L.J., Saric, J. & Bork, P., Literature mining for the biologist: from information retrieval to biological discovery. Nat Rev Genet. 7, 119-129 (2006).
6
Methods Focus on 2 frames of study 1. Exploit Text Mining Engine Ali-Baba (HU-Berlin) Information Extraction Tool from Medline abstracts resulting from a PubMed Query Ali-baba is not a simple pattern matching tool for counting keyword occurrences. It recognizes effective biological entities localized in the abstracts using dictionnaries. Different sorts of bio-entities extracted proteins Cell Type Disease tissue Drug Specie Hepcidin 2005 [dp]
7
Methods Ali-Baba extracts relationships between recognized bio-entities, namely bio- events. Curcuminhepcidin1 reduce …. STAT3 inhibitors, including curcumin, AG490 and a peptide (PpYLKTK), reduced hepcidin1”, …. AG490 hepcidin1 reduce Peptide (PpYLKTK) hepcidin1 reduce Source EntityRelationshipTarget Entity Biological Events
8
Co-occurrence Natural Language processing (NLP) Extraction of Bio-events Methods Abstracts of « Hepcidin 2005 [dp] » Graph of events
9
Methods 2.Focus on Hepcidin gene Retrospective study of Hepcidin over time period = 1 month Filter trivial bio-events Select relevant bio-entities dec2000 June 2012 time Corpus of linked biological events published since gene discovery until today
10
Methods Definition A biological entity e recognized by an IE based text mining system is time relevant for period t if it achieves at time t a maximum of relationships with other biological entities recognized by the same IE based system. What is a time relevant biological entity ? e Highly Targeted by other bio-entities Graph G(Nodes,Edges) of extracted bio-events, e t-relevant biological entity e at time t
11
T-Relevance can be computed for different sorts of biological entities Methods Protein Cell Type Disease Tissue Drug Specie Different valuable information for each kind of relevance Source Entity Relationships Protein Cell Type Disease tTssue Drug Specie Target Entity
12
Methods What is a trivial biological event at time t ? G0 = Graph of events at time t 0 G1 = Graph of events at time t 1 = t 0 +pG2 = Graph of events at time t 2 = t 0 +2p A trivial event T e = event already published before t T e Є G1 and T e Є G0 T e Є G2 and (T e Є G1 or T e Є G0)... t 0 +p t 0 +2p t 0 +3p
13
Data Processing Pipeline Methods integrated time-based events of the decade For each period t in [t 0,t n ] : Query(t) = « Gene t [dp]" events extracted and drawn for period t Ali-baba web-service for Query(t) GraphML database graphML export insert final retrospective data analysis Clearing of trivial data Selection of t-relevant bio-entities Data transformation Data integration Data stamping
14
Database of more than 50,000 published biological events. Considerable amount of trivial eventsBackground ? Results Hepcidin Gene Use case - from t 0 = 12/2000 to t n = 12/2011 - 52% of published events on the whole Hepcidin decade are trivial Cumulative Quantification of trivial events over time
15
Relevant bio-entities over time Permanent visibility of Hepcidin as relevant New information emerge as highly targeted : several proteins regulate Hepcidin Transcription Before clearing trivials After Clearing Relevant Proteins over time Results Hepcidin Gene Use case
16
Relevant diseases over time Permanent visibility of hemochromatosis and iron overload New diseases linked to Hepcidin and iron, emerge as highly targeted, like the neurological diseases Before clearing trivials After Clearing Relevant bio-entities over time Results Hepcidin Gene Use case
17
More annotations of the “relevant entities” Results
18
Conclusion A new straightforward approach for retrospective studies of genes has been proposed. This work is still ongoing. Current developments … Time has been coupled to the process of information extraction to improve comprehension of the considerable amount of biological events linked to a Hepcidin gene since its discovery in dec 2000. Toward a generalization to queries of any biological entities Exclude review papers, sections “background” and “methods” from mining to minimize trivial events and entities Threshold of relevance, threshold of triviality
19
Acknowledgments Contributors Bertrand De-Cadeville Master2 MSB Olivier Loréal, resp. Iron Ieam INSERM UMR 991 Ulf Leser, resp. Bioinformatics Team HU-Berlin Astrid Rheinlander Ali-baba Team at Berlin
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.