 PhD defense Patrick Glenisson Integrating Scientific Literature With Large Scale Gene Expression Analysis Promotor Prof. Bart De Moor June 11 th 2004.

Slides:



Advertisements
Similar presentations
STRING Prediction of protein networks through integration of diverse large-scale data sets Lars Juhl Jensen EMBL Heidelberg.
Advertisements

Statistical Data Fusion to Prioritize Lists of Genes Bert Coessens, Stein Aerts Departement ESAT - SCD Katholieke Universiteit Leuven Promotor: Bart De.
Basic Gene Expression Data Analysis--Clustering
Computational detection of cis-regulatory modules Stein Aerts, Peter Van Loo, Ger Thijs, Yves Moreau and Bart De Moor Katholieke Universiteit Leuven, Belgium.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
D ISCOVERING REGULATORY AND SIGNALLING CIRCUITS IN MOLECULAR INTERACTION NETWORK Ideker Bioinformatics 2002 Presented by: Omrit Zemach April Seminar.
Microarray technology and analysis of gene expression data Hillevi Lindroos.
What's in a word ? Term-based approaches across bioinformatics, scientometrics and knowledge management Patrick Glenisson Bio-informatics group Dept Electrical.
A New Biclustering Algorithm for Analyzing Biological Data Prashant Paymal Advisor: Dr. Hesham Ali.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Model and Variable Selections for Personalized Medicine Lu Tian (Northwestern University) Hajime Uno (Kitasato University) Tianxi Cai, Els Goetghebeur,
Integrative Analysis of Biological Data Sai Moturu.
Bioinformatics and Phylogenetic Analysis
A Flexible Workbench for Document Analysis and Text Mining NLDB’2004, Salford, June Gulla, Brasethvik and Kaada A Flexible Workbench for Document.
Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.
27803::Systems Biology1CBS, Department of Systems Biology Schedule for the Afternoon 13:00 – 13:30ChIP-chip lecture 13:30 – 14:30Exercise 14:30 – 14:45Break.
Microarray-based Disease Prognosis using Gene Annotation Signatures Michael Kovshilovsky Swapna Annavarapu SoCalBSI 2005.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
Scalable Text Mining with Sparse Generative Models
Data Mining – Intro.
Gene Ontology and Functional Enrichment Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Analysis of microarray data
 Patrick Glenisson Integrating Scientific Literature With Large Scale Gene Expression Analysis December 21th 2004.
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
9/30/2004TCSS588A Isabelle Bichindaritz1 Introduction to Bioinformatics.
Epigenome 1. 2 Background: GWAS Genome-Wide Association Studies 3.
MATISSE - Modular Analysis for Topology of Interactions and Similarity SEts Igor Ulitsky and Ron Shamir Identification.
Analysis and Management of Microarray Data Dr G. P. S. Raghava.
From motif search to gene expression analysis
A systems biology approach to the identification and analysis of transcriptional regulatory networks in osteocytes Angela K. Dean, Stephen E. Harris, Jianhua.
Genetic Regulatory Network Inference Russell Schwartz Department of Biological Sciences Carnegie Mellon University.
A Framework for Examning Topical Locality in Object- Oriented Software 2012 IEEE International Conference on Computer Software and Applications p
Bioinformatics and medicine: Are we meeting the challenge?
Combining Full-text analysis & Bibliometric Indicators a pilot study Patrick Glenisson 1 Wolfgang Glänzel 1,2 Olle Persson 3 1.Steunpunt O&O Statistieken,
Finish up array applications Move on to proteomics Protein microarrays.
Bioinformatics Brad Windle Ph# Web Site:
HUMAN-MOUSE CONSERVED COEXPRESSION NETWORKS PREDICT CANDIDATE DISEASE GENES Ala U., Piro R., Grassi E., Damasco C., Silengo L., Brunner H., Provero P.
Computational biology of cancer cell pathways Modelling of cancer cell function and response to therapy.
Gene expression analysis
Construction of cancer pathways for personalized medicine | Presented By Date Construction of cancer pathways for personalized medicine Predictive, Preventive.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Playing Biology ’ s Name Game: Identifying Protein Names In Scientific Text Daniel Hanisch, Juliane Fluck, Heinz-Theodor Mevissen and Ralf Zimmer Pac Symp.
BioSumm A novel summarizer oriented to biological information Elena Baralis, Alessandro Fiori, Lorenzo Montrucchio Politecnico di Torino Introduction text.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Intel Confidential – Internal Only Co-clustering of biological networks and gene expression data Hanisch et al. This paper appears in: bioinformatics 2002.
Module networks Sushmita Roy BMI/CS 576 Nov 18 th & 20th, 2014.
1 FINAL PROJECT- Key dates –last day to decided on a project * 11-10/1- Presenting a proposed project in small groups A very short presentation (Max.
Ranjit Ganta, Raj Acharya, Shruthi Prabhakara Department of Computer Science and Engineering, Penn State University DATA WAREHOUSE FOR BIO-GEO HEALTH CARE.
Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.
Overview of Bioinformatics 1 Module Denis Manley..
Problem Limited number of experimental replications. Postgenomic data intrinsically noisy. Poor network reconstruction.
Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry University of Tennessee.
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
Using Domain Ontologies to Improve Information Retrieval in Scientific Publications Engineering Informatics Lab at Stanford.
While gene expression data is widely available describing mRNA levels in different cancer cells lines, the molecular regulatory mechanisms responsible.
Mining the Biomedical Research Literature Ken Baclawski.
Bioinformatics and Computational Biology
A literature network of human genes for high-throughput analysis of gene expression Speaker : Shih-Te, YangShih-Te, Yang Advisor : Ueng-Cheng, YangUeng-Cheng,
Cluster validation Integration ICES Bioinformatics.
Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.
Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007.
GeWorkbench Overview Support Team Molecular Analysis Tools Knowledge Center Columbia University and The Broad Institute of MIT and Harvard.
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
Gene expression. Gene Expression 2 protein RNA DNA.
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
Gene Set Analysis using R and Bioconductor Daniel Gusenleitner
1 Survey of Biodata Analysis from a Data Mining Perspective Peter Bajcsy Jiawei Han Lei Liu Jiong Yang.
BioCreAtIvE Critical Assessment for Information Extraction in Biology Granada, Spain, March28-March 31, 2004 Task 2: Functional annotation of gene products.
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Presentation transcript:

 PhD defense Patrick Glenisson Integrating Scientific Literature With Large Scale Gene Expression Analysis Promotor Prof. Bart De Moor June 11 th 2004

 Overview  Genes & microarrays  Gene expression data analysis  Text mining in biology: principles  Text mining in practice: TXTGate  Combining text and gene expression data  Conclusion

 Overview M-score Overview  Genes & microarrays  Gene expression data analysis  Text mining in biology: principles  Text mining in practice: TXTGate  Combining text and gene expression data  Conclusion  Cluster analysis

 Overview  Genes & microarrays  Gene expression data analysis  Text mining in biology: principles  Text mining in practice: TXTGate  Combining text and gene expression data  Conclusion  Literature analysis

 Overview  Genes & microarrays  Gene expression data analysis  Text mining in biology: principles  Text mining in practice: TXTGate  Combining text and gene expression data  Conclusion  TXTGate

 Overview  Genes & microarrays  Gene expression data analysis  Text mining in biology: principles  Text mining in practice: TXTGate  Combining text and gene expression data  Conclusion  & Integrated clustering

 Overview  Genes & microarrays  Gene expression data analysis  Text mining in biology: principles  Text mining in practice: TXTGate  Combining text and gene expression data  Conclusion  &

 Overview  Genes & microarrays  Gene expression data analysis  Text mining in biology: principles  Text mining in practice: TXTGate  Combining text and gene expression data  Conclusion  &

 Genes and Microarrays DNA, genes, proteins and cells

 Genes and Microarrays DNA, genes, proteins and cells protein

 Genes and Microarrays Genes are expressed and regulated

 Genes and Microarrays Microarrays measure gene expression Laser excitation Genes Gene expression measurement Conditions G1 G2 G3.. C1 C2C3.. Sample annotations Gene annotations

 Genes and Microarrays Representing expression information  Gene expression experiments are complex :  Too verbose to include in a scientific publication  Too important to compromise on reproducibility  Too valuable for post-genome research to have it scattered around on various websites  Hence, standard for reporting on MA experiments  As a guideline for databases hosting expression compendia Conditions in which expression occurs

 Genes and Microarrays MIAME standard  Minimum Information About a MicroArray Experiment  Internationally proposed standard  Published in Dec 2001 by International consortium MGED  Some prominent journals (Nature, Lancet, EMBO, Cell) require MIAME-compliant submissions of data  Some hurdles:  Significant overhead in filling out the questionnaire  Scooping of leads (!)  Proprietary information about probe sequences

 Overview  Genes & microarrays  Gene expression data analysis  Text mining in biology: principles  Text mining in practice: TXTGate  Combining text and gene expression data  Conclusion  &

 Gene expression data analysis Questions asked with microarrays  Fundamental  Functional roles of genes (and transcriptional regulation)  Genetic network reconstruction  Clinical  Correlation of genes with a given disease  Diagnosis of disease stage with patients  Pharmacological  Toxicological drug response assessment

 Gene expression data analysis Microarray data analysis  Fundamental  Functional roles of genes (and transcriptional regulation)  Genetic network reconstruction  Clinical  Correlation of genes with a given disease  Diagnosis of disease stage with patients  Pharmacological  Toxicological drug response assessment

 Gene expression data analysis Clustering Conditions Genes Expression data C1 C3 C2 Genes Distance matrix Clustering Hierarchical clustering k - Means

 Gene expression data analysis  Data-centered statistical scores  Coherence vs separation of clusters  Stability of a cluster solution when leaving out data Cluster validation Define `optimal’ ? Optimal number of clusters ? C1 C3 C2

 Gene expression data analysis  Data-centered statistical scores  Knowledge-based scores  Enrichment of GO annotations in clusters  Literature-based scoring Cluster validation Define `optimal’ ? Optimal number of clusters ?

 Gene expression data analysis Cluster validation Define `optimal’ ? Optimal number of clusters ?  Data-centered statistical scores  Knowledge-based scores  Motif-based  DNA patterns in regulatory regions of gene groups Regulatory DNA patterns (motifs) Gene

 Genes expression data analysis DNA patterns in expression clusters Significant occurrences of known motifs in cluster Motifs Clusters Cluster-by-Motif (motif enrichment matrix) A B C.. -log(p-value) M-score Gene clusters

 Genes expression data analysis Cluster-by-motif matrix cluster motif M-Score for the entire clustering solution  one-shot estimate of the `biological relevance’

 Gene expression data analysis M-score  A motif is less interesting when it (significantly) occurs in many clusters  A cluster that contains a large portion of (significant) motifs is less likely to be biologically relevant.  A `too large' number of clusters is less likely to reflect the true biological diversity underlying the experiment.

 Gene expression data analysis M-score validation  A simplification of reality  No absolute quantification of biological relevance.  Useful tool when experimenting with Multiple clustering methods Multiple parameterizations  To economize on biological validations  Optimal k in yeast cell cycle expression data  Original studies by Tavazoie et al. used k=30  Overestimation  confirmed by analyses of De Smet et al. (AQBC) Gibbons et al. (GO-based scoring) k M-score

 Overview  Genes & microarrays  Gene expression data analysis  Text mining in biology: principles  Text mining in practice: TXTGate  Combining text and gene expression data  Conclusion 

 Text Mining: principles Problem setting  Given a set of documents,  compute a representation, called index  to retrieve, summarize, classify or cluster them 

 Text Mining: principles Problem setting  Given a set of genes (and their literature),  compute a representation, called gene index  to retrieve, summarize, classify or cluster them 

 Text Mining: principles Vector space model  Document processing  Remove punctuation & grammatical structure (`Bag of words’)  Define a vocabulary Identify Multi-word terms (e.g., tumor suppressor) (phrases) Eliminate words low content (e.g., and, thus, gene,...) (stopwords) Map words with same meaning (synonyms) Strip plurals, conjugations,... (stemming)  Define weighing scheme and/or transformations (tf-idf,svd,..)  Compute index of textual resources: T 1 T 3 T 2 vocabulary gene

 Text Mining: principles Validity of gene index Genes that are functionally related should be close in text space:  Modeled wrt a background distribution of  through random and permuted gene groups Text-based coherence score

 Text Mining: principles Validity of gene index Genes that are functionally related should be close in text space:

 Text Mining: principles Validity of gene index Genes that are functionally related should be close in text space:

 Overview  Genes & microarrays  Gene expression data analysis  Text mining in biology: principles  Text mining in practice: TXTGate  Combining text and gene expression data  Conclusion  TXTGate

 TXTGate - a platform to profile groups of genes Motivation 1 “ Until now it has been largely overlooked that there is little difference between retrieving a MEDLINE abstract and downloading an entry from a biological database ” (M. Gerstein, 2001) VEGF is associated with the development and prognosis of colorectal cancer PTEN modulates angiogenesis in prostate cancer by regulating VEGF expression Vascular endothelial growth factor modulates the Tie- 2:Tie-1 receptor complex GeneRIF GO cell proliferation heparin binding growth factor activity

 TXTGate - a platform to profile groups of genes Motivation 2  Controlled vocabularies are of great value when constructing interoperable and computer-parsable systems.  A number of structured vocabularies have already arisen: Gene Ontology (GO) MeSH eVOC  Standards are systematically being adopted to store biological concepts or annotations: HUGO

 TXTGate - a platform to profile groups of genes Motivation 3 (Figure courtesy: S. Van Vooren)

 TXTGate - a platform to profile groups of genes TXTGate Profile Distance matrix & Clustering Other vocabulary

 TXTGate - a platform to profile groups of genes TXTGate – a case study  Gene modules over various expression data sets  Reported two sub modules of TCA cycle Two ‘new’ genes ACN9 & CAT8 in module 2

 Overview  Genes & microarrays  Gene expression data analysis  Text mining in biology: principles  Text mining in practice: TXTGate  Combining text and gene expression data  Conclusion  &

 Fusion of text and expression data Problem setting “How can we analyze data in an integrated fashion to extract more information than solely from expression data ? ”

 Fusion of text and expression data  In each information space  Appropriate preprocessing  Choice of distance measures Integration of text and data

 Fusion of text and expression data Integration of text and data  Combine data:  confidence attributed to either of the two data types  in case of distance, we can see it as a scaling constant between the norms of the data- and text representations.

 Fusion of text and expression data Integration of text and data  However, distribution of distances invoke a bias  Scaling problem  Therefore, use technique from statistical meta-analysis (so-called omnibus procedure) Expression Distance histogram Text Distance histogram

 Fusion of text and expression data Overview meta-clustering M-score Clustering

 Fusion of text and expression data Integration improves M-score M-score expression data only M-score integrated clustering Various cutoffs k of the cluster tree Optimal k ?

 Fusion of text and expression data A look inside the integration

 Fusion of text and expression data A look inside the integration Expression Profile Text Profile Strong re-enforcement

 Overview  Genes & microarrays  Gene expression data analysis  Text mining in biology: principles  Text mining in practice: TXTGate  Combining text and gene expression data  Conclusion  &

 Conclusion Contributions  Representation of a gene expression experiment  MIAME  Laboratory Information Management System v.  at the VIB MicroArray Facility  Gene expression analysis  Iterative clustering to determine optimal k  M-score  Text-based gene representation  To represent functional information about genes  To score gene groups based on literature  To cluster genes based on literature  TXTGate text mining application  To profile, in an flexible and interactive manner, gene groups from different ‘views’  Integration of text and expression data in clustering

 Conclusion  Semantically-oriented text mining representations  Algorithm-based: Improved phrases (word co-locations) Latent Semantic Indexing concept clustering, bi-clustering  Knowledge based: Gene Ontology  distance in a taxonomy Basic natural language processing + statistics = Shallow Parsing  Advanced ways of integrating data  Combine link information with term information  Ways to determine Future work

 Conclusion Publications

 Questions ? ?

 TXTGate - a platform to profile groups of genes TXTGate – final considerations  Flexible tool for analyzing gene groups (~200 genes) due to various term- and gene-centric vocabularies  … that allow some level of interoperability with external annotation databases  Sub-clustering gene groups useful to detect biological sub-patterns, or, shortcomings of the text representation.  Reasonably robust to corrupted groups  Gene index normalizes for unbalanced references and handles multiple gene function by ‘overruling’

 Genes and Microarrays Representing expression information  Rationale:  Gene expression experiments are a chain of biotechnological operations, protocols and data processing steps  Too verbose to include in a scientific publication  Too important to compromise on reproducibility  Too valuable for post-genome research to have it scattered around on various websites  Standards for reporting on MA experiments  MIAME-compliant databases hosting expression compendia Conditions in which expression occurs

 Gene expression data analysis Clustering parameterization Clustering Hierarchical clustering k - Means Optimal number of clusters ?Define `optimal’ ?  Data-centered statistical scores exist (Gap-statistic, FOM, Silhouette coefficient,…)  … but built on data that produced the result, not necessarily biologically relevant  Knowledge-based (GO- or text-based) scores (Neighborhood divergence, Gibbons et al.)  … but cyclic confirmations of truth (  As will be explained later on…)

 Genes expression data analysis Optimal k by looking at DNA patterns  Evaluation :  we constructed a motif-based heuristic  in terms of upstream regulatory sequence patterns in clusters,  To have a one-shot estimate of the `biological relevance’ of a clustering result.

 TXTGate - a platform to profile groups of genes TXTGate  multiple ‘views’ (through use of different vocabularies)  on vast amounts of (gene-based) free-text information  available in selected curated database entries & linked scientific publications.

 TXTGate - a platform to profile groups of genes TXTGate  incorporates term-based indices.. (cfr before) .. and use them as a starting point  to explore terms generated through different domain vocabularies  to link out to other resources by query building, or  to sub-cluster genes based on text.

 TXTGate - a platform to profile groups of genes TXTGate – case 2

 Text Mining: principles How to construct a gene index Gene index Document index Gene-literature associations

 TXTGate - a platform to profile groups of genes TXTGate – case 1  Gene clusters from microarray experiment on human immune response  Comparative study with Chaussabel et al.  TXTGate’s disease vocabulary

 Fusion of text and expression data Various ways to integrate data