Download presentation
Presentation is loading. Please wait.
Published byRosalyn Russell Modified over 9 years ago
1
PhD defense Patrick Glenisson Integrating Scientific Literature With Large Scale Gene Expression Analysis Promotor Prof. Bart De Moor June 11 th 2004
2
Overview Genes & microarrays Gene expression data analysis Text mining in biology: principles Text mining in practice: TXTGate Combining text and gene expression data Conclusion
3
Overview M-score Overview Genes & microarrays Gene expression data analysis Text mining in biology: principles Text mining in practice: TXTGate Combining text and gene expression data Conclusion Cluster analysis
4
Overview Genes & microarrays Gene expression data analysis Text mining in biology: principles Text mining in practice: TXTGate Combining text and gene expression data Conclusion Literature analysis
5
Overview Genes & microarrays Gene expression data analysis Text mining in biology: principles Text mining in practice: TXTGate Combining text and gene expression data Conclusion TXTGate
6
Overview Genes & microarrays Gene expression data analysis Text mining in biology: principles Text mining in practice: TXTGate Combining text and gene expression data Conclusion & Integrated clustering
7
Overview Genes & microarrays Gene expression data analysis Text mining in biology: principles Text mining in practice: TXTGate Combining text and gene expression data Conclusion &
8
Overview Genes & microarrays Gene expression data analysis Text mining in biology: principles Text mining in practice: TXTGate Combining text and gene expression data Conclusion &
9
Genes and Microarrays DNA, genes, proteins and cells
10
Genes and Microarrays DNA, genes, proteins and cells protein
11
Genes and Microarrays Genes are expressed and regulated
12
Genes and Microarrays Microarrays measure gene expression Laser excitation Genes Gene expression measurement Conditions G1 G2 G3.. C1 C2C3.. Sample annotations Gene annotations
13
Genes and Microarrays Representing expression information Gene expression experiments are complex : Too verbose to include in a scientific publication Too important to compromise on reproducibility Too valuable for post-genome research to have it scattered around on various websites Hence, standard for reporting on MA experiments As a guideline for databases hosting expression compendia Conditions in which expression occurs
14
Genes and Microarrays MIAME standard Minimum Information About a MicroArray Experiment Internationally proposed standard Published in Dec 2001 by International consortium MGED Some prominent journals (Nature, Lancet, EMBO, Cell) require MIAME-compliant submissions of data Some hurdles: Significant overhead in filling out the questionnaire Scooping of leads (!) Proprietary information about probe sequences
15
Overview Genes & microarrays Gene expression data analysis Text mining in biology: principles Text mining in practice: TXTGate Combining text and gene expression data Conclusion &
16
Gene expression data analysis Questions asked with microarrays Fundamental Functional roles of genes (and transcriptional regulation) Genetic network reconstruction Clinical Correlation of genes with a given disease Diagnosis of disease stage with patients Pharmacological Toxicological drug response assessment
17
Gene expression data analysis Microarray data analysis Fundamental Functional roles of genes (and transcriptional regulation) Genetic network reconstruction Clinical Correlation of genes with a given disease Diagnosis of disease stage with patients Pharmacological Toxicological drug response assessment
18
Gene expression data analysis Clustering Conditions Genes Expression data C1 C3 C2 Genes Distance matrix Clustering Hierarchical clustering k - Means
19
Gene expression data analysis Data-centered statistical scores Coherence vs separation of clusters Stability of a cluster solution when leaving out data Cluster validation Define `optimal’ ? Optimal number of clusters ? C1 C3 C2
20
Gene expression data analysis Data-centered statistical scores Knowledge-based scores Enrichment of GO annotations in clusters Literature-based scoring Cluster validation Define `optimal’ ? Optimal number of clusters ?
21
Gene expression data analysis Cluster validation Define `optimal’ ? Optimal number of clusters ? Data-centered statistical scores Knowledge-based scores Motif-based DNA patterns in regulatory regions of gene groups Regulatory DNA patterns (motifs) Gene
22
Genes expression data analysis DNA patterns in expression clusters Significant occurrences of known motifs in cluster Motifs Clusters Cluster-by-Motif (motif enrichment matrix) 1 2 3.. A B C.. -log(p-value) M-score Gene clusters
23
Genes expression data analysis Cluster-by-motif matrix cluster motif M-Score for the entire clustering solution one-shot estimate of the `biological relevance’
24
Gene expression data analysis M-score A motif is less interesting when it (significantly) occurs in many clusters A cluster that contains a large portion of (significant) motifs is less likely to be biologically relevant. A `too large' number of clusters is less likely to reflect the true biological diversity underlying the experiment.
25
Gene expression data analysis M-score validation A simplification of reality No absolute quantification of biological relevance. Useful tool when experimenting with Multiple clustering methods Multiple parameterizations To economize on biological validations Optimal k in yeast cell cycle expression data Original studies by Tavazoie et al. used k=30 Overestimation confirmed by analyses of De Smet et al. (AQBC) Gibbons et al. (GO-based scoring) k M-score
26
Overview Genes & microarrays Gene expression data analysis Text mining in biology: principles Text mining in practice: TXTGate Combining text and gene expression data Conclusion
27
Text Mining: principles Problem setting Given a set of documents, compute a representation, called index to retrieve, summarize, classify or cluster them
28
Text Mining: principles Problem setting Given a set of genes (and their literature), compute a representation, called gene index to retrieve, summarize, classify or cluster them
29
Text Mining: principles Vector space model Document processing Remove punctuation & grammatical structure (`Bag of words’) Define a vocabulary Identify Multi-word terms (e.g., tumor suppressor) (phrases) Eliminate words low content (e.g., and, thus, gene,...) (stopwords) Map words with same meaning (synonyms) Strip plurals, conjugations,... (stemming) Define weighing scheme and/or transformations (tf-idf,svd,..) Compute index of textual resources: T 1 T 3 T 2 vocabulary gene
30
Text Mining: principles Validity of gene index Genes that are functionally related should be close in text space: Modeled wrt a background distribution of through random and permuted gene groups Text-based coherence score
31
Text Mining: principles Validity of gene index Genes that are functionally related should be close in text space:
32
Text Mining: principles Validity of gene index Genes that are functionally related should be close in text space:
33
Overview Genes & microarrays Gene expression data analysis Text mining in biology: principles Text mining in practice: TXTGate Combining text and gene expression data Conclusion TXTGate
34
TXTGate - a platform to profile groups of genes Motivation 1 “ Until now it has been largely overlooked that there is little difference between retrieving a MEDLINE abstract and downloading an entry from a biological database ” (M. Gerstein, 2001) 12133521 12133521 VEGF is associated with the development and prognosis of colorectal cancer. 12168088 12168088 PTEN modulates angiogenesis in prostate cancer by regulating VEGF expression. 11866538 11866538 Vascular endothelial growth factor modulates the Tie- 2:Tie-1 receptor complex GeneRIF GO cell proliferation heparin binding growth factor activity
35
TXTGate - a platform to profile groups of genes Motivation 2 Controlled vocabularies are of great value when constructing interoperable and computer-parsable systems. A number of structured vocabularies have already arisen: Gene Ontology (GO) MeSH eVOC Standards are systematically being adopted to store biological concepts or annotations: HUGO GOA@EBI
36
TXTGate - a platform to profile groups of genes Motivation 3 (Figure courtesy: S. Van Vooren)
37
TXTGate - a platform to profile groups of genes TXTGate Profile Distance matrix & Clustering Other vocabulary
38
TXTGate - a platform to profile groups of genes TXTGate – a case study Gene modules over various expression data sets Reported two sub modules of TCA cycle Two ‘new’ genes ACN9 & CAT8 in module 2
39
Overview Genes & microarrays Gene expression data analysis Text mining in biology: principles Text mining in practice: TXTGate Combining text and gene expression data Conclusion &
40
Fusion of text and expression data Problem setting “How can we analyze data in an integrated fashion to extract more information than solely from expression data ? ”
41
Fusion of text and expression data In each information space Appropriate preprocessing Choice of distance measures Integration of text and data
42
Fusion of text and expression data Integration of text and data Combine data: confidence attributed to either of the two data types in case of distance, we can see it as a scaling constant between the norms of the data- and text representations.
43
Fusion of text and expression data Integration of text and data However, distribution of distances invoke a bias Scaling problem Therefore, use technique from statistical meta-analysis (so-called omnibus procedure) Expression Distance histogram Text Distance histogram
44
Fusion of text and expression data Overview meta-clustering M-score Clustering
45
Fusion of text and expression data Integration improves M-score M-score expression data only M-score integrated clustering Various cutoffs k of the cluster tree Optimal k ?
46
Fusion of text and expression data A look inside the integration
47
Fusion of text and expression data A look inside the integration Expression Profile Text Profile Strong re-enforcement
48
Overview Genes & microarrays Gene expression data analysis Text mining in biology: principles Text mining in practice: TXTGate Combining text and gene expression data Conclusion &
49
Conclusion Contributions Representation of a gene expression experiment MIAME Laboratory Information Management System v. at the VIB MicroArray Facility Gene expression analysis Iterative clustering to determine optimal k M-score Text-based gene representation To represent functional information about genes To score gene groups based on literature To cluster genes based on literature TXTGate text mining application To profile, in an flexible and interactive manner, gene groups from different ‘views’ Integration of text and expression data in clustering
50
Conclusion Semantically-oriented text mining representations Algorithm-based: Improved phrases (word co-locations) Latent Semantic Indexing concept clustering, bi-clustering Knowledge based: Gene Ontology distance in a taxonomy Basic natural language processing + statistics = Shallow Parsing Advanced ways of integrating data Combine link information with term information Ways to determine Future work
51
Conclusion Publications
52
Questions ? ?
53
TXTGate - a platform to profile groups of genes TXTGate – final considerations Flexible tool for analyzing gene groups (~200 genes) due to various term- and gene-centric vocabularies … that allow some level of interoperability with external annotation databases Sub-clustering gene groups useful to detect biological sub-patterns, or, shortcomings of the text representation. Reasonably robust to corrupted groups Gene index normalizes for unbalanced references and handles multiple gene function by ‘overruling’
54
Genes and Microarrays Representing expression information Rationale: Gene expression experiments are a chain of biotechnological operations, protocols and data processing steps Too verbose to include in a scientific publication Too important to compromise on reproducibility Too valuable for post-genome research to have it scattered around on various websites Standards for reporting on MA experiments MIAME-compliant databases hosting expression compendia Conditions in which expression occurs
55
Gene expression data analysis Clustering parameterization Clustering Hierarchical clustering k - Means Optimal number of clusters ?Define `optimal’ ? Data-centered statistical scores exist (Gap-statistic, FOM, Silhouette coefficient,…) … but built on data that produced the result, not necessarily biologically relevant Knowledge-based (GO- or text-based) scores (Neighborhood divergence, Gibbons et al.) … but cyclic confirmations of truth ( As will be explained later on…)
56
Genes expression data analysis Optimal k by looking at DNA patterns Evaluation : we constructed a motif-based heuristic in terms of upstream regulatory sequence patterns in clusters, To have a one-shot estimate of the `biological relevance’ of a clustering result.
57
TXTGate - a platform to profile groups of genes TXTGate multiple ‘views’ (through use of different vocabularies) on vast amounts of (gene-based) free-text information available in selected curated database entries & linked scientific publications.
58
TXTGate - a platform to profile groups of genes TXTGate incorporates term-based indices.. (cfr before) .. and use them as a starting point to explore terms generated through different domain vocabularies to link out to other resources by query building, or to sub-cluster genes based on text.
59
TXTGate - a platform to profile groups of genes TXTGate – case 2
60
Text Mining: principles How to construct a gene index Gene index Document index Gene-literature associations
61
TXTGate - a platform to profile groups of genes TXTGate – case 1 Gene clusters from microarray experiment on human immune response Comparative study with Chaussabel et al. TXTGate’s disease vocabulary
62
Fusion of text and expression data Various ways to integrate data
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.