Presentation is loading. Please wait.

Presentation is loading. Please wait.

 PhD defense Patrick Glenisson Integrating Scientific Literature With Large Scale Gene Expression Analysis Promotor Prof. Bart De Moor June 11 th 2004.

Similar presentations


Presentation on theme: " PhD defense Patrick Glenisson Integrating Scientific Literature With Large Scale Gene Expression Analysis Promotor Prof. Bart De Moor June 11 th 2004."— Presentation transcript:

1  PhD defense Patrick Glenisson Integrating Scientific Literature With Large Scale Gene Expression Analysis Promotor Prof. Bart De Moor June 11 th 2004

2  Overview  Genes & microarrays  Gene expression data analysis  Text mining in biology: principles  Text mining in practice: TXTGate  Combining text and gene expression data  Conclusion

3  Overview M-score Overview  Genes & microarrays  Gene expression data analysis  Text mining in biology: principles  Text mining in practice: TXTGate  Combining text and gene expression data  Conclusion  Cluster analysis

4  Overview  Genes & microarrays  Gene expression data analysis  Text mining in biology: principles  Text mining in practice: TXTGate  Combining text and gene expression data  Conclusion  Literature analysis

5  Overview  Genes & microarrays  Gene expression data analysis  Text mining in biology: principles  Text mining in practice: TXTGate  Combining text and gene expression data  Conclusion  TXTGate

6  Overview  Genes & microarrays  Gene expression data analysis  Text mining in biology: principles  Text mining in practice: TXTGate  Combining text and gene expression data  Conclusion  & Integrated clustering

7  Overview  Genes & microarrays  Gene expression data analysis  Text mining in biology: principles  Text mining in practice: TXTGate  Combining text and gene expression data  Conclusion  &

8  Overview  Genes & microarrays  Gene expression data analysis  Text mining in biology: principles  Text mining in practice: TXTGate  Combining text and gene expression data  Conclusion  &

9  Genes and Microarrays DNA, genes, proteins and cells

10  Genes and Microarrays DNA, genes, proteins and cells protein

11  Genes and Microarrays Genes are expressed and regulated

12  Genes and Microarrays Microarrays measure gene expression Laser excitation Genes Gene expression measurement Conditions G1 G2 G3.. C1 C2C3.. Sample annotations Gene annotations

13  Genes and Microarrays Representing expression information  Gene expression experiments are complex :  Too verbose to include in a scientific publication  Too important to compromise on reproducibility  Too valuable for post-genome research to have it scattered around on various websites  Hence, standard for reporting on MA experiments  As a guideline for databases hosting expression compendia Conditions in which expression occurs

14  Genes and Microarrays MIAME standard  Minimum Information About a MicroArray Experiment  Internationally proposed standard  Published in Dec 2001 by International consortium MGED  Some prominent journals (Nature, Lancet, EMBO, Cell) require MIAME-compliant submissions of data  Some hurdles:  Significant overhead in filling out the questionnaire  Scooping of leads (!)  Proprietary information about probe sequences

15  Overview  Genes & microarrays  Gene expression data analysis  Text mining in biology: principles  Text mining in practice: TXTGate  Combining text and gene expression data  Conclusion  &

16  Gene expression data analysis Questions asked with microarrays  Fundamental  Functional roles of genes (and transcriptional regulation)  Genetic network reconstruction  Clinical  Correlation of genes with a given disease  Diagnosis of disease stage with patients  Pharmacological  Toxicological drug response assessment

17  Gene expression data analysis Microarray data analysis  Fundamental  Functional roles of genes (and transcriptional regulation)  Genetic network reconstruction  Clinical  Correlation of genes with a given disease  Diagnosis of disease stage with patients  Pharmacological  Toxicological drug response assessment

18  Gene expression data analysis Clustering Conditions Genes Expression data C1 C3 C2 Genes Distance matrix Clustering Hierarchical clustering k - Means

19  Gene expression data analysis  Data-centered statistical scores  Coherence vs separation of clusters  Stability of a cluster solution when leaving out data Cluster validation Define `optimal’ ? Optimal number of clusters ? C1 C3 C2

20  Gene expression data analysis  Data-centered statistical scores  Knowledge-based scores  Enrichment of GO annotations in clusters  Literature-based scoring Cluster validation Define `optimal’ ? Optimal number of clusters ?

21  Gene expression data analysis Cluster validation Define `optimal’ ? Optimal number of clusters ?  Data-centered statistical scores  Knowledge-based scores  Motif-based  DNA patterns in regulatory regions of gene groups Regulatory DNA patterns (motifs) Gene

22  Genes expression data analysis DNA patterns in expression clusters Significant occurrences of known motifs in cluster Motifs Clusters Cluster-by-Motif (motif enrichment matrix) 1 2 3.. A B C.. -log(p-value) M-score Gene clusters

23  Genes expression data analysis Cluster-by-motif matrix cluster motif M-Score for the entire clustering solution  one-shot estimate of the `biological relevance’

24  Gene expression data analysis M-score  A motif is less interesting when it (significantly) occurs in many clusters  A cluster that contains a large portion of (significant) motifs is less likely to be biologically relevant.  A `too large' number of clusters is less likely to reflect the true biological diversity underlying the experiment.

25  Gene expression data analysis M-score validation  A simplification of reality  No absolute quantification of biological relevance.  Useful tool when experimenting with Multiple clustering methods Multiple parameterizations  To economize on biological validations  Optimal k in yeast cell cycle expression data  Original studies by Tavazoie et al. used k=30  Overestimation  confirmed by analyses of De Smet et al. (AQBC) Gibbons et al. (GO-based scoring) k M-score

26  Overview  Genes & microarrays  Gene expression data analysis  Text mining in biology: principles  Text mining in practice: TXTGate  Combining text and gene expression data  Conclusion 

27  Text Mining: principles Problem setting  Given a set of documents,  compute a representation, called index  to retrieve, summarize, classify or cluster them 

28  Text Mining: principles Problem setting  Given a set of genes (and their literature),  compute a representation, called gene index  to retrieve, summarize, classify or cluster them 

29  Text Mining: principles Vector space model  Document processing  Remove punctuation & grammatical structure (`Bag of words’)  Define a vocabulary Identify Multi-word terms (e.g., tumor suppressor) (phrases) Eliminate words low content (e.g., and, thus, gene,...) (stopwords) Map words with same meaning (synonyms) Strip plurals, conjugations,... (stemming)  Define weighing scheme and/or transformations (tf-idf,svd,..)  Compute index of textual resources: T 1 T 3 T 2 vocabulary gene

30  Text Mining: principles Validity of gene index Genes that are functionally related should be close in text space:  Modeled wrt a background distribution of  through random and permuted gene groups Text-based coherence score

31  Text Mining: principles Validity of gene index Genes that are functionally related should be close in text space:

32  Text Mining: principles Validity of gene index Genes that are functionally related should be close in text space:

33  Overview  Genes & microarrays  Gene expression data analysis  Text mining in biology: principles  Text mining in practice: TXTGate  Combining text and gene expression data  Conclusion  TXTGate

34  TXTGate - a platform to profile groups of genes Motivation 1 “ Until now it has been largely overlooked that there is little difference between retrieving a MEDLINE abstract and downloading an entry from a biological database ” (M. Gerstein, 2001) 12133521 12133521 VEGF is associated with the development and prognosis of colorectal cancer. 12168088 12168088 PTEN modulates angiogenesis in prostate cancer by regulating VEGF expression. 11866538 11866538 Vascular endothelial growth factor modulates the Tie- 2:Tie-1 receptor complex GeneRIF GO cell proliferation heparin binding growth factor activity

35  TXTGate - a platform to profile groups of genes Motivation 2  Controlled vocabularies are of great value when constructing interoperable and computer-parsable systems.  A number of structured vocabularies have already arisen: Gene Ontology (GO) MeSH eVOC  Standards are systematically being adopted to store biological concepts or annotations: HUGO GOA@EBI

36  TXTGate - a platform to profile groups of genes Motivation 3 (Figure courtesy: S. Van Vooren)

37  TXTGate - a platform to profile groups of genes TXTGate Profile Distance matrix & Clustering Other vocabulary

38  TXTGate - a platform to profile groups of genes TXTGate – a case study  Gene modules over various expression data sets  Reported two sub modules of TCA cycle Two ‘new’ genes ACN9 & CAT8 in module 2

39  Overview  Genes & microarrays  Gene expression data analysis  Text mining in biology: principles  Text mining in practice: TXTGate  Combining text and gene expression data  Conclusion  &

40  Fusion of text and expression data Problem setting “How can we analyze data in an integrated fashion to extract more information than solely from expression data ? ”

41  Fusion of text and expression data  In each information space  Appropriate preprocessing  Choice of distance measures Integration of text and data

42  Fusion of text and expression data Integration of text and data  Combine data:  confidence attributed to either of the two data types  in case of distance, we can see it as a scaling constant between the norms of the data- and text representations.

43  Fusion of text and expression data Integration of text and data  However, distribution of distances invoke a bias  Scaling problem  Therefore, use technique from statistical meta-analysis (so-called omnibus procedure) Expression Distance histogram Text Distance histogram

44  Fusion of text and expression data Overview meta-clustering M-score Clustering

45  Fusion of text and expression data Integration improves M-score M-score expression data only M-score integrated clustering Various cutoffs k of the cluster tree Optimal k ?

46  Fusion of text and expression data A look inside the integration

47  Fusion of text and expression data A look inside the integration Expression Profile Text Profile Strong re-enforcement

48  Overview  Genes & microarrays  Gene expression data analysis  Text mining in biology: principles  Text mining in practice: TXTGate  Combining text and gene expression data  Conclusion  &

49  Conclusion Contributions  Representation of a gene expression experiment  MIAME  Laboratory Information Management System v.  at the VIB MicroArray Facility  Gene expression analysis  Iterative clustering to determine optimal k  M-score  Text-based gene representation  To represent functional information about genes  To score gene groups based on literature  To cluster genes based on literature  TXTGate text mining application  To profile, in an flexible and interactive manner, gene groups from different ‘views’  Integration of text and expression data in clustering

50  Conclusion  Semantically-oriented text mining representations  Algorithm-based: Improved phrases (word co-locations) Latent Semantic Indexing concept clustering, bi-clustering  Knowledge based: Gene Ontology  distance in a taxonomy Basic natural language processing + statistics = Shallow Parsing  Advanced ways of integrating data  Combine link information with term information  Ways to determine Future work

51  Conclusion Publications

52  Questions ? ?

53  TXTGate - a platform to profile groups of genes TXTGate – final considerations  Flexible tool for analyzing gene groups (~200 genes) due to various term- and gene-centric vocabularies  … that allow some level of interoperability with external annotation databases  Sub-clustering gene groups useful to detect biological sub-patterns, or, shortcomings of the text representation.  Reasonably robust to corrupted groups  Gene index normalizes for unbalanced references and handles multiple gene function by ‘overruling’

54  Genes and Microarrays Representing expression information  Rationale:  Gene expression experiments are a chain of biotechnological operations, protocols and data processing steps  Too verbose to include in a scientific publication  Too important to compromise on reproducibility  Too valuable for post-genome research to have it scattered around on various websites  Standards for reporting on MA experiments  MIAME-compliant databases hosting expression compendia Conditions in which expression occurs

55  Gene expression data analysis Clustering parameterization Clustering Hierarchical clustering k - Means Optimal number of clusters ?Define `optimal’ ?  Data-centered statistical scores exist (Gap-statistic, FOM, Silhouette coefficient,…)  … but built on data that produced the result, not necessarily biologically relevant  Knowledge-based (GO- or text-based) scores (Neighborhood divergence, Gibbons et al.)  … but cyclic confirmations of truth (  As will be explained later on…)

56  Genes expression data analysis Optimal k by looking at DNA patterns  Evaluation :  we constructed a motif-based heuristic  in terms of upstream regulatory sequence patterns in clusters,  To have a one-shot estimate of the `biological relevance’ of a clustering result.

57  TXTGate - a platform to profile groups of genes TXTGate  multiple ‘views’ (through use of different vocabularies)  on vast amounts of (gene-based) free-text information  available in selected curated database entries & linked scientific publications.

58  TXTGate - a platform to profile groups of genes TXTGate  incorporates term-based indices.. (cfr before) .. and use them as a starting point  to explore terms generated through different domain vocabularies  to link out to other resources by query building, or  to sub-cluster genes based on text.

59  TXTGate - a platform to profile groups of genes TXTGate – case 2

60  Text Mining: principles How to construct a gene index Gene index Document index Gene-literature associations

61  TXTGate - a platform to profile groups of genes TXTGate – case 1  Gene clusters from microarray experiment on human immune response  Comparative study with Chaussabel et al.  TXTGate’s disease vocabulary

62  Fusion of text and expression data Various ways to integrate data


Download ppt " PhD defense Patrick Glenisson Integrating Scientific Literature With Large Scale Gene Expression Analysis Promotor Prof. Bart De Moor June 11 th 2004."

Similar presentations


Ads by Google