Download presentation
Presentation is loading. Please wait.
Published byLeah Kelley Modified over 11 years ago
1
Statistical Data Fusion to Prioritize Lists of Genes Bert Coessens, Stein Aerts Departement ESAT - SCD Katholieke Universiteit Leuven Promotor: Bart De Moor Assessor: Yves Moreau Database Issues in Biological Databases (DBiBD), January 8-9, 2005
2
Context x x x x x x x x x Linkage Analysis Positional Cloning NEFL RAB7 GARS GIB1 LMNA High-throughput technologies
3
Database Issues in Biological Databases (DBiBD), January 8-9, 2005 Concept Pathology / Biological process / … Gene Expression Literature Anatomical Expression Gene Regulation Protein Domains Functional Annotation Evolutionary Conservation …
4
Database Issues in Biological Databases (DBiBD), January 8-9, 2005 Concept Model with multiple submodels Training genes Training set Choose submodels TRAIN Candidate genes Test set One ranking for each submodel Combined ranking Order statistics SCORE gene i
5
Database Issues in Biological Databases (DBiBD), January 8-9, 2005 Order Statistics Given a set of n rank ratios for gene i - what is the probability of getting these ratios by chance alone? Joint probability density function of all n order statistics: Complexity O(n 2 )
6
Database Issues in Biological Databases (DBiBD), January 8-9, 2005 Statistical Validation: Setup 29 lists of disease genes from OMIM 5 lists of random genes from the human genome Foreach disease or random gene set do: Foreach gene in the set do: a. Leave one gene out b.TRAIN all submodels on the set minus the left-out gene c. Create a test set by adding left-out gene to [9, 49, 99] random genes d. SCORE the test set with all trained submodels e. RANK the genes in the test set according to their order statistics p-value end Calculate for a certain cut-off x the number of - TP: number of left-out genes ranked above x - FP: number of genes but left-out gene ranked above x - TN: number of genes but the left-out gene ranked below x - FN: number of left-out genes ranked below x Calculate sensitivity and specificity using the above mentioned values, plot (1-specificity) versus sensitivity to obtain a Rank ROC plot and calculate the area under the curve.
7
Database Issues in Biological Databases (DBiBD), January 8-9, 2005 Statistical Validation: Disease genes
8
Database Issues in Biological Databases (DBiBD), January 8-9, 2005 Statistical Validation: Disease genes - 29 human diseases (OMIM) = 29 gene sets - 627 disease genes with Ensembl identifier in total - average gene set contains 19 genes - smallest gene set = ALS with 4 genes - largest gene set = leukemia with 113 genes
9
Database Issues in Biological Databases (DBiBD), January 8-9, 2005 Statistical Validation: Submodels Textual data: TXTGate Sequence similarity: BLAST + Rank genes according to e-value Example: Presenilin 1 vs. Presenilin 2 e-value = 10 -133
10
Database Issues in Biological Databases (DBiBD), January 8-9, 2005 Statistical Validation: Submodels Functional annotation: GO Functional annotation: Kegg Set of genes GO IDs observed frequencies Full Genome GO IDs GO-id expected frequencies GO IDs
11
Database Issues in Biological Databases (DBiBD), January 8-9, 2005 Statistical Validation: Submodels Protein information: InterPro Protein information: BIND Training genes + Interaction partners Test gene + Interaction partners Overlap?
12
Database Issues in Biological Databases (DBiBD), January 8-9, 2005 Statistical Validation: Submodels Gene expression: Microarray data Gene expression: ESTs - Model is average expression profile of training genes - Score test gene by calculating Pearson correlation Human gene expression atlas: Su et al. 47 normal human tissues
13
Database Issues in Biological Databases (DBiBD), January 8-9, 2005 Statistical Validation: Submodels Cis-regulatory elements: TFBSs Cis-regulatory elements: TFBS modules - Check human-mouse CNS blocks in upstream sequence of a test gene - Compare found motifs with motifs in training set ModuleSearcher: searches best combination of 3 TFs in 300 bp US of genes in training set ModuleScanner: scores test gene with model
14
Database Issues in Biological Databases (DBiBD), January 8-9, 2005 Statistical Validation: Similarity Statistical meta-analysis Vector-based similarity Fishers method Assume there are m independent tests of H 0. 1.For the i-th test calculate the corresponding p-value, p i. 2.If p i has a uniform distribution on [0,1], then –2Σlog p i has a χ 2 m distribution. T1 T3 T2 - Euclidean distance - Pearson correlation - Cosine similarity
15
Database Issues in Biological Databases (DBiBD), January 8-9, 2005 Statistical Validation: Correlation
16
Database Issues in Biological Databases (DBiBD), January 8-9, 2005 Statistical Validation: Rank ROC
17
Database Issues in Biological Databases (DBiBD), January 8-9, 2005 Statistical Validation: Submodel Rank ROC
18
Database Issues in Biological Databases (DBiBD), January 8-9, 2005 Statistical Validation: Bias towards known genes
19
Database Issues in Biological Databases (DBiBD), January 8-9, 2005 Endeavour Application: Screenshot
20
Database Issues in Biological Databases (DBiBD), January 8-9, 2005 Endeavour Application: Architecture ESAT Web server Linux cluster Java RMI SOAP messages
21
Database Issues in Biological Databases (DBiBD), January 8-9, 2005 Conclusions and Future - Different weighting for different submodels - Explore mathematical modeling techniques (neural nets, SVM) - Add more information models - Define best combination of submodels F - Allows integration of heterogeneous data - Solves problem of uncertainty - Solves multiple testing problem (Bonferroni correction) - Allows for cut-offs with statistical significance C
22
Database Issues in Biological Databases (DBiBD), January 8-9, 2005 Acknowledgements Bart De MoorStein Aerts Yves Moreau Patrick GlenissonSteven Van VoorenJoke Allemeersch
23
Database Issues in Biological Databases (DBiBD), January 8-9, 2005 Endeavour Application: Demo Load training set
24
Database Issues in Biological Databases (DBiBD), January 8-9, 2005 Endeavour Application: Demo Add submodels
25
Database Issues in Biological Databases (DBiBD), January 8-9, 2005 Endeavour Application: Demo Train submodels
26
Database Issues in Biological Databases (DBiBD), January 8-9, 2005 Endeavour Application: Demo Load candidate genes
27
Database Issues in Biological Databases (DBiBD), January 8-9, 2005 Endeavour Application: Demo Score candidate genes with all submodels
28
Database Issues in Biological Databases (DBiBD), January 8-9, 2005 Endeavour Application: Demo Results of scoring
29
Database Issues in Biological Databases (DBiBD), January 8-9, 2005 Endeavour Application: Demo Ranking visualized in sprintplot
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.