SAMSI Program Beyond Bioinformatics: Statistical and Mathematical Challenges Topic: Data Integration Katerina Kechris, PhD Associate Professor Biostatistics and Informatics Colorado School of Public Health University of Colorado Denver
Omics Large-scale analyses for studying a population of molecules or molecular mechanisms High-throughput data Examples – Genomics (entire genome – DNA) – Proteomics (study of protein repertoire) – Epigenomics (study of DNA and histone modifications)
Omics Epigenome Phenome Adapted from
Large-scale Projects & Databases NCI 60 Database
Integration of Omics Data Each type of data gives a different snapshot of the biological or disease system Why integrate data? Reduce false positives/negatives Identify interactions between different molecules Explore functional mechanisms
Challenges 1.When to integrate? 2.Dimensionality 3.Resolution 4.Heterogeneity 5.Interactions and Pathways
Challenge 1: When to integrate? Early – Merging data to increase sample size Intermediate – Convert different data sources into common format (e.g., ranks, correlation matrices), kernel-based analysis Late – Meta-analysis (combine effect size or p-value), aggregate voting for classifiers, genomic enrichment and overlap of significant results
Genomic Meta-analysis: Combining Multiple Transcriptomic Studies Tseng Lab, U. of Pitt.
Assessing Genomic Overlap: Permutation-based Strategies Bickel Lab, Berkeley & ENCODE Ann. Appl. Stat. (2010) 4:
Challenge 2: Dimensionality Most technologies produce 10Ks to 100Ks measurements per sample – Exponential increase with 2+ data types Dimension reduction – Process data type separately (filtering) – Combine with model fitting – Multivariate analysis
Sparse Multivariate Methods Variable Selection, Discriminant Analysis, Visualization Penalties (or regularization) to reduce parameter space, only a few entries are non- zero (sparsity) Sparse Canonical Correlation Analysis (CCA) and Partial Least Squares Regression (PLS) Le Cao, U. of Queensland; Besse, U. of Toulose; Witten, U. of Wash; Tibshirani, Stanford Stat Appl Genet Mol Biol January 1; 8(1): Article 28; Stat Appl Genet Mol Biol. 2008;7(1):Article 35
Challenge 3: Genomic Resolution Base level (conservation, motif scores) Regular intervals (expression/binding from tiling arrays) Irregular intervals – Gene/ncRNA level data (expression) – Individual positions (SNP, methylation sites)
Challenge 4: Heterogeneity Technology-specific sources of error Different pre-processing, normalization Different amounts of missing values Data matching – Different identifiers – Not always one-to-one (microarrays) – Imputation
Challenge 4: Heterogeneity Continuous – expression and binding data from microarrays, motif scores, protein/metabolite abundance Counts – expression data from sequencing 0-1 – conservation (UCSC), DNA methylation Binary/Categorical – Thresh-holding (e.g., motif scores), genotype
Case Study: Development Ci important for differentiation of appendages during development transcription factor – binds to DNA near target genes Kechris Lab, CU Denver
Hierarchical Mixture Model Data -Transcriptome: Ci pathway mutants (expr) – irregular interval -Genome: DNA binding data of Ci (bind) – regular interval, DNA conservation across 14 insect species (cons)– base level Goal: Predict gene targets of Ci Hidden variable is gene target – hierarchical mixture model Dvorkin et al., 2013 (under review)
Challenge 5: Interactions and Pathways Known Pathways – Incorporate information in databases (curated but sparse) – e.g., KEGG pathways have metabolite – protein interactions (directed graphs) De novo Pathways – Discover novel interactions
Known Pathways Jornsten, Chalmers & Michailidis, U. Michigan Biostatistics (2012) 13: Joint modeling of metabolite and transcript data to identify active pathways metabolite gene
de novo Interactions Single data INTEGRATION Pair-wise – Correlations (e.g., eQTL) – Bayesian networks Multiple – Kernel-based methods – Probabilistic graphical models – Network analysis gene SNP protein metabolite gene methylation site PHENOTYPE
de novo Interactions Shojaie Lab U. Washington Biometrika (2010) 97 (3):
Summary Methodology 1.Meta-analysis 2.Permutation-based Methods 3.Sparse Multivariate Methods 4.Graphical Models 5.Network Analysis