SAMSI 2014-2015 Program Beyond Bioinformatics: Statistical and Mathematical Challenges Topic: Data Integration Katerina Kechris, PhD Associate Professor.

Slides:



Advertisements
Similar presentations
Analysis of High-Throughput Screening Data C371 Fall 2004.
Advertisements

Multiple testing and false discovery rate in feature selection
Control Case Common Always active
Genetic Analysis of Genome-wide Variation in Human Gene Expression Morley M. et al. Nature 2004,430: Yen-Yi Ho.
Structural Equation Modeling analysis for causal inference from multiple -omics datasets So-Youn Shin, Ann-Kristin Petersen Christian Gieger, Nicole Soranzo.
Statistical methods and tools for integrative analysis of perturbation signatures Mario Medvedovic Laboratory for Statistical Genomics and Systems Biology.
Computational discovery of gene modules and regulatory networks Ziv Bar-Joseph et al (2003) Presented By: Dan Baluta.
Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.
Data integration across omics landscapes Bing Zhang, Ph.D. Department of Biomedical Informatics Vanderbilt University School of Medicine
The Central Dogma & Data DNA mRNA Transcription Protei n Translation Metabolite Cellular processes Phenotype Embryology Organismal Biology Genetic Data.
Generalized Protein Parsimony and Spectral Counting for Functional Enrichment Analysis Nathan Edwards Department of Biochemistry and Molecular & Cellular.
Next-generation sequencing and PBRC. Next Generation Sequencer Applications DeNovo Sequencing Resequencing, Comparative Genomics Global SNP Analysis Gene.
Gene expression analysis summary Where are we now?
27803::Systems Biology1CBS, Department of Systems Biology Schedule for the Afternoon 13:00 – 13:30ChIP-chip lecture 13:30 – 14:30Exercise 14:30 – 14:45Break.
Bioinformatics: a Multidisciplinary Challenge Ron Y. Pinter Dept. of Computer Science Technion March 12, 2003.
Introduction to Genomics, Bioinformatics & Proteomics Brian Rybarczyk, PhD PMABS Department of Biology University of North Carolina Chapel Hill.
Introduction to BioInformatics GCB/CIS535
The Model To model the complex distribution of the data we used the Gaussian Mixture Model (GMM) with a countable infinite number of Gaussian components.
27803::Systems Biology1CBS, Department of Systems Biology Schedule for the Afternoon 13:00 – 13:30ChIP-chip lecture 13:30 – 14:30Exercise 14:30 – 14:45Break.
Computational Approaches in Epigenomics Guo-Cheng Yuan Department of Biostatistics and Computational Biology Dana-Farber Cancer Institute Harvard School.
Evaluation of Signaling Cascades Based on the Weights from Microarray and ChIP-seq Data by Zerrin Işık Volkan Atalay Rengül Çetin-Atalay Middle East Technical.
CISC667, F05, Lec24, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) DNA Microarray, 2d gel, MSMS, yeast 2-hybrid.
Modeling Functional Genomics Datasets CVM Lesson 1 13 June 2007Bindu Nanduri.
Introduction to molecular networks Sushmita Roy BMI/CS 576 Nov 6 th, 2014.
341: Introduction to Bioinformatics Dr. Natasa Przulj Deaprtment of Computing Imperial College London
Computational Molecular Biology Biochem 218 – BioMedical Informatics Gene Regulatory.
Inferring Cellular Networks Using Probabilistic Graphical Models Jianlin Cheng, PhD University of Missouri 2009.
Statistical Bioinformatics QTL mapping Analysis of DNA sequence alignments Postgenomic data integration Systems biology.
Geuvadis RNAseq analysis at UNIGE Analysis plans
Radiogenomics in glioblastoma multiforme
Detecting enriched regions (Chip- seq, RIP-seq) Statistical evaluation of enriched regions Data displayed in Genome Browser Detection of enriched motifs.
Correlate February 19, 2010 Sam Gross, Balasubramanian Narasimhan, Robert Tibshirani, and Daniela Witten A method for the integrative analysis of two genomic.
A New Oklahoma Bioinformatics Company. Microarray and Bioinformatics.
The Complexities of Data Analysis in Human Genetics Marylyn DeRiggi Ritchie, Ph.D. Center for Human Genetics Research Vanderbilt University Nashville,
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Metabolomics Metabolome Reflects the State of the Cell, Organ or Organism Change in the metabolome is a direct consequence of protein activity changes.
Introduction to Bioinformatics Dr. Rybarczyk, PhD University of North Carolina-Chapel Hill
1 Global expression analysis Monday 10/1: Intro* 1 page Project Overview Due Intro to R lab Wednesday 10/3: Stats & FDR - * read the paper! Monday 10/8:
Bioinformatics MEDC601 Lecture by Brad Windle Ph# Office: Massey Cancer Center, Goodwin Labs Room 319 Web site for lecture:
Central dogma: the story of life RNA DNA Protein.
TOXICOGENOMICS.
DNA Microarray Data Analysis using Artificial Neural Network Models. by Venkatanand Venkatachalapathy (‘Venkat’) ECE/ CS/ ME 539 Course Project.
Introduction to biological molecular networks
ACCELERATING CLINICAL AND TRANSLATIONAL RESEARCH Challenges in Bioinformatics R.W. Doerge Department of Statistics Department Agronomy.
Microarray analysis Quantitation of Gene Expression Expression Data to Networks BIO520 BioinformaticsJim Lund Reading: Ch 16.
ANALYSIS OF GENE EXPRESSION DATA. Gene expression data is a high-throughput data type (like DNA and protein sequences) that requires bioinformatic pattern.
Data Integration & Data Mining Tool Donald Dunbar BHF CoRE Bioinformatics Team Edinburgh Bioinformatics Meeting April 2013.
No reference available
Gene Set Analysis using R and Bioconductor Daniel Gusenleitner
13 October 2004Statistics: Yandell © Inferring Genetic Architecture of Complex Biological Processes Brian S. Yandell 12, Christina Kendziorski 13,
Advances and challenges in computational modeling and statistical learning of biological systems Qi Liu Department of Biomedical Informatics Vanderbilt.
A graph-based integration of multiple layers of cancer genomics data (Progress Report) Do Kyoon Kim 1.
Many Sample Size and Power Calculators Exist On-Line
Post-GWAS and Mechanistic Analyses
Principles of using neural networks for predicting molecular traits from DNA sequence Principles of using neural networks for predicting molecular traits.
Areas of Research Xia Jiang Assistant Professor
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Genetic-Variation-Driven Gene-Expression Changes Highlight Genes with Important Functions for Kidney Disease  Yi-An Ko, Huiguang Yi, Chengxiang Qiu, Shizheng.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Proteomics Informatics David Fenyő
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
CSCI2950-C Lecture 13 Network Motifs; Network Integration
The Impact of Network Medicine in Gastroenterology and Hepatology
In these studies, expression levels are viewed as quantitative traits, and gene expression phenotypes are mapped to particular genomic loci by combining.
Network Inference Chris Holmes Oxford Centre for Gene Function, &,
V13 Multi-omics data integration
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Proteomics Informatics David Fenyő
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Kernel Methods for large-scale Genomics Data Analysis
Presentation transcript:

SAMSI Program Beyond Bioinformatics: Statistical and Mathematical Challenges Topic: Data Integration Katerina Kechris, PhD Associate Professor Biostatistics and Informatics Colorado School of Public Health University of Colorado Denver

Omics Large-scale analyses for studying a population of molecules or molecular mechanisms High-throughput data Examples – Genomics (entire genome – DNA) – Proteomics (study of protein repertoire) – Epigenomics (study of DNA and histone modifications)

Omics Epigenome Phenome Adapted from

Large-scale Projects & Databases NCI 60 Database

Integration of Omics Data Each type of data gives a different snapshot of the biological or disease system Why integrate data? Reduce false positives/negatives Identify interactions between different molecules Explore functional mechanisms

Challenges 1.When to integrate? 2.Dimensionality 3.Resolution 4.Heterogeneity 5.Interactions and Pathways

Challenge 1: When to integrate? Early – Merging data to increase sample size Intermediate – Convert different data sources into common format (e.g., ranks, correlation matrices), kernel-based analysis Late – Meta-analysis (combine effect size or p-value), aggregate voting for classifiers, genomic enrichment and overlap of significant results

Genomic Meta-analysis: Combining Multiple Transcriptomic Studies Tseng Lab, U. of Pitt.

Assessing Genomic Overlap: Permutation-based Strategies Bickel Lab, Berkeley & ENCODE Ann. Appl. Stat. (2010) 4:

Challenge 2: Dimensionality Most technologies produce 10Ks to 100Ks measurements per sample – Exponential increase with 2+ data types Dimension reduction – Process data type separately (filtering) – Combine with model fitting – Multivariate analysis

Sparse Multivariate Methods Variable Selection, Discriminant Analysis, Visualization Penalties (or regularization) to reduce parameter space, only a few entries are non- zero (sparsity) Sparse Canonical Correlation Analysis (CCA) and Partial Least Squares Regression (PLS) Le Cao, U. of Queensland; Besse, U. of Toulose; Witten, U. of Wash; Tibshirani, Stanford Stat Appl Genet Mol Biol January 1; 8(1): Article 28; Stat Appl Genet Mol Biol. 2008;7(1):Article 35

Challenge 3: Genomic Resolution Base level (conservation, motif scores) Regular intervals (expression/binding from tiling arrays) Irregular intervals – Gene/ncRNA level data (expression) – Individual positions (SNP, methylation sites)

Challenge 4: Heterogeneity Technology-specific sources of error Different pre-processing, normalization Different amounts of missing values Data matching – Different identifiers – Not always one-to-one (microarrays) – Imputation

Challenge 4: Heterogeneity Continuous – expression and binding data from microarrays, motif scores, protein/metabolite abundance Counts – expression data from sequencing 0-1 – conservation (UCSC), DNA methylation Binary/Categorical – Thresh-holding (e.g., motif scores), genotype

Case Study: Development Ci important for differentiation of appendages during development transcription factor – binds to DNA near target genes Kechris Lab, CU Denver

Hierarchical Mixture Model Data -Transcriptome: Ci pathway mutants (expr) – irregular interval -Genome: DNA binding data of Ci (bind) – regular interval, DNA conservation across 14 insect species (cons)– base level Goal: Predict gene targets of Ci Hidden variable is gene target – hierarchical mixture model Dvorkin et al., 2013 (under review)

Challenge 5: Interactions and Pathways Known Pathways – Incorporate information in databases (curated but sparse) – e.g., KEGG pathways have metabolite – protein interactions (directed graphs) De novo Pathways – Discover novel interactions

Known Pathways Jornsten, Chalmers & Michailidis, U. Michigan Biostatistics (2012) 13: Joint modeling of metabolite and transcript data to identify active pathways metabolite gene

de novo Interactions Single data INTEGRATION Pair-wise – Correlations (e.g., eQTL) – Bayesian networks Multiple – Kernel-based methods – Probabilistic graphical models – Network analysis gene SNP protein metabolite gene methylation site PHENOTYPE

de novo Interactions Shojaie Lab U. Washington Biometrika (2010) 97 (3):

Summary Methodology 1.Meta-analysis 2.Permutation-based Methods 3.Sparse Multivariate Methods 4.Graphical Models 5.Network Analysis