Empirical evaluation of prediction- and correlation network methods applied to genomic data Steve Horvath University of California, Los Angeles.

Slides:



Advertisements
Similar presentations
Gene Correlation Networks
Advertisements

Basic Gene Expression Data Analysis--Clustering
Molecular Systems Biology 3; Article number 140; doi: /msb
Using genetic markers to orient the edges in quantitative trait networks: the NEO software Steve Horvath dissertation work of Jason Aten Aten JE, Fuller.
Data integration across omics landscapes Bing Zhang, Ph.D. Department of Biomedical Informatics Vanderbilt University School of Medicine
Andy Yip, Steve Horvath Depts Human Genetics and Biostatistics, University of California, Los Angeles The Generalized Topological.
DNA methylation age of human tissues and cell types. Genome Biol (10):R115 PMID:
Weighted Gene Co-Expression Network Analysis of Multiple Independent Lung Cancer Data Sets Steve Horvath University of California, Los Angeles.
Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.
Gene Set Analysis 09/24/07. From individual gene to gene sets Finding a list of differentially expressed genes is only the starting point. Suppose we.
Steve Horvath University of California, Los Angeles
Steve Horvath, Andy Yip Depts Human Genetics and Biostatistics, University of California, Los Angeles The Generalized Topological.
Is Forkhead Box N1 (FOXN1) significant in both men and women diagnosed with Chronic Fatigue Syndrome? Charlyn Suarez.
Fuzzy K means.
Introduction to molecular networks Sushmita Roy BMI/CS 576 Nov 6 th, 2014.
Office hours Wednesday 3-4pm 304A Stanley Hall Review session 5pm Thursday, Dec. 11 GPB100.
Is my network module preserved and reproducible? PloS Comp Biol. 7(1): e Steve Horvath Peter Langfelder University of California, Los Angeles.
Consensus eigengene networks: Studying relationships between gene co-expression modules across networks Peter Langfelder Dept. of Human Genetics, UC Los.
Protein Interactions and Disease Audry Kang 7/15/2013.
Steve Horvath University of California, Los Angeles Weighted Correlation Network Analysis and Systems Biologic Applications.
Empirical Evaluation of Correlation Network Methods Applied to Genomic Data Steve Horvath Acknowledgement: Lin Song (dissertation)+Peter Langfelder.
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
Paola CASTAGNOLI Maria FOTI Microarrays. Applicazioni nella genomica funzionale e nel genotyping DIPARTIMENTO DI BIOTECNOLOGIE E BIOSCIENZE.
Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.
Expression profiling of peripheral blood cells for early detection of breast cancer Introduction Early detection of breast cancer is a key to successful.
Genetic Analysis in Human Disease. Learning Objectives Describe the differences between a linkage analysis and an association analysis Identify potentially.
Ai Li and Steve Horvath Depts Human Genetics and Biostatistics, University of California, Los Angeles Generalizations of.
Empirical Evaluation of Correlation Network Methods Applied to Genomic Data Steve Horvath Acknowledgement: Lin Song (dissertation)+Peter Langfelder.
Epigenome 1. 2 Background: GWAS Genome-Wide Association Studies 3.
An Overview of Weighted Gene Co-Expression Network Analysis
MATISSE - Modular Analysis for Topology of Interactions and Similarity SEts Igor Ulitsky and Ron Shamir Identification.
Network Analysis and Application Yao Fu
“An Extension of Weighted Gene Co-Expression Network Analysis to Include Signed Interactions” Michael Mason Department of Statistics, UCLA.
A Geometric Interpretation of Gene Co-Expression Network Analysis Steve Horvath, Jun Dong.
Bioinformatics Dealing with expression data Kristel Van Steen, PhD, ScD Université de Liege - Institut Montefiore
Signed weighted gene co-expression network analysis of transcriptional regulation in murine embryonic stem cells ES cell culture Self- renewing Ecto- derm.
Steve Horvath University of California, Los Angeles Module preservation statistics.
Hierarchical Clustering and Dynamic Branch Cutting
Steve Horvath Co-authors: Zhang Y, Langfelder P, Kahn RS, Boks MPM, van Eijk K, van den Berg LH, Ophoff RA Aging effects on DNA methylation modules in.
Expression Modules Brian S. Yandell (with slides from Steve Horvath, UCLA, and Mark Keller, UW-Madison)
Epigenetic Analysis BIOS Statistics for Systems Biology Spring 2008.
Module networks Sushmita Roy BMI/CS 576 Nov 18 th & 20th, 2014.
COMPUTATIONAL ANALYSIS OF MULTILEVEL OMICS DATA FOR THE ELUCIDATION OF MOLECULAR MECHANISMS OF CANCER Presented by Azeez Ayomide Fatai Supervisor: Junaid.
Differential analysis of Eigengene Networks: Finding And Analyzing Shared Modules Across Multiple Microarray Datasets Peter Langfelder and Steve Horvath.
Understanding Network Concepts in Modules Dong J, Horvath S (2007) BMC Systems Biology 2007, 1:24.
341- INTRODUCTION TO BIOINFORMATICS Overview of the Course Material 1.
A comparative study of survival models for breast cancer prognostication based on microarray data: a single gene beat them all? B. Haibe-Kains, C. Desmedt,
Analyzing Expression Data: Clustering and Stats Chapter 16.
Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.
Consensus modules: modules present across multiple data sets Peter Langfelder and Steve Horvath Eigengene networks for studying the relationships between.
PLANT BIOTECHNOLOGY & GENETIC ENGINEERING (3 CREDIT HOURS) LECTURE 13 ANALYSIS OF THE TRANSCRIPTOME.
Eigengenes as biological signatures Dr. Habil Zare, PhD PI of Oncinfo Lab Assistant Professor, Department of Computer Science Texas State University 5.
NCode TM miRNA Analysis Platform Identifies Differentially Expressed Novel miRNAs in Adenocarcinoma Using Clinical Human Samples Provided By BioServe.
Computational methods for inferring cellular networks II Stat 877 Apr 17 th, 2014 Sushmita Roy.
(1) Genotype-Tissue Expression (GTEx) Largest systematic study of genetic regulation in multiple tissues to date 53 tissues, 500+ donors, 9K samples, 180M.
Steve Horvath University of California, Los Angeles Module preservation statistics.
Different microarray applications Rita Holdhus Introduction to microarrays September 2010 microarray.no Aim of lecture: To get some basic knowledge about.
Eigengenes as biological signatures Dr. Habil Zare, PhD PI of Oncinfo Lab Assistant Professor, Department of Computer Science Texas State University 3.
Steve Horvath University of California, Los Angeles Weighted Correlation Network Analysis and Systems Biologic Applications.
High-throughput genomic profiling of tumor-infiltrating leukocytes
Graph clustering to detect network modules
Loyola Marymount University
Topological overlap matrix (TOM) plots of weighted, gene coexpression networks constructed from one mouse studies (A–F) and four human studies including.
Translation of Genotype to Phenotype by a Hierarchy of Cell Subsystems
Anastasia Baryshnikova  Cell Systems 
Volume 3, Issue 1, Pages (July 2016)
Single Sample Expression-Anchored Mechanisms Predict Survival in Head and Neck Cancer Yang et al Presented by Yves A. Lussier MD PhD The University.
Label propagation algorithm
Loyola Marymount University
Loyola Marymount University
Presentation transcript:

Empirical evaluation of prediction- and correlation network methods applied to genomic data Steve Horvath University of California, Los Angeles

Review of weighted correlation network analysis (WGCNA) When Is Hub Gene Selection Better than Standard Meta-Analysis? Evaluating systems biologic gene selection methods The epigenetic clock: a highly accurate genomic predictor of age Content

What is weighted correlation network analysis (WGCNA) ?

Construct a network Rationale: make use of interaction patterns between genes Identify modules Rationale: module (pathway) based analysis Relate modules to external information Array Information: Clinical data, SNPs, proteomics Gene Information: gene ontology, EASE, IPA Rationale: find biologically interesting modules Find the key drivers in interesting modules Tools: intramodular connectivity, causality testing Rationale: experimental validation, therapeutics, biomarkers Study Module Preservation across different data Rationale: Same data: to check robustness of module definition Different data: to find interesting modules.

Weighted correlation networks are valuable for a biologically meaningful… reduction of high dimensional data – expression: microarray, RNA-seq – gene methylation data, fMRI data, etc. integration of multiscale data – expression data from multiple tissues – SNPs (module QTL analysis) – Complex phenotypes

An anatomically comprehensive atlas of the adult human brain transcriptome MJ Hawrylycz, E Lein,..,AR Jones (2012) Nature 489, Allen Brain Institute

Data Brains from two healthy males (ages 24 and 39) 170 brain structures over 900 microarray samples per individual 64K Agilent microarray This data set provides a neuroanatomically precise, genome-wide map of transcript distributions

Global gene networks. Modules in brain 1

How to construct a weighted correlation network? Systems biology as a field of study: interactions between the components of biological systems

Network=Adjacency Matrix A network can be represented by an adjacency matrix, A=[a ij ], that encodes whether/how a pair of nodes is connected. – A is a symmetric matrix with entries in [0,1] – For unweighted network, entries are 1 or 0 depending on whether or not 2 nodes are adjacent (connected) – For weighted networks, the adjacency matrix reports the connection strength between node pairs – Our convention: diagonal elements of A are all 1.

Two types of weighted correlation networks Default values: β=6 for unsigned and β =12 for signed networks. We prefer signed networks… Zhang et al SAGMB Vol. 4: No. 1, Article 17.

Adjacency versus correlation in unsigned and signed networks Unsigned Network Signed Network

Advantages of soft thresholding with the power function 1.Robustness: Network results are highly robust with respect to the choice of the power β ( Zhang et al 2005 ) 2.Calibration of different networks becomes straightforward, which facilitates consensus module analysis 3.Module preservation statistics are particularly sensitive for measuring connectivity preservation in weighted networks 4.Math reason: Geometric Interpretation of Gene Co- Expression Network Analysis. PloS Computational Biology. 4(8): e

How to detect network modules? Systems biology as a paradigm, usually defined in antithesis to the so- called reductionist paradigm (biological organization)

Module Definition Based on the resulting cluster tree, we define modules as branches Modules are either labeled by integers (1,2,3…) or equivalently by colors (turquoise, blue, brown, etc) We often use average linkage hierarchical clustering coupled with the topological overlap dissimilarity measure. Next we use the dynamic tree cutting method to define clusters. Langfelder et al 2007

Defining modules based on a hierarchical cluster tree Langfelder P, Zhang B et al (2007) Defining clusters from a hierarchical cluster tree: the Dynamic Tree Cut library for R. Bioinformatics (5): Module=branch of a cluster tree Dynamic hybrid branch cutting method combines advantages of hierarchical clustering and pam clustering

How does one find “consensus” modules based on multiple gene expression data (networks)?

Example: Multiple Human brain expression data sets from Huntington's Disease Publicly available caudate nucleus gene expression data from HD subjects and controls 1) Durrenberger et al (2012). Selection of novel reference genes for use in the human central nervous system: a BrainNet Europe Study. Acta Neuropathol Dec;124(6): ) Hodges et al Luthi-Carter (2006) Regional and cellular gene expression changes in human Huntington’s disease brain. Human Molecular Genetics, 2006, Vol. 15, No. 6

1.Construct a signed weighted correlation network based on 2 human gene expression data sets Purpose: keep track of co-expression relationships 2. Identify consensus modules Purpose: find robustly defined and reproducible modules Technique: Consensus adjacency is a quantile of the input e.g. minimum, lower quartile, median 3. Relate modules to external information HD disease status Gene Information: gene ontology, cell marker genes Purpose: find biologically meaningful modules Analysis steps of WGCNA

Consensus dendrogram with module colors and meta-analysis significance for diagnosis. The colors correspond to the meta-analysis Z score (with weights proportional to root of number of DOF); blue color denotes genes are down in HD vs controls, and red color denotes genes that are up in HD vs controls.

Question: How does one summarize the expression profiles in a module? Answer: This has been solved. Math answer: module eigengene = first principal component Network answer: the most highly connected intramodular hub gene Both turn out to be equivalent

Module Eigengene= measure of over- expression=average redness Rows,=genes, Columns=microarray The brown module eigengenes across samples

Module eigengenes are very useful 1) They allow one to relate modules to each other – Allows one to determine whether modules should be merged 2) They allow one to relate modules to clinical traits (HD status) and genetic variation (e.g. CAG tri- nucleotide repeat length) -> avoids multiple comparison problem 3) They allow one to define a measure of module membership: kME=cor(x,ME) – Can be used for finding centrally located hub genes – Can be used to define gene lists for GO enrichment

When Is Hub Gene Selection Better than Standard Meta-Analysis? Evaluating systems biologic gene selection methods Content

When does hub gene selection lead to more meaningful gene lists than a standard statistical analysis based on significance testing? Here we address this question for the special case when multiple data sets are available. This is of great practical importance since for many research questions multiple gene expression or other -omics data sets are publicly available. In this case, the data analyst can decide between a standard statistical approach (e.g., based on meta-analysis) and a co- expression network analysis approach that selects intramodular hubs in consensus modules.

Intramodular hub genes versus whole network hubs Intramodular hubs have high intramodular connectivity kME with respect to a given module of interest Whole network hubs have high values of whole network connectivity k – k= row sum of the adjacency matrix – k= number of direct neighbors in case of an unweighted network

Q & A 1. Are whole-network hub genes relevant or should one exclusively focus on intramodular hubs? Answer: Focus exclusively on intramodular hubs in trait-related modules. 2. Do network-based gene selection strategies lead to gene lists that are biologically more informative than those based on a standard marginal approaches? Answer: Yes, gene selection based on intramodular connectivity leads to biologically more informative gene lists than marginal approaches. 3. Do network-based gene selection strategies lead to gene lists that have more reproducible trait associations than those based on a standard marginal approaches? Answer: Overall no. But in case of a weak signal networks can help.

Criteria for judging gene selection methods Criterion 1 evaluates the biological insights gained, i.e. it is relevant in basic research. Criterion 2 evaluates the validation success in independent data sets, i.e. it is relevant when it comes to developing diagnostic or prognostic biomarkers.

Data sets used in the empirical evaluation We compare standard meta-analysis with consensus network analysis in three comprehensive and unbiased empirical studies: (1) Find genes predictive of lung cancer survival – Gold standard=cell proliferation related genes (2) Find age related DNA methylation markers – Gold standard= Polycomb group target genes (3) Find genes related to total cholesterol in mouse liver tissues – Gold standard= immune system related genes

R code in the WGCNA package For standard screening, we used the metaAnalysis function For finding hubs in consensus modules, we used the consensusKME function

Results The results demonstrate that intramodular hub gene status is more useful than a meta- analysis p-value when identifying biologically meaningful gene lists (reflecting criterion 1). However, meta-analysis methods perform as good as (if not better) than a co-expression network approach in terms of validation success (criterion 2).

Overview of biological aging clocks

Here a biological aging clock is defined as a method for predicting the age (in years) of a subject/biological sample Examples 1.based on telomere length 2.based on gene expression levels 3.based on protein expression levels 4.DNA methylation levels

Telomere length versus age in white blood cells Relation between age and TRF in men (r=−0.45) and in women (r=−0.48) Benetos A, et al (2001) Telomere Length as an Indicator of Biological Aging: The Gender Effect and Relation With Pulse Pressure and Pulse Wave Velocity Hypertension. 2001

p16INK4a clock

CDKN2A=p16Ink4A=tumor suppressor tumor suppressor protein encoded by the CDKN2A gene Cyclin-dependent kinase inhibitor 2A, (CDKN2A, p16Ink4A) – also known as multiple tumor suppressor 1 (MTS-1) p16 plays an important role in regulating the cell cycle, and mutations in p16 increase the risk of developing a variety of cancers, notably melanoma. Increased expression of the p16 gene as organisms age reduces the proliferation of stem cells. – This reduction in the division and production of stem cells protects against cancer while increasing the risks associated with cellular senescence.

p16INK4a clock R^2=0.40 means that the age correlation is 0.63 Liu Y et al (May 2009). "Expression of p16INK4a in peripheral blood T-cells is a biomarker of human aging". Aging Cell 8 (4): 439–48.

Disruptive clock technology based on DNA methylation levels State of the art of biological clock before epigenetic markers – Gene products (mRNA, protein levels) lead to an age correlation = 0.63 DNA methylation levels (epigenetics) can be used to define drastically more accurate clocks – Epigenetic clock leads to an age correlation = 0.96

DNA methylation age of human tissues and cell types. Genome Biol (10):R115 PMID:

Training data sets

Test data sets

Illumina data sets The first 39 data sets were used to construct ("train") the age predictor. Data sets were used to test (validate) the age predictor. Data sets served other purposes e.g. to estimate the DNAm age of embryonic stem and iPS cells. Training data were chosen i) to represent a wide spectrum of tissues/cell types, ii) to involve samples whose mean age (43 years) is similar to that in the test data, and iii) to involve a high proportion of samples (37%) measured on the Illumina 450K platform since many on-going studies use this recent Illumina platform. Only studied CpGs (measured with the Infinium type II assay) which were present on both Illumina platforms (Infinium 450K and 27K) and had fewer than 10 missing values across the data sets.

Age predictor To ensure an unbiased validation in the test data, only used the training data to define the age predictor. A transformed version of chronological age was regressed on the CpGs using a penalized regression model (elastic net). The elastic net regression model automatically selected 353 CpGs. I refer to the 353 CpGs as (epigenetic) clock CpGs since their weighted average (formed by the regression coefficients) amounts to an epigenetic clock.

Accuracy across tissues and cell types (training)

Accuracy across test data

Accuracy in brain tissue

Results send to me via Blood data from Marco Boks Jan 2014 Blood data Jim Pankow, Jan 2014 Median error=3.5 years

Aging clock applied to urine This figure, created by Wei Guo from Zymo Research, Median error=2.7 years, Cor=0.98

Acknowledgements WGCNA analysis – Lin Song – Peter Langfelder