Bioinformatics Dealing with expression data Kristel Van Steen, PhD, ScD Université de Liege - Institut Montefiore 2008-2009.

Bioinformatics Dealing with expression data Kristel Van Steen, PhD, ScD (kristel.vansteen@ulg.ac.be) Université de Liege - Institut Montefiore 2008-2009

Acknowledgements Material based on: Material based on: Slides from Patrik D’haeseleer, Shoudan Liang and Roland Somogyi (genetic network inference) Slides from Patrik D’haeseleer, Shoudan Liang and Roland Somogyi (genetic network inference) Slides from Steve Horvath and Jun Dong (co-expression networks) Slides from Steve Horvath and Jun Dong (co-expression networks) Slides from Sargur Srihari (bagging and boosting)

Class Outline Genetic networks A primer to co-expression network analysis Bagging and boosting (as promised …) Concensus microarray data analysis Theory Application

Genetic networks

Outline Introduction Introduction A conceptual approach to complex network dynamics A conceptual approach to complex network dynamics Inference of regulation through clustering of gene expression data Inference of regulation through clustering of gene expression data Modeling methodologies Modeling methodologies Gene network inference: reverse engineering Gene network inference: reverse engineering

Genes encode proteins, some of which in turn regulate other genes Genes encode proteins, some of which in turn regulate other genes  determine the structure of this intricate network of genetic regulatory interactions  determine the structure of this intricate network of genetic regulatory interactions

Traditional approach: local Traditional approach: local Examining and collecting data on a single gene, a single protein or a single reaction at a time Examining and collecting data on a single gene, a single protein or a single reaction at a time  functional genomics  functional genomics

Functional Genomics Specifically, functional genomics refers to the development and application of global experimental approaches to assess gene function by making use of the information and reagents provided by structural genomic. Specifically, functional genomics refers to the development and application of global experimental approaches to assess gene function by making use of the information and reagents provided by structural genomic. high throughput high throughput large scale experimental methodologies combined with statistical and computational analysis of the results. large scale experimental methodologies combined with statistical and computational analysis of the results.

Functional Genomics(Cont.) We need to define the mapping from sequence space to functional space. We need to define the mapping from sequence space to functional space.

Intermediate representation Focus at the level of single cells Focus at the level of single cells A biological system can be considered to be a state machine,where the change in internal state of the system depends on both its current internal state and any external inputs. A biological system can be considered to be a state machine,where the change in internal state of the system depends on both its current internal state and any external inputs.

The goal Observe the state of a cell and how it changes under different circumstances, and from this to derive a model of how these state changes are generated Observe the state of a cell and how it changes under different circumstances, and from this to derive a model of how these state changes are generated The state of cell The state of cell All those variables determining its behavior All those variables determining its behavior

Example A simple,6-node regulatory network A simple,6-node regulatory network

Outline Introduction Introduction A conceptual approach to complex network dynamics A conceptual approach to complex network dynamics Inference of regulation through clustering of gene expression data Inference of regulation through clustering of gene expression data Modeling methodologies Modeling methodologies Gene network inference:reverse engineering Gene network inference:reverse engineering Conclusions and Outlook Conclusions and Outlook

The global gene expression pattern is the result of the collective behavior of individual regulatory pathways The global gene expression pattern is the result of the collective behavior of individual regulatory pathways Gene function depends on its cellular context; thus understanding the network as a whole is essential. Gene function depends on its cellular context; thus understanding the network as a whole is essential.

Boolean Networks Each gene is considered as a binary variable—either ON or OFF—regulated by other genes through logical or Boolean functions. Each gene is considered as a binary variable—either ON or OFF—regulated by other genes through logical or Boolean functions. Even with this simplification,the network behavior is already extremely rich. Even with this simplification,the network behavior is already extremely rich.

Boolean Networks(Cont.) Cell differentiation corresponds to transitions from one global gene expression pattern to another. Cell differentiation corresponds to transitions from one global gene expression pattern to another.

Scoring methods Whether there has been a significant change at any one condition Whether there has been a significant change at any one condition Whether there has been a significant aggregate change over all conditions Whether there has been a significant aggregate change over all conditions Whether the fluctuation pattern shows high diversity according to Shannon entropy Whether the fluctuation pattern shows high diversity according to Shannon entropy

Guilt By Association Select a gene Select a gene Determine its nearest neighbors in expression space within a certain user- defined distance cut-off Determine its nearest neighbors in expression space within a certain user- defined distance cut-off

Clustering extract groups of genes that are tightly co-expressed over a range of different experiments. extract groups of genes that are tightly co-expressed over a range of different experiments.

Caution Different clustering methods can have very different results Different clustering methods can have very different results It’s not yet clear which clustering methods are most useful for gene expression analysis. It’s not yet clear which clustering methods are most useful for gene expression analysis.

Definition:Gene Expression Profile An expression profile e j of an ordered list of N samples(k=1 to N) for a particular gene j is a vector of scaled expression values v jk An expression profile e j of an ordered list of N samples(k=1 to N) for a particular gene j is a vector of scaled expression values v jk The expression profile is: The expression profile is: e j =(v j1,v j2,v j3,…,v jN ) e j =(v j1,v j2,v j3,…,v jN )

Definition:Gene Expression Profile( Cont.) A difference between two genes p and q may be estimated as N-dimensional metric “distance” between e p and e q. A difference between two genes p and q may be estimated as N-dimensional metric “distance” between e p and e q. Euclidean distance: Euclidean distance: = = = =

Clustering algorithms Non-hierarchical methods Non-hierarchical methods Cluster N objects into K groups in an iterative process until certain goodness criteria are optimized Cluster N objects into K groups in an iterative process until certain goodness criteria are optimized E.g. K-means E.g. K-means

Clustering algorithms Hierarchical methods Hierarchical methods Return an hierarchy of nested clusters, where each cluster typically consists of the union of two or more smaller clusters. Return an hierarchy of nested clusters, where each cluster typically consists of the union of two or more smaller clusters. Agglomerative methods Agglomerative methods Start with single object clusters and recursively merge them into larger clusters Start with single object clusters and recursively merge them into larger clusters Divisive methods Divisive methods Start with the cluster containing all objects and recursively divide it into smaller clusters Start with the cluster containing all objects and recursively divide it into smaller clusters

Other applications of co- expression clusters Extraction of regulatory motifs Extraction of regulatory motifs Genes in the same expression share biological funtions Genes in the same expression share biological funtions Inference of functional annotation Inference of functional annotation Functions of unknown genes may be hypothesized from genes with know function within the same cluster Functions of unknown genes may be hypothesized from genes with know function within the same cluster As a molecular signature in distinguishing cell or tissue types As a molecular signature in distinguishing cell or tissue types mRNA expression mRNA expression

Which clustering method to use? There is no single best criterion for obtaining a partition because no precise and workable definition of ‘cluster’ exists. There is no single best criterion for obtaining a partition because no precise and workable definition of ‘cluster’ exists. Clusters can be of any arbitrary shapes and sizes in a multidimensional pattern space. Clusters can be of any arbitrary shapes and sizes in a multidimensional pattern space.

Challenge in cluster analysis A gene could be a member of several clusters, each reflecting a particular aspect of its function and control A gene could be a member of several clusters, each reflecting a particular aspect of its function and control Solutions Solutions clustering methods that partition genes into non-exclusive clusters clustering methods that partition genes into non-exclusive clusters Several clustering methods could be used simultaneously Several clustering methods could be used simultaneously

Level of biochemical detail abstract abstract Boolean networks Boolean networks concrete concrete Full biochemical interaction models with stochastic kinetics in Arkin et al.(1998) Full biochemical interaction models with stochastic kinetics in Arkin et al.(1998)

Forward and inverse modeling Forward modeling approach Forward modeling approach Inverse modeling, or reverse engineering Inverse modeling, or reverse engineering Given an amount of data, what can we deduce about the unknown underlying regulatory network? Given an amount of data, what can we deduce about the unknown underlying regulatory network? Requires the use of a parametric model, the parameters of which are then fit to the real- world data. Requires the use of a parametric model, the parameters of which are then fit to the real- world data.

Goal of network inference Construct a coarse-scale model of the network of regulatory interactions between the genes Construct a coarse-scale model of the network of regulatory interactions between the genes It’s possible to reverse engineer a network from its activity profiles It’s possible to reverse engineer a network from its activity profiles

Data requirements We need to observe the expression of that gene under many different combinations of expression levels of its regulatory inputs We need to observe the expression of that gene under many different combinations of expression levels of its regulatory inputs Use data from different sources Use data from different sources Deal with different data types Deal with different data types

Estimates for network models a sparse network model of N genes, where each gene is only affected by K other genes on average. a sparse network model of N genes, where each gene is only affected by K other genes on average.  a sparsely connected, directed graph with N nodes and NK edges.  a sparsely connected, directed graph with N nodes and NK edges.

Co-expression network analysis

Outline Network and network concepts Network and network concepts Approximately factorizable networks Approximately factorizable networks Gene Co-expression Network Gene Co-expression Network Eigengene Factorizability, Eigengene Conformity Eigengene Factorizability, Eigengene Conformity Eigengene-based network concepts Eigengene-based network concepts What can we learn from the geometric interpretation? What can we learn from the geometric interpretation?

Network=Adjacency Matrix A network can be represented by an adjacency matrix, A=[a ij ], that encodes whether/how a pair of nodes is connected. A network can be represented by an adjacency matrix, A=[a ij ], that encodes whether/how a pair of nodes is connected. A is a symmetric matrix with entries in [0,1] A is a symmetric matrix with entries in [0,1] For unweighted network, entries are 1 or 0 depending on whether or not 2 nodes are adjacent (connected) For unweighted network, entries are 1 or 0 depending on whether or not 2 nodes are adjacent (connected) For weighted networks, the adjacency matrix reports the connection strength between node pairs For weighted networks, the adjacency matrix reports the connection strength between node pairs Our convention: diagonal elements of A are all 1. Our convention: diagonal elements of A are all 1.

Motivational example I: Pair-wise relationships between genes across different mouse tissues and genders Challenge: Develop simple descriptive measures that describe the patterns. Solution: The following network concepts are useful: density, centralization, clustering coefficient, heterogeneity

Motivational example (continued) Challenge: Find a simple measure for describing the relationship between gene significance and connectivity Solution: network concept called hub gene significance

Backgrounds Network concepts are also known as network statistics or network indices Network concepts are also known as network statistics or network indices Examples: connectivity (degree), clustering coefficient, topological overlap, etc Examples: connectivity (degree), clustering coefficient, topological overlap, etc Network concepts underlie network language and systems biological modeling. Network concepts underlie network language and systems biological modeling. Dozens of potentially useful network concepts are known from graph theory. Dozens of potentially useful network concepts are known from graph theory.

Review of some fundamental network concepts which are defined for all networks (not just co-expression networks)

Connectivity Node connectivity = row sum of the adjacency matrix Node connectivity = row sum of the adjacency matrix For unweighted networks=number of direct neighbors For unweighted networks=number of direct neighbors For weighted networks= sum of connection strengths to other nodes For weighted networks= sum of connection strengths to other nodes

Density Density= mean adjacency Density= mean adjacency Highly related to mean connectivity Highly related to mean connectivity

Centralization Centralization = 1 because it has a star topology Centralization = 0 because all nodes have the same connectivity of 2 = 1 if the network has a star topology = 0 if all nodes have the same connectivity

Heterogeneity Heterogeneity: coefficient of variation of the connectivity Heterogeneity: coefficient of variation of the connectivity Highly heterogeneous networks exhibit hubs Highly heterogeneous networks exhibit hubs

Clustering Coefficient Measures the cliquishness of a particular node « A node is cliquish if its neighbors know each other » Clustering Coef of the white node = 0 Clustering Coef = 1 This generalizes directly to weighted networks (Zhang and Horvath 2005)

The topological overlap dissimilarity is used as input of hierarchical clustering Generalized in Zhang and Horvath (2005) to the case of weighted networks Generalized in Zhang and Horvath (2005) to the case of weighted networks Generalized in Li and Horvath (2006) to multiple nodes Generalized in Li and Horvath (2006) to multiple nodes Generalized in Yip and Horvath (2007) to higher order interactions Generalized in Yip and Horvath (2007) to higher order interactions

Network Significance Defined as average gene significance Defined as average gene significance We often refer to the network significance of a module network as module significance. We often refer to the network significance of a module network as module significance.

Hub Gene Significance= slope of the regression line (intercept=0)

Q: What do all of these fundamental network concepts have in common? They are functions of the adjacency matrix A and/or a gene significance measure GS.

CHALLENGE Find relationships between these and other seemingly disparate network concepts. For general networks, this is a difficult problem. For general networks, this is a difficult problem. But a solution exists for a special subclass of networks: approximately factorizable networks But a solution exists for a special subclass of networks: approximately factorizable networks

Definition of an approximately factorizable network Why is this relevant? Answer: Because modules are often approximately factorizable

Observation: Approximate relationships among network concepts in approximately factorizable networks

Weighted Gene Co-expression Network

Steps for constructing a co-expression network Hi Hi A) Microarray gene expression data B) Measure concordance of gene expression with a Pearson correlation C) The Pearson correlation matrix is either dichotomized to arrive at an adjacency matrix  unweighted network Or transformed continuously with the power adjacency function  weighted network

Definition of module (cluster) Module=cluster of highly connected nodes Module=cluster of highly connected nodes Any clustering method that results in such sets is suitable Any clustering method that results in such sets is suitable We define modules as branches of a hierarchical clustering tree using the topological overlap matrix We define modules as branches of a hierarchical clustering tree using the topological overlap matrix

Module Eigengene= measure of over-expression=average redness Rows=genes, Columns=microarray module eigengenes across samples

The module eigengene is highly correlated with the most highly connected hub gene.

Some insights Intramodular hub gene= a genes that is highly correlated with the module eigengene, i.e. it is a good representative of a module Intramodular hub gene= a genes that is highly correlated with the module eigengene, i.e. it is a good representative of a module Gene screening strategies that use intramodular connectivity amount to path-way based gene screening methods Gene screening strategies that use intramodular connectivity amount to path-way based gene screening methods Intramodular connectivity is a highly reproducible “fuzzy” measure of module membership. Intramodular connectivity is a highly reproducible “fuzzy” measure of module membership. Network concepts are useful for describing pairwise interaction patterns. Network concepts are useful for describing pairwise interaction patterns.

Bagging and Boosting

Bagging

Boosting

Creating a classifier sequence

Creating a 2 nd training set

Creating a 3rd data set

Boosting vs Bagging

Concensus microarray analysis

Theory (Allison et al 2006 !!!) Practical IBD application

Bioinformatics Dealing with expression data Kristel Van Steen, PhD, ScD Université de Liege - Institut Montefiore 2008-2009.

Similar presentations

Presentation on theme: "Bioinformatics Dealing with expression data Kristel Van Steen, PhD, ScD Université de Liege - Institut Montefiore 2008-2009."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Bioinformatics Dealing with expression data Kristel Van Steen, PhD, ScD Université de Liege - Institut Montefiore 2008-2009.

Similar presentations

Presentation on theme: "Bioinformatics Dealing with expression data Kristel Van Steen, PhD, ScD Université de Liege - Institut Montefiore 2008-2009."— Presentation transcript:

Similar presentations

About project

Feedback