Identifying Causal Genes and Dysregulated Pathways in Complex Diseases Discussion leader: Nafisah Islam Scribe: Matthew Computational Network Biology BMI.

Slides:



Advertisements
Similar presentations
Lecture 15. Graph Algorithms
Advertisements

IDENTIFYING CAUSAL GENES AND DYSREGULATED PATHWAYS IN COMPLEX DISEASES Nov. 6 th, 2010 YOO-AH KIM NIH / NLM / NCBI.
1 Discrete Structures & Algorithms Graphs and Trees: III EECE 320.
1 Harvard Medical School Mapping Transcription Mechanisms from Multimodal Genomic Data Hsun-Hsien Chang, Michael McGeachie, and Marco F. Ramoni Children.
1 Maximal Independent Set. 2 Independent Set (IS): In a graph G=(V,E), |V|=n, |E|=m, any set of nodes that are not adjacent.
Global Flow Optimization (GFO) in Automatic Logic Design “ TCAD91 ” by C. Leonard Berman & Louise H. Trevillyan CAD Group Meeting Prepared by Ray Cheung.
CSE Fall. Summary Goal: infer models of transcriptional regulation with annotated molecular interaction graphs The attributes in the model.
D ISCOVERING REGULATORY AND SIGNALLING CIRCUITS IN MOLECULAR INTERACTION NETWORK Ideker Bioinformatics 2002 Presented by: Omrit Zemach April Seminar.
Online Social Networks and Media. Graph partitioning The general problem – Input: a graph G=(V,E) edge (u,v) denotes similarity between u and v weighted.
Introduction To Algorithms CS 445 Discussion Session 8 Instructor: Dr Alon Efrat TA : Pooja Vaswani 04/04/2005.
A hub-attachment based method to detect functional modules from confidence-scored protein interactions and expression profiles Authors: Chia-Hao Chin 1,4,
Teresa Przytycka NIH / NLM / NCBI RECOMB 2010 Bridging the genotype and phenotype.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
27803::Systems Biology1CBS, Department of Systems Biology Schedule for the Afternoon 13:00 – 13:30ChIP-chip lecture 13:30 – 14:30Exercise 14:30 – 14:45Break.
HCS Clustering Algorithm
University of CreteCS4831 The use of Minimum Spanning Trees in microarray expression data Gkirtzou Ekaterini.
Reduced Support Vector Machine
Schedule for the Afternoon 13:00 – 13:30ChIP-chip lecture 13:30 – 14:30Exercise 14:30 – 14:45Break 14:45 – 15:15Regulatory pathways lecture 15:15 – 15:45Exercise.
27803::Systems Biology1CBS, Department of Systems Biology Schedule for the Afternoon 13:00 – 13:30ChIP-chip lecture 13:30 – 14:30Exercise 14:30 – 14:45Break.
Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.
1 Efficient Placement and Dispatch of Sensors in a Wireless Sensor Network Prof. Yu-Chee Tseng Department of Computer Science National Chiao-Tung University.
Systems Biology, April 25 th 2007Thomas Skøt Jensen Technical University of Denmark Networks and Network Topology Thomas Skøt Jensen Center for Biological.
The Shortest Path Problem
Minimum Spanning Trees. Subgraph A graph G is a subgraph of graph H if –The vertices of G are a subset of the vertices of H, and –The edges of G are a.
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
Escaping local optimas Accept nonimproving neighbors – Tabu search and simulated annealing Iterating with different initial solutions – Multistart local.
Epigenome 1. 2 Background: GWAS Genome-Wide Association Studies 3.
MATISSE - Modular Analysis for Topology of Interactions and Similarity SEts Igor Ulitsky and Ron Shamir Identification.
Clustering of protein networks: Graph theory and terminology Scale-free architecture Modularity Robustness Reading: Barabasi and Oltvai 2004, Milo et al.
1 Maximal Independent Set. 2 Independent Set (IS): In a graph G=(V,E), |V|=n, |E|=m, any set of nodes that are not adjacent.
Apostolos Zaravinos and Constantinos C Deltas Molecular Medicine Research Center and Laboratory of Molecular and Medical Genetics, Department of Biological.
Supplementary Figure S1 eQTL prior model modified from previous approaches to Bayesian gene regulatory network modeling. Detailed description is provided.
Combinatorial Optimization Problems in Computational Biology Ion Mandoiu CSE Department.
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
 Rooted tree and binary tree  Theorem 5.19: A full binary tree with t leaves contains i=t-1 internal vertices.
Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.
CS223 Advanced Data Structures and Algorithms 1 Maximum Flow Neil Tang 3/30/2010.
De novo discovery of mutated driver pathways in cancer Discussion leader: Matthew Bernstein Scribe: Kun-Chieh Wang Computational Network Biology BMI 826/Computer.
GeneMANIA: a real-time multiple association network integration algorithm for predicting gene function Sara Mostafavi, Debajyoti Ray, David Warde-Farley,
HIT’nDRIVE: Multi-driver Gene Prioritization Based on Hitting Time Raunak Shrestha, Ermin Hodzic, Jake Yeung, Kendric Wang, Thomas Sauerwald, Phuong Dao,
Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring T.R. Golub et al., Science 286, 531 (1999)
Nonlinear differential equation model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae Vu, T. T.,
Case Study: Characterizing Diseased States from Expression/Regulation Data Tuck et al., BMC Bioinformatics, 2006.
A comparative approach for gene network inference using time-series gene expression data Guillaume Bourque* and David Sankoff *Centre de Recherches Mathématiques,
Network applications Sushmita Roy BMI/CS 576 Dec 9 th, 2014.
Approximation Algorithms Greedy Strategies. I hear, I forget. I learn, I remember. I do, I understand! 2 Max and Min  min f is equivalent to max –f.
5.6 Prefix codes and optimal tree Definition 31: Codes with this property which the bit string for a letter never occurs as the first part of the bit string.
6/11/20161 Graph models and efficient exact algorithms in studying cancer signaling pathways Songjian Lu, Lujia Chen, Chunhui Cai Department of Biomedical.
::Network Optimization:: Minimum Spanning Trees and Clustering Taufik Djatna, Dr.Eng. 1.
Simultaneous identification of causal genes and dys-regulated pathways in complex diseases Yoo-Ah Kim, Stefan Wuchty and Teresa M Przytycka Paper to be.
Lecture 20. Graphs and network models 1. Recap Binary search tree is a special binary tree which is designed to make the search of elements or keys in.
Prof. Yu-Chee Tseng Department of Computer Science
1. SELECTION OF THE KEY GENE SET 2. BIOLOGICAL NETWORK SELECTION
Redraw these graphs so that none of the line intersect except at the vertices B C D E F G H.
Topological Sort (topological order)
The Taxi Scheduling Problem
Songjian Lu, PhD Assistant Professor
Minimum-Cost Spanning Tree
1 Department of Engineering, 2 Department of Mathematics,
1 Department of Engineering, 2 Department of Mathematics,
1 Department of Engineering, 2 Department of Mathematics,
Minimum-Cost Spanning Tree
Minimum-Cost Spanning Tree
Discrete Math II Howon Kim
Cyclin E1 Is Amplified and Overexpressed in Osteosarcoma
Anastasia Baryshnikova  Cell Systems 
Algorithms (2IL15) – Lecture 7
On the Graph Decomposition
Network-Based Coverage of Mutational Profiles Reveals Cancer Genes
Minimum-Cost Spanning Tree
Presentation transcript:

Identifying Causal Genes and Dysregulated Pathways in Complex Diseases Discussion leader: Nafisah Islam Scribe: Matthew Computational Network Biology BMI 826/Computer Sciences By Yoo-Ah Kim, Stefan Wuchty, Teresa M. Przytycka* PLOS Computation Biology, 2011

Problem overview Different patients might have different combinations of molecular perturbations. These combinations of perturbations dys-regulate same cellular pathway. For complex diseases, information flow from potential causal genes to effected genes is never investigated. Problem: - A set of differentially expressed genes and - Genotype alterations for a particular disease case. Goal: - Find a set of potential causal genes and - Dys-regulated pathways that affect molecular entities.

Approach Use of associations between gene expression and genomic alterations. Four main steps to solve the problem: – Selection of expressed target genes that covered underlying disease case – Finding associations between altered genomic loci and changed expression level of target genes (eQTL Mapping) – Identification of potential causal genes – Determination of a subset of causal genes that best explain underlying disease case.

Figure 1. (A)Selection of target genes that were differentially expressed in disease cases, using a multi-set cover approach. (B)Detection of genome-wide associations between gene expression changes of target genes and genomic alterations which allows to find potential causal genomic areas. (C)Determination of causal paths from genomic alterations (i.e. causal genes) to target genes by modeling and solving a current flow problem through a circuit of molecular interactions. (D)To select a final set of causal genes, a weighted multi-set cover algorithm is designed. A bipartite graph between candidate causal genes and disease cases is constructed where each edge is labeled with the associated set of target genes that were affected by the causal gene and were differentially expressed in the corresponding disease case. In the final set-cover, causal genes in boxes covered each disease case with at least two target genes, allowing one exception. Outline of the method

Step 1: Selecting Target Genes Minimum multi-set cover problem and solved using greedy algorithm. Determination of a set of genes that were differentially expressed in 158 glioblastoma (GBM) cases compared to 32 non-tumor control cases. A gene is differentially expressed if the normalized gene expression value of the gene had a p-value is less than 0.01 in the given case using a Z-test. Bipartite graph B(Ƭ, S) between genes Ƭ and disease cases S by adding edges between genes g and cases s if and only if gene g was differentially expressed in case s. Multi-set cover instance SC = {B(Ƭ, S), α, β} where, α represented the number of times that a case needed to be covered β was the maximum number of outliers.

Algorithm 1: Pseudocode for selecting target genes. 1.Construct a bipartite graph B(T, S) of genes T and disease cases S by adding edges between gene g and case s if gene g is differentially expressed in case s. Let S(g) denote the set of cases to which gene g has edges. 2.Create a multi-set cover instance SC = {B(T, S) α. β} 3.U = a set of cases covered less than α. 4.TG = a set of selected genes 5.Repeat the following until |U| ≤ β : a.Select a gene with maximum |U ∩ S(g)| b.Include the selected gene in TG c.Update U They obtain 74 target genes

Step 2: Finding Association A set of loci L = {l 1, l 2,…, l m } where each locus l i was characterized by the corresponding copy number cn i,j in each case j, CN i = {cn i,1, cn i,2,…, cn i,n } Identify a potential tag locus tl k that satisfy Pearson’s correlation between CN k and Cn i. Given, a set of loci ƬL = {tl 1,tl 2,..tl m } and a set of target genes ƬG = {tg 1,tg 2 …,tg n }. Find, candidate causal loci using eQTL association analysis. For each tg i select tag loci with p-value less than 0.01.

Algorithm 2: Pseudocode of eQTL mapping. 1.For each chromsome chr, let L chr be the set of loci on the chromosome, sorted in increasing order of their genomic locations. 2. tl = L chr [0] \\ the first locus 3.Add tl to TL \\ TL … set of TAG loci 4.Consider loci i in sorted order: a.If corr (tl, L chr [i])   TL : (corr (x,y) … Pearson’s correlation coefficient) i.right(tl) = L chr [i-1] \\ set the right boundary of the old TAG locus ii.tl = L chr [i] and tl  TL \\ select a new TAG locus iii.Consider loci in reverse sorted order starting from j = i – 1 iv.if corr(tl, L chr [j])   TL : left (tl) = L chr [j+1] \\ set the left boundary of the new ta Go to 4.a 5.For each target gene dg i : a.TL(i) = [] \\set of target loci associated with disease gene i b.For each tag locus tl j : Run linear regression between E(dg i ) and CN(tl j ) and compute p-value If p <  eqtl : tl j  TL(i)

Step 3: Identification of Candidate Causal Genes Let G = (N, E) a gene network where N is a set of genes and E is a set of molecular interactions. I = [I(e) for e є E], current passing through the edges and V = [V(n) for n є N], holds voltage at the nodes For an edge e = (u,v) connecting genes u and v, the conductance of edge e, w(e) as the mean of corr(u, tg) and corr(v, tg).

Assumption: Direct regulation activity on the expression of target genes is mediated by transcription factors. Heuristic approach to solve the problem. Remove edges until a small number of directed edges are there. Uses empirical p-value less than 0.05

Algorithm 3: Pseudocode for selecting candidate causal genes. 1.For each disease gene dg i, a.CG(dgi) ← b.For each tag locus tl j є TL(i) and associated region R(tlj) i.Compute C(tl j ), a set of genes located in R(tlj) ii.Repeat the following: Construct an electric circuit G= {N, E} Compute current I(g) to each gene in C(tl j ) If |{e in E| e in reverse direction}| < θ r : Go to 1.a.iii Else: Remove the edges and repeat 1.a.ii iii.Compute current in random networks and p-values iv. Algorithm 4: Finding Dysregulated Pathway from a causal gene c to target gene d. 1.r max (c, d) ← the region in r(c) for which c has the most significant p-value where r(c) is the regions that contain a causal gene c) 2.tl max (c, d) ← the corresponding tag locus 3.G’ ← subgraph of G consisting of nodes with p-value > 0.05 in Sol(d, tl max (c, d)) 4.For each gene g i in G’: I(g i ) ← the total current passing through the gene g i in Sol(d, tl max (c, d)) 5.P max (d, c) ← paths from d to c with max p ∈ P(d, c) (min gi in p I(g i )) 6.Choose the shortest path in P max (d, c)

Step 4: Finalizing Causal Genes A weighted bipartite graph WB(C, S) Edges between gene cg k and case s i if and only if gene cg k explains a case s i. W(C 0, s) be the total number of target genes covering s by the genes in C 0 if the total weight covering the case exceeds a certain threshold. Minimum weighted multi-set cover problem

Algorithm 5: Pseudocode for the selection of final causal genes. 1.Create a weighted multi-set cover instance WSC = {B, ,  } 2.U = a set of cases covered less than . 3. MCG = a set of selected causal genes 4. Repeat the following until |U| ≤  a.Select a gene with maximum b.Include the selected gene in MCG c.Update U

Validation/Evaluation In the early steps of the algorithm, they determined associations between copy number variations and expression of target genes, yielding 16,056 associated genes. Next step reduced this set to 701 candidate causal genes with a significant enrichment of 10 GBM specific and 25 Glioma related genes. Also, obtain 1,763 pairs. Using the weighted set cover approach and information from DAVID database gives consistent results – 128 causal genes that harbored 6 GBM relevant genes [CDKN2A, EGFR, ERBB4, PTEN, RB1 and TP53]. According to AceView 280 causal genes were obtained, including only 4 GBM related genes.

Chromosomal Analysis of Causal Genes Genomic alternation in GBM gives A genomic amplification in chromosome 7 Deletion on chromosome 10 These alternations occur at genomic locations of EGFR and PTEN. Final 128 causal genes and corresponding target genes pair Causal genes at chromosome 7 and 10 were connected to target genes.

Dysregulated Pathways and Subnetworks Figure: The network of causal paths from PTEN.

Literature-Based Validation RHOBOTB2, a recently discovered gene, small genomic alterations GBAS and CEBPA were not included by AceView or DAVID

(A) Dysregulated pathways from causal gene CDC2. Genes in this larger network were significantly enriched in the cell cycle, pathways in cancer, chronic myeloid leukemia, prostate, pancreatic, bladder, colorectal and lung cancers (P < 0.01).

(B) Dysregulated pathways from causal gene GBAS. Genes that appear in this network are enriched in bladder cancer and cancer pathways in general (P < 0.01). (C) Pooling all genes that appear in dys-regulated pathways as presented in the main text, we found that a small number of genes appeared in many causal paths. The hub genes were significantly enriched in cancer pathways, chronic and acute myeloid leukemia, prostate and pancreatic cancers, cell cycle, neurotrophin signaling pathway, renal cell carcinoma, TGF signaling and T-cell receptor signaling pathways (P < 0.001).

Discussion Each individual step can be used separately, depending on a specific application. Thus, method can be applied to any disease system where genetic variations play a fundamental, causal role.