Inference of Transcriptional Regulation Network with Gene Expression Data Andrew Kwon.

Slides:



Advertisements
Similar presentations
Yinyin Yuan and Chang-Tsun Li Computer Science Department
Advertisements

Control Case Common Always active
Statistics Review – Part II Topics: – Hypothesis Testing – Paired Tests – Tests of variability 1.
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
Copyright © 2009 Pearson Education, Inc. Chapter 29 Multiple Regression.
Hypothesis: It is an assumption of population parameter ( mean, proportion, variance) There are two types of hypothesis : 1) Simple hypothesis :A statistical.
Putting genetic interactions in context through a global modular decomposition Jamal.
CSE Fall. Summary Goal: infer models of transcriptional regulation with annotated molecular interaction graphs The attributes in the model.
D ISCOVERING REGULATORY AND SIGNALLING CIRCUITS IN MOLECULAR INTERACTION NETWORK Ideker Bioinformatics 2002 Presented by: Omrit Zemach April Seminar.
Correlation and regression
Clustered alignments of gene- expression time series data Adam A. Smith, Aaron Vollrath, Cristopher A. Bradfield and Mark Craven Department of Biosatatistics.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Genome-wide prediction and characterization of interactions between transcription factors in S. cerevisiae Speaker: Chunhui Cai.
PSY 307 – Statistics for the Behavioral Sciences
Mutual Information Mathematical Biology Seminar
Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.
Statistics II: An Overview of Statistics. Outline for Statistics II Lecture: SPSS Syntax – Some examples. Normal Distribution Curve. Sampling Distribution.
Heuristic alignment algorithms and cost matrices
Topic 2: Statistical Concepts and Market Returns
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Similar Sequence Similar Function Charles Yan Spring 2006.
Fuzzy K means.
Lecture 16 – Thurs, Oct. 30 Inference for Regression (Sections ): –Hypothesis Tests and Confidence Intervals for Intercept and Slope –Confidence.
6. Gene Regulatory Networks
Geometric Approaches to Reconstructing Time Series Data Project Update 29 March 2007 CSC/Math 870 Computational Discrete Geometry Connie Phong.
Bayes Net Perspectives on Causation and Causal Inference
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Bayesian integration of biological prior knowledge into the reconstruction of gene regulatory networks Dirk Husmeier Adriano V. Werhli.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
ANCOVA Lecture 9 Andrew Ainsworth. What is ANCOVA?
DNA microarray technology allows an individual to rapidly and quantitatively measure the expression levels of thousands of genes in a biological sample.
Using Bayesian Networks to Analyze Expression Data N. Friedman, M. Linial, I. Nachman, D. Hebrew University.
Fundamentals of Data Analysis Lecture 10 Management of data sets and improving the precision of measurement pt. 2.
Time Series Data Analysis - I Yaji Sripada. Dept. of Computing Science, University of Aberdeen2 In this lecture you learn What are Time Series? How to.
Copyright © Cengage Learning. All rights reserved. 10 Inferences Involving Two Populations.
Reverse engineering gene regulatory networks Dirk Husmeier Adriano Werhli Marco Grzegorczyk.
Using Bayesian Networks to Analyze Whole-Genome Expression Data Nir Friedman Iftach Nachman Dana Pe’er Institute of Computer Science, The Hebrew University.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
Analysis of the yeast transcriptional regulatory network.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Identification of cell cycle-related regulatory motifs using a kernel canonical correlation analysis Presented by Rhee, Je-Keun Graduate Program in Bioinformatics.
Learning Linear Causal Models Oksana Kohutyuk ComS 673 Spring 2005 Department of Computer Science Iowa State University.
Intel Confidential – Internal Only Co-clustering of biological networks and gene expression data Hanisch et al. This paper appears in: bioinformatics 2002.
December 9, 2014Computer Vision Lecture 23: Motion Analysis 1 Now we will talk about… Motion Analysis.
1 Searching for Periodic Gene Expression Patterns Using Lomb-Scargle Periodograms Critical Assessment.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Problem Limited number of experimental replications. Postgenomic data intrinsically noisy. Poor network reconstruction.
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
Tetris Agent Optimization Using Harmony Search Algorithm
Class 23, 2001 CBCl/AI MIT Bioinformatics Applications and Feature Selection for SVMs S. Mukherjee.
Correlation & Regression Analysis
IMPORTANCE OF STATISTICS MR.CHITHRAVEL.V ASST.PROFESSOR ACN.
Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007.
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Sequence Alignment.
Nonlinear differential equation model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae Vu, T. T.,
Case Study: Characterizing Diseased States from Expression/Regulation Data Tuck et al., BMC Bioinformatics, 2006.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Computational methods for inferring cellular networks II Stat 877 Apr 17 th, 2014 Sushmita Roy.
Network Motifs See some examples of motifs and their functionality Discuss a study that showed how a miRNA also can be integrated into motifs Today’s plan.
Inferring Regulatory Networks from Gene Expression Data BMI/CS 776 Mark Craven April 2002.
A Simple Approach to Ranking Differentially Expressed Gene Expression Time Courses through Gaussian Process Regression By Alfredo A Kalaitzis and Neil.
APPROACHES TO QUANTITATIVE DATA ANALYSIS
Building and Analyzing Genome-Wide Gene Disruption Networks
1 Department of Engineering, 2 Department of Mathematics,
1 Department of Engineering, 2 Department of Mathematics,
1 Department of Engineering, 2 Department of Mathematics,
Statistical Inference about Regression
Analyzing Time Series Gene Expression Data
Product moment correlation
Presentation transcript:

Inference of Transcriptional Regulation Network with Gene Expression Data Andrew Kwon

Role of Proteins Both functional and structural Main agents of cellular functions Each protein has a specific function The amount of each protein in the cell must be controlled carefully Elaborate Regulatory Network

Gene Regulatory Network Fundamental mechanism by which protein production and cellular functions are controlled Complex input-output system made of proteins and genes for controlling cellular functions Important for understanding of many important problems, including medical ones

Cell Cycle After certain amount of growth, cell divides into two identical cells Need to duplicate cellular components and equally divide among progenitors Different regulators act in different parts and stages in concert to control cell cycle

Types of Regulation Activation Increase in protein A leads to increase in gene B’s transcription Inhibition Increase in protein A leads to decrease in gene B’s transcription Not a simple binary relationship Many genes could act on a particular gene at once - Complexes Feedback and Self-Regulation

Example of Regulatory Network S phase control in yeast

Microarray Each spot contains a specific probe designed for a single cDNA When more cDNA binds to a spot, the red intensity increases Allow study of gene expression in large scale

Which Genes Are Related? Goal: to find out which pairs of genes have direct regulatory relationship

Correlation Method Standard correlation coefficient Widely used method for sequence similarity comparisons Tests for degree of linear relationship between two variables Cannot take into account the time delay involved in gene regulation Strongly favours global over local similarities

Edge Detection Method (1) By Filkov et al. Focus on improving local similarity detection Scan through gene expression curves and determine where major edges occur, and remove spurious edges Construct primary edges using local minima and maxima Filter out those edges whose height does not make the pre-determined threshold

Edge Detection Method (2) Group those edges with similar direction Now left with edges depicting the major features only compare the edge profiles between two genes by summing up closely located edges from two genes with the same direction

Edge Detection Method (3) Scoring Formula d = agreement of slopes of edges (-1 or 1) n = number of edges a, b = two genes being compared  = gap between edges  max = maximum allowable time difference between two edges

Edge Detection Method (4) Does not differentiate between the direction of regulation Cannot be used to find inhibitory relationships Allows for negative time delays between two corresponding edges on the basis that there is not enough data resolution Detects strong local matches only

Bayesian Networks Consists of two parts Directed Acyclic Graph (Structure of GRN) Set of parameters for the DAG (Statistical Hypothesis) DAG represents the causal relations among a set of random variables (gene expression levels) X causes Y if and only if there is a direct edge from X to Y

Bayesian Networks (2) Must learn the network using observed data Perform a series of conditional independence tests and construct the most likely set of DAGs based on the results Assign a score to each DAG based on the sample data, and search for the highest scoring one

Bayesian Networks (3) Need large sample size for accuracy Representing Time Increases the number of variables dramatically, if one is to represent the time in the bayesian network Dynamic Bayesian Network High complexity

Event Method Need a method that balances between global and local similarity Need to make use of temporal evidence Need to account for directionality of regulation Need to be computationally efficient

Hypotheses on Regulation Hypothesis 1: A activates B Rise in expression of A followed by rise in expression of B Fall in expression of A followed by fall in expression B Hypothesis 2: A inhibits B Rise in A followed by fall in B Fall in A followed by rise in A Time delay between 2 corresponding events

Events Directional changes in expression profile State of gene expression at an instant 3 possible states Rise, Constant, Fall (R, C, F) Event state/type determined by the slope of the expression profile

Event Conversion Microarray data is quite noisy Perform smoothing to reduce noise before calculating slopes Select the ‘flat’ region around slope of 0 Classify into R, C, F based on the slope values Any value falling in the flat region → C Result: 2 event strings

Event String Alignment Need to best match 2 event strings with noise and time delay in mind Use Needleman-Wunsch’s global sequence alignment algorithm Handling of time delay Events that do not occur at the same time may still be related to each other No negative time delay

Scoring Matrix (1) Scoring Method for Event Method RCF RS(dT)0-βS(dT) C000 F 0αS(dT) 0 < S(dT) ≤ 1 0 ≤ α ≤ 1, 0 ≤ β ≤ 1 dT = time delay between two events If dT < 0, match penalty = ∞

Scoring Matrix (2) R-R matches weighted more than F-F matches Decreases in mRNA levels less indicative Any match with C assigned neutral score of 0 C = region of uncertainty Could be due to any number of reasons Penalty for R-F matches Scores function of time delay dT

Example

Event vs. Correlation Event scores high, but correlation scores low Time delay lowers the correlation coefficient

Event vs. Edge Detection Event scores high, edge detection scores low Bolded edges: what edge detection finds Only edges A and B are close enough to be added to score

Spellman’s Data Sets Snapshots of yeast cellular mRNA levels at regular time intervals using cDNA microarrays 4 separate data sets based on different cell arresting methods used α-arrest, elutriation, CDC15, CDC28 temp. sensitive mutants Yeast genome: ~6200 genes Too many; need to reduce search space

Selecting Genes to Study Want to restrict to genes related to cell- cycle regulation Filkov et al searched for known transcriptional regulation pairs in Yeast Proteome Database 888 transcriptional regulations 486 genes 647 activations, 241 inhibitions

Pre-Processing Data Microarray data by Spellman contains many missing points Experimental errors Use linear interpolation to fill in for the missing points If the ratio of the missing points to valid points is greater than the threshold, ignore the gene data in question

Analysis of the Test Set (1) α and CDC28 data sets analyzed Data Set# ORFs# Genes α CDC Need to compare each gene with all the others >120,000 comparisons for alpha >200,000 comparisons for CDC28

Analysis of the Test Set (2) Correlation and edge detection methods: no directionality of regulation Only ½ as many comparisons as the event method To make comparison possible, remove directionality aspect from the event method as well

Analysis Results (1) Overlapping results among 3 methods (all results) MethodsAlphaCDC28 Event + Correlation Event + Edge Correlation + Edge α=0.7, -β =0.3 used for scoring matrix Top-10,000 rankings

Analysis Results (2) Overlapping results among 3 methods (true positive results only) MethodsAlphaCDC28 Event + Correlation119 Event + Edge00 Correlation + Edge00 α=0.7, -β =0.3 used for scoring matrix Top-10,000 rankings

Analysis Results (3) < 1/3 of results by any 2 methods overlap Event method finds significantly different pairs from the other methods Very little overlap between true positives Consistent with the fact the 3 methods employ different search strategies Local vs. global similarity

True (+) distribution for top-k results 0 < k < 10,000 Alpha data set CDC28 data set

Effects of Time Delay (1) Perform time-shifting experiments and see how score changes Gene 1Gene 2CorrelationEdgeEvent YDR225WYDR224C YDR225WYDR224C YDR225WYDR224C YMR199WYPL256C YMR199WYPL256C YMR199WYPL256C

Effects of Time Delay (2) Correlation coefficients drop rapidly as time delay is introduced Supports assertion that correlation cannot handle time delay gracefully Unexpected drop in edge detection scores Probably due to problem in finding significant edges to compare

Effects of Scoring Matrix Parameters True (+) for Event Method α-β-βAlpha Act.Alpha Inh.CDC28 Act.CDC28 Inh

Problems with Results Many genes shared identical expression curves, incl. unrelated genes Poor resolution of data Edge detection method Too many scores of 0 Simply cannot find enough edges Significance of scores doubtful

More Notes on Edge Cumulative Distribution Function for Edge Zero scores make up the vertical column

Synthetic Data Sets (1) Spellman’s data sets not enough to test the algorithms properly 4 different data sets Constant time delay Irregular time delay Partial matching Differential weighting of events

Synthetic Data Sets (2) Each data set consists of equal number of gene profiles and random profiles Gene profiles: gene i Random profiles: random i gene i and gene i+x related Better match if x is smaller

Synthetic Data Sets (3) Avg. No. of True (+) Data SetCorrelationEvent Constant Time Delay Irregular Time Delay Partial Matching Differential Weighting Event method superior except in partial matching Could not test edge detection method Could not produce non-zero scores

Summary Event Method: find potential regulatory pairs from gene expression data Based on key features of gene expression Computationally efficient Perform comparably to correlation and edge detection methods in finding true (+) from Spellman’s data sets Outperform correlation in synthetic data sets

Future Work (1) Limitation of real-world data Obtain data with better resolution Integrate data with other a priori knowledge Narrow down focus to transcription factors More realistic synthetic data Realistic modeling of artificial regulatory network

Future Work (2) Transitive Closure: It would make sense to remove E 13 from the pair rankings in order to accommodate other potential pairs If E 12 and E 23 have higher scores than E 13, Node 3 would be only conditionally dependent on Node 1

Future Work (3) Improvement of event method Different number of event types Global regulatory network Combine pairings by event method to form potential networks Other uses for event method Different types of data, such as proteins Adaptation to other fields may be possible