1 Searching for Periodic Gene Expression Patterns Using Lomb-Scargle Periodograms Critical Assessment.

Slides:



Advertisements
Similar presentations
Statistical Modeling and Data Analysis Given a data set, first question a statistician ask is, “What is the statistical model to this data?” We then characterize.
Advertisements

Chapter 9 Hypothesis Testing
1 Chapter 16 Fourier Analysis with MATLAB Fourier analysis is the process of representing a function in terms of sinusoidal components. It is widely employed.
Darya Chudova, Alexander Ihler, Kevin K. Lin, Bogi Andersen and Padhraic Smyth BIOINFORMATICS Gene expression Vol. 25 no , pages
N.D.GagunashviliUniversity of Akureyri, Iceland Pearson´s χ 2 Test Modifications for Comparison of Unweighted and Weighted Histograms and Two Weighted.
Genomic Profiles of Brain Tissue in Humans and Chimpanzees II Naomi Altman Oct 06.
Microarray technology and analysis of gene expression data Hillevi Lindroos.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
EPIDEMIOLOGY AND BIOSTATISTICS DEPT Esimating Population Value with Hypothesis Testing.
Mutual Information Mathematical Biology Seminar
Statistics II: An Overview of Statistics. Outline for Statistics II Lecture: SPSS Syntax – Some examples. Normal Distribution Curve. Sampling Distribution.
Chapter 5 Time Series Analysis
Differentially expressed genes
Sampling Distributions
Lecture 4 Measurement Accuracy and Statistical Variation.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Zhaohua Wu and N. E. Huang:
Statistics for Microarrays
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 5 – Testing for equivalence or non-inferiority. Power.
Chapter 12: Analysis of Variance
V. Rouillard  Introduction to measurement and statistical analysis ASSESSING EXPERIMENTAL DATA : ERRORS Remember: no measurement is perfect – errors.
Estimation and Hypothesis Testing Faculty of Information Technology King Mongkut’s University of Technology North Bangkok 1.
ElectroScience Lab IGARSS 2011 Vancouver Jul 26th, 2011 Chun-Sik Chae and Joel T. Johnson ElectroScience Laboratory Department of Electrical and Computer.
PSYCHOLOGY 820 Chapters Introduction Variables, Measurement, Scales Frequency Distributions and Visual Displays of Data.
Wfleabase.org/docs/tileMEseq0905.pdf Notes and statistics on base level expression May 2009Don Gilbert Biology Dept., Indiana University
Multiple testing correction
Multiple testing in high- throughput biology Petter Mostad.
DNA microarray technology allows an individual to rapidly and quantitatively measure the expression levels of thousands of genes in a biological sample.
Using Bayesian Networks to Analyze Expression Data N. Friedman, M. Linial, I. Nachman, D. Hebrew University.
BPS - 3rd Ed. Chapter 211 Inference for Regression.
Essential Statistics in Biology: Getting the Numbers Right
Population All members of a set which have a given characteristic. Population Data Data associated with a certain population. Population Parameter A measure.
Quantitative Skills 1: Graphing
Hypothesis Testing Hypothesis Testing Topic 11. Hypothesis Testing Another way of looking at statistical inference in which we want to ask a question.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 1/31 Prediction of significant positions in biological sequences.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Probabilistic & Statistical Techniques Eng. Tamer Eshtawi First Semester Eng. Tamer Eshtawi First Semester
Introduction to Microarrays Dr. Özlem İLK & İbrahim ERKAN 2011, Ankara.
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
Statistical Methods for Identifying Differentially Expressed Genes in Replicated cDNA Microarray Experiments Presented by Nan Lin 13 October 2002.
BPS - 3rd Ed. Chapter 161 Inference about a Population Mean.
Basic Time Series Analyzing variable star data for the amateur astronomer.
Announcements First quiz next Monday (Week 3) at 6:15-6:45 Summary:  Recap first lecture: Descriptive statistics – Measures of center and spread  Normal.
Statistical Testing with Genes Saurabh Sinha CS 466.
Suppose we have T genes which we measured under two experimental conditions (Ctl and Nic) in n replicated experiments t i * and p i are the t-statistic.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
Inferential Statistics Introduction. If both variables are categorical, build tables... Convention: Each value of the independent (causal) variable has.
Spatial Smoothing and Multiple Comparisons Correction for Dummies Alexa Morcom, Matthew Brett Acknowledgements.
Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007.
1 Significance analysis of Microarrays (SAM) Applied to the ionizing radiation response Tusher, Tibshirani, Chu (2001) Dafna Shahaf.
Understanding Basic Statistics Fourth Edition By Brase and Brase Prepared by: Lynn Smith Gloucester County College Chapter Nine Hypothesis Testing.
Nonlinear differential equation model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae Vu, T. T.,
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 6 –Multiple hypothesis testing Marshall University Genomics.
Distinguishing active from non active genes: Main principle: DNA hybridization -DNA hybridizes due to base pairing using H-bonds -A/T and C/G and A/U possible.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
A Quantitative Overview to Gene Expression Profiling in Animal Genetics Armidale Animal Breeding Summer Course, UNE, Feb Analysis of (cDNA) Microarray.
Slide Slide 1 Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley. Lecture Slides Elementary Statistics Tenth Edition and the.
BPS - 5th Ed. Chapter 231 Inference for Regression.
Analyzing circadian expression data by harmonic regression based on autoregressive spectral estimation Rendong Yang and Zhen Su Division of Bioinformatics,
A NOVEL SIGNAL PROCESSING METHOD BASED ON THE FREQUENCY MODALITY FOR INTRA-BODY MEDICAL INSTRUMENT TRACKING Dr. DIMITRIS A. FOTIADIS TECHNOLOGICAL EDUCATIONAL.
Estimating the False Discovery Rate in Genome-wide Studies BMI/CS 576 Colin Dewey Fall 2008.
Chapter Nine Hypothesis Testing.
Modeling Distributions of Data
Hypothesis Testing and Confidence Intervals (Part 1): Using the Standard Normal Lecture 8 Justin Kern October 10 and 12, 2017.
1 Department of Engineering, 2 Department of Mathematics,
1 Department of Engineering, 2 Department of Mathematics,
1 Department of Engineering, 2 Department of Mathematics,
Presentation transcript:

1 Searching for Periodic Gene Expression Patterns Using Lomb-Scargle Periodograms Critical Assessment of Microarray Data Analysis Conference November 11, 2004 Earl F. Glynn Stowers Institute Arcady R. Mushegian Stowers Institute & Univ. of Kansas Medical Center Jie Chen Stowers Institute & Univ. of Missouri Kansas City

2 Searching for Periodic Gene Expression Patterns Using Lomb-Scargle Periodograms Periodic Patterns in Biology Introduction to Lomb-Scargle Periodogram Data Pipeline Application to Bozdech’s Plasmodium dataset Conclusions

3 Periodic Patterns in Biology Photograph taken at Reptile Gardens, Rapid City, SD A vertebrate’s body plan: a segmented pattern. Segmentation is established during somitogenesis.

4 Periodic Patterns in Biology From Bozdech, et al, Fig. 1A, PLoS Biology, Vol 1, No 1, Oct 2003, p 3. Intraerythrocytic Developmental Cycle of Plasmodium falciparum Expression Ratio = RNA from parasitized red blood cells RNA from all development cycles = Cy5 Cy3 Values for Log 2 (Expression Ratio) are approximately normally distributed. Assume gene expression reflects observed biological periodicity.

5 Simple Periodic Gene Expression Model Time Expression period (T) “On” “Off” frequency = 1 period f = 1 T Gene Expression = Constant  Cosine(2  f t) “Periodic” if only observed over a single cycle?  = angular frequency = 2  f

6 Introduction to Lomb-Scargle Periodogram What is a Periodogram? Why Lomb-Scargle Instead of Fourier? Example Using Cosine Expression Model Mathematical Details Mathematical Experiments - Single Dominant Frequency - Multiple Frequencies - Mixtures: Signal and Noise

7 What is a Periodogram? A graph showing frequency “power” for a spectrum of frequencies “Peak” in periodogram indicates a frequency with significant periodicity Time Log 2 (Expression) Periodic Signal Periodogram Frequency Spectral “Power” Computation

8 Why Lomb-Scargle Instead of Fourier? Missing data handled naturally No data imputation needed Any number of points can be used No need for 2 N data points like with FFT Lomb-Scargle periodogram has known statistical properties Note:The Lomb-Scargle algorithm is NOT equivalent to the conventional periodogram analysis based Fourier analysis.

9 Lomb-Scargle Periodogram Example Using Cosine Expression Model T = 1 f A small value for the false-alarm probability indicates a highly significant periodic signal. Evenly-spaced time points

10 Lomb-Scargle Periodogram Example Using Noisy Cosine Expression Model Unevenly-spaced time points

11 Lomb-Scargle Periodogram Example Using Noise

12 Lomb-Scargle Periodogram Mathematical Details Source: Numerical Recipes in C (2 nd Ed), p. 577 P N (  ) has an exponential probability distribution with unit mean.

13 Mathematical Experiment: Single Dominant Frequency Expression = Cosine(2  t/24) Single “peak” in periodogram. Single “valley” in significance curve.

14 Mathematical Experiment: Multiple Frequencies Expression = Cosine(2  t/48) + Cosine(2  t/24) + Cosine(2  t/ 8) Multiple peaks in periodogram. Corresponding valleys in significance curve.

15 Mathematical Experiment: Multiple Frequencies Expression = 3*Cosine(2  t/48) + Cosine(2  t/24) + Cosine(2  t/ 8) “Weaker” periodicities cannot always be resolved statistically.

16 Mathematical Experiment: Multiple Frequencies: “Duty Cycle” 50%66.6% (e.g., human sleep cycle) One peak with symmetric “duty cycle”.Multiple peaks with asymmetric cycle.

17 Mathematical Experiment: Mixtures: Periodic Signal Vs. Noise “p” histogram 100% periodic genes 50% periodic 50% noise 100% noise

18 Mathematical Experiment: Mixtures: Periodic Signal Vs. Noise 50% periodic, 50% noise Multiple-Hypothesis Testing Bonferroni Holm Hochberg Benjamini & Hochberg FDR None More False Negatives More False Positives

19 Data Pipeline to Apply to Bozdech’s Data 1.Apply quality control checks to data 2.Apply Lomb-Scargle algorithm to all expression profiles 3.Apply multiple hypothesis testing to define “significant” genes 4.Analyze biological significance of significant genes

20 Bozdech’s Plasmodium dataset: 1. Apply Quality Control Checks Global views of experiment. Remove certain outliers.

21 Bozdech’s Plasmodium dataset: 1. Apply Quality Control Checks Many missing data points require imputation for Fourier analysis.

22 Bozdech’s Plasmodium dataset: 2. Apply Lomb-Scargle Algorithm A weak diurnal period is visible in “mean” data profile.

23 Bozdech’s Plasmodium dataset: 2. Apply Lomb-Scargle Algorithm Periodic Expression Patterns Examples of highly-significant periodic expression profiles.

24 Bozdech’s Plasmodium dataset: 2. Apply Lomb-Scargle Algorithm Aperiodic/Noise Expression Patterns

25 Bozdech’s Plasmodium dataset: 2. Apply Lomb-Scargle Algorithm Small “N” N=39 N=32

26 Bozdech’s Plasmodium dataset: 2. Apply Lomb-Scargle Algorithm Signal and Noise Mixture

27 Bozdech’s Plasmodium dataset: 3. Apply Multiple-Hypothesis Testing  = 1E-4 Bonferroni Holm Hochberg Benjamini & Hochberg FDR None More False Negatives More False Positives Significance

28 Bozdech’s Plasmodium dataset: 3. Apply Multiple-Hypothesis Testing p Adjustment Method  Significance Level Bonferroni Holm Hochberg Benjamini & Hochberg FDR None A priori plan: Use Benjamini & Hochberg FDR level of Observed number of periodic probes consistent with biological observation of ~60% of Plasmodium genome being transcriptionally active during the intraerythrocytic developmental cycle.

29 Bozdech’s Plasmodium dataset: 4. Analyze Biological Significance Lomb-Scargle: 4358 Probes,  = 1E-4 significance Comparison with Bozdech’s Results DatasetN time series points ProbesLomb-Scargle Periodic % Bozdech Complete (Bozdech Quality Control Dataset) % Total While Lomb-Scargle identified 243 new low “N” periodic probes, the low percentage in that group may indicate some other problem.

30 Bozdech’s Plasmodium dataset: 4. Analyze Biological Significance Lomb-Scargle: 4358 Probes,  = 1E-4 significance Comparison with Bozdech’s Results Unclear how to apply Bozdech’s ad hoc “Overview” criteria for use with Lomb-Scargle method: “70% power in max frequency with top 75% of max frequency magnitude.” The best 3711 Lomb-Scargle “p” values contained 3449 (92.9%) of the Overview probes. DatasetProbesLomb-Scargle Periodic Bozdech Overview

31 Bozdech’s Plasmodium dataset: 4. Analyze Biological Significance Lomb-Scargle Results 4358 Probes “Phaseograms” Time Probes Ordered by Phase Time Probes Ordered by Phase Bozdech: “Overview” Dataset 2714 genes, 3395 probes

32 Bozdech’s Plasmodium dataset: 4. Analyze Biological Significance Lomb-Scargle: 4358 Probes,  = 1E-4 significance Periodogram Map Frequency Probes Ordered by Peak Frequency Shows periodograms, not expression profiles Shows frequency space, not time Dominant frequency band corresponds to 48-hr period Are “weak” bands indicative of complex expression, perhaps a diurnal component, or an asymmetric “duty cycle”? Period

33 Summary Lomb-Scargle MethodFourier Method Weights data pointsWeights frequency intervals No special requirementRequires uniform spacing No special processingMissing data imputed No special requirement2 N points for FFT; 0 padding Known statistical propertiesPermutation tests needed to assess statistical properties Use “p” valuesAd hoc scoring rules Need estimate of number of “independent frequencies” but explore using continuum Usually only look at “independent” Fourier frequencies

34 Conclusions Lomb-Scargle periodogram is effective tool to identify periodic gene expression profiles Results comparable with Fourier analysis Lomb-Scargle can help when data are missing or not evenly spaced We wanted to validate the Lomb-Scargle method before applying to our somitogenesis problem, since the Fourier technique would be difficult to use. Scargle (1982): “surprising result is that the … spectrum of a process can be estimated … [with] only the order of the samples...”

35 Conclusions Conclusions should not be drawn using the individual p-value calculated for each profile. A multiple comparison procedure False Discovery Rate (FDR) must be used to control the error rate. Expression profiles may be more complex than simple cosine curves Power spectra of non-sinusoid rhythms are more difficult to interpret

36 Supplementary Information

37 Acknowledgements Stowers Institute for Medical Research Pourquie Lab Olivier Pourquie Mary-Lee Dequeant