Proteomics Informatics – Data Analysis and Visualization (Week 13)

Slides:



Advertisements
Similar presentations
Analytical Method Development and Validation
Advertisements

Statistics Review – Part II Topics: – Hypothesis Testing – Paired Tests – Tests of variability 1.
Supervised and unsupervised analysis of gene expression data Bing Zhang Department of Biomedical Informatics Vanderbilt University
From the homework: Distribution of DNA fragments generated by Micrococcal nuclease digestion mean(nucs) = bp median(nucs) = 110 bp sd(nucs+ = 17.3.
Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.
Statistical Decision Making
Applied statistics Katrin Jaedicke
Summary 1 l The Analytical Problem l Data Handling.
Differentially expressed genes
Statistical Analysis SC504/HS927 Spring Term 2008 Week 17 (25th January 2008): Analysing data.
Significance Tests P-values and Q-values. Outline Statistical significance in multiple testing Statistical significance in multiple testing Empirical.
Proteomics Informatics – Protein identification II: search engines and protein sequence databases (Week 5)
Proteomics Informatics Workshop Part I: Protein Identification
Previous Lecture: Regression and Correlation
Scaffold Download free viewer:
Proteomics Informatics (BMSC-GA 4437) Course Director David Fenyö Contact information
Proteomics Informatics (BMSC-GA 4437) Course Director David Fenyö Contact information
Proteomics Informatics Workshop Part III: Protein Quantitation
Chemometrics Method comparison
Method Comparison A method comparison is done when: A lab is considering performing an assay they have not performed previously or Performing an assay.
Multiple testing correction
8 - 1 © 2003 Pearson Prentice Hall Chi-Square (  2 ) Test of Variance.
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved. Chapter 12 Describing Data.
Things that I think are important Chapter 1 Bar graphs, histograms Outliers Mean, median, mode, quartiles of data Variance and standard deviation of.
Fall 2013 Lecture 5: Chapter 5 Statistical Analysis of Data …yes the “S” word.
Statistics & Biology Shelly’s Super Happy Fun Times February 7, 2012 Will Herrick.
Biostatistics: Measures of Central Tendency and Variance in Medical Laboratory Settings Module 5 1.
The Scientific Method Formulation of an H ypothesis P lanning an experiment to objectively test the hypothesis Careful observation and collection of D.
Prof. of Clinical Chemistry, Mansoura University.
Chapter 5 Errors In Chemical Analyses Mean, arithmetic mean, and average (x) are synonyms for the quantity obtained by dividing the sum of replicate measurements.
ERT 207-ANALYTICAL CHEMISTRY
Regression Models Residuals and Diagnosing the Quality of a Model.
Lecture 5: Chapter 5: Part I: pg Statistical Analysis of Data …yes the “S” word.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
Comp. Genomics Recitation 3 The statistics of database searching.
TYPES OF STATISTICAL METHODS USED IN PSYCHOLOGY Statistics.
Basic Concepts of Correlation. Definition A correlation exists between two variables when the values of one are somehow associated with the values of.
Previous Lecture: Bioimage Informatics Agullo-Pascual E, Reid DA, Keegan S, Sidhu M, Fenyö D, Rothenberg E, Delmar M, "Super-resolution fluorescence microscopy.
Quality Assurance How do you know your results are correct? How confident are you?
1 Inferences About The Pearson Correlation Coefficient.
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
Multiple Testing Matthew Kowgier. Multiple Testing In statistics, the multiple comparisons/testing problem occurs when one considers a set of statistical.
Correlation Assume you have two measurements, x and y, on a set of objects, and would like to know if x and y are related. If they are directly related,
STA 286 week 131 Inference for the Regression Coefficient Recall, b 0 and b 1 are the estimates of the slope β 1 and intercept β 0 of population regression.
 Two basic types Descriptive  Describes the nature and properties of the data  Helps to organize and summarize information Inferential  Used in testing.
Introduction to Biostatistics and Bioinformatics Regression and Correlation.
1 Exercise 7: Accuracy and precision. 2 Origin of the error : Accuracy and precision Systematic (not random) –bias –impossible to be corrected  accuracy.
Review Lecture 51 Tue, Dec 13, Chapter 1 Sections 1.1 – 1.4. Sections 1.1 – 1.4. Be familiar with the language and principles of hypothesis testing.
Appendix B: Statistical Methods. Statistical Methods: Graphing Data Frequency distribution Histogram Frequency polygon.
RESEARCH & DATA ANALYSIS
Introduction to Biostatistics and Bioinformatics Experimental Design.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
Introducing Communication Research 2e © 2014 SAGE Publications Chapter Seven Generalizing From Research Results: Inferential Statistics.
Limit of detection, limit of quantification and limit of blank Elvar Theodorsson.
Proteomics Informatics (BMSC-GA 4437) Instructor David Fenyö Contact information
ERT 207 Analytical Chemistry ERT 207 ANALYTICAL CHEMISTRY Dr. Saleha Shamsudin.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Revision of topics for CMED 305 Final Exam. The exam duration: 2 hours Marks :25 All MCQ’s. (50 questions) You should choose the correct answer. No major.
Proteomics Informatics (BMSC-GA 4437) Course Directors David Fenyö Kelly Ruggles Beatrix Ueberheide Contact information
Protein quantitation I: Overview (Week 5). Fractionation Digestion LC-MS Lysis MS Sample i Protein j Peptide k Proteomic Bioinformatics – Quantitation.
Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and l Chapter 9 l Simple Linear Regression 9.1 Simple Linear Regression 9.2 Scatter Diagram 9.3 Graphical.
Measurement, Quantification and Analysis
CORRELATION.
Residuals and Diagnosing the Quality of a Model
Proteomics Informatics David Fenyő
Proteomics Informatics –
CHK1 downregulation upon ERG overexpression.
Introduction to Analytical Chemistry
Quality Assessment The goal of laboratory analysis is to provide the accurate, reliable and timeliness result Quality assurance The overall program that.
Proteomics Informatics David Fenyő
Presentation transcript:

Proteomics Informatics – Data Analysis and Visualization (Week 13)

Statistics

Data Visualization -visualization-points-of-view.html

MS/MS Lysis Fractionation Protein Identification MS/MS Digestion Sequence DB All Fragment Masses Pick Protein Compare, Score, Test Significance Repeat for all proteins Pick PeptideLC-MS Repeat for all peptides

Search Results

Significance Testing False protein identification is caused by random matching An objective criterion for testing the significance of protein identification results is necessary. The significance of protein identifications can be tested once the distribution of scores for false results is known.

Distribution of Extreme Values NormalSkewed n=3 n=10 n=100 n=3 n=10 n=100

Significance Testing - Expectation Values The majority of sequences in a collection will give a score due to random matching.

Database Search M/Z List of Candidates Extrapolate And Calculate Expectation Values List of Candidates With Expectation Values Distribution of Scores for Random and False Identifications Significance Testing - Expectation Values

Application: Analytical Measurements Theoretical Concentration Measured Concentration

A Few Characteristics of Analytical Measurements Accuracy: Closeness of agreement between a test result and an accepted reference value. Precision: Closeness of agreement between independent test results. Robustness: Test precision given small, deliberate changes in test conditions (preanalytic delays, variations in storage temperature). Lower limit of detection: The lowest amount of analyte that is statistically distinguishable from background or a negative control. Limit of quantification: Lowest and highest concentrations of analyte that can be quantitatively determined with suitable precision and accuracy. Linearity: The ability of the test to return values that are directly proportional to the concentration of the analyte in the sample.

Measuring Blanks

Coefficient of Variation Variance Sample Mean Coefficient of Variation (CV)

Lower Limit of Detection The lowest amount of analyte that is statistically distinguishable from background or a negative control. Two methods to determine lower limit of detection: 1.Lowest concentration of the analyte where CV is less than for example 20%. 2.Determine level of blank by taking 95 th percentile of the blank measurements and add a constant times the standard deviation of the lowest concentration. K. Linnet and M. Kondratovich, Partly Nonparametric Approach for Determining the Limit of Detection, Clinical Chemistry 50 (2004) 732–740.

Limit of Detection and Linearity Theoretical Concentration Measured Concentration

Precision and Accuracy Theoretical Concentration Measured Concentration

A Data Set with Two Samples

A proteomics example – no replicates

A proteomics example – three replicates no replicates three replicates Log 2 Standard Deviation Log 2 Average Spectrum Count Log 2 Sum Spectrum Count Log 2 Spectrum Count Ratio Log 2 Sum Spectrum Count Log 2 Spectrum Count Ratio

How Different are Two Measurements?

A Data Set with Seven Samples 3 replicates 3 replicates + one more replicate a few months later Normalized

A Data Set with Seven Samples

Box Plot M. Krzywinski & N. Altman, Visualizing samples with box plots, Nature Methods 11 (2014) 119

n=5 Box Plots ComplexNormalSkewedLong tails n=10 n=100 n=5 n=10 n=100 n=5 n=10 n=100 n=5 n=10 n=100

Box Plots with All the Data Points ComplexNormalSkewedLong tails n=5 n=10 n=100 n=5 n=10 n=100 n=5 n=10 n=100 n=5 n=10 n=100

Box Plots, Scatter Plots and Bar Graphs Normal Distribution Error bars: standard deviation error bars: standard deviation error bars: standard error

Box Plots, Scatter Plots and Bar Graphs Skewed Distribution Error bars: standard deviation error bars: standard deviation error bars: standard error

Box Plots, Scatter Plots and Bar Graphs Distribution with Fat Tail Error bars: standard deviation error bars: standard deviation error bars: standard error

Venn Diagrams

TCGA – Unsupervised mRNA Expression Analysis The Cancer Genome Atlas Network, Comprehensive molecular portraits of human breast tumors. Nature. 490 (7418):61-70.

Correlations between mRNA and protein abundance in TCGA colon tumors B Zhang et al. Nature 000, 1-6 (2014) doi: /nature13438

The Effect of Copy Number Alterations B Zhang et al. Nature 000, 1-6 (2014) doi: /nature13438

The Effect of Copy Number Alterations

Testing multiple hypothesis Is the concentration of calcium/calmodulin-dependent protein kinase type II different between the two samples? What protein concentration are different between the two samples? p = 2x10 -6 The p-value needs to be corrected taking into account the we perform many tests. Bonferroni correction: multiply the p-value with The number of tests performed (n): p corr = p uncorr x n In this case where 3685 proteins are identified, so the Bonferroni corrected p-value for calcium/calmodulin-dependent protein kinase type II is p corr = 2x10 -6 x 3685 = 0.007

Testing multiple hypothesis The p-value distribution is uniform when testing differences between samples from the same distribution. Normal distribution Sample size = 10 p-value 1 0 # of test p-value 1 0 # of test p-value 1 0 # of test ,000 tests1,000 tests100 tests

Testing multiple hypothesis The p-value distribution is uniform when testing differences between samples from the same distribution. Normal distribution Sample size = tests from a distribution with a different mean (μ 1 -μ 2 >>σ) p-value 1 # of test p-value 1 # of test p-value 1 0 # of test ,000 tests1,000 tests100 tests 0 0

Testing multiple hypothesis Controlling for False Discovery Rate (FDR) Normal distribution Sample size = tests from a distribution with a different mean (μ 1 -μ 2 >>σ) p-value 1 False Rate p-value 1 False Rate p-value 1 0 False Rate False Discovery Rate False Discovery Rate False Discovery Rate 10,000 tests1,000 tests100 tests

Testing multiple hypothesis False Discovery Rate (FDR) and False Negative Rate (FNR) Normal distribution Sample size = tests 30 tests from a distribution with a different mean p-value 1 False Rate p-value 1 False Rate p-value 1 0 False Rate μ 1 -μ 2 =2σμ1-μ2=σμ1-μ2=σμ 1 -μ 2 =σ/2 False Discovery Rate False Negative Rate False Discovery Rate False Negative Rate False Discovery Rate False Negative Rate

Proteomics Informatics – Data Analysis and Visualization (Week 13)