Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Statistics Achim Tresch UoC / MPIPZ Cologne treschgroup.de/OmicsModule1415.html

Similar presentations


Presentation on theme: "1 Statistics Achim Tresch UoC / MPIPZ Cologne treschgroup.de/OmicsModule1415.html"— Presentation transcript:

1 1 Statistics Achim Tresch UoC / MPIPZ Cologne treschgroup.de/OmicsModule1415.html tresch@mpipz.mpg.de

2 2

3 3 Franziskus, Pope Andrej Kolmogoroff, Mathematician Two ways of dealing with uncertainty

4 4 Topics I.Descriptive Statistics II.Testing III. Clustering IV. Regression

5 5 What is „data“? Cases (Samples, Observations) Endpoints (Variables) Only one item per column! Meaningful variable names! Values, instances of a variable … The sample/ the sample population ⊆ population A collection of observations of a similar structure

6 6 Different Scales of a Variable Categorial Variables Have only a finite number of instances: Male/female; Mon/Tue/…/Sun Continuous Variables Can take values in an interval of the real numbers E.g. blood pressure [mmHg], costs [€] Nominal data: Categorial variables without a given order E.g. eye color [brown, blue, green, grey] Special Case: Binary (=dichotomic) variables (yes/no, 0/1…) Ordinal data: Instances are ordered in a natural way E.g. tumor grade [I, II, III, IV], rank in a contest (1,2,3,…)

7 7 85% shinier hair! I. Description Problem: It is often difficult to map a variable to an appropriate scale: E.g. metabolic activity, evolutionary success, pain, social status, customer satisfaction, anger -> Check whether your choice of scale is meaningful!

8 8 Value ABAB0  (absolute) frequency 83201075188 relative frequency 44%11%5%40%100% Always list absolute frequencies! Do not list relative frequencies in percent if the sample size is small (n < 20) Do not use decimal digits in percent numbers for n<300 Rule of thumb: Use ca. (log 10 n) - 2 digits „Side effects were observed in 14,2857% of all cases“ Nonsense, we conclude that n=7! Description of a categorial variable: Tables Example: Blood antigens (ABO), n = 188 samples I. Description

9 9 % Description of a categorial variable: Barplot I. Description Rel. fre- quency Abs. fre- quency 10 20 40 80

10 10 Description of continuous data: Histogram I. Description

11 11 The size of the bins (= width of the bars) is a matter of choice and has to be determined sensibly! 50 bins 4 Balken 12 bins I. Description

12 12 Caution: Data will be smoothed automatically. This is very suggestive and blurs discontinuities in a distribution. I. Description Description of continuous data: Density plot

13 13 Most important: The Gaussian (=normal) distribution Expectation value Standard- deviation I. Description C.F Gauss (1777-1855): Roughly speaking, continuous variables that are the (additive) result of a lot of other random variables follow a Gaussian distribution. -> It is often sensible to assume a gaussian distribution for continuous variables.

14 14 Measures of Location, Scale and Scatter Mean: sum of all observations / number of samples Ex.: observations: 2, 3, 7, 9, 14 sum: 2+3+7+9+14 = 35 # observations: 5 Mean: 35/5 = 7 Median: A number M such that 50% of all observations are less than or equal to M, and 50% are greater than or equal to M. (Q: What if #observations is even?) 50% I. Description

15 Mode: A value for which the density of the variable reaches a local maximum. If there is only one such value, the distribution is called unimodal, otherwise multimodal. Special case: bimodal) The mode usually is an unstable description of a sample. 15 Mean Median I. Description Description of Location, Scale and Scatter Mode

16 16 Distribution Shapes Symmetric Mean  Median Skewed to the right Median << Mean Skewed to the left Mean << Median I. Description

17 17 The median should be preferred to the mean if the ditribution is very asymmetric there are extreme outliers The skewness g of the distribution ranges between –1 und +1, i.e. the distribution is approx. symmetric. skewness g > 0 skewness g < 0 The mean is more „precise“ than the median if the distribution is approximately normal Rule of thumb: Right skew: Left skew: I. Description

18 18 How would you describe this distribution? I. Description

19 19 „…it showed a giant boa swallowing an elephant. I painted the inside of the boa to make it visible to the adults. They always need explanations.“ Antoine de Saint-Exupéry, Le petit prince Unexpected distributions have unexpected causes! I. Description

20 20 More measures of location Quantile: A q-quantile Q (0≤q≤1) splits the data into a fraction of q points below or equal to Q and a fraction of 1-q points above or equal to Q. 50% Median = 50%-quantile 25% 1.quartile = 25%-quantile 25% 3.quartile = 75%-quantile 1-quantile = maximum 0-quantile = minimum I. Description

21 21 The five-point Summary and the Boxplot I. Description

22 22 How far do the observations scatter around their „center“(=measure of location)? Measures of Variation large variation small variation Location measure e.g.: location = Median variation = 3.quartile – 1.quartile = Interquartile range (IQR) I. Description

23 23 Measures of Variation e.g.: location = median variation = mean deviation (MD) from = e.g.: location = median variation = median absolute deviation,MAD from I. Description

24 24 Mean ± s contains ~68% of the data Mean ± 2s ´´ ~95% ´´ Mean ± 3s ´´ ~99.7% ´´ x-s x x+s Measures of Variation Numbers for Gaussian variables: z.B.: location = mean variation = mean squared deviation from = =variance Or:variation = square root of the variance = standard deviation (s, std.dev) I. Description

25 25 Histogram/Density Plot vs. Boxplot Boxplot contains less information, but it is easier to interpret. I. Description 1 3 2 4

26 26 Multiple Boxplots I. Description Sample: 2769 schoolchildren

27 27 Always report the sample size! a)numerical Median, Q 1, Q 3, Min., Max. (5-summary) for symmetric distr. alternatively: mean, standard deviation b)graphical Boxplots, histograms, density plots c) textual e.g. „Blood pressure was reduced by 12 mmHg (Interquartile range: 8 to 18 mmHg = 10mmHg), whereas the reduction in the placebo group was only 3 mmHg (IQR: –2 to 4 mmHg = 6mmHg).“ Summary I. Description

28 28 Cross Table PersonMedicationResponse AVerumyes BPlacebono Two categorial variables: Cross Tables Data I. Description

29 29 Cross Table values of variable 2 values of variable 1 (potential causes) (potential effects) I. Description Two categorial variables: Cross tables PersonMedicationResponse AVerumyes BPlacebono Data

30 30 Cross Table Response yesno Medi- cation Verum Placebo values of variable 2 values of variable 1 (potential causes) (potential effects) Each case is one count in the table I. Description Two categorial variables: Cross tables PersonMedicationResponse AVerumyes BPlacebono Data

31 31 Cross Table Response yesno Medi- cation Verum10 Placebo01 values of variable 2 values of variable 1 (potential causes) (potential effects) I. Description Two categorial variables: Cross tables Each case is one count in the table PersonMedicationResponse AVerumyes BPlacebono Data ≠

32 32 Cross Table Response yesno Medi- cation Verum10 Placebo01 values of variable 2 values of variable 1 (potential causes) (potential effects) The most common question is: Are there differences between █ and █ ? I. Description Two categorial variables: Cross tables

33 33 Absolute number, row-, column percent Response Total yesno Medi- cation Verum 20 50%, 67% 20 50%, 40% 40 50% Placebo 10 25%, 33% 30 75%, 60% 40 50% Total30, 37%50, 63%80, 100% Cross Table: n = 80 cases I. Description Two categorial variables: Cross tables

34 34 What‘s bad about this table? I. Description Two categorial variables: Cross tables

35 35 Cross tables: Independent vs. paired data independent data paired data PersonMedicationResponse AVerumyes BPlacebono PersonMedic.: VerumMedic.: Placebo Ayes B no Paired data: One object (or two closely related objects) serves for the measurement of two variables of the same kind. Exercise: The influence of diet on body height is assessed in 1) a study with 100 randomly picked subjects. 2) a study with 50 identical twins that grew up separately. Write down the cross tables. Which study is probably more informative? I. Description

36 36 Cross Table Medic.: Placebo yesno Medic.: Verum yes11 no00 values of variable 2 values of variable 1 I. Description Cross tables: Paired data paired data PersonMedic.: VerumMedic.: Placebo Ayes B no

37 37 Cross table Medic.: Placebo yesno Medic.: Verum yes11 no00 values of variable 2 values of variable 1 A typical question is: concordant observations discordant observations Are the observations concordant or discordant? Is there a particularly large number in █ or █ ? I. Description Cross tables: Paired data

38 Comparison of two global gene expression measurements Absolute scaleDouble logarithmic scale y = ½ x y = ¼ x y = 2x y = 4x y = ½ x y = ¼ x y = 2x y = 4x Advantages of double log scale: Skewed distributions appear more evenly spread across the plot Loci of fixed expression folds are lines parallel to the main diagonal Scatterplot I. Description Two continuous variables: Scatter Plots

39 Advantages of the MA-Plot: Lines of constant expression folds are parallel to the x-axis. Differences between channel 1 and channel 2 can easily be read off the plot. Intensity-dependent systematic errors can be detected. turn by 45 o log (fold ratio of y and x) log (geometr. Mean of x and y) Scatterplot vs. M-A-plot I. Description > x = log(exprs[,1]) > y = log(exprs[,2]) > plot(x,y) > xMA =(x+y)/2 > yMA = y - x > plot(xMA,yMA) log (x) log (y) There is a mistake in these plots (compare left and right plot)!

40 No visible bias (=systematic error) Channel 2 differs from channel 1 by a constant factor multiplicative bias M-A-plot I. Description

41 How to quantify such a relation between x and y? Example Korrelation I. Description Dependence of two continuous variables

42 The Pearson correlation coefficient r measures the degree of linear dependence of two variables Properties: -1 ≤ r ≤ +1 r = ± 1: perfect linear dependence the sign of r indicates the direction of the dependence r is symmetric, i.e., r xy =r yx Pearson Korrelation I. Description r=1 r= -1

43 Pearson Korrelation I. Description The smaller the absolute value of r, the weaker the linear dependence

44 Pearson Korrelation I. Description The smaller the absolute value of r, the weaker the linear dependence

45 Pearson Korrelation I. Description The smaller the absolute value of r, the weaker the linear dependence

46 Pearson Korrelation I. Description The smaller the absolute value of r, the weaker the linear dependence

47 Pearson Korrelation I. Description The smaller the absolute value of r, the weaker the linear dependence

48 Pearson Korrelation I. Description The smaller the absolute value of r, the weaker the linear dependence

49 r xy = 0,38 r xy = 0,84 Example: Relation between height and weight resp. Arm length The closer the points scatter around a line, the larger the absolute value of r. Pearson Korrelation I. Description

50 What is the value of r in these cases? Pearson correlation has difficulties in recognizing non-linear dependencies. r ≈ 0 Pearson Korrelation I. Description

51 Spearman correlation measures monotonic dependencies. Idea: Calculate the pearson correlation coefficient of the rank transformed data  Spearman-Korrelation s X Y rank(Y) rank(X) r = 0,88 s = 0,95 Korrelation Pearson correlation Spearman correlation I. Description

52 Raw data Pearson vs. Spearman Korrelation I. Description

53 Pearson correlation NM_001767NM_000734NM_001049NM_006205 NM_0017671.000000000.94918522-0.045597660.04341766 NM_0007340.949185221.00000000-0.026595450.01229839 NM_001049-0.04559766-0.026595451.00000000-0.85043885 NM_0062050.043417660.01229839-0.850438851.00000000 Pearson vs. Spearman Korrelation I. Description

54 Rank transformed data Pearson vs. Spearman Korrelation I. Description

55 NM_001767NM_000734NM_001049NM_006205 NM_0017671.000000000.9529094-0.10869080-0.17821449 NM_0007340.95290941.00000000-0.11247013-0.20515650 NM_001049-0.10869080-0.112470131.000000000.03386758 NM_006205-0.17821449-0.205156500.033867581.00000000 Spearman correlation Pearson vs. Spearman Korrelation I. Description

56 Conclusion: Spearman correlation is more robust against outliers. However in case of linear dependence, it is less sensitive than Pearsion correlation. Pearson vs. Spearman Korrelation Raw data Rank transformed data I. Description

57 Quantile-Quantile plot (qq-plot). For the comparison of two distributions (of x and y), plot the quantiles of the x distribution against the corresponding quantiles of the y distribution. QQ-plot Q(uantile)-Q(uantile) Plots I. Description

58 Interpretation: Unsimilar distributions: qq-plot is not linear, in particular not in the center of the qq-line. Similar Distributions except for the tails, the tails of the y distribution are “heavier” Q(uantile)-Q(uantile) Plots I. Description Similar Distributions except for the tails, the tails of the x distribution are “heavier”


Download ppt "1 Statistics Achim Tresch UoC / MPIPZ Cologne treschgroup.de/OmicsModule1415.html"

Similar presentations


Ads by Google