Presentation is loading. Please wait.

Presentation is loading. Please wait.

Canadian Bioinformatics Workshops www.bioinformatics.ca.

Similar presentations


Presentation on theme: "Canadian Bioinformatics Workshops www.bioinformatics.ca."— Presentation transcript:

1 Canadian Bioinformatics Workshops www.bioinformatics.ca

2 2Module #: Title of Module

3 Module 5

4 Distributions & Significance

5 Univariate Statistics

6 Module 5 bioinformatics. ca Univariate Statistics Univariate means a single variable If you measure a population using some single measure such as height, weight, test score, IQ, you are measuring a single variable If you plot that single variable over the whole population, measuring the frequency that a given value is reached you will get the following:

7 A Bell Curve Also called a Gaussian or Normal Distribution # of each Height

8 Features of a Normal Distribution Symmetric Distribution Has an average or mean value (  ) at the centre Has a characteristic width called the standard deviation (  ) Most common type of distribution known  = mean

9 Module 5 bioinformatics. ca Normal Distribution Almost any set of biological or physical measurements will display some some variation and these will almost always follow a Normal distribution The larger the set of measurements, the more “normal” the curve Minimum set of measurements to get a normal distribution is 30-40

10 Gaussian Distribution

11 Some Equations Mean  =  x i N Variance  2 =  (x i -  ) 2 Standard Deviation  =  (x i -  ) 2 N N

12 Standard Deviations (Z-values)

13 Module 5 bioinformatics. ca Significance Based on the Normal Distribution, the probability that something is >1 SD away (larger or smaller) from the mean is 32% Based on the Normal Distribution, the probability that something is >2 SD away (larger or smaller) from the mean is 5% Based on the Normal Distribution, the probability that something is >3 SD away (larger or smaller) from the mean is 0.3%

14 Module 5 bioinformatics. ca Significance In a test with a class of 400 students, if you score the average you typically receive a “C” In a test with a class of 400 students, if you score 1 SD above the average you typically receive a “B” In a test with a class of 400 students if you score 2 SD above the average you typically receive an “A”,

15 Module 5 bioinformatics. ca The P-value The p-value is the probability of obtaining a test statistic (a score, a set of events, a height) at least as extreme as the one that was actually observed One "rejects the null hypothesis" when the p- value is less than the significance level α which is often 0.05 or 0.01 When the null hypothesis is rejected, the result is said to be statistically significant

16 Module 5 bioinformatics. ca P-value If the average height of an adult (M+F) human is 5’ 7” and the standard deviation is 5”, what is the probability of finding someone who is more than 6’ 10”? If you choose an  of 0.05 is a 6’ 11” individual a member of the human species? If you choose an  of 0.01 is a 6’ 11” individual a member of the human species?

17 Module 5 bioinformatics. ca P-value If you flip a coin 20 times and the coin turns up heads 14/20 times the probability that this would occur is 60,000/1,048,000 = 0.058 If you choose an  of 0.05 is this coin a fair coin? If you choose an  of 0.10 is this coin a fair coin?

18 Mean, Median & Mode Mode Median Mean

19 Module 5 bioinformatics. ca Mean, Median, Mode In a Normal Distribution the mean, mode and median are all equal In skewed distributions they are unequal Mean - average value, affected by extreme values in the distribution Median - the “middlemost” value, usually half way between the mode and the mean Mode - most common value

20 Different Distributions UnimodalBimodal

21 Module 5 bioinformatics. ca Other Distributions Binomial Distribution Poisson Distribution Extreme Value Distribution Skewed or Exponential Distribution

22 Binomial Distribution 1 1 2 1 1 3 3 1 1 4 6 4 1 1 5 10 10 5 1 P(x) = (p + q) n

23 Poisson Distribution Proportion of samples  = 10  =0.1  = 1  = 2  = 3 P(x) x

24 Extreme Value Distribution Arises from sampling the extreme end of a normal distribution A distribution which is “skewed” due to its selective sampling Skew can be either right or left Gaussian Distribution

25 Skewed Distribution Resembles an exponential or Poisson-like distribution Lots of extreme values far from mean or mode Hard to do useful statistical tests with this type of distribution Outliers

26 Module 5 bioinformatics. ca Fixing a Skewed Distribution A skewed distribution or exponentially decaying distribution can be transformed into a “normal” or Gaussian distribution by applying a log transformation This brings the outliers a little closer to the mean because it rescales the x-variable, it also makes the distribution much more Gaussian

27 log transformed exp’t B linear scale exp’t B Log Transformation Skewed distribution Normal distribution

28 Log Transformation on Real Data

29 Distinguishing 2 Populations Normals Leprechauns

30 The Result # of each Height Are they different?

31 What about these 2 Populations?

32 The Result # of each Height Are they different?

33 Module 5 bioinformatics. ca Student’s t-Test Also called the t-Test Used to determine if 2 populations are different Formally allows you to calculate the probability that 2 sample means are the same If the t-Test statistic gives you a p=0.4, and the  is 0.05, then the 2 populations are the same If the t-Test statistic gives you a p=0.04, and the  is 0.05, then the 2 populations are different Paired and unpaired t-Tests are available, paired if used for “before & after” expts. while unpaired is for 2 randomly chosen samples

34 Module 5 bioinformatics. ca Student’s t-Test A t-Test can also be used to determine whether 2 clusters are different if the clusters follow a normal distribution Variable 2 Variable 1

35 Distinguishing 3+ Populations Normals Leprechauns Elves

36 The Result # of each Height Are they different?

37 Distinguishing 3+ Populations

38 The Result # of each Height Are they different?

39 Module 5 bioinformatics. ca ANOVA Also called Analysis of Variance Used to determine if 3 or more populations are different, it is a generalization of the t-Test Formally ANOVA provides a statistical test (by looking at group variance) of whether or not the means of several groups are all equal Uses an F-measure to test for significance 1-way, 2-way, 3-way and n-way ANOVAs, most common is 1-way which just is concerned about whether any of the 3+ populations are different, not which pair is different

40 Module 5 bioinformatics. ca ANOVA ANOVA can also be used to determine whether 3+ clusters are different if the clusters follow a normal distribution Variable 2 Variable 1

41 Normalization

42 Module 5 bioinformatics. ca Normalization What if we measured the top population using a ruler that was miscalibrated or biased (inches were short by 10%)? We would get the following result: # of each Height

43 Module 5 bioinformatics. ca Normalization Normalization adjusts for systematic bias in the measurement tool After normalization we would get: # of each Height

44 Data Comparisons & Dependencies

45 Module 5 bioinformatics. ca Data Comparisons In many kinds of experiments we want to know what happened to a population “before” and “after” some treatment or intervention In other situations we want to measure the dependency of one variable against another In still others we want to assess how the observed property matches the predicted property In all cases we will measure multiple samples or work with a population of subjects The best way to view this kind of data is through a scatter plot

46 A Scatter Plot

47 Module 5 bioinformatics. ca Scatter Plots If there is some dependency between the two variables or if there is a relationship between the predicted and observer variable or if the “before” and “after” treatments led to some effect, then it is possible to see some clear patterns to the scatter plot This pattern or relationship is called correlation

48 Correlation “+” correlation Uncorrelated “-” correlation

49 Correlation High correlation Low correlation Perfect correlation

50 Correlation Coefficient r = 0.85r = 0.4r = 1.0 r =  (x i -  x )(y i -  y )  (x i -  x ) 2 (y i -  y ) 2

51 Module 5 bioinformatics. ca Correlation Coefficient Sometimes called coefficient of linear correlation or Pearson product-moment correlation coefficient A quantitative way of determining what model (or equation or type of line) best fits a set of data Commonly used to assess most kinds of predictions, simulations, comparisons or dependencies

52 Module 5 bioinformatics. ca Student’s t-Test (Again) The t-Test can also be used to assess the statistical significance of a correlation It specifically determines whether the slope of the regression line is statistically different than 0

53 Correlation and Outliers Experimental error or something important? A single “bad” point can destroy a good correlation

54 Module 5 bioinformatics. ca Outliers Can be both “good” and “bad” When modeling data -- you don’t like to see outliers (suggests the model is bad) Often a good indicator of experimental or measurement errors -- only you can know! When plotting metabolite concentration data you do like to see outliers A good indicator of something significant

55 Detecting Clusters Weight Height

56 Is it Right to Calculate a Correlation Coefficient? Weight Height r = 0.73

57 Or is There More to This? Weight Height female male

58 Module 5 bioinformatics. ca Clustering Applications in Bioinformatics Metabolomics and Cheminformatics Microarray or GeneChip Analysis 2D Gel or ProteinChip Analysis Protein Interaction Analysis Phylogenetic and Evolutionary Analysis Structural Classification of Proteins Protein Sequence Families

59 Module 5 bioinformatics. ca Clustering Definition - a process by which objects that are logically similar in characteristics are grouped together. Clustering is different than Classification In classification the objects are assigned to pre-defined classes, in clustering the classes are yet to be defined Clustering helps in classification

60 Module 5 bioinformatics. ca Clustering Requires... A method to measure similarity (a similarity matrix) or dissimilarity (a dissimilarity coefficient) between objects A threshold value with which to decide whether an object belongs with a cluster A way of measuring the “distance” between two clusters A cluster seed (an object to begin the clustering process)

61 Module 5 bioinformatics. ca Clustering Algorithms K-means or Partitioning Methods - divides a set of N objects into M clusters -- with or without overlap Hierarchical Methods - produces a set of nested clusters in which each pair of objects is progressively nested into a larger cluster until only one cluster remains Self-Organizing Feature Maps - produces a cluster set through iterative “training”

62 Module 5 bioinformatics. ca K-means or Partitioning Methods Make the first object the centroid for the first cluster For the next object calculate the similarity to each existing centroid If the similarity is greater than a threshold add the object to the existing cluster and redetermine the centroid, else use the object to start new cluster Return to step 2 and repeat until done

63 K-means or Partitioning Methods Rule:  T = centroid + 50 nm - Initial cluster choose 1 choose 2 test & join centroid= centroid=

64 Module 5 bioinformatics. ca Hierarchical Clustering Find the two closest objects and merge them into a cluster Find and merge the next two closest objects (or an object and a cluster, or two clusters) using some similarity measure and a predefined threshold If more than one cluster remains return to step 2 until finished

65 Hierarchical Clustering Rule:  T = obs + 50 nm - Initial cluster pairwise select select compare closest next closest

66 Hierarchical Clustering Find 2 most similar metabolite expression levels or curves Find the next closest pair of levels or curves Iterate A B B A C A B C D E F Heat map

67 Multivariate Statistics

68 Module 5 bioinformatics. ca Multivariate Statistics Multivariate means multiple variables If you measure a population using multiple measures at the same time such as height, weight, hair colour, clothing colour, eye colour, etc. you are performing multivariate statistics Multivariate statistics requires more complex, multidimensional analyses or dimensional reduction methods

69 A Typical Metabolomics Experiment

70 Module 5 bioinformatics. ca A Metabolomics Experiment Metabolomics experiments typically measure many metabolites at once, in other words the instruments are measuring multiple variables and so metabolomic data are inherently multivariate data Metabolomics requires multivariate statistics

71 Module 5 bioinformatics. ca Multivariate Statistics – The Trick The key trick in multivariate statistics is to find a way that effectively reduces the multivariate data into univariate data Once done, then you can apply the same univariate concepts such as p-values, t-Tests and ANOVA tests to the data The trick is dimensional reduction

72 Dimension Reduction & PCA PCA – Principal Componenent Analysis Process that transforms a number of possibly correlated variables into a smaller number of uncorrelated variables called principal components Reduces 1000’s of variables to 2-3 key features Scores plot

73 ANIT PAP Control Principal Component Analysis PCA captures what should be visually detectable If you can’t see it, PCA probably won’t help Scores plot Hundreds of peaks2 components

74 Visualizing PCA PCA of a “bagel” One projection produces a weiner Another projection produces an “O” The “O” projection captures most of the variation and has the largest eigenvector (PC1) The weiner projection is PC2 and gives depth info

75 PCA - The Details PCA involves the calculation of the eigenvalue (singular value) decomposition of a data covariance matrix PCA is an orthogonal linear transformation PCA transforms data to a new coordinate system so that the greatest variance of the data comes to lie on the first coordinate (1st PC), the second greatest variance on the 2nd PC etc. x 1 x 2 x 3, … variables ……. x n s 1 s 2 s 3 … samples. s k t 1 t 2 ….. t m p1p2pkp1p2pk ….. Scores = t (eigen vectors uncorrelated orthogonal) Loadings = p scores = loadings x data t 1 = p 1 x 1 + p 2 x 2 + p 3 x 3 + … + p n x n

76 Visualizing PCA Airport data from USA 5000 “samples” X 1 - latitude X 2 - longitude X 3 - altitude What should you expect? Data from Roy Goodacre (U of Manchester)

77 Visualizing PCA PCA is equivalent to K-means clustering

78 K-means Clustering Rule:  T = centroid + 50 nm - Initial cluster choose 1 choose 2 test & join centroid= centroid=

79 Module 5 bioinformatics. ca PCA Clusters Once dimensional reduction has been achieved you obtain clusters of data that are mostly normally distributed with means and variances (in PCA space) It is possible to use t-Tests and ANOVA tests to determine if these clusters or their means are significantly different or not

80 Module 5 bioinformatics. ca PCA and ANOVA ANOVA can also be used to determine whether 3+ clusters are different if the clusters follow a normal distribution PC 2 PC 1

81 PCA Plot Nomenclature PCA Generate 2 kinds of plots, the scores plot and the loadings plot Scores plot (on right) plots the data using the main principal components

82 PCA Loadings Plot Loadings plot shows how much each of the variables (metabolites) contributed to the different principal components Variables at the extreme corners contribute most to the scores plot separation

83 Module 5 bioinformatics. ca PCA Details/Advice In some cases PCA will not succeed in identifying any clear clusters or obvious groupings no matter how many components are used. If this is the case, it is wise to accept the result and assume that the presumptive classes or groups cannot be distinguished As a general rule, if a PCA analysis fails to achieve even a modest separation of classes, then it is probably not worthwhile using other statistical techniques to try to separate them

84 Module 5 bioinformatics. ca PCA Q 2 and R 2 The performance of a PCA model can be quantitatively evaluated in terms of an R 2 and/or a Q 2 value R 2 is the correlation index and refers to the goodness of fit or the explained variation (range = 0-1) Q 2 refers to the predicted variation or quality of prediction (range = 0-1) Typically Q 2 and R 2 track very closely together

85 Module 5 bioinformatics. ca PCA R 2 R 2 is a quantitative measure (with a maximum value of 1) that indicates how well the PCA model is able to mathematically reproduce the data in the data set A poorly fit model will have an R 2 of 0.2 or 0.3, while a well-fit model will have an R 2 of 0.7 or 0.8.

86 Module 5 bioinformatics. ca PCA Q 2 To guard against over-fitting, the value Q 2 is commonly determined. Q 2 is usually estimated by cross validation or permutation testing to assess the predictive ability of the model relative to the number of principal components used in the model Generally a Q 2 > 0.5 if considered good while a Q 2 of 0.9 is outstanding

87 Module 5 bioinformatics. ca PCA vs. PLS-DA Partial Least Squares Discriminant Analysis PLS-DA is a supervised classification technique while PCA is an unsupervised clustering technique PLS-DA uses “labeled” data while PCA uses no prior knowledge PLS-DA enhances the separation between groups of observations by rotating PCA components such that a maximum separation among classes is obtained

88 Module 5 bioinformatics. ca Other Supervised Classification Methods SIMCA – Soft Independent Modeling of Class Analogy OPLS – Orthoganol Project of Least Squares Support Vector Machines Random Forest Naïve Bayes Classifiers Neural Networks

89 Breaching the Data Barrier Unsupervised Methods PCA K-means clustering Factor Analysis Supervised Methods PLS-DA LDA PLS-Regression Machine Learning Neural Networks Support Vector Machines Bayesian Belief Net

90 Module 5 bioinformatics. ca Data Analysis Progression Unsupervised Methods – PCA or cluster to see if natural clusters form or if data separates well – Data is “unlabeled” (no prior knowledge) Supervised Methods/Machine Learning – Data is labeled (prior knowledge) – Used to see if data can be classified – Helps separate less obvious clusters or features Statistical Significance – Supervised methods always generate clusters -- this can be very misleading – Check if clusters are real by label permutation

91 Testing Significance PCALabelled dataPLS-DA/SVM Permuted data Separation score

92 Module 5 bioinformatics. ca Note of Caution Supervised classification methods are powerful – Learn from experience – Generalize from previous examples – Perform pattern recognition Too many people skip the PCA or clustering steps and jump straight to supervised methods Some get great separation and think the job is done - this is where the errors begin… Too many don’t assess significance using permutation testing or n-fold cross validation If separation isn’t partially obvious by eye-balling your data, you may be treading on thin ice


Download ppt "Canadian Bioinformatics Workshops www.bioinformatics.ca."

Similar presentations


Ads by Google