Canadian Bioinformatics Workshops www.bioinformatics.ca.

Canadian Bioinformatics Workshops www.bioinformatics.ca

2Module #: Title of Module

Module 5

Distributions & Significance

Univariate Statistics

Module 5 bioinformatics. ca Univariate Statistics Univariate means a single variable If you measure a population using some single measure such as height, weight, test score, IQ, you are measuring a single variable If you plot that single variable over the whole population, measuring the frequency that a given value is reached you will get the following:

A Bell Curve Also called a Gaussian or Normal Distribution # of each Height

Features of a Normal Distribution Symmetric Distribution Has an average or mean value (  ) at the centre Has a characteristic width called the standard deviation (  ) Most common type of distribution known  = mean

Module 5 bioinformatics. ca Normal Distribution Almost any set of biological or physical measurements will display some some variation and these will almost always follow a Normal distribution The larger the set of measurements, the more “normal” the curve Minimum set of measurements to get a normal distribution is 30-40

Gaussian Distribution

Some Equations Mean  =  x i N Variance  2 =  (x i -  ) 2 Standard Deviation  =  (x i -  ) 2 N N

Standard Deviations (Z-values)

Module 5 bioinformatics. ca Significance Based on the Normal Distribution, the probability that something is >1 SD away (larger or smaller) from the mean is 32% Based on the Normal Distribution, the probability that something is >2 SD away (larger or smaller) from the mean is 5% Based on the Normal Distribution, the probability that something is >3 SD away (larger or smaller) from the mean is 0.3%

Module 5 bioinformatics. ca Significance In a test with a class of 400 students, if you score the average you typically receive a “C” In a test with a class of 400 students, if you score 1 SD above the average you typically receive a “B” In a test with a class of 400 students if you score 2 SD above the average you typically receive an “A”,

Module 5 bioinformatics. ca The P-value The p-value is the probability of obtaining a test statistic (a score, a set of events, a height) at least as extreme as the one that was actually observed One "rejects the null hypothesis" when the p- value is less than the significance level α which is often 0.05 or 0.01 When the null hypothesis is rejected, the result is said to be statistically significant

Module 5 bioinformatics. ca P-value If the average height of an adult (M+F) human is 5’ 7” and the standard deviation is 5”, what is the probability of finding someone who is more than 6’ 10”? If you choose an  of 0.05 is a 6’ 11” individual a member of the human species? If you choose an  of 0.01 is a 6’ 11” individual a member of the human species?

Module 5 bioinformatics. ca P-value If you flip a coin 20 times and the coin turns up heads 14/20 times the probability that this would occur is 60,000/1,048,000 = 0.058 If you choose an  of 0.05 is this coin a fair coin? If you choose an  of 0.10 is this coin a fair coin?

Mean, Median & Mode Mode Median Mean

Module 5 bioinformatics. ca Mean, Median, Mode In a Normal Distribution the mean, mode and median are all equal In skewed distributions they are unequal Mean - average value, affected by extreme values in the distribution Median - the “middlemost” value, usually half way between the mode and the mean Mode - most common value

Different Distributions UnimodalBimodal

Module 5 bioinformatics. ca Other Distributions Binomial Distribution Poisson Distribution Extreme Value Distribution Skewed or Exponential Distribution

Binomial Distribution 1 1 2 1 1 3 3 1 1 4 6 4 1 1 5 10 10 5 1 P(x) = (p + q) n

Poisson Distribution Proportion of samples  = 10  =0.1  = 1  = 2  = 3 P(x) x

Extreme Value Distribution Arises from sampling the extreme end of a normal distribution A distribution which is “skewed” due to its selective sampling Skew can be either right or left Gaussian Distribution

Skewed Distribution Resembles an exponential or Poisson-like distribution Lots of extreme values far from mean or mode Hard to do useful statistical tests with this type of distribution Outliers

Module 5 bioinformatics. ca Fixing a Skewed Distribution A skewed distribution or exponentially decaying distribution can be transformed into a “normal” or Gaussian distribution by applying a log transformation This brings the outliers a little closer to the mean because it rescales the x-variable, it also makes the distribution much more Gaussian

log transformed exp’t B linear scale exp’t B Log Transformation Skewed distribution Normal distribution

Log Transformation on Real Data

Distinguishing 2 Populations Normals Leprechauns

The Result # of each Height Are they different?

What about these 2 Populations?

Module 5 bioinformatics. ca Student’s t-Test Also called the t-Test Used to determine if 2 populations are different Formally allows you to calculate the probability that 2 sample means are the same If the t-Test statistic gives you a p=0.4, and the  is 0.05, then the 2 populations are the same If the t-Test statistic gives you a p=0.04, and the  is 0.05, then the 2 populations are different Paired and unpaired t-Tests are available, paired if used for “before & after” expts. while unpaired is for 2 randomly chosen samples

Module 5 bioinformatics. ca Student’s t-Test A t-Test can also be used to determine whether 2 clusters are different if the clusters follow a normal distribution Variable 2 Variable 1

Distinguishing 3+ Populations Normals Leprechauns Elves

Distinguishing 3+ Populations

Module 5 bioinformatics. ca ANOVA Also called Analysis of Variance Used to determine if 3 or more populations are different, it is a generalization of the t-Test Formally ANOVA provides a statistical test (by looking at group variance) of whether or not the means of several groups are all equal Uses an F-measure to test for significance 1-way, 2-way, 3-way and n-way ANOVAs, most common is 1-way which just is concerned about whether any of the 3+ populations are different, not which pair is different

Module 5 bioinformatics. ca ANOVA ANOVA can also be used to determine whether 3+ clusters are different if the clusters follow a normal distribution Variable 2 Variable 1

Normalization

Module 5 bioinformatics. ca Normalization What if we measured the top population using a ruler that was miscalibrated or biased (inches were short by 10%)? We would get the following result: # of each Height

Module 5 bioinformatics. ca Normalization Normalization adjusts for systematic bias in the measurement tool After normalization we would get: # of each Height

Data Comparisons & Dependencies

Module 5 bioinformatics. ca Data Comparisons In many kinds of experiments we want to know what happened to a population “before” and “after” some treatment or intervention In other situations we want to measure the dependency of one variable against another In still others we want to assess how the observed property matches the predicted property In all cases we will measure multiple samples or work with a population of subjects The best way to view this kind of data is through a scatter plot

A Scatter Plot

Module 5 bioinformatics. ca Scatter Plots If there is some dependency between the two variables or if there is a relationship between the predicted and observer variable or if the “before” and “after” treatments led to some effect, then it is possible to see some clear patterns to the scatter plot This pattern or relationship is called correlation

Correlation “+” correlation Uncorrelated “-” correlation

Correlation High correlation Low correlation Perfect correlation

Correlation Coefficient r = 0.85r = 0.4r = 1.0 r =  (x i -  x )(y i -  y )  (x i -  x ) 2 (y i -  y ) 2

Module 5 bioinformatics. ca Correlation Coefficient Sometimes called coefficient of linear correlation or Pearson product-moment correlation coefficient A quantitative way of determining what model (or equation or type of line) best fits a set of data Commonly used to assess most kinds of predictions, simulations, comparisons or dependencies

Module 5 bioinformatics. ca Student’s t-Test (Again) The t-Test can also be used to assess the statistical significance of a correlation It specifically determines whether the slope of the regression line is statistically different than 0

Correlation and Outliers Experimental error or something important? A single “bad” point can destroy a good correlation

Module 5 bioinformatics. ca Outliers Can be both “good” and “bad” When modeling data -- you don’t like to see outliers (suggests the model is bad) Often a good indicator of experimental or measurement errors -- only you can know! When plotting metabolite concentration data you do like to see outliers A good indicator of something significant

Detecting Clusters Weight Height

Is it Right to Calculate a Correlation Coefficient? Weight Height r = 0.73

Or is There More to This? Weight Height female male

Module 5 bioinformatics. ca Clustering Applications in Bioinformatics Metabolomics and Cheminformatics Microarray or GeneChip Analysis 2D Gel or ProteinChip Analysis Protein Interaction Analysis Phylogenetic and Evolutionary Analysis Structural Classification of Proteins Protein Sequence Families

Module 5 bioinformatics. ca Clustering Definition - a process by which objects that are logically similar in characteristics are grouped together. Clustering is different than Classification In classification the objects are assigned to pre-defined classes, in clustering the classes are yet to be defined Clustering helps in classification

Module 5 bioinformatics. ca Clustering Requires... A method to measure similarity (a similarity matrix) or dissimilarity (a dissimilarity coefficient) between objects A threshold value with which to decide whether an object belongs with a cluster A way of measuring the “distance” between two clusters A cluster seed (an object to begin the clustering process)

Module 5 bioinformatics. ca Clustering Algorithms K-means or Partitioning Methods - divides a set of N objects into M clusters -- with or without overlap Hierarchical Methods - produces a set of nested clusters in which each pair of objects is progressively nested into a larger cluster until only one cluster remains Self-Organizing Feature Maps - produces a cluster set through iterative “training”

Module 5 bioinformatics. ca K-means or Partitioning Methods Make the first object the centroid for the first cluster For the next object calculate the similarity to each existing centroid If the similarity is greater than a threshold add the object to the existing cluster and redetermine the centroid, else use the object to start new cluster Return to step 2 and repeat until done

K-means or Partitioning Methods Rule:  T = centroid + 50 nm - Initial cluster choose 1 choose 2 test & join centroid= centroid=

Module 5 bioinformatics. ca Hierarchical Clustering Find the two closest objects and merge them into a cluster Find and merge the next two closest objects (or an object and a cluster, or two clusters) using some similarity measure and a predefined threshold If more than one cluster remains return to step 2 until finished

Hierarchical Clustering Rule:  T = obs + 50 nm - Initial cluster pairwise select select compare closest next closest

Hierarchical Clustering Find 2 most similar metabolite expression levels or curves Find the next closest pair of levels or curves Iterate A B B A C A B C D E F Heat map

Multivariate Statistics

Module 5 bioinformatics. ca Multivariate Statistics Multivariate means multiple variables If you measure a population using multiple measures at the same time such as height, weight, hair colour, clothing colour, eye colour, etc. you are performing multivariate statistics Multivariate statistics requires more complex, multidimensional analyses or dimensional reduction methods

A Typical Metabolomics Experiment

Module 5 bioinformatics. ca A Metabolomics Experiment Metabolomics experiments typically measure many metabolites at once, in other words the instruments are measuring multiple variables and so metabolomic data are inherently multivariate data Metabolomics requires multivariate statistics

Module 5 bioinformatics. ca Multivariate Statistics – The Trick The key trick in multivariate statistics is to find a way that effectively reduces the multivariate data into univariate data Once done, then you can apply the same univariate concepts such as p-values, t-Tests and ANOVA tests to the data The trick is dimensional reduction

Dimension Reduction & PCA PCA – Principal Componenent Analysis Process that transforms a number of possibly correlated variables into a smaller number of uncorrelated variables called principal components Reduces 1000’s of variables to 2-3 key features Scores plot

ANIT PAP Control Principal Component Analysis PCA captures what should be visually detectable If you can’t see it, PCA probably won’t help Scores plot Hundreds of peaks2 components

Visualizing PCA PCA of a “bagel” One projection produces a weiner Another projection produces an “O” The “O” projection captures most of the variation and has the largest eigenvector (PC1) The weiner projection is PC2 and gives depth info

PCA - The Details PCA involves the calculation of the eigenvalue (singular value) decomposition of a data covariance matrix PCA is an orthogonal linear transformation PCA transforms data to a new coordinate system so that the greatest variance of the data comes to lie on the first coordinate (1st PC), the second greatest variance on the 2nd PC etc. x 1 x 2 x 3, … variables ……. x n s 1 s 2 s 3 … samples. s k t 1 t 2 ….. t m p1p2pkp1p2pk ….. Scores = t (eigen vectors uncorrelated orthogonal) Loadings = p scores = loadings x data t 1 = p 1 x 1 + p 2 x 2 + p 3 x 3 + … + p n x n

Visualizing PCA Airport data from USA 5000 “samples” X 1 - latitude X 2 - longitude X 3 - altitude What should you expect? Data from Roy Goodacre (U of Manchester)

Visualizing PCA PCA is equivalent to K-means clustering

K-means Clustering Rule:  T = centroid + 50 nm - Initial cluster choose 1 choose 2 test & join centroid= centroid=

Module 5 bioinformatics. ca PCA Clusters Once dimensional reduction has been achieved you obtain clusters of data that are mostly normally distributed with means and variances (in PCA space) It is possible to use t-Tests and ANOVA tests to determine if these clusters or their means are significantly different or not

Module 5 bioinformatics. ca PCA and ANOVA ANOVA can also be used to determine whether 3+ clusters are different if the clusters follow a normal distribution PC 2 PC 1

PCA Plot Nomenclature PCA Generate 2 kinds of plots, the scores plot and the loadings plot Scores plot (on right) plots the data using the main principal components

PCA Loadings Plot Loadings plot shows how much each of the variables (metabolites) contributed to the different principal components Variables at the extreme corners contribute most to the scores plot separation

Module 5 bioinformatics. ca PCA Details/Advice In some cases PCA will not succeed in identifying any clear clusters or obvious groupings no matter how many components are used. If this is the case, it is wise to accept the result and assume that the presumptive classes or groups cannot be distinguished As a general rule, if a PCA analysis fails to achieve even a modest separation of classes, then it is probably not worthwhile using other statistical techniques to try to separate them

Module 5 bioinformatics. ca PCA Q 2 and R 2 The performance of a PCA model can be quantitatively evaluated in terms of an R 2 and/or a Q 2 value R 2 is the correlation index and refers to the goodness of fit or the explained variation (range = 0-1) Q 2 refers to the predicted variation or quality of prediction (range = 0-1) Typically Q 2 and R 2 track very closely together

Module 5 bioinformatics. ca PCA R 2 R 2 is a quantitative measure (with a maximum value of 1) that indicates how well the PCA model is able to mathematically reproduce the data in the data set A poorly fit model will have an R 2 of 0.2 or 0.3, while a well-fit model will have an R 2 of 0.7 or 0.8.

Module 5 bioinformatics. ca PCA Q 2 To guard against over-fitting, the value Q 2 is commonly determined. Q 2 is usually estimated by cross validation or permutation testing to assess the predictive ability of the model relative to the number of principal components used in the model Generally a Q 2 > 0.5 if considered good while a Q 2 of 0.9 is outstanding

Module 5 bioinformatics. ca PCA vs. PLS-DA Partial Least Squares Discriminant Analysis PLS-DA is a supervised classification technique while PCA is an unsupervised clustering technique PLS-DA uses “labeled” data while PCA uses no prior knowledge PLS-DA enhances the separation between groups of observations by rotating PCA components such that a maximum separation among classes is obtained

Module 5 bioinformatics. ca Other Supervised Classification Methods SIMCA – Soft Independent Modeling of Class Analogy OPLS – Orthoganol Project of Least Squares Support Vector Machines Random Forest Naïve Bayes Classifiers Neural Networks

Breaching the Data Barrier Unsupervised Methods PCA K-means clustering Factor Analysis Supervised Methods PLS-DA LDA PLS-Regression Machine Learning Neural Networks Support Vector Machines Bayesian Belief Net

Module 5 bioinformatics. ca Data Analysis Progression Unsupervised Methods – PCA or cluster to see if natural clusters form or if data separates well – Data is “unlabeled” (no prior knowledge) Supervised Methods/Machine Learning – Data is labeled (prior knowledge) – Used to see if data can be classified – Helps separate less obvious clusters or features Statistical Significance – Supervised methods always generate clusters -- this can be very misleading – Check if clusters are real by label permutation

Testing Significance PCALabelled dataPLS-DA/SVM Permuted data Separation score

Module 5 bioinformatics. ca Note of Caution Supervised classification methods are powerful – Learn from experience – Generalize from previous examples – Perform pattern recognition Too many people skip the PCA or clustering steps and jump straight to supervised methods Some get great separation and think the job is done - this is where the errors begin… Too many don’t assess significance using permutation testing or n-fold cross validation If separation isn’t partially obvious by eye-balling your data, you may be treading on thin ice

Canadian Bioinformatics Workshops www.bioinformatics.ca.

Similar presentations

Presentation on theme: "Canadian Bioinformatics Workshops www.bioinformatics.ca."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Canadian Bioinformatics Workshops www.bioinformatics.ca.

Similar presentations

Presentation on theme: "Canadian Bioinformatics Workshops www.bioinformatics.ca."— Presentation transcript:

Similar presentations

About project

Feedback