Statistical Analysis of Microarray Data

Statistical Analysis of Microarray Data
Ka-Lok Ng Asia University

Ratios and reference samples Compute the ratio of fluorescence intensities for two samples that are competitively hybridized to the same microarray. One sample acts as a control , or “reference” sample, and is labeled with a dye (Cy3) that has a different fluorescent spectrum from the dye (Cy5) used to label the experimental sample. A convention emerged that two-fold induction or repression of an experimental sample, relative to the reference sample, were indicative of a meaningful change in gene expression. This convection does not reflect standard statistical definition of significance This often has the effect of selecting the top 5% or so of the clones present on the microarray

Statistical Analysis Microarray Data
Reasons for adopting ratios as the standard for comparison of gene expression Microarrays do not provide data on absolute expression levels. Formulation of a ratio captures the central idea that it is a change in relative level of expression that is biological interesting. removes variation among arrays from the analysis. Differences between microarray – such as (1) the absolute amount of DNA spotted on the arrays, (2) local variation introduced either during the sliding preparation and washing, or during image capture.

All microarray experiments must be normalized to ensure that biases inherent in each hybridization are removed. True whether use ratios or raw fluorescent intensities are adopted as the measure of transcript abundance. pgs2e-fig jpg Simple normalization of microarray data. The difference between the raw fluorescence is a meaningless number. Computing ratios allows immediate visualization of which genes are higher in the red channel than the green channel, but logarithmic transformation of this measure on the base 2 scale results in symmetric distribution of values. Finally, normalization by subtraction of the mean log ratio adjusts for the fact that the red channel was generally more intense than the green channel, and centers the data around zero.

Calculate which genes are differentially expressed.
Statistical Analysis of Microarray Data Calculate which genes are differentially expressed The fluorescence intensity for the Cy3 or Cy5 channel after background subtraction. Calculate which genes are at least twofold different in their abundance on this array using two different approaches: (a) by formulating the Cy3:Cy5 ratio, and (b) by calculating the difference in the log base 2 transformed values. In both cases, make sure that you adjust for any overall difference in intensity for the two dyes and comment on whether this adjustment affects your conclusions. pgs2e-exer jpg Calculate which genes are differentially expressed.

Divide by 0.954

Using the ratio method, without adjustment for overall dye effects, genes 2 and 9 appear to have Cy3/Cy5 < 0.5, suggesting that they are differentially regulated. No genes have Cy3/Cy5 > 2. However, the average ratio is 0.95, indicating that overall fluorescence is generally 5% greater in the Cy5 (RED) channel. One way to adjust for this is to divide the individual ratios by the average ratio, which results in the adjusted ratio column. This confirm that gene 2 is underexpressed in Cy3, but not gene 9, whereas gene 5 may be overexpressed. pgs2e-exer jpg

Using the log transformation method, you get very similar results (-1 and +1). The adjusted columns indicate the difference between the log2 fluorescenec intensity and the mean log2 intensity for the respective dye, and hence express the relative fluorescence intensity, relative to the sample mean. The difference between these values gives the final column, indicating that genes 2 and 5 may differentially expressed by twofold or more.

If you just subtract the raw log2 values, you will see that gene 9 appears to be underexpressed in Cy3, but gene 5 appears to be slightly less than twofold overexpressed.

Finding significant genes
After normalizing, filtering and averaging the data, one can identify genes with expression ratios that are significantly different from 1 or -1 Some genes fluctuates a great deal more than others (Hughes et al. 2000a, b) In general the genes whose expression is most variable are those in which expression is stress induced, modulated by the immune system or hormonally regulated (Pritchard et al. 2001) There is the Missing Value problem in microarray data set By interpolation References Hughes TR, et al. (2000a) Functional discovery via a compendium of expression profiles. Cell 102(1):109-26 Hughes TR, et al. (2000b) Widespread aneuploidy revealed by DNA microarray expression profiling. Nat Genet 25(3):333-7 Pritchard et al Project normal: Defining normal variance in mouse gene expression. PNAS 98,

Measure of similarity – definition of distance
A measure of similarity - distance Euclidean distance between two genes for example: p53 and mdm2

Non-Euclidean metrics Any distance dij be the distance between two vectors, i and j must satisfy a number of rules: The distance must be positive definite The distance must be symmetric, dij = dji An object is zero distance from itself, dii =0 Triangle inequality dik ≦ dij + djk Distance measures that obey 1 to 3 but not 4 are referred to as semi-metric. Manhattan distance (or city block) distance is an example of non-Euclidean distance metric, The Mahattan distance is defined as the sum of the absolute distances between the components (i) of each expression vector, x and y, It measures the route one might have to travel between two points in a place such as Manhattan where the streets and avenues are arranged at right angles to one another. It is known as Hamming distance when applied to data expressed in binary form, e.g. if the expression levels of the genes have been discretized into 1s and 0s.

Chebychev distance (The L∞, Chebychev or Maximum metric) between two n-dimensional vectors x = (x1, x2, …., xn) and y = (y1, y2, ….yn) Chebychev distance will pick the one experiment in which these two genes are most different (the largest difference ) and will consider that value the distance between the genes. The Chebychev distance behaves inconsistently with respect to outliers since it only looks at one dimension. If any or all other coordinates are changed due to measurement error without changing the maximum difference, the Chebychev distance will remain the same. The Chebychev distance is resilient with respect to noise and outliers. However, if any one coordinate is affected sufficiently such that the maximum distance changes, the Chebychev distance will change. The Chebychev distance is in general resilient to small amount of noise even if they affect several coordinates but will affected by a single large change.

Minkowski distance is a generalization of the Euclidean distance and is expressed as The parameter p is called the order. The higher the value of p, the more significant is the contribution of the largest components |ai – bi |. p=1  Manhattan distance p=2  Euclidean distance p=∞ Chebychev distance The Mahalanobis metric is defined as: Herman Minkowski ( ) where Cov(D) is the covariance matrix for dataset D. If the covariance matrix Cov(D) is the identity matrix, then the Mahalanobis distance would be equal to the Euclidean.

The graphical illustration of the Mahattan and Euclidean distances Y Mahattan distance = 3 O X Hamming distance = 3

The higher the value of p, the more significant is the contribution of the largest components |ai – bi |. Close to 3, that is < 3.162 Close to 10

The Canberra metric is defined as The output ranges from 0 to the number of variables used, that is, in case of yi < 0, the maximum of |xi – yi| is |xi| + |yi| The Canberra distance is very sensitive to small changes near zero., that is when there is a change of sign near zero. double

Euclidean distance is one of the most intuitive ways to measure the distance between points in space, but it is not always the most appropriate one for expression profiles. We need to define distance measures that score as similar gene expression profiles that show similar trend, rather than those that depend on the absolute levels. Two simple measures that can be used are the angle and chord distances. chord distance angular distance A B A B chord distance angular distance

A = (ax, ay), B = (bx, by) The cosine of the angle between the two vectors A and B is given by their dot product, and can be used as a similarity measure. In n-dimensional space for vectors A = (a1, …. an) and B = (b1, …. bn), the cosine is defined as The chord distance is defined as the length of the chord between the vectors of unit length having the same directions as the original ones. chord distance angular distance A B

Semimetric distance – Pearson correlation coefficient or Covariance
Statistics – standard deviation and variance, var(X)=s2, for 1-dimension data How about higher dimension data ? It is useful to have a similar measure to find out how much the dimensions vary from the mean with respect to each other. Covariance is measured between 2 dimensions, suppose one have a 3-dimension data set (X,Y,Z), then one can calculate Cov(X,Y), Cov(X,Z) and Cov(Y,Z) - to compare heterogenous pairs of variables, define the correlation coefficient or Pearson correlation coefficient, -1≦ rXY ≦1 -1  perfect anticorrelation 0  independent +1 perfect correlation

Semimetric distance – the squared Pearson correlation coefficient
Pearson correlation coefficient is useful for examining correlations in the data One may imagine an instance, for example, in which the same TF can cause both enhancement and repression of expression. A better alternative is the squared Pearson correlation coefficient (pcc), The square pcc takes the values in the range 0 ≦ rsq ≦ 1. 0  uncorrelate vector 1  perfectly correlated or anti-correlated pcc are measures of similarity Similarity and distance have a reciprocal relationship similarity↑  distance↓  d = 1 – r is typically used as a measure of distance

Semimetric distance – Pearson correlation coefficient or Covariance
The resulting rXY value will be larger than 0 if a and b tend to increase together, below 0 if they tend to decrease together, and 0 if they are independent. Remark: rXY only test whether there is a linear dependence, Y=aX+b if two variables independent  low rXY, a low rXY may or may not  independent, it may be a non-linear relation a high rXY is a sufficient but not necessary condition for variable dependence

Semimetric distance – the squared Pearson correlation coefficient
To test for a non-linear relation among the data, one could make a transformation by variables substitution Suppose one wants to test the relation u(v) = avn Take logarithm on both sides log u = log a + n log v Set Y = log u, b = log a, and X = log v  a linear relation, Y = b + nX  log u correlates (n>0) or anti-correlates (n<0) with log v

Semimetric distance – Pearson correlation coefficient or Covariance matrix
A covariance matrix is merely collection of many covariances in the form of a d x d matrix:

Spearman’s rank correlation
One of the problems with using the PCC is that it is susceptible to being skewed by outliers: a single data point can result in two genes appearing to be correlated, even when all the other data points suggest that they are not. Spearman’s rank correlation (SRC) is a non-parametric measure of correlation that is robust to outliers. SRC is a measure that ignores the magnitude of the changes. The idea of the rank correlation is to transform the original values into ranks, and then to compute the correlation between the series of ranks. First we order the values of gene A and B in ascending order, and assign the lowest value with rank 1. The SRC between A and B is defined as the PCC between ranked A and B. In case of ties assign midranks  both are ranked 5, then assign a rank of 5.5

Spearman’s rank correlation
The SRC can be calculated by the following formula, where xi and yi denote the rank of the x and y respectively. An approximate formula in case of ties is given by

Distances in discretized space
Sometimes it is advantageous to use a discreteized expression matrix as the starting point, e.g. to assign values 0 (expression unchanged, 1 (expression increased) and -1 (expression decreased). The similarity between two discretized vectors can be measured by the notion of Shannon entropy.

Entropy and the Second Law of Thermodynamics: Disorder and the Unavailability of energy
Entropy always increase Ice melts, it becomes more disordered and less structured.

Statistical Interpretation of Entropy and the Second Law
S = k ln W S = entropy, k = Boltzmann constant, ln W = natural logarithm of the number of microstates W corresponding to the given macrostate. L. Boltzmann ( )

Entropy and the Second Law of Thermodynamics:
Disorder and the Unavailability of energy

Concept of entropy Toss 5 coins, outcome 5H0T 1 4H1T 5 3H2T 10 2H3T 10
A total of 32 microstates. Propose entropy, S ~ no. of microstates, W, i.e. S ~ W Generate coin toss with Excel The most probable microstates

Shannon entropy Shannon entropy is related to physical entropy
Shannon ask the question “What is information ?” Energy is defined as the capacity to do work, not the work itself. Work is a form of energy. Define information as the capacity to store and transmit meaning or knowledge, not the meaning or knowledge itself. For example, a lot of information from WWW, but it does not mean knowledge Shannon suggest entropy is the measure of this capacity Summary Define information  capacity to store knowlege  entropy is the measure  Shannon entropy Entropy ~ randomness ~ measure of capacity to store and transmit knowledge Reference: Gatlin L.L, Information theory and the living system, Columbia University Press, New York, 1972.

Shannon entropy How to relate randomness and measure of this capacity ? Microstates 5H0T 1 4H1T 5 3H2T 10 2H3T 10 1H4T 5 0H5T 1 Physical entropy S = k ln W Shannon entropy Assuming equal probability of each individual microstate, pi pi = 1/W S = - k ln pi Information ~ 1/pi = W If pi = 1  there is no information, because it means certainty If pi << 1  there are more information, that is information is a decrease in certainty

Distances in discretized space
Sometimes it is advantageous to use a discretized expression matrix as the starting point, e.g. to assign values 0 (expression unchanged, 1 (expression increased) and -1 (expression decreased). The similarity between two discretized vectors can be measured by the notion of Shannon entropy. Shannon entropy, H1 Probability of observing a particular symbol or event, pi, with in a given sequence Consider a binary system, an element X has two states, 0 or 1 Claude Shannon - father of information theory and H1 measure the “uncertainty” of a probability distribution - Expectation (average) value of information References: 1. 2. 3. plus.maths.org/issue23/ features/data/ base 2

Shannon Entropy certain
pi 1,0 -1*1[log2(1)] = 0 ½, ½ -2*1/2[log2 (1/2)]=1 22-states, 1/4 -4*1/4[log2(1/4)]=2 2N-states, 1/2N -2N*1/2N[log2(2N)]=N No information certain Uniform probability Maximal value DNA seq.  n = 4 states, maximum H1 = - 4*(1/4)* log(1/4) = 2 bits Protein seq.  n = 20 states,  maximum = - 20*(1/20)*log(1/20) = bits, which is between 4 and 5 bits.

The Divergence from equi-probability
Hmax1 H1 = log2 (n) D1 When all letter are equi-probable, pi = 1/n H1 = log2 (n)  the maximum value H1 can take Define Hmax1 = log2 (n) Define the divergence from this equi-probable state, D1 D1 = Hmax1 - H1 D1 tells us how much of the total divergence from the maximum entropy state is due to the divergence of the base composition from a uniform distribution For example, E. coli genome has no divergence from equi-probability because H1Ec = 2 bits, but, for M. lysodeikticus genome, H1Ml = 1.87, then D1 = 2.00 – 1.87 = 0.13 bit Divergence from independence Single-letter events  which contains no information about how these letters are arranged in a linear sequence

Divergence from independence – Conditional Entropy
Question Does the occurrence of any one base along the DNA seq. alter the probability of occurrence of the base next to it ? What are the numerical values of the conditional probabilities ? p(X|Y) = prob. of event X condition on event Y p(A|A), p(T|A), p(C|A), p(T|A) … etc. If they were independent, p(A|A) = p(A), p(T|A) = p(T) …. Extreme ordering case, equi-probable seq., AAAA…TTTT…CCCC…GGGG… p(A|A) is very high, p(T|A) is very low, p(C|A) = 0, p(G|A) = 0 Extreme case, ATCGATCGATCG…. Here p(T|A) = p(C|T) = p(G|C) = p(A|G) = 1, and all others are 0 Equi-probable state ≠ independent events

Consider the space of DNA dimers (nearest neighbor) S2 = {AA, AT, …. TT} Entropy of S2, H2 = -[p(AA)log p(AA) + p(AT) log p(AT) + …. + p(TT) log(TT)] If the single letter events are independent, p(X|Y) = p(X), If the dimer event is independent, p(A|A)=p(A)p(A), p(A|T)=p(A)p(T), …. If the dimer is not independent, p(XY) = p(X)p(Y|X), such as p(AA) = p(A)p(A|A), p(AT) = p(A) p(T|A) … etc. HInp2 = entropy of completely independent Divergence from independence, D2 = HInp2 – H2 D1 + D2 = the total divergence from the maximum entropy state

Calculate D1 and D2 for M. phlei DNA, where p(A)=0.164, p(T)=0.162, p(C)=0.337, p(G)=0.337. H1= -(0.164 log log ) = bits D1 = – = bit See the Excel file D2 = HInp2 – H2 = – = bit Total divergence, D1 + D2 = = bit

Compare different sequences using H to establish relationships where Y=2x log10Y=xlog102 X=log10Y/log102 Given the knowledge of one sequence, say X, can we estimate the uncertainty of Y relative to X ? Relation between X, Y, and the conditional entropy, H(X|Y) and H(Y|X) conditional entropy is the uncertainty relative to known information H(X,Y) = H(Y|X) + H(X) = uncertainty of Y given knowledge of X, H(Y|X) + uncertainty of X, sum to the entropy of X and Y = H(X|Y) + H(Y) H(Y|X) = H(X,Y) – H(X) = 1.85 – 0.97 = 0.88 bit

Shannon Entropy – Mutual Information
Joint entropy H(X,Y) where pij is the joint probability of finding xi and yj probability of finding (X,Y) p00 = 0.1, p01 = 0.3, p10 = 0.4, p11 = 0.2 Mutual information entropy, M(X,Y) Information shared by X and Y, or it can be used as a similarity measure between X and Y H(X,Y)= H(X) + H(Y) – M(X,Y)  like in set theory, A∪B = A + B – (A∩B) M(X,Y)= H(X) + H(Y) - H(X,Y) = H(X) – H(X|Y) = H(Y) – H(Y|X) = 1.00 – 0.88 M(X,Y)= H(X) + H(Y) – H(X,Y) = – 1.85 = 0.12 bit A small M(X,Y)  X and Y are independent, p(X,Y)=p(X)p(Y) A large M(X,Y)  X and Y are associated and

Shannon Entropy – Conditional Entropy
a particular x H(Y|X) p(Y|X=x) /4 /4 /6 /6 All x’s

Normalize each channel separately  Gn-<G> and Rn-<R> Subtraction of the mean log fluorescence intensity for the channel from each value transforms the measurements such that the abundance of each transcript is represented as a fold increase or decrease relative to the sample mean, namely as a relative fluorescence intensity. Log Gn - <log Gn>, Log Rn - <log Rn>, where n=1,2,…. and

Central Limit Theorem Considered the following set of measurements for a given population: 55.20, 18.06, 28.16, 44.14, 61.61, 4.88, , , 97.47, 56.89, , , , 9.98, The population mean is Now, considered two samples from this population. These two different samples could have means very different from each other and also very different from the true population mean. What happen if we considered, not only two samples, but all possible samples of the same size ? The answer to this question is one of the most fascinating facts in statistics – Central limit theorem. It turns out that if we calculate the mean of each sample, those mean values tend to be distributed as a normal distribution, independently on the original distribution. The mean of this new distribution of the means is exactly the mean of the original population and the variance of the new distribution is reduced by a factor equal to the sample size n.

Central Limit Theorem When sampling from a population with mean m and variance s, the distribution of the sample mean (or the sampling distribution X) will have the following properties: The distribution of distribution X will be approximately normal. The larger the sample is , the more will the sampling distribution resemble the normal distribution. The mean x of the distribution of X will be equal to m, the mean of the population from which the samples were drawn. The variance s of distribution X will be equal to s2/n, the variance of the original population of X divided by the sample size. The quantity s is called the standard error of the mean.

Statistical hypothesis testing
The expression level of a gene in a given condition is measured several times. A mean x of these measurements is calculated. From many previous experiments, it is known that the mean expression level of the given gene in normal conditions is m. How can you decide which genes are significantly regulated in a microarray experiment? For instance, one can apply an arbitrary cutoff such as a threshold of at least twofold up or down regulation. One can formulate the following hypotheses: The gene is up-regulated in the condition under study: x>m The gene is down-regulated in the condition under study: x<m The gene is unchanged in the condition under study: x=m Something has gone awry during the lab experiments and the genes measurements are completely off; the mean of the measurements may be higher or lower than the normal: x≠m.

When a hypothesis test is viewed as a decision procedure, two types of error are possible, depending on which hypothesis, H0 or H1, is actually true. If a test rejects H0 (and accept H1) when H0 is true, it is called a type I error. If a test fails to reject H0 when H1 is true, it is called a type II error. The following shows the results of the different decisions. Accept H0 Reject H0 H0 is True Correct decision Type I error H0 is False Type II error

The next step is to generate two hypotheses. The two hypotheses must be mutually exclusive and all inclusive. Mutually exclusive – the two hypotheses cannot be true both at the same time All inclusive means that their union has to cover all possibilities Expression ratios are converted into probability values to test the hypothesis that particular genes are significantly regulated Null hypothesis H0 that there is no difference in signal intensity across the conditions being tested The other hypothesis (called alternate or research hypothesis) named H1. If we believe that the gene is up-regulated, the research hypothesis will be H1: x ＞ m, The null hypothesis has to be mutually exclusive and also has to include all other possibilities, therefore, the null hypothesis will be H0: x≦ m. One assigns a p-value for testing the hypothesis. The p-value is the probability of a measurement more extreme than a certain threshold occurring just by chance. The probability of rejecting the null hypothesis when it is true is the significance level a , which is typically set at p<0.05, in other words we accept that 1 in 20 cases our conclusion can be wrong.

One-tail testing The alternative hypothesis specifies that the parameter is greater than the values specified under H0, e.g. H1: m>15. such a hypothesis is called upper one-tail testing. Example The expression level of a gene is measured 4 times in a given condition. The 4 measurements are used to calculate a mean expression level of x=90. it is known from the literature that the mean expression level of the given gene, measured with the same technology in normal conditions is m=100 and the standard deviation is s=10. We expect the gene to be down-regulated in the condition under study and we would like to test whether the data support this assumption. The alternative hypothesis H1 is “the gene is down-regulated” or H0: x≧m, therefore, H1 x<m This is an example of a one-tail hypothesis in which we expect the values to be in one particular tail of the distribution.

From the sampling theorem, the means of samples are distributed approximately as a normal distribution. Sample size = 4, Mean x = 90 Standard deviation s = 10 Assuming a significance level of 5% The null hypothesis is rejected if the computed p-value is lower than the critical value (0.05) We can calculate the value of Z as The probability of having such a value just by chance, i.e. the p-value, is : p(Z < -2) = The computed p-value is lower than our significance threshold < 0.05, therefore we reject the null hypothesis. In other words, we accept the alternate hypothesis. We stated that “the gene is down-regulated at 5% significance level”. This will be understood by the knowledgeable reader as a conclusion that is wrong in 5% of the cases or fewer.

Normal distribution table

Normal distribution table
NORMDIST - Area under the curve start from left hand side Z=0 Z=2

Two-tail testing A novel gene has just been discovered. A large number of expression experiments measured the mean expression level of this gene as 100 with a standard deviation of 10. Subsequently, the same gene is measured 4 times in 4 cancer patients. The mean of these 4 measurements is 109. Can we conclude that this gene is differential expressed in cancer? We do not whether the gene will be up-regulated or down-regulated. Null hypothesis H0: = 100, Alternative hypothesis H1: ≠ 100 At a significant level of 5%  2.5% for the left tail and 2.5% for the right tail Z = (109 – 100)/(10/√4) = 9/(10)*2 = 1.8 p-value, p(Z≧1.8) = 1 – p(Z≦1.8) = 1 – = >  that is the p-value is higher than the significant level, so we cannot reject the null hypothesis 2.5% 2.5%

Tests involving the mean – the t distribution
Hypothesis testing Parametric testing – where the data are known or assumed to follow a certain probability distribution (e.g. normal distribution) Non-parametric testing – where no a priori knowledge is available and no such assumptions are made. The t distribution test or student’s t distribution test is a parametric test, it was discovered by William S. Gossett, a 32-year old research chemist employed by the famous Irish brewery (釀造，如啤酒) Guinness.

Tests involving a single sample may focus on the mean of the sample (t-test, where variance of the population is not known) and the variance (c2-test). The following hypotheses may be formulated if the testing regards the mean of the sample: H0: m = c, H1: m≠c H0: m≧c, H1: m＜c H0: m≦c, H1: m＞c The first hypotheses corresponds to a two-tail testing in which no a prior knowledge is available, while the second and the third correspond to a one-tail testing in which the measured value c is expected to be higher and lower than the population mean m, respectively.

The expression level of a gene is known to have a mean expression level of 18 in the normal human population. The following expression values have been obtained in five measurements: 21, 18, 23, 20, 18. Is this data consistent with the published mean of 18 at a 5% significant level? Population s.d. s is not known  t-test, calculate sample s.d. s to estimate s H0 : = m = 18, H1 : ≠ m = 18  two-tail test Calculate the t-test statistics Remember using n-1 when calculating standard deviation s.

is symmetric Degree of freedom, n, n=5-1=4. Using a table of the t-distribution with four degree of freedom, the p-value associated with this test statistic is found to be between 0.05 and 0.1. The 5% two-tail test corresponds to a critical value of Since the p-value is greater than 0.05 (t-value=2.11 < critical value=2.776), the evidence is not strong enough to reject the null hypothesis of mean 18  accept H0.

The t-distribution table - cumulative probability starting from left hand side
Two-tails a=0.10, 0.05

The t-distribution table – Excel – TINV gives the two-tails critical value

Evaluate the significance of the following gene expression differences – t test
pgs2e-exer jpg

Evaluate the significance of the following gene expression differences – t test
Expect average ratio = 1, H0 : measured mean ≦1, H1: measured mean >1 left-hand one-tail test t-score = (average -1)/(s/n0.5) The p-values (for and 6.71) are less than 0.05 (t0.05(4)=2.132) for genes 1 and 3 (reject H0), but not for 2. It is conclude that the level of expression is increased only in genes 1 and 3.

The expression level of a gene is known to have a mean expression level of 225 in the normal human population. The expression values have been obtained in sixteen measurements, in which the sample mean and s.d. are found to be and respectively. Is this data higher than the published mean at a 5% significant level? This is a left-hand one-tail test Null hypothesis H0: x≦m=225 alternative hypothesis H1: x>m=225 t-score = ( )/[ /sqrt(16)] = Degree of freedom = 15 The 5% level corresponds to a critical value (t0.05(15)) of The t-score is less than the critical value, i.e < Based on the critical value, we can accept the null hypothesis. The gene expression data set is not higher than the published mean of 225 at a 5% significant level

Tests involving the variance – the chi-square distribution
The expression level of a gene is known to have a variance s2 = 5000 in the normal human population. The same gene is measured 26 times and found to have a s2 = Is there evidence that the new measurement different from the population at a 2% significant level? Unknown population mean, c2 test Null hypotheses H0: s2 = s2 = 5000, that is the new measured variance is not different from the population s The alternative hypotheses H1: s2 ≠ s2 = 5000 (two-tail test) The new variable of score is This variable with the interesting that if all possible samples of size n are drawn from a normal population with a variance s2 and for each such sample the quantity is computed, these value will always form the same distribution. This distribution will be a sample distribution called a c2 (chi-square) distribution. p=0.99 p=0.01 two-tail test reject H0 accept H0 reject H0

If the sample standard deviation s is close to the population standard deviation s, the value of c2 will be close to n-1 (degree of freedom) If the sample standard deviation s is very different to the population standard deviation s, the value of c2 will be very different from n-1 Use the c2 distribution to solve the above problem. Assuming a 2% significant level, the critical values for c20.01(25) = and c20.99(25) = (right-hand tail) Reject areas are c2 ≦ or c2≧ Since 46 >  reject null hypothesis The measurement is different from the population at a 2% significant level probability, a

The chi-square distribution
Excel - CHIINV, uses right hand tail

The expression level of a gene is known to follow normal distribution and have a standard deviation (s.d.) of no more than 5 in the normal human population. The same gene is measured 9 times and found to has a s.d. of 7. Is this data set has a sample variance higher than the published variance at a 5% significant level? This is a left-hand one-tail test (accepting null hypothesis) Null hypothesis H0: s2 ≦ 25 Alternative hypothesis H1 : s2 > 25 c2= (9-1)*49/25 = 15.68 Degree of freedom = 8 The 5% level corresponds to a critical value of The c2 value is larger than the critical value Based on the critical value, we can reject the null hypothesis. The gene does has a s.d. higher than the published value 5 at a 5% significant level.

Tests involving two samples drawing from the SAME population – comparing variances, F distribution
Goal – whether a given gene is expressed differently between patients and healthy subjects This involves comparing the mean of the two samples To answer this question one must first know whether the two samples have the same variance The method used to compare variances of two samples – F distribution (named after R.A. Fisher) Then we use t-test to test whether the mean of the gene is expressed differently between patients and healthy subjects In summary, F-test sample A and B  sA2 = sB2 ?  t-test (use different formula for sA2 = sB2 and sA2 ≠sB2 case)

Tests involving two samples – comparing variances, F distribution
The values measured in controls are: 10, 11, 11, 12, 15, 13, 12 The values measured in patients are: 12, 13, 13, 15, 12, 18, 17, 16, 16, 12, 15, 10, 12. Is the variance different between the controls and the patients at a 5% significant level ? H0: sA2 = sB2, H1: sA2 ≠ sB2 Need to find a new test statistics, Two-tail test Notation: assume A = controls, B = patients in the following calculation Controls sample A has d.o.f and variance = 6 and 2.66 Patients sample B has d.o.f and variance = 12 and 5.74 Consider the ratio F = 2.66/5.74 = , Significant level for two-tail test = 5%/2 = 2.5% F-distribution (right tail) F0.025(6,12) = (from Excel) F0.975(6,12) = (from Excel) F- distribution (right tail)

F distribution – right tail
0.025 see next page

Tests involving two samples – comparing variances, F distribution

Tests involving two samples – comparing variances, F-distribution
Usually we have F-distribution table for 0.01, 0.025, 0.05 but not !! Given F0.025(6,12) = , how to find F0.975(6,12) ??? The F distribution has the interesting property that : left tail for an F with n1 and n2 d.o.f. is = the reciprocal of the right tail for an F with the d.o.f reversed: F[Left tail(n1,n2)]a = 1/F[right tail(n2,n1)]1-a F0.975(6,12) = 1/ F( )(12,6) F0.975(6,12) = 1/ F0.025(12,6) = 1/ = back to our null hypothesis test Since < < Since the F-statistics is in between and , we will accept the null hypothesis  there is no difference between controls and patients

Tests involving two samples – comparing variances, F-distribution
Now, let us consider the ratio The two different choices should lead to same conclusion, since the conclusion should not depend which variance we put on the numerator or denominator Controls sample A has d.o.f and variance = 6 and 2.66 Patients sample B has d.o.f and variance = 12 and 5.74 F = 5.74/2.66 = F-distribution (right tail) F0.025(12,6) = (from Excel) F0.975(12,6) = (from Excel) Since < < Since the F-statistics is in between and 5.366, we will accept the null hypothesis  there is no difference between controls and patients REMARK The two F-tests are reciprocal to each other That is < < Reciprocal  1/ > 1/ >1/3.7283  > >

Tests involving two samples – comparing means
The gene expression level of the gene AC is measured for the patients, P and controls, C are given in the following: geneID P1 P2 P3 P4 P5 P6 AC geneID C1 C2 C3 C4 C5 C6 AC H0: mP = mC, H1: mP ≠ mC Mean of gene expression level of patients, XP = 0.755 Mean of gene expression level of controls, XC = 0.373 sP2 = 0.059, sC2 = 0.097 To test whether the two samples have the same variance or not, we perform the F-test at a 5% level F = 0.059/0.097 = 0.60, d.o.f. = 10 F0.025(6,6) = , F0.975(6,6) = In between and  accept the null hypothesis  the patients and controls have the same variances

t-statistic of two independent samples with equal variances The t-score is where the p-value, or the probability of having such a value by chance is This value is smaller than the significant level 0.05, and therefore we reject the null hypothesis, the gene AC is expressed differently between cancer patients and healthy subjects.

t-statistic of two independent samples with unequal variances The modified t-score is The degree of freedom n need to be adjusted as This value is not an integer and needs to be rounded down

Tests involving two samples from two different popuations
F-statistics

Analysis of variance (ANOVA)
How do we compare multiple samples at once? We could do pairwise t-tests to see which differed from one another. However, the a-level probability of making an error applies to each test. So, the real chance of making an error is increased by using multiple tests. There are ways of dealing with this, but it is time-consuming to do many pairs. The analysis of variance procedure (called ANOVA) is a way to make multiple comparisons. H0 : m1 = m2 = m3 = .... mn, for n means Ha : at least one mean is not equal to the others, i.e. mi ≠ mj or mi ≠ mj ≠ mk or more

Some necessary definitions and notation xij = observation j in group i I = the number of groups ni = the sample size of group i Dot notation = a dot that replaces an index stands for the mean for the observations the dot replaces. xi• = mean for group i (the j's have been averaged for the group) In summation notation, the dot looks like: The total number of observations is n* : the OVERALL MEAN x•• Group i j obs. n1 n2 n x1. x2. x11 x12 . Total number of obs. Group average Overall mean, n* x..

Now we need to define some of the terms that will be important for this technique. The first term is called a "SUM OF SQUARES" (abbreviated "SS"), SUM OF SQUARES: TOTAL, WITHIN GROUPS, AND BETWEEN GROUPS SS(total) means the sum of squares for all of the data, corrected for the OVERALL MEAN OF ALL OBSERVATIONS. SS(within groups) means the sum of squares for all of the groups, corrected for the MEAN OF EACH GROUP (i.e. j observations) The SS(between groups) is the degrees of freedom associated with the total n* -1 the degrees of freedom associated with the within groups is n* - I the degrees of freedom associated with the between groups is I -1.

the relationship between the Sum of Squares we have just calculated: SS(total) = SS(between groups) + SS(within groups) Another way to write this (in terms of an experiment) is: SS(total) = SS(treatments) + SS(random error) The mean squares If SS(total) is fixed (= can't change), then as SS(treatments) increases, SS(random error) must decrease. A successful experiment has most of the sums of squares in the between group partition, so that groups differ (treatments differ from control, etc.). An unsuccessful experiment has most of the sums of squares in the within group partition, so that it doesn't matter which group an observation is in (treatment means are the same as control, etc.).

A summary of presenting the calculated values. Source d f SS MS Between Groups I - 1 SS(between groups) MS(between groups) Within Groups n* - I SS(within groups) MS(within groups) Total n* - 1 SS(total)

Example – see the hyperlink EXCEL functions – SUMXMY2, FINV, AVERAGE

Analysis of variance (ANOVA) – using Excel tool

Perform a cluster analysis on gene expression profiles
pgs2e-exer jpg

Perform a cluster analysis on gene expression profiles by computing the Pearson correlation coefficient

Hierarchical Clustering Method
We continue this process, clustering 1 with 4, then {2,3} with 5. The resulting hierarchy takes the form

References Draghici S. Data analysis tools for DNA microarrays. Chapman & Hall/CRC 2003. Gibson and Muse. A primer of Genome Science. 2nd ed. Sinauer 2004. Stekel D. Microarray Bioinformatics. Cambridge University Press 2003. Tamhane and Dunlop Statistics and data analysis, from elementary to intermediate. Prentice Hall 2000.

Statistical Analysis of Microarray Data

Similar presentations

Presentation on theme: "Statistical Analysis of Microarray Data"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Statistical Analysis of Microarray Data

Similar presentations

Presentation on theme: "Statistical Analysis of Microarray Data"— Presentation transcript:

Similar presentations

About project

Feedback