Presentation is loading. Please wait.

Presentation is loading. Please wait.

Generate Affy.dat file Hyb. cRNA Hybridize to Affy arrays Output as Affy.chp file Text Self Organized Maps (SOMs) Functional annotation Pathway assignment.

Similar presentations


Presentation on theme: "Generate Affy.dat file Hyb. cRNA Hybridize to Affy arrays Output as Affy.chp file Text Self Organized Maps (SOMs) Functional annotation Pathway assignment."— Presentation transcript:

1 Generate Affy.dat file Hyb. cRNA Hybridize to Affy arrays Output as Affy.chp file Text Self Organized Maps (SOMs) Functional annotation Pathway assignment Co-ordinate regulation Promoter motif commonalities Tissue Flow chart of Affymetrix from sample to information

2 Microarray Data Analysis Data preprocessing and visualization Data preprocessing and visualization Supervised learning Supervised learning Machine learning approaches Machine learning approaches Unsupervised learning Unsupervised learning Clustering and pattern detection Clustering and pattern detection Gene regulatory regions predictions based co- regulated genes Gene regulatory regions predictions based co- regulated genes Linkage between gene expression data and gene sequence/function databases Linkage between gene expression data and gene sequence/function databases …

3 Data preprocessing Data preparation or pre-processing Data preparation or pre-processing Normalization Normalization Feature selection Feature selection Base on the quality of the signal intensity Base on the quality of the signal intensity Based on the fold change Based on the fold change T-test T-test …

4 Normalization Need to scale the red sample so that the overall intensities for each chip are equivalent Experiment1 Control Experiment2 Control

5 Normalization To insure the data are comparable, normalization attempts to correct the following variables: To insure the data are comparable, normalization attempts to correct the following variables: Number of cells in the sample Number of cells in the sample Total RNA isolation efficiency Total RNA isolation efficiency Signal measurement sensitivity Signal measurement sensitivity … Can use simple math Can use simple math Normalization by global scaling (bring each image to the same average brightness) Normalization by global scaling (bring each image to the same average brightness) Normalization by sectors Normalization by sectors Normalization to housekeeping genes Normalization to housekeeping genes … Active research area Active research area

6 Basic Data Analysis Fold change (relative change in intensity for each gene) Fold change (relative change in intensity for each gene) Mn-SOD Annexin IV Aminoacylase 1

7 Microarray Data Analysis Data preprocessing and visualization Data preprocessing and visualization Supervised learning Supervised learning Machine learning approaches Machine learning approaches Unsupervised learning Unsupervised learning Clustering and pattern detection Clustering and pattern detection Gene regulatory regions predictions based co- regulated genes Gene regulatory regions predictions based co- regulated genes Linkage between gene expression data and gene sequence/function databases Linkage between gene expression data and gene sequence/function databases …

8 Microarrays: An Example Leukemia: Acute Lymphoblastic (ALL) vs Acute Myeloid (AML), Golub et al, Science, v.286, 1999 Leukemia: Acute Lymphoblastic (ALL) vs Acute Myeloid (AML), Golub et al, Science, v.286, 1999 72 examples (38 train, 34 test), about 7,000 probes 72 examples (38 train, 34 test), about 7,000 probes well-studied (CAMDA-2000), good test example well-studied (CAMDA-2000), good test example ALLAML Visually similar, but genetically very different

9 Feature selection … 0.022 0.236 0.963 0.022 0.941 0.626 0.178 0.260 0.332 0.0026 0.487 0.243p-value   

10 Hypothesis Testing Null hypothesis is an hypothesis about a population parameter. Null hypothesis is an hypothesis about a population parameter. Hypothesis testing is to test the viability of the null hypothesis for a set of experimental data Hypothesis testing is to test the viability of the null hypothesis for a set of experimental data Example: Example: Test whether the time to respond to a tone is affected by the consumption of alcohol Test whether the time to respond to a tone is affected by the consumption of alcohol Hypothesis : µ1 - µ2 = 0 Hypothesis : µ1 - µ2 = 0 µ1 is the mean time to respond after consuming alcohol µ1 is the mean time to respond after consuming alcohol µ2 is the mean time to respond otherwise µ2 is the mean time to respond otherwise

11 Z-test Theorem: If x i has a normal distribution with mean  and standard deviation  2, i=1,…,n, then U=  a i x i has a normal distribution with a mean E(U)=   a i and standard deviation D(U)=  2  a i 2. Theorem: If x i has a normal distribution with mean  and standard deviation  2, i=1,…,n, then U=  a i x i has a normal distribution with a mean E(U)=   a i and standard deviation D(U)=  2  a i 2.  x i /n ~ N( ,  2 /n).  x i /n ~ N( ,  2 /n). Z test : H: µ = µ 0 (µ 0 and  0 are known, assume  =  0 ) Z test : H: µ = µ 0 (µ 0 and  0 are known, assume  =  0 ) What would one conclude about the null hypothesis that a sample of N = 46 with a mean of 104 could reasonably have been drawn from a population with the parameters of µ = 100 and  = 8? Use What would one conclude about the null hypothesis that a sample of N = 46 with a mean of 104 could reasonably have been drawn from a population with the parameters of µ = 100 and  = 8? Use Reject the null hypothesis.

12 Histogram Set 1 Set 2

13 T-test

14 William Sealey Gosset (1876-1937) William Sealey Gosset (1876-1937) (Guinness Brewing Company)

15 Project 3 A training data set A training data set (38 samples, 7129 probes, 27 ALL, 11 AML) (38 samples, 7129 probes, 27 ALL, 11 AML) A testing data set A testing data set (35 samples, 7129 probes, 22 ALL, 13 AML) (35 samples, 7129 probes, 22 ALL, 13 AML) Lab today: pick the top probes that can differentiate the two sub types and process the testing data set Lab today: pick the top probes that can differentiate the two sub types and process the testing data set

16 Feature 2 Feature 1 L L L L L L L M M M M M M K Nearest Neighbor Classification = AML = ALL = test sample M L Feature 2 Feature 1 L L L L L L L M M M M M M Feature 2 Feature 1 L L L L L L L M M M M M M = AML = ALL = test sample M L

17 Distance measures Euclidean distance Manhattan distance

18 Jury Decisions Use one feature at a time for the classification Combining the results from the top 51 features Majority decision Feature0Feature1Feature50 … M LM … M test sample

19 False Discovery Two possible errors in making a decision about the null hypothesis. 1. 1.We could reject the null hypothesis when it is actually true, i.e., our results were obtained by chance. (Type I error). 2. 2.We could fail to reject the null hypothesis when it is actually false, i.e. our experiment failed to detect the true difference that exists. (Type II error) We set  at a level which will minimize the chances of making either of these errors.

20 False Discovery Type I error: False Discovery Type I error: False Discovery False Discovery Rate (FDR) is equal to the p-value of the t-test X the number of genes in the array False Discovery Rate (FDR) is equal to the p-value of the t-test X the number of genes in the array For a p-value of 0.01  10,000 genes = 100 false “ different ” genes For a p-value of 0.01  10,000 genes = 100 false “ different ” genes You cannot eliminate false positives, but by choosing a more stringent p-value, you can keep them manageable (try p=0.001) You cannot eliminate false positives, but by choosing a more stringent p-value, you can keep them manageable (try p=0.001) The FDR must be smaller than the number of real differences that you find - which in turn depends on the size of the differences and variability of the measured expression values The FDR must be smaller than the number of real differences that you find - which in turn depends on the size of the differences and variability of the measured expression values

21 RCC subtypes Clear Cell RCC (70-80%) Clear Cell RCC (70-80%) Papillary (15-20%) Papillary (15-20%) Chromoprobe (4-5%) Chromoprobe (4-5%) Collecting duct Collecting duct Oncocytoma Oncocytoma Saramatoid RCC Saramatoid RCC Goal: Identify a panel Identify a panel of discriminator of discriminator genes genes ?

22 Genetic Algorithm for Feature Selection Sample Clear cell RCC, etc. Raw measurement data f1 f2 f3 f4 f5 Feature vector = pattern

23 Why Genetic Algorithm? Assuming 2,000 relevant genes, 20 important discriminator genes (features). Assuming 2,000 relevant genes, 20 important discriminator genes (features). Cost of an exhaustive search for the optimal set of features ? Cost of an exhaustive search for the optimal set of features ? C(n,k)=n!/k!(n-k)! C(2,000, 20) = 2000!/(20!1980!) ≥ (100)^20 = 10^40 If it takes one femtosecond (10 -15 second) to evaluate a set of features, it takes more than 3  10^17 years to find the optimal solution on the computer.

24 Evolutionary Methods Based on the mechanics of Darwinian evolution Based on the mechanics of Darwinian evolution The evolution of a solution is loosely based on biological evolution The evolution of a solution is loosely based on biological evolution Population of competing candidate solutions Population of competing candidate solutions Chromosomes (a set of features) Chromosomes (a set of features) Genetic operators (mutation, recombination, etc.) Genetic operators (mutation, recombination, etc.) generate new candidate solutions generate new candidate solutions Selection pressure directs the search Selection pressure directs the search those that do well survive (selection) to form the basis for the next set of solutions. those that do well survive (selection) to form the basis for the next set of solutions.

25 A Simple Evolutionary Algorithm Selection Genetic Operators Evaluation

26 Genetic Algorithm g2g1g6g3g21 g201g17g51g21g1 g12g7g15g12g10 g25g72g56g23g10 g20g7g5g2g100 Good enough Stop g20g7g6g3g21 g20g7g25g23g14 g12g7g15g22g10 g25g72g56g23g10 g2g1g5g2g100 Not good enough 5 2 1 4 3

27 Encoding Most difficult, and important part of any GA Most difficult, and important part of any GA Encode so that illegal solutions are not possible Encode so that illegal solutions are not possible Encode to simplify the “evolutionary” processes, e.g. reduce the size of the search space Encode to simplify the “evolutionary” processes, e.g. reduce the size of the search space Most GA’s use a binary encoding of a solution, but other schemes are possible Most GA’s use a binary encoding of a solution, but other schemes are possible

28 GA Fitness At the core of any optimization approach is the function that measures the quality of a solution or optimization. At the core of any optimization approach is the function that measures the quality of a solution or optimization. Called: Called: Objective function Objective function Fitness function Fitness function Error function Error function measure measure etc. etc.

29 Genetic Operators Crossover 10305070 20406080 Randomly Selected Crossover Point 1030 50702040 6080 Mutation 10306280 Randomly Selected Mutation Site l Recombination is intended to produce promising individuals. l Mutation maintains population diversity, preventing premature convergence.

30 Genetic Algorithm/K-Nearest Neighbor Algorithm Classifier ( kNN ) Feature Selection ( GA ) Microarray Database


Download ppt "Generate Affy.dat file Hyb. cRNA Hybridize to Affy arrays Output as Affy.chp file Text Self Organized Maps (SOMs) Functional annotation Pathway assignment."

Similar presentations


Ads by Google