Download presentation
Presentation is loading. Please wait.
1
. Differentially Expressed Genes, Class Discovery & Classification
2
Finding Differentially Expressed Genes Two types of motivation: u “Direct”: l Relate the genes to known biology: functions, pathways etc. Infer about their rule, the mechanisms governing the process etc. u “Indirect”: Use as a “pruning stage” for tools that perform learning tasks: l Infer regulatory mechanisms and relations l Classification ( disease Vs. normal, disease subtypes)
3
Example: Tumor vs. Normal tissues u Identify differentially expressed genes l Diagnostic Markers l Therapeutic targets l Understanding the disease process Normal samples Tumor samples Over expressed Under expressed Non-small cell lung carcinomas Sheba medical center U. of Colorado Medical Center
4
What We Need u Score the genes, hopefully in a meaningful way.. u Attach a measure of statistical significance to the score so we can l Choose a subset of genes “wisely” l Have a measure of how strong our signal is
5
Simplest Score: Fold Change Avg. expression in Normal lung Avg. expression in tumors 2-fold change 2-fold up: 761 genes 2-fold down: 272 genes
6
Fold Change: problems u Not reliable at the low end of the scale l (“0/0” effects – large variance) u Sensitive to outliers u Variant: “pairwise fold change” l compute fold change over all possible sample pairs l If in e.g. 75% of the pairs, change > => significant
7
Relevance Scores - TNoM Beyond “fold change” Both genes have >15 fold change TNoM (Total Number of Misclassifications) score Find the threshold that best separates tumors from normals, count the number of errors committed there. 10100100010000 100000 Gene 1 10100100010000100000 Gene 2 tumor normal 5 0 Uninformative Gene Informative Gene
8
Expression pattern of a gene: a Pathological diagnosis information (annotation): L v(a,L), a vector of +s and –s, ordered by the a values + + - - + + + - - + - - + + - a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12 a13 a14 a15 Informative genes + + + + + + + + - - - - - - - - - - - - - - + + + + + + + + - - - - + - - + + - + + + + + etc Non-informative genes + - + - + + + + - - + + - - - - + + - + - - + + - + + - - + + - - - + + - + + - + + - + - etc Scoring Informative Genes
9
Find the threshold that best separates tumors from normals, count the number of errors committed there. - + + - + - - + + - + + - - + # of errors = min(7,8) = 7. 6 7 Ex 1: Ex 2: A perfect single gene classifier gets a score of 0. + + + + + + + + - - - - - - - 0 TNoM Score
10
TNoM vs. Fold Change 2-fold up: 761 genes 2-fold down: 272 genes TNoM 3 62 genes Avg. expression in Normal lung Avg. expression in tumors 2-fold change TNoM 3 TNoM > 3
11
Cons: Ones-sided vs. two sided errors Absolute values ignored For any given level s, we can efficiently compute p-Val(s) = Prob( TNoM(V) s ), where V is uniformly drawn over the appropriate space. (H 0 – the gene expression values are independent of the labels) Computed using DP TNoM
12
Wilcoxon Rank Test u Another gene score, which similarly to TNoM: l Ignores absolute values l Takes into account only order of measurements Sort the expression values of both groups + + - - + + + - - + - - + + - a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12 a13 a14 a15 W(g)= sum of ranks of the positive examples: W(g) = 1 + 2 + 5 + 6 + 7 + 10 + 13 + 14 = 58
13
Wilcoxon Rank Test u A common test in statistics u Again, we can compute p-Values given the null hypothesis H 0 l P(W(g) > s|n,k) = the probability of getting a score > s given a total of n samples, out of which k are labeled as (+).
14
SAM (Tusher et al., PNAS 01) Where a = (1/n 1 + 1/n 2 )/(n 1 + n 2 -2) d(i) is exactly the paired t-statistic Tests the assumption: are the means of the two processes the same? Underlying assumption: two normal distributions A known p-value: the t-distribution
15
SAM – Alternative to P-Value P-value relies on t-test assumptions - problematic Can we assess the significance of d(i) without parametric assumptions? Define a “balanced” permutation: division of samples to 2 groups, where in each group the number of ‘+’ and ‘-’ is balanced Perform all possible “balanced” permutations p to the data and compute:
16
False Discovery Rate for SAM Genes with above a given threshold – significant FDR – False discovery rate = the % of genes passing as “significant” which are expected to be false positives Each threshold on (i) can be given an FDR value: compute the avg. number of FP crossing this threshold in the permuted sets
17
Different Scores l TNoM l Info l Wilcoxon l t Test l Fold Change Different scores and null hypothesis (parametric, non parametric etc.) All can be found in the ScoreGene package: http://www.cs.huji.ac.il/labs/compbio /scoregenes/ Can we assess which scoring method is the best for our case?
18
Data on 30 samples from normal and tumor lung tissues. ~7000 genes. Naftali Kaminski’s lab, Sheba Medical Center Overabundance Analysis
19
Why Test Overabundance? u Tests how informative is a set of genes w.r.t. a given classification of the data and a scoring method. u Can be used to compare different: l gene scoring methods l normalization methods
20
Comparing Normalization Methods
21
Why Test Overabundance? u But also: a method to discover new classes in the data u Intuition: biologically meaningful partitions will have a high overabundance of informative genes
22
Overabundance Analysis in Class Discovery Biologically meaningful partitions. Overabundance of informative genes Score Genes Count Compare to random AML/ALL BRCA1/2 Melanoma
23
Seek partitions with statistically significant overabundance of informative genes Use local search techniques, e.g: Steepest ascent Simulated annealing Class Discovery Approach
24
At a given score level s, set p = p-Val(s). Suppose that in the data we observe n(s) genes with score s. The number of genes with score s we observe for uniformly and independently drawn labeling vectors is a random variable N(s) with N(s) ~ Binom(n,p) where n is the total number of genes. The surprise rate at s is defined as (s) = Prob( N(s) n(s) ) = k=n(s)…n [n choose n(s)]p k (1-p) n-p. Finally, the max surprise score for the suggested partition is Max s (s) Scoring a Partition
25
Overabundance & Max-Surprise
26
Example: Survival Prediction All Patients Good Prognosis Patients 024681012 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 32 patients 17 deaths 8 patients 5 deaths 024681012 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 19 patients 6 deaths 5 patients 3 deaths
27
Class 2 All Patients Good Prognosis Patients 024681012 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 10 patients 2 deaths 14 patients 7 deaths
28
Class 3 All Patients Good Prognosis Patients 024681012 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 12 patients 6 deaths 12 patients 3 deaths
29
Tissue Classification u Given a set of labeled samples, we can try to classify a new sample l Supervised methods: SVM, Adaboost, Naïve Bayes l Semi-supervised methods: Clustering u Issues: l Evaluating the methods l Feature Selection l Sample contamination/composition
30
Evaluating Classification LOOCV – Leave one out cross validation: For all samples i = 1…M: l Take sample i out l Learn from M-1 remaining samples l Test on sample i NormalTumors Mislabeled sample
31
u How many of the informative genes do we choose for our classifier? u A question of choosing a cutoff Feature Selection
32
Tissue Composition Small cell lung carcinoma Lung metastasa Serous carcinoma Lung adenocarcinoma
33
Tissue Composition u The tissue is composed of many cell types (tumor, blood, muscle, …) u The arrayed samples are not always pure! u Major difference: differentialy expressed genes which are: l Causes of the disease state l Outcome of the disease state
34
Summary u Many methods for choosing differentially expressed genes u These can be compared, e.g. using overabundance tests u Overabundance can also be used for new class discovery u Expression patterns can be used to classify a tissue
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.