. Differentially Expressed Genes, Class Discovery & Classification.

. Differentially Expressed Genes, Class Discovery & Classification

Finding Differentially Expressed Genes Two types of motivation: u “Direct”: l Relate the genes to known biology: functions, pathways etc. Infer about their rule, the mechanisms governing the process etc. u “Indirect”: Use as a “pruning stage” for tools that perform learning tasks: l Infer regulatory mechanisms and relations l Classification ( disease Vs. normal, disease subtypes)

Example: Tumor vs. Normal tissues u Identify differentially expressed genes l Diagnostic Markers l Therapeutic targets l Understanding the disease process Normal samples Tumor samples Over expressed Under expressed Non-small cell lung carcinomas Sheba medical center U. of Colorado Medical Center

What We Need u Score the genes, hopefully in a meaningful way.. u Attach a measure of statistical significance to the score so we can l Choose a subset of genes “wisely” l Have a measure of how strong our signal is

Simplest Score: Fold Change Avg. expression in Normal lung Avg. expression in tumors 2-fold change 2-fold up: 761 genes 2-fold down: 272 genes

Fold Change: problems u Not reliable at the low end of the scale l (“0/0” effects – large variance) u Sensitive to outliers u Variant: “pairwise fold change” l compute fold change over all possible sample pairs l If in e.g. 75% of the pairs, change >  => significant

Relevance Scores - TNoM Beyond “fold change” Both genes have >15 fold change TNoM (Total Number of Misclassifications) score  Find the threshold that best separates tumors from normals,  count the number of errors committed there. 10100100010000 100000 Gene 1 10100100010000100000 Gene 2 tumor normal 5 0 Uninformative Gene Informative Gene

Expression pattern of a gene: a Pathological diagnosis information (annotation): L v(a,L), a vector of +s and –s, ordered by the a values + + - - + + + - - + - - + + - a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12 a13 a14 a15 Informative genes + + + + + + + + - - - - - - - - - - - - - - + + + + + + + + - - - - + - - + + - + + + + + etc Non-informative genes + - + - + + + + - - + + - - - - + + - + - - + + - + + - - + + - - - + + - + + - + + - + - etc Scoring Informative Genes

Find the threshold that best separates tumors from normals, count the number of errors committed there. - + + - + - - + + - + + - - + # of errors = min(7,8) = 7. 6 7 Ex 1: Ex 2: A perfect single gene classifier gets a score of 0. + + + + + + + + - - - - - - - 0 TNoM Score

TNoM vs. Fold Change 2-fold up: 761 genes 2-fold down: 272 genes TNoM  3 62 genes Avg. expression in Normal lung Avg. expression in tumors 2-fold change TNoM  3 TNoM > 3

 Cons:  Ones-sided vs. two sided errors  Absolute values ignored  For any given level s, we can efficiently compute p-Val(s) = Prob( TNoM(V)  s ), where V is uniformly drawn over the appropriate space.  (H 0 – the gene expression values are independent of the labels)  Computed using DP TNoM

Wilcoxon Rank Test u Another gene score, which similarly to TNoM: l Ignores absolute values l Takes into account only order of measurements Sort the expression values of both groups + + - - + + + - - + - - + + - a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12 a13 a14 a15 W(g)= sum of ranks of the positive examples: W(g) = 1 + 2 + 5 + 6 + 7 + 10 + 13 + 14 = 58

Wilcoxon Rank Test u A common test in statistics u Again, we can compute p-Values given the null hypothesis H 0 l P(W(g) > s|n,k) = the probability of getting a score > s given a total of n samples, out of which k are labeled as (+).

SAM (Tusher et al., PNAS 01) Where a = (1/n 1 + 1/n 2 )/(n 1 + n 2 -2) d(i) is exactly the paired t-statistic Tests the assumption: are the means of the two processes the same? Underlying assumption: two normal distributions A known p-value: the t-distribution

SAM – Alternative to P-Value P-value relies on t-test assumptions - problematic Can we assess the significance of d(i) without parametric assumptions? Define a “balanced” permutation: division of samples to 2 groups, where in each group the number of ‘+’ and ‘-’ is balanced Perform all possible “balanced” permutations p to the data and compute:

False Discovery Rate for SAM Genes with  above a given threshold – significant FDR – False discovery rate = the % of genes passing as “significant” which are expected to be false positives Each threshold on  (i) can be given an FDR value: compute the avg. number of FP crossing this threshold in the permuted sets

Different Scores l TNoM l Info l Wilcoxon l t Test l Fold Change Different scores and null hypothesis (parametric, non parametric etc.) All can be found in the ScoreGene package: http://www.cs.huji.ac.il/labs/compbio /scoregenes/ Can we assess which scoring method is the best for our case?

Data on 30 samples from normal and tumor lung tissues. ~7000 genes. Naftali Kaminski’s lab, Sheba Medical Center Overabundance Analysis

Why Test Overabundance? u Tests how informative is a set of genes w.r.t. a given classification of the data and a scoring method. u Can be used to compare different: l gene scoring methods l normalization methods

Comparing Normalization Methods

Why Test Overabundance? u But also: a method to discover new classes in the data u Intuition: biologically meaningful partitions will have a high overabundance of informative genes

Overabundance Analysis in Class Discovery Biologically meaningful partitions. Overabundance of informative genes Score Genes Count Compare to random AML/ALL BRCA1/2 Melanoma

Seek partitions with statistically significant overabundance of informative genes Use local search techniques, e.g: Steepest ascent Simulated annealing Class Discovery Approach

 At a given score level s, set p = p-Val(s).  Suppose that in the data we observe n(s) genes with score  s.  The number of genes with score  s we observe for uniformly and independently drawn labeling vectors is a random variable N(s) with N(s) ~ Binom(n,p) where n is the total number of genes.  The surprise rate at s is defined as  (s) = Prob( N(s)  n(s) ) =  k=n(s)…n [n choose n(s)]p k (1-p) n-p.  Finally, the max surprise score for the suggested partition is Max s  (s) Scoring a Partition

Overabundance & Max-Surprise

Example: Survival Prediction All Patients Good Prognosis Patients 024681012 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 32 patients 17 deaths 8 patients 5 deaths 024681012 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 19 patients 6 deaths 5 patients 3 deaths

Class 2 All Patients Good Prognosis Patients 024681012 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 10 patients 2 deaths 14 patients 7 deaths

Class 3 All Patients Good Prognosis Patients 024681012 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 12 patients 6 deaths 12 patients 3 deaths

Tissue Classification u Given a set of labeled samples, we can try to classify a new sample l Supervised methods: SVM, Adaboost, Naïve Bayes l Semi-supervised methods: Clustering u Issues: l Evaluating the methods l Feature Selection l Sample contamination/composition

Evaluating Classification LOOCV – Leave one out cross validation: For all samples i = 1…M: l Take sample i out l Learn from M-1 remaining samples l Test on sample i NormalTumors Mislabeled sample

u How many of the informative genes do we choose for our classifier? u A question of choosing a cutoff Feature Selection

Tissue Composition Small cell lung carcinoma Lung metastasa Serous carcinoma Lung adenocarcinoma

Tissue Composition u The tissue is composed of many cell types (tumor, blood, muscle, …) u The arrayed samples are not always pure! u Major difference: differentialy expressed genes which are: l Causes of the disease state l Outcome of the disease state

Summary u Many methods for choosing differentially expressed genes u These can be compared, e.g. using overabundance tests u Overabundance can also be used for new class discovery u Expression patterns can be used to classify a tissue

. Differentially Expressed Genes, Class Discovery & Classification.

Similar presentations

Presentation on theme: ". Differentially Expressed Genes, Class Discovery & Classification."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

. Differentially Expressed Genes, Class Discovery & Classification.

Similar presentations

Presentation on theme: ". Differentially Expressed Genes, Class Discovery & Classification."— Presentation transcript:

Similar presentations

About project

Feedback