Download presentation
Presentation is loading. Please wait.
Published byJeremy Gardner Modified over 9 years ago
1
1 Global expression analysis Monday 10/1: Intro* 1 page Project Overview Due Intro to R lab Wednesday 10/3: Stats & FDR - * read the paper! Monday 10/8: Calling differentially expressed genes with baySeq - *read the paper! baySeq lab for RNA-seq data Wednesday 10/10: Clustering analysis Monday 10/15: Clustering analysis Clustering lab Wednesday 10/17: Motif analysis Monday 10/12: Motif analysis Motif lab Wednesday 10/14: ChIP/RIP/Nuc/Ect-Seq
2
2 Global expression analysis Goal: To measure transcript abundance of every gene in your organism at once … AND make sense out of it The power is in organizing genomic expression data to find meaningful patterns & groups of genes
3
Gasch et al. 2000, 2001
4
4 What kinds of information can we extract from genomic expression data? 1.Hypothetical functions for uncharacterized genes -- genes encoding subunits of multi-subunit protein complexes are often highly coregulated example: ribosomal protein genes, proteasome genes in yeast -- genes involved in the same cellular processes are often coregulated 2.New roles for characterized genes 5.Understanding developmental pathways 4. Implications of gene regulation -- WT vs. mutants can identify transcription factor targets -- promoter analysis of coregulated genes = upstream elements -- gene coregulation with known pathway targets can implicate pathway activity 3. Better understanding of the experimental conditions -- based on expression patterns of characterized genes 6. Defining samples based on expression profiles example: comparing tumor samples from patients
5
5 Technologies for Quantifying & Identifying Nucleic Acids DNA microarraysDeep sequencing 1.Collect RNA 2.Generate fluorescently-labeled cDNA 3.Hybridize to array 4.Detect fluorescence emission with scanning laser Data: Continuous measurements of relative fluorescence 1.Collect RNA 2.Make strand-specific cDNA library 3.Deep sequence short reads 4.Relate sequences back to genome / transcriptome location (or de novo assembly) Data: Number of sequencing reads per each base in the genome = Discrete ‘Counts’
6
6 ORF mRNA Array Probes Tiled-genome arrays cover the entire genome
7
7 Tiled sequences across each gene / locus To get relative differences in expression across two samples: 1. Need to normalize array signals across arrays 2. Need to compress measurements to a single score for each gene/transcript Tiled genomic arrays (Nimblegen, Affymetrix, Agilent)
8
8 PM MM ‘Robust Multiarray Analysis’ (RMA Irizarry et al. 2003) 1. On Affy: Throw out elements where MM signal > PM signal … but otherwise ignore MM 2. Local background subtraction from each probe intensity 3. Quantile normalization of arrays to be compared … sets the distribution of probe intensities to be the same 4. Convert intensity values to log 2 scale 5. Use a linear model to fit a given probe set and compute one expression value per gene PM = ‘perfect match’ oligo MM = ‘mismatch’ oligo (central nucleotide is mutated) Tiled genomic arrays (Nimblegen, Affymetrix, Agilent)
9
9 Deep sequencing for gene expression analysis mRNA Old protocol: make ds cDNA New protocols: 1 st strand cDNA (2 nd strand with dUTP) Sequence Number of sequencing reads per region ~= number of starting transcripts
10
10 Number of sequencing reads per region ~= number of starting transcripts * But sometimes one lane of sequencing works better than others: Simple normalization: Avg counts within gene length / Total Counts in That Lane RPKM: Reads Per Kb per Million mapped reads BUT … have to account for the length of the gene/transcript: Counts per base pair Total reads in lane 40 x 10 6 32 x 10 6
11
11 Another challenge: mapping reads to the genome/transcriptome intron Spliced transcript DNA Should you restrict yourself to ORF annotations? Can map reads to genome or transcriptome sequence, or assemble de novo.
12
12 Comparing samples via fold-changes: RPKM across samples reflects Differential Expression Usually work in log 2 space
13
13 Now each sample = list of normalized relative transcript values Array 1Array 2
14
14 Assessing replicates: how well do the data agree overall? linear regression Where does the noise come from? -- can be biological variation -- can be array artifacts … should define both types of variation …
15
15 Now you have your data, in the form of relative log2 expression differences Now what?
16
16 Select differentially expressed genes to focus on Methods of gene selection: -- arbitrary fold-expression-change cutoff example: genes that change >3X in expression between samples -- statistically significant change in expression requires replicates Expression difference Gene X expression under condition 1 Gene X expression under condition 2
17
17 Expression difference Gene X expression under condition 1 Gene X expression under condition 2 Select differentially expressed genes to focus on Methods of gene selection: -- arbitrary fold-expression-change cutoff example: genes that change >3X in expression between samples -- statistically significant change in expression requires replicates
18
18 Expression difference Use statistics to compare the mean & variation of 2 (or more) populations Select differentially expressed genes to focus on Methods of gene selection: -- arbitrary fold-expression-change cutoff example: genes that change >3X in expression between samples -- statistically significant change in expression requires replicates
19
19 Test if the means of 2 (or more) groups are the same or statistically different The ‘null hypothesis’ H 0 says that the two groups are statistically the same -- you will either accept or reject the null hypothesis Choosing the right test: parametric test if your data are normally distributed with equal variance nonparametric test if neither of the above are true Why do the data need to be normally distributed?
20
20 Test if the means of 2 groups are the same or statistically different The ‘null hypothesis’ H 0 says that the two groups are statistically the same -- you will either accept or reject the null hypothesis T = X 1 – X 2 difference in the means standard error of the difference in the means SED If your two samples are normally distributed with equal variance, use the t-test If T > T c where T c is the critical value for the degrees of freedom & confidence level, then reject H 0 Notice that if the data aren’t normally distributed mean and standard deviation are not meaningful.
21
21 Differential expression on DNA microarrays: Bioconductor package Limma (ref) ** See previous years’ limma lab for a walk-through example 1.Load your data 2.Provide a ‘target’ file that says which samples are on which arrays 3.Provide a ‘design’ file (and in some cases a ‘contrast matrix’) to specify which samples you want to compare 4.Limma will look at the entire dataset and model the error on the data, to try to over-come measurement error 5.Limma then does a modified T-test to identify genes with significant expression differences across the samples you specified.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.