Analysis of SAGE Data: An Introduction Kevin R. Coombes Section of Bioinformatics.

Slides:



Advertisements
Similar presentations
RNA-Seq as a Discovery Tool
Advertisements

RNA-seq library prep introduction
Serial Analysis of Gene Expression Velculescu, V., Zhang, L., Vogelstein, B. Kinzler, K. (1995) Science.
RNA-Seq based discovery and reconstruction of unannotated transcripts
RNAseq.
12/04/2017 RNA seq (I) Edouard Severing.
Peter Tsai Bioinformatics Institute, University of Auckland
RNA-seq: the future of transcriptomics ……. ?
Microarray technology and analysis of gene expression data Hillevi Lindroos.
RNA-Seq An alternative to microarray. Steps Grow cells or isolate tissue (brain, liver, muscle) Isolate total RNA Isolate mRNA from total RNA (poly.
Chapter 7: Statistical Applications in Traffic Engineering
Chapter 7(7b): Statistical Applications in Traffic Engineering Chapter objectives: By the end of these chapters the student will be able to (We spend 3.
Transcriptomics Jim Noonan GENE 760.
Some Statistical Methods For Detecting Clustering In Biological Sequences Some Statistical Methods For Detecting Clustering In Biological Sequences
Gene Expression And Regulation Bioinformatics January 11, 2006 D. A. McClellan
Probability: Many Random Variables (Part 2) Mike Wasikowski June 12, 2008.
Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
RNA-Seq An alternative to microarray. Steps Grow cells or isolate tissue (brain, liver, muscle) Isolate total RNA Isolate mRNA from total RNA (poly.
Data analytical issues with high-density oligonucleotide arrays A model for gene expression analysis and data quality assessment.
Gene Expression Analysis by SAGE. Gene Expression Some challenges: –Large number of genes How do you keep samples and equipment small and affordable?
Computational Analysis of Transcript Identification Using GenBank Slides by Terry Clark.
mRNA-Seq: methods and applications
Gene Expression Analysis by SAGE and MPSS Amanda Sitterly.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Fine Structure and Analysis of Eukaryotic Genes
CDNA Microarrays Neil Lawrence. Schedule Today: Introduction and Background 18 th AprilIntroduction and Background 25 th AprilcDNA Mircoarrays 2 nd MayNo.
(2) Ratio statistics of gene expression levels and applications to microarray data analysis Bioinformatics, Vol. 18, no. 9, 2002 Yidong Chen, Vishnu Kamat,
A Quantitative Overview to Gene Expression Profiling in Animal Genetics Armidale Animal Breeding Summer Course, UNE, Feb MPSS Massively Parallel.
Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF Probe selection for Microarrays Considerations and pitfalls.
BLAST What it does and what it means Steven Slater Adapted from pt.
Population All members of a set which have a given characteristic. Population Data Data associated with a certain population. Population Parameter A measure.
Ch9. Inferences Concerning Proportions. Outline Estimation of Proportions Hypothesis concerning one Proportion Hypothesis concerning several proportions.
Verna Vu & Timothy Abreo
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
IS 4800 Empirical Research Methods for Information Science Class Notes March 13 and 15, 2012 Instructor: Prof. Carole Hafner, 446 WVH
Summarization of Oligonucleotide Expression Arrays BIOS Winter 2010.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
1 Global expression analysis Monday 10/1: Intro* 1 page Project Overview Due Intro to R lab Wednesday 10/3: Stats & FDR - * read the paper! Monday 10/8:
Two Main Uses of Statistics: 1)Descriptive : To describe or summarize a collection of data points The data set in hand = the population of interest 2)Inferential.
Design of Micro-arrays Lecture Topic 6. Experimental design Proper experimental design is needed to ensure that questions of interest can be answered.
Confidence Interval & Unbiased Estimator Review and Foreword.
Introduction to RNAseq
Alternative Splicing (a review by Liliana Florea, 2005) CS 498 SS Saurabh Sinha 11/30/06.
Cluster validation Integration ICES Bioinformatics.
Proteome and Gene Expression Analysis Chapter 15 & 16.
ANALYSIS OF GENE EXPRESSION DATA. Gene expression data is a high-throughput data type (like DNA and protein sequences) that requires bioinformatic pattern.
Review Normal Distributions –Draw a picture. –Convert to standard normal (if necessary) –Use the binomial tables to look up the value. –In the case of.
© Copyright McGraw-Hill 2004
中国免疫学信息网 SAGE 的原理及其应用 新乡医学院免疫学研究中心 王 辉.
TOX680 Unveiling the Transcriptome using RNA-seq Jinze Liu.
Lecture 12 RNA – seq analysis.
DNA Microarray Overview and Application. Table of Contents Section One : Introduction Section Two : Microarray Technique Section Three : Types of DNA.
Transcriptome What is it - genome wide transcript abundance How do you obtain it - Arrays + MPSS What do you do with it when you have it - ?
Gene Technologies and Human ApplicationsSection 3 Section 3: Gene Technologies in Detail Preview Bellringer Key Ideas Basic Tools for Genetic Manipulation.
Ligate tags SAGE: Procedure Digest with “Tagging enzyme” BsmFI tm Isolate mRNA, RT to cDNA Digest with “Anchoring.
CSE280Stefano/Hossein Project: Primer design for cancer genomics.
Class Seven Turn In: Chapter 18: 32, 34, 36 Chapter 19: 26, 34, 44 Quiz 3 For Class Eight: Chapter 20: 18, 20, 24 Chapter 22: 34, 36 Read Chapters 23 &
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Radiation hybrid map of the zebrafish genome
RNA Quantitation from RNAseq Data
Amos Tanay Nir Yosef 1st HCA Jamboree, 8/2017
Chapter 6 Inferences Based on a Single Sample: Estimation with Confidence Intervals Slides for Optional Sections Section 7.5 Finite Population Correction.
Gene expression from RNA-Seq
Design and Analysis of Single-Cell Sequencing Experiments
Discrete Event Simulation - 4
EXTENDING GENE ANNOTATION WITH GENE EXPRESSION
Sequence Analysis - RNA-Seq 2
Presentation transcript:

Analysis of SAGE Data: An Introduction Kevin R. Coombes Section of Bioinformatics

Outline Description of SAGE method Preliminary bioinformatics issues Description of analysis methods introduced in early paper Review of literature: statistics and SAGE

What is SAGE? Serial Analysis of Gene Expression Method to quantify gene expression levels in samples of cells Open system –Can potentially reveal expression levels of all genes: “unbiased” and “comprehensive” –Microarrays are closed, since they only tell you about the genes spotted on the array Ref: Velculescu et al., Science 1995; 270:

How does SAGE work? 1. Isolate mRNA. 2.(b) Synthesize ds cDNA. 2.(a) Add biotin-labeled dT primer: 4.(a) Divide into two pools and add linker sequences: 4.(b) Ligate. 3.(c) Discard loose fragments. 3.(a) Bind to streptavidin-coated beads. 3.(b) Cleave with “anchoring enzyme”. 5. Cleave with “tagging enzyme”. 6. Combine pools and ligate. 7. Amplify ditags, then cleave with anchoring enzyme. 8. Ligate ditags. 9. Sequence and record the tags and frequencies.

From ditags to counts Locate the punctuation “CATG” Extract ditags of length between the punctuation Discard duplicate ditags (including in reverse direction) -- probably PCR artifacts Take extreme 10 bases as the two tags, reversing right-hand tag Discard linker sequences Count occurrences of each tag SAGE software available at

What does the data look like?

From tags to genes Collect sequence records from GenBank that are represented in UniGene Assign sequence orientation (by finding poly- A tail or poly-A signal or from annotations) Extract 10-bases 3’-adjacent to 3’-most CATG Assign UniGene identifier to each sequence with a SAGE tag Record (for each tag-gene pair) –#sequences with this tag –#sequences in gene cluster with this tag Maps available at

From tags to genes Ideal situation: –one gene = one tag True situation –one gene = many tags (alternative splicing; alternative polyadenylation) –one tag = many genes (conserved 3’ regions)

Sequencing Errors Estimated sequencing error rate: –0.7% per base (range 0.2% - 1%) Affect –ditags in a SAGE experiment can improve by using phred scores and discarding ambiguous sequences –tag-gene mappings from GenBank RNA better than EST

Reliable tag-gene assignments

SAGE and cancer Ten SAGE libraries, two each from –normal colon –colon tumors –colon cancer cell lines –pancreatic tumors –pancreatic cell lines Pooled each pair Ref: Zhang et al., Science 1997; 276:

Variability in SAGE libraries

Distribution of tags 303,706 total tags 48,471 distinct tags Distribution –85.9% seen up to 5 times (25% of mass) –12.7% between 5 and 50 times (30%) –0.1% between 50 and 500 times (26%) –0.1% more than 500 times (19%) Ref: Zhang et al., Science 1997; 276:

How many tags were missed? They simulated to find 92% chance of detecting tags at 3 copies/cell Using binomial approximation –Get 95% chance for 3 copies/cell –Only get 63% chance for 1 copy/cell Most of what they saw occurred at 1-5 copies per cell

Differential Expression Found 289 tags differentially expressed between normal colon and colon cancer (181 decreased; 108 increased) Method: Monte Carlo simulation. – sims per transcript for relative likelihood of seeing observed difference –Used observed distribution of transcripts to simulate 40 experiments. Ref: Zhang et al., Science 1997; 276:

Sensitivity Claim: 95% chance of detecting 6-fold difference Method: Monte Carlo –200 simulations, assuming abundance of in first sample and in second sample Ref: Zhang et al., Science 1997; 276:

Weaknesses in Analysis Failed to account for intrinsic variability in samples (which changes depending on abundance) in assessing significance Monte Carlo used observed distribution, which is definitely not true distribution. Sensitivity only measured at one abundance level.

Alternative Analytic Methods Audic and Claverie, Genome Res 1997; 7: Chen et al., J Exp Med 1998; 9: Kal et al., Mol Biol of Cell 1999; 10: Michiels et al., Physiol Genomics 1999; 1: Stollberg et al., Genome Res 2000; 10: Man et al., Bioinformatics 2000; 16:

Audic and Claverie Main goal: confidence limits for differential expression Use Poisson approximation for number of times x you see the same tag. Put a uniform prior on the Poisson parameter; get posterior probability of see tag y times in new experiment p( y | x ) = ( x + y )! / [ x ! y ! 2^( x + y +1)] Generalizes to unequal sample sizes

Chen et al. Assume –equal sample sizes –tag has concentration X, Y in two samples Look at W = X/(X+Y) Use a symmetric Beta prior distribution with a peak near 0.5 (since most genes don’t change) Use Bayes theorem to compute posterior probability of threefold difference in expression

Unequal sample sizes This analysis generalizes easily to the case of unequal size SAGE libraries –Lal et al., Cancer Res 1999; 59: This method is used at the NCBI SAGEmap web site for online differential expression queries –

Kal et al. Assume the proportion of times you see a tag has binomial distribution Replace with a normal approximation to compute confidence limits Used at Equivalent to chi-square test on 2x2 table:

Michiels et al. First perform overall chi-square test to decide if the two SAGE libraries being compared are different. Get significance by Monte Carlo simulation Perform gene-by-gene chi-square tests and use them to rank genes in order of “most likely to be different”

Stollberg et al. Assume binomial distributions Model the binomial parameters as a sum of two exponentials –fit to the Zhang step function data Simulate from this model, adding –sequencing errors –nonuniqueness of tags –nonrandomness of DNA sequences

Stollberg et al. Key finding: –Naively using observed data to fit model parameters cannot recover the observed data by simulation –Maximum likelihood estimate of parameters that recover the observed data give very different looking parameters

Stollberg et al.

Man et al. Compares specificity and sensitivity of different tests for differential expression –Audic and Claverie –Kal –Fisher’s exact test Monte Carlo simulation of experiments Findings –Similar power at high abundance –Kal has highest power at low abundance

Questions Sample size computations: –How many tags should we sequence if we want to see tags of a given frequency? –How many tags should we sequence if we want to see a given percentage of tags? How many tags are expressed in a sample? Best method for identifying differential expression?

Additional SAGE references Review –Madden et al., Drug Disc Today 2000; 5: Online Tools –Lash et al., Genome Res 2000; 10: –van Kampen et al., Bioinformatics 2000; 16: Comparison of SAGE and Affymetrix –Ishii et al., Genomics 2000; 68: Combine SAGE and custom microarrays –Nacht et al., Cancer Res 1999; 59: Mapping SAGE data onto genome –Caron et al., Science 2001; 291: Data mining the public SAGE libraries –Argani et al., Cancer Res 2001; 61: