Presentation is loading. Please wait.

Presentation is loading. Please wait.

Li and Dewey BMC Bioinformatics 2011, 12:323

Similar presentations


Presentation on theme: "Li and Dewey BMC Bioinformatics 2011, 12:323"— Presentation transcript:

1 Li and Dewey BMC Bioinformatics 2011, 12:323
RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome Kim Dong-in 테스트 Bo Li1 and Colin N Dewey1,2* Li and Dewey BMC Bioinformatics 2011, 12:323

2 Li and Dewey BMC Bioinformatics 2011, 12:323
Abstract + Backgroud RNA-Seq millions read reads end of cDNA from RNA fragment (single,pair) transcript quantification multiple genes or isoforms reads count, length Li and Dewey BMC Bioinformatics 2011, 12:323

3 Li and Dewey BMC Bioinformatics 2011, 12:323
Abstract + Backgroud Transcript quantification - mapping reads to genome, transcript set - estimation gene, isoform abundances Major complication - Not map uniquely to a single gene or isoform Li and Dewey BMC Bioinformatics 2011, 12:323

4 Li and Dewey BMC Bioinformatics 2011, 12:323
Abstract + Backgroud RSEM (RNASeq by Expectation Maximization) transcript sequences not reference genome  de novo transcriptome assembler Extension methodology paired-end, length reads, length distributions, quality scores 95% credibility interval (CI) posterior mean estimate(PME)  maximum likelihood (ML) estimate  abundance of each gene and isoform Li and Dewey BMC Bioinformatics 2011, 12:323

5 Li and Dewey BMC Bioinformatics 2011, 12:323
Abstract + Backgroud RSEM (RNASeq by Expectation Maximization) In experiments best quantification accuracy  short SE reads than PE reads in gene level same sequencing quality scores is not significant. Illumina error only read sequences  quantification accuracy Li and Dewey BMC Bioinformatics 2011, 12:323

6 Li and Dewey BMC Bioinformatics 2011, 12:323
Abstract + Backgroud count reads number, read length (mapped uniquely gene)  problems - mappability not in account : biased - alternatively-spliced genes : incorrect estimates - isoform abundances  developed - address rescuing reads to multiple gene modeling by isoform level EM (expectation-maximization algorithm) Li and Dewey BMC Bioinformatics 2011, 12:323

7 Li and Dewey BMC Bioinformatics 2011, 12:323
Abstract + Backgroud - Related work similar statistical methods tools  only RSEM, IsoEM handling reads mapped ambiguously isoforms and genes RSEM (RNASeq by Expectation Maximization) - modeling RSPDs(start position distributions) - compute posterior mean estimate(PME) 95% credibility interval (CI) - designed without a whole genome sequence IsoEM - maximum likelihood (ML) estimate Li and Dewey BMC Bioinformatics 2011, 12:323

8 Li and Dewey BMC Bioinformatics 2011, 12:323
Implementation RSEM (RNASeq by Expectation Maximization) 1. generate reference transcript sequences 2. aligned the reference - estimate abundances, credibility intervals scripts rsem-prepare-reference rsem-calculate-expression Li and Dewey BMC Bioinformatics 2011, 12:323

9 Li and Dewey BMC Bioinformatics 2011, 12:323
Implementation Li and Dewey BMC Bioinformatics 2011, 12:323

10 Li and Dewey BMC Bioinformatics 2011, 12:323
Implementation - Reference sequence preparation designed to transcript sequences not whole genome 1. complicated alignment to genome ( eukaryotic )  splicing , polyadenylation  challenging at genome level 2. transcript-level alignments easy, faster Li and Dewey BMC Bioinformatics 2011, 12:323

11 Li and Dewey BMC Bioinformatics 2011, 12:323
Implementation - Reference sequence preparation rsem-prepare-reference - genome database - de novo transcriptome assembler - EST database - UCSC, Ensemble genome browser database - set of preprocessed transcript sequences append poly(A) tail sequences to reference (disabled with–no-polyA) Li and Dewey BMC Bioinformatics 2011, 12:323

12 Li and Dewey BMC Bioinformatics 2011, 12:323
Implementation - Reference sequence preparation rsem-prepare-reference --gtf mm9.gtf \ --transcript-to-gene-map knownIsoforms.txt \ --bowtie-path /sw/bowtie \ /mm9/chr1.fa,/data/mm9/chr2.fa,...,/data/mm9/chrM.fa \ /ref/mouse_125 -or /mm9 Li and Dewey BMC Bioinformatics 2011, 12:323

13 Li and Dewey BMC Bioinformatics 2011, 12:323
Implementation - Read mapping ,abundance estimation rsem-calculate-expression mapping(aligned) to reference calculation of relative abundances mapping tools: bowtie(default), sam format mapping condition - no single best align - mismatches in first 25 bases - reads > 200 Li and Dewey BMC Bioinformatics 2011, 12:323

14 Li and Dewey BMC Bioinformatics 2011, 12:323
Implementation - Read mapping ,abundance estimation rsem-calculate-expression fasta (position-dependent) fastq (paired-end, single-end, score) EM(expectation-maximization algorithm) options –strand-specific sense or antisense directions –fragment-length (SE) PE learns length Li and Dewey BMC Bioinformatics 2011, 12:323

15 Li and Dewey BMC Bioinformatics 2011, 12:323
Implementation - Read mapping ,abundance estimation rsem-calculate-expression –estimate-rspd : highly 5’ or 3’ biased from position distributions –calc-ci (maximum likelihood) 95% credibility intervals : capture uncertainty posterior mean Li and Dewey BMC Bioinformatics 2011, 12:323

16 Li and Dewey BMC Bioinformatics 2011, 12:323
Implementation - Read mapping ,abundance estimation rsem-calculate-expression - output estimate quantity - isoform-level, gene-level : used by edgeR, DESeq estimate fraction transcripts - TPM (transcripts per million) - independent, mean expressed transcript length TPM > RPKM, FPKM Li and Dewey BMC Bioinformatics 2011, 12:323

17 Li and Dewey BMC Bioinformatics 2011, 12:323
Implementation - Visualization rsem-calculate-expression – output –out-bam BAM file : genome browser(alignment) sem-bam2-wig BAM  wig the expected number of reads overlapping each genomic position  annotation GTF-formatted Li and Dewey BMC Bioinformatics 2011, 12:323

18 Li and Dewey BMC Bioinformatics 2011, 12:323
Implementation - Visualization Li and Dewey BMC Bioinformatics 2011, 12:323

19 Li and Dewey BMC Bioinformatics 2011, 12:323
Implementation - Visualization rsem-plot-model  rsem-calculate-expression output to pdf report learned fragment read length distributions sequencing error parameters Li and Dewey BMC Bioinformatics 2011, 12:323

20 Li and Dewey BMC Bioinformatics 2011, 12:323
Results and Discussion - Comparison to related tools IsoEM - transcript sequences aligned(bowtie) Cufflinks - quantification mode genome sequence aligned(tophat) rQuant - genome sequence aligned(tophat) RSEM (v0.6) - transcript sequences aligned(bowtie) Li and Dewey BMC Bioinformatics 2011, 12:323

21 Li and Dewey BMC Bioinformatics 2011, 12:323
Results and Discussion - Comparison to related tools 20 million RNA-Seq (non-strand-specific, mouse transcriptome) Paired-end reads Single-end reads throwing out the second read of each pair reference transcript RefSeq - 20,852 genes and 1.2 isoforms per gene on average Ensembl - 22,329 genes and 3.4 isoforms per gene on average Li and Dewey BMC Bioinformatics 2011, 12:323

22 Li and Dewey BMC Bioinformatics 2011, 12:323
Results and Discussion - Comparison to related tools tested methods measured accuracy median percent error (MPE) error fraction (EF) – 10% false positive (FP) statistics Li and Dewey BMC Bioinformatics 2011, 12:323

23 Li and Dewey BMC Bioinformatics 2011, 12:323
Results and Discussion - Comparison to related tools RSEM and IsoEM outperform Cufflinks and rQuant. 1. not fully handle reads map to multiple genes - cufflinks “rescue"-like strategy one iteration of the EM algorithm - rQuant method handles gene multireads is not clear. Li and Dewey BMC Bioinformatics 2011, 12:323

24 Li and Dewey BMC Bioinformatics 2011, 12:323
Results and Discussion - Comparison to related tools RSEM and IsoEM outperform Cufflinks and rQuant. 2. performance gap Cufflinks, rQuant – genome set RSEM and IsoEM – transcript set Cufflinks not properly short transcripts - abnormally high abundance estimates of shorter mean fragment length (280 bases) RSEM - poly(A) tail handling but not IsoEM Li and Dewey BMC Bioinformatics 2011, 12:323

25 Li and Dewey BMC Bioinformatics 2011, 12:323
Results and Discussion - Comparison to related tools Li and Dewey BMC Bioinformatics 2011, 12:323

26 Li and Dewey BMC Bioinformatics 2011, 12:323
Results and Discussion - Comparison to related tools MPE : median percent error EF : error fraction FP : false positive Li and Dewey BMC Bioinformatics 2011, 12:323

27 Li and Dewey BMC Bioinformatics 2011, 12:323
Results and Discussion - Comparison to related tools HBR: human brain reference UHR: universal human reference Microarray Quality Control (MAQC) qRT-PCR : 1,000 (5%) out of a total of 19,005 - gene : 716 filterd genes biased towards single-isoform genes Li and Dewey BMC Bioinformatics 2011, 12:323

28 Li and Dewey BMC Bioinformatics 2011, 12:323
Results and Discussion - Comparison to related tools Li and Dewey BMC Bioinformatics 2011, 12:323

29 Li and Dewey BMC Bioinformatics 2011, 12:323
Results and Discussion - Paired vs. single end reads single-end number of reads than length at gene level optimal read length, around 25 bases in mouse and maize paired-end isoform, alternative splice genes Li and Dewey BMC Bioinformatics 2011, 12:323

30 Li and Dewey BMC Bioinformatics 2011, 12:323
Results and Discussion - Comparison to related tools Li and Dewey BMC Bioinformatics 2011, 12:323

31 Li and Dewey BMC Bioinformatics 2011, 12:323
Results and Discussion - Comparison to related tools mouse RefSeq empirical : training data Profile : base-dependent these results only for the task of quantification (We stress…) Li and Dewey BMC Bioinformatics 2011, 12:323

32 Li and Dewey BMC Bioinformatics 2011, 12:323
Results and Discussion - Availability and requirements Project name: RSEM Project home page: Operating systems: Any POSIX-compatible platform (e.g., Linux, Mac OS X, Cygwin) Programming languages: C++, Perl Other requirements: Pthreads; Bowtie, R Li and Dewey BMC Bioinformatics 2011, 12:323

33 Li and Dewey BMC Bioinformatics 2011, 12:323
Conclusions RSEM (RNASeq by Expectation Maximization) - preforming gene, isoform level - not require a reference genome - quantification with de novo transcriptome assemblies - visualization outputs - credibility interval (CI) estimates - userfriendly, two commands - reference transcript files - single end at gene level quntification - paired end within-gene isoform for mouse, human Li and Dewey BMC Bioinformatics 2011, 12:323


Download ppt "Li and Dewey BMC Bioinformatics 2011, 12:323"

Similar presentations


Ads by Google