Download presentation
1
Li and Dewey BMC Bioinformatics 2011, 12:323
RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome Kim Dong-in 테스트 Bo Li1 and Colin N Dewey1,2* Li and Dewey BMC Bioinformatics 2011, 12:323
2
Li and Dewey BMC Bioinformatics 2011, 12:323
Abstract + Backgroud RNA-Seq millions read reads end of cDNA from RNA fragment (single,pair) transcript quantification multiple genes or isoforms reads count, length Li and Dewey BMC Bioinformatics 2011, 12:323
3
Li and Dewey BMC Bioinformatics 2011, 12:323
Abstract + Backgroud Transcript quantification - mapping reads to genome, transcript set - estimation gene, isoform abundances Major complication - Not map uniquely to a single gene or isoform Li and Dewey BMC Bioinformatics 2011, 12:323
4
Li and Dewey BMC Bioinformatics 2011, 12:323
Abstract + Backgroud RSEM (RNASeq by Expectation Maximization) transcript sequences not reference genome de novo transcriptome assembler Extension methodology paired-end, length reads, length distributions, quality scores 95% credibility interval (CI) posterior mean estimate(PME) maximum likelihood (ML) estimate abundance of each gene and isoform Li and Dewey BMC Bioinformatics 2011, 12:323
5
Li and Dewey BMC Bioinformatics 2011, 12:323
Abstract + Backgroud RSEM (RNASeq by Expectation Maximization) In experiments best quantification accuracy short SE reads than PE reads in gene level same sequencing quality scores is not significant. Illumina error only read sequences quantification accuracy Li and Dewey BMC Bioinformatics 2011, 12:323
6
Li and Dewey BMC Bioinformatics 2011, 12:323
Abstract + Backgroud count reads number, read length (mapped uniquely gene) problems - mappability not in account : biased - alternatively-spliced genes : incorrect estimates - isoform abundances developed - address rescuing reads to multiple gene modeling by isoform level EM (expectation-maximization algorithm) Li and Dewey BMC Bioinformatics 2011, 12:323
7
Li and Dewey BMC Bioinformatics 2011, 12:323
Abstract + Backgroud - Related work similar statistical methods tools only RSEM, IsoEM handling reads mapped ambiguously isoforms and genes RSEM (RNASeq by Expectation Maximization) - modeling RSPDs(start position distributions) - compute posterior mean estimate(PME) 95% credibility interval (CI) - designed without a whole genome sequence IsoEM - maximum likelihood (ML) estimate Li and Dewey BMC Bioinformatics 2011, 12:323
8
Li and Dewey BMC Bioinformatics 2011, 12:323
Implementation RSEM (RNASeq by Expectation Maximization) 1. generate reference transcript sequences 2. aligned the reference - estimate abundances, credibility intervals scripts rsem-prepare-reference rsem-calculate-expression Li and Dewey BMC Bioinformatics 2011, 12:323
9
Li and Dewey BMC Bioinformatics 2011, 12:323
Implementation Li and Dewey BMC Bioinformatics 2011, 12:323
10
Li and Dewey BMC Bioinformatics 2011, 12:323
Implementation - Reference sequence preparation designed to transcript sequences not whole genome 1. complicated alignment to genome ( eukaryotic ) splicing , polyadenylation challenging at genome level 2. transcript-level alignments easy, faster Li and Dewey BMC Bioinformatics 2011, 12:323
11
Li and Dewey BMC Bioinformatics 2011, 12:323
Implementation - Reference sequence preparation rsem-prepare-reference - genome database - de novo transcriptome assembler - EST database - UCSC, Ensemble genome browser database - set of preprocessed transcript sequences append poly(A) tail sequences to reference (disabled with–no-polyA) Li and Dewey BMC Bioinformatics 2011, 12:323
12
Li and Dewey BMC Bioinformatics 2011, 12:323
Implementation - Reference sequence preparation rsem-prepare-reference --gtf mm9.gtf \ --transcript-to-gene-map knownIsoforms.txt \ --bowtie-path /sw/bowtie \ /mm9/chr1.fa,/data/mm9/chr2.fa,...,/data/mm9/chrM.fa \ /ref/mouse_125 -or /mm9 Li and Dewey BMC Bioinformatics 2011, 12:323
13
Li and Dewey BMC Bioinformatics 2011, 12:323
Implementation - Read mapping ,abundance estimation rsem-calculate-expression mapping(aligned) to reference calculation of relative abundances mapping tools: bowtie(default), sam format mapping condition - no single best align - mismatches in first 25 bases - reads > 200 Li and Dewey BMC Bioinformatics 2011, 12:323
14
Li and Dewey BMC Bioinformatics 2011, 12:323
Implementation - Read mapping ,abundance estimation rsem-calculate-expression fasta (position-dependent) fastq (paired-end, single-end, score) EM(expectation-maximization algorithm) options –strand-specific sense or antisense directions –fragment-length (SE) PE learns length Li and Dewey BMC Bioinformatics 2011, 12:323
15
Li and Dewey BMC Bioinformatics 2011, 12:323
Implementation - Read mapping ,abundance estimation rsem-calculate-expression –estimate-rspd : highly 5’ or 3’ biased from position distributions –calc-ci (maximum likelihood) 95% credibility intervals : capture uncertainty posterior mean Li and Dewey BMC Bioinformatics 2011, 12:323
16
Li and Dewey BMC Bioinformatics 2011, 12:323
Implementation - Read mapping ,abundance estimation rsem-calculate-expression - output estimate quantity - isoform-level, gene-level : used by edgeR, DESeq estimate fraction transcripts - TPM (transcripts per million) - independent, mean expressed transcript length TPM > RPKM, FPKM Li and Dewey BMC Bioinformatics 2011, 12:323
17
Li and Dewey BMC Bioinformatics 2011, 12:323
Implementation - Visualization rsem-calculate-expression – output –out-bam BAM file : genome browser(alignment) sem-bam2-wig BAM wig the expected number of reads overlapping each genomic position annotation GTF-formatted Li and Dewey BMC Bioinformatics 2011, 12:323
18
Li and Dewey BMC Bioinformatics 2011, 12:323
Implementation - Visualization Li and Dewey BMC Bioinformatics 2011, 12:323
19
Li and Dewey BMC Bioinformatics 2011, 12:323
Implementation - Visualization rsem-plot-model rsem-calculate-expression output to pdf report learned fragment read length distributions sequencing error parameters Li and Dewey BMC Bioinformatics 2011, 12:323
20
Li and Dewey BMC Bioinformatics 2011, 12:323
Results and Discussion - Comparison to related tools IsoEM - transcript sequences aligned(bowtie) Cufflinks - quantification mode genome sequence aligned(tophat) rQuant - genome sequence aligned(tophat) RSEM (v0.6) - transcript sequences aligned(bowtie) Li and Dewey BMC Bioinformatics 2011, 12:323
21
Li and Dewey BMC Bioinformatics 2011, 12:323
Results and Discussion - Comparison to related tools 20 million RNA-Seq (non-strand-specific, mouse transcriptome) Paired-end reads Single-end reads throwing out the second read of each pair reference transcript RefSeq - 20,852 genes and 1.2 isoforms per gene on average Ensembl - 22,329 genes and 3.4 isoforms per gene on average Li and Dewey BMC Bioinformatics 2011, 12:323
22
Li and Dewey BMC Bioinformatics 2011, 12:323
Results and Discussion - Comparison to related tools tested methods measured accuracy median percent error (MPE) error fraction (EF) – 10% false positive (FP) statistics Li and Dewey BMC Bioinformatics 2011, 12:323
23
Li and Dewey BMC Bioinformatics 2011, 12:323
Results and Discussion - Comparison to related tools RSEM and IsoEM outperform Cufflinks and rQuant. 1. not fully handle reads map to multiple genes - cufflinks “rescue"-like strategy one iteration of the EM algorithm - rQuant method handles gene multireads is not clear. Li and Dewey BMC Bioinformatics 2011, 12:323
24
Li and Dewey BMC Bioinformatics 2011, 12:323
Results and Discussion - Comparison to related tools RSEM and IsoEM outperform Cufflinks and rQuant. 2. performance gap Cufflinks, rQuant – genome set RSEM and IsoEM – transcript set Cufflinks not properly short transcripts - abnormally high abundance estimates of shorter mean fragment length (280 bases) RSEM - poly(A) tail handling but not IsoEM Li and Dewey BMC Bioinformatics 2011, 12:323
25
Li and Dewey BMC Bioinformatics 2011, 12:323
Results and Discussion - Comparison to related tools Li and Dewey BMC Bioinformatics 2011, 12:323
26
Li and Dewey BMC Bioinformatics 2011, 12:323
Results and Discussion - Comparison to related tools MPE : median percent error EF : error fraction FP : false positive Li and Dewey BMC Bioinformatics 2011, 12:323
27
Li and Dewey BMC Bioinformatics 2011, 12:323
Results and Discussion - Comparison to related tools HBR: human brain reference UHR: universal human reference Microarray Quality Control (MAQC) qRT-PCR : 1,000 (5%) out of a total of 19,005 - gene : 716 filterd genes biased towards single-isoform genes Li and Dewey BMC Bioinformatics 2011, 12:323
28
Li and Dewey BMC Bioinformatics 2011, 12:323
Results and Discussion - Comparison to related tools Li and Dewey BMC Bioinformatics 2011, 12:323
29
Li and Dewey BMC Bioinformatics 2011, 12:323
Results and Discussion - Paired vs. single end reads single-end number of reads than length at gene level optimal read length, around 25 bases in mouse and maize paired-end isoform, alternative splice genes Li and Dewey BMC Bioinformatics 2011, 12:323
30
Li and Dewey BMC Bioinformatics 2011, 12:323
Results and Discussion - Comparison to related tools Li and Dewey BMC Bioinformatics 2011, 12:323
31
Li and Dewey BMC Bioinformatics 2011, 12:323
Results and Discussion - Comparison to related tools mouse RefSeq empirical : training data Profile : base-dependent these results only for the task of quantification (We stress…) Li and Dewey BMC Bioinformatics 2011, 12:323
32
Li and Dewey BMC Bioinformatics 2011, 12:323
Results and Discussion - Availability and requirements Project name: RSEM Project home page: Operating systems: Any POSIX-compatible platform (e.g., Linux, Mac OS X, Cygwin) Programming languages: C++, Perl Other requirements: Pthreads; Bowtie, R Li and Dewey BMC Bioinformatics 2011, 12:323
33
Li and Dewey BMC Bioinformatics 2011, 12:323
Conclusions RSEM (RNASeq by Expectation Maximization) - preforming gene, isoform level - not require a reference genome - quantification with de novo transcriptome assemblies - visualization outputs - credibility interval (CI) estimates - userfriendly, two commands - reference transcript files - single end at gene level quntification - paired end within-gene isoform for mouse, human Li and Dewey BMC Bioinformatics 2011, 12:323
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.