Presentation is loading. Please wait.

Presentation is loading. Please wait.

RNA-Seq Assembly 转录组拼接 唐海宝 基因组与生物技术研究中心 2013 年 11 月 23 日.

Similar presentations


Presentation on theme: "RNA-Seq Assembly 转录组拼接 唐海宝 基因组与生物技术研究中心 2013 年 11 月 23 日."— Presentation transcript:

1 RNA-Seq Assembly 转录组拼接 唐海宝 基因组与生物技术研究中心 2013 年 11 月 23 日

2 Motivation Why transcriptome sequencing (RNA-seq)? Gene expression / differential expression Reconstruct transcripts Exon-exon-junction detection (genome annotation) Alternative splicing/isoforms SNP detection...

3 RNA-seq with Illumina Wang L, Brutnell T et al. (2010) Brief Funct Genomic Proteomic 9:118-28.

4 Constructing transcripts from RNA-Seq data Haas & Zody, 2010 http://www.nature.com/nbt/journal/v28/n5/fig_tab/nbt0510-421_F1.htmlhttp://www.nature.com/nbt/journal/v28/n5/fig_tab/nbt0510-421_F1.html

5 Why de novo assembly? No reference genome available for species Genomic sequence: Incomplete (even reference genomes!) Fragmented Altered

6 Run FASTQC first! Quality trimming Based on quality scores

7 Base quality

8 Data Quality Assessment Good Quality scores across reads Bad Filtering needed

9 Data Quality Assessment - FastQC GoodGC Distribution Bad

10 Adapter contamination

11 Data Quality Assessment Recommendations - Generate quality plots for all read libraries - Trim and/or filter data if needed Always trim and filter for de novo transcriptome assembly - Regenerate quality plots after trimming and filtering to determine effectiveness

12 TRIMMOMATIC example This will perform the following: Remove adapters Remove leading low quality or N bases (below quality 3) Remove trailing low quality or N bases (below quality 3) Scan the read with a 4-base wide sliding window, cutting when the average quality per base drops below 15 Drop reads below the 36 bases long Read and write files in gzipped format $ java -cp trimmomatic-0.15.jar org.usadellab.trimmomatic.TrimmomaticPE s_1_1_sequence.txt.gz s_1_2_sequence.txt.gz lane1_forward_paired.fq.gz lane1_forward_unpaired.fq.gz lane1_reverse_paired.fq.gz lane1_reverse_unpaired.fq.gz ILLUMINACLIP:adapters.fasta:2:40:15 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

13 Workflow

14 Assembly strategy - k-mer construction Create all substrings of length k from the reads read

15 Assembly strategy - k-mer construction Create all substrings of length k from the reads read Martin, J. A., and Wang, Z. (2011). Nature Reviews Genetics 12, 671-682.

16 Assembly strategy - k-mer construction Create all substrings of length k from the reads read Martin, J. A., and Wang, Z. (2011). Nature Reviews Genetics 12, 671-682.

17 Assembly strategy - k-mer construction Create all substrings of length k from the reads read Martin, J. A., and Wang, Z. (2011). Nature Reviews Genetics 12, 671-682.

18 Assembly strategy - k-mer construction Create all substrings of length k from the reads read Martin, J. A., and Wang, Z. (2011). Nature Reviews Genetics 12, 671-682.

19 Assembly strategy - k-mer construction Generate de Bruijn graph read Martin, J. A., and Wang, Z. (2011). Nature Reviews Genetics 12, 671-682.

20 Assembly strategy - k-mer construction Generate de Bruijn graph read Martin, J. A., and Wang, Z. (2011). Nature Reviews Genetics 12, 671-682.

21 Assembly strategy - de Bruijn-graph Martin, J. A., and Wang, Z. (2011). Nature Reviews Genetics 12, 671-682.

22 Assembly strategy - de Bruijn-graph Martin, J. A., and Wang, Z. (2011). Nature Reviews Genetics 12, 671-682.

23 Assembly strategy - k-mer construction Martin, J. A., and Wang, Z. (2011). Nature Reviews Genetics 12, 671-682.

24 Assembly strategy - k-mer construction Martin, J. A., and Wang, Z. (2011). Nature Reviews Genetics 12, 671-682.

25 Assembly strategy - de Bruijn-graph Martin, J. A., and Wang, Z. (2011). Nature Reviews Genetics 12, 671-682.

26 Assembly strategy - splice isoforms Martin, J. A., and Wang, Z. (2011). Nature Reviews Genetics 12, 671-682.

27 Software An (incomplete) selection Trinity (single k-mer) Broad Institute and Hebrew University of Jerusalem Trans-ABySS (multiple k-mers) Canada's Michael Smith Genome Sciences Centre, BC Cancer Agency Velvet-Oases (single & multiple k-mers) EMBL-EBI / MPI for Molecular Genomics SOAPdenovo-Trans Beijing Genomics Institute ( 华大 ) CLC pipeline (not free) CLCbio

28 Example assembly Zhao et al. (2011). BMC Bioinformatics 12 Suppl 1, S2.

29 Runtime and memory usage Zhao et al. (2011). BMC Bioinformatics 12 Suppl 1, S2. 106.8M read pairs, on 20 CPUs

30 Assemblers De novo assemblers are prone to miss lowly expressed transcripts Multi k-mer approaches can improve assembly results Pool RNA-seq reads from different samples Assembler overview AssemblerRunning timeMemory requirements Trinity+++ Velvet-Oases○++ Trans-ABySS○- SOAPdenovo-○

31 Trinity

32 Inchworm Sequencing error correction Assemble candidate contigs Chrysalis Build de Bruijn transcript graphs from candidate contigs Butterfly Resolve alternatively spliced and paralogous transcripts k-mer length fixed to 25 Can incorporate results from reference-based analysis

33 Trinity Inchworm Sequencing error correction Assemble candidate contigs Chrysalis Build de Bruijn transcript graphs from candidate contigs Butterfly Resolve alternatively spliced and paralogous transcripts k-mer length fixed to 25 Can incorporate results from reference-based analysis

34 Trinity Inchworm Sequencing error correction Assemble candidate contigs Chrysalis Build de Bruijn transcript graphs from candidate contigs Butterfly Resolve alternatively spliced and paralogous transcripts k-mer length fixed to 25 Can incorporate results from reference-based analysis

35 Assembly QC - continuity Average length, min and max length, combined total length (N%) N50 captures how much of the assembly is covered by relatively large contigs “Half of all the sequences I’ve assembled are in contigs larger than {N50} bp …”

36 Assembly QC Ask these questions: Accuracy – How many of the assembled contigs map to genome? Accuracy – What are the contigs that do not align? (BLAST to nr) Completeness – How many previously annotated genes covered by the contigs? How many full-length? Contiguity – Does a single contig cover each gene? Compare results from multiple programs Martin et al., 2010

37 练习 Assemble rice RNA-seq data Compare two different assemblers Compare to standard rice gene models (MSU7.0)


Download ppt "RNA-Seq Assembly 转录组拼接 唐海宝 基因组与生物技术研究中心 2013 年 11 月 23 日."

Similar presentations


Ads by Google