RNA Sequencing and transcriptome reconstruction Manfred G. Grabherr.

Slides:



Advertisements
Similar presentations
RNA-seq library prep introduction
Advertisements

Functional Genomics with Next-Generation Sequencing
Capturing the chicken transcriptome with PacBio long read RNA-seq data OR Chicken in awesome sauce: a recipe for new transcript identification Gladstone.
The Past, Present, and Future of DNA Sequencing
RNAseq.
Transcriptome reconstruction and quantification
Transcriptome Sequencing with Reference
Peter Tsai Bioinformatics Institute, University of Auckland
Transcriptome Assembly and Quantification from Ion Torrent RNA-Seq Data Alex Zelikovsky Department of Computer Science Georgia State University Joint work.
Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520
Transcriptomics Jim Noonan GENE 760.
Novel multi-platform next generation assembly methods for mammalian genomes The Baylor College of Medicine, Australian Government and University of Connecticut.
NGS Transcriptomic Workflows Hugh Shanahan & Jamie al-Nasir Royal Holloway, University of London.
High Throughput Sequencing
mRNA-Seq: methods and applications
Software for Robust Transcript Discovery and Quantification from RNA-Seq Ion Mandoiu, Alex Zelikovsky, Serghei Mangul.
Delon Toh. Pitfalls of 2 nd Gen Amplification of cDNA – Artifacts – Biased coverage Short reads – Medium ~100bp for Illumina – 700bp for 454.
De-novo Assembly Day 4.
LECTURE 2 Splicing graphs / Annoteted transcript expression estimation.
Li and Dewey BMC Bioinformatics 2011, 12:323
Expression Analysis of RNA-seq Data
Mon C222 lecture by Veli Mäkinen Thu C222 study group by VM  Mon C222 exercises by Anna Kuosmanen Algorithms in Molecular Biology, 5.
Ji-hye Choi August Introduction (2006) ABRF-NGS (the Association fo Biomolecular Resource Facilities next-generation sequencing study)
Todd J. Treangen, Steven L. Salzberg
1 Velvet: Algorithms for De Novo Short Assembly Using De Bruijn Graphs March 12, 2008 Daniel R. Zerbino and Ewan Birney Presenter: Seunghak Lee.
Variables: – T(p) - set of candidate transcripts on which pe read p can be mapped within 1 std. dev. – y(t) -1 if a candidate transcript t is selected,
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
Next Generation Sequencing and its data analysis challenges Background Alignment and Assembly Applications Genome Epigenome Transcriptome.
Adrian Caciula Department of Computer Science Georgia State University Joint work with Serghei Mangul (UCLA) Ion Mandoiu (UCONN) Alex Zelikovsky (GSU)
The iPlant Collaborative
RNA-Seq Assembly 转录组拼接 唐海宝 基因组与生物技术研究中心 2013 年 11 月 23 日.
RNA Sequence Assembly WEI Xueliang. Overview Sequence Assembly Current Method My Method RNA Assembly To Do.
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.
Introduction to RNAseq
TOX680 Unveiling the Transcriptome using RNA-seq Jinze Liu.
No reference available
De novo assembly of RNA Steve Kelly
Accessing and visualizing genomics data
OPERA highthroughput paired-end sequences Reconstructing optimal genomic scaffolds with.
An Integer Programming Approach to Novel Transcript Reconstruction from Paired-End RNA-Seq Reads Serghei Mangul Department of Computer Science Georgia.
CyVerse Workshop Transcriptome Assembly. Overview of work RNA-Seq without a reference genome Generate Sequence QC and Processing Transcriptome Assembly.
A high-resolution map of human evolutionary constraints using 29 mammals Kerstin Lindblad-Toh et al Presentation by Robert Lewis and Kaylee Wells.
When the next-generation sequencing becomes the now- generation Lisa Zhang November 6th, 2012.
Canadian Bioinformatics Workshops
How to design arrays with Next generation sequencing (NGS) data Lecture 2 Christopher Wheat.
Canadian Bioinformatics Workshops
Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Extract RNA, convert to cDNA RNA-Seq Empowers Transcriptome Studies Next-gen Sequencer (pick your favorite)
RNA-Seq with the Tuxedo Suite Monica Britton, Ph.D. Sr. Bioinformatics Analyst September 2015 Workshop.
RNA-Seq Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on
de Novo Transcriptome Assembly
Short Read Sequencing Analysis Workshop
RNA Quantitation from RNAseq Data
The Transcriptional Landscape of the Mammalian Genome
RNA-Seq analysis in R (Bioconductor)
A Fast Hybrid Short Read Fragment Assembly Algorithm
Professors: Dr. Gribskov and Dr. Weil
S1 Supporting information Bioinformatic workflow and quality of the metrics Number of slides: 10.
Kallisto: near-optimal RNA seq quantification tool
Transcriptome Assembly
CS 598AGB Genome Assembly Tandy Warnow.
Inference of alternative splicing from RNA-Seq data with probabilistic splice graphs BMI/CS Spring 2019 Colin Dewey
Sequence Analysis - RNA-Seq 2
Schematic representation of a transcriptomic evaluation approach.
Presentation transcript:

RNA Sequencing and transcriptome reconstruction Manfred G. Grabherr

A historic perspective -Traditional: sequence cDNA libraries by Sanger  Tens of thousands of pairs at most (20K genes in mammal)  Redundancy due to highly expressed genes  Not only coding genes are transcribed  Poor full-lengthness (read length about 800bp)  Indels are the dominant error mode in Sanger (frameshifts)

A historic perspective -Quantification: microarrays  Sequences have to be known  Annotations are often incomplete  No novel transcripts  Hybridization bias (SNPs)  Noise

Next-Gen Sequencing technologies -1 Lane of HiSeq yields 30GB in sequence -Short reads (100nt), but: -Good depth, high dynamic range -Full-length transcripts -Novel transcripts -Allow for expression quantification -Error patterns are mostly substitutions -Strand-specific libraries

Strategy: read mapping vs. de novo assembly Haas and Zody, Nature Biotechnology 28, 421–423 (2010)

Strategy: read mapping vs. de novo assembly Haas and Zody, Nature Biotechnology 28, 421–423 (2010) Good reference No genome

Leveraging RNA-Seq for Genome-free Transcriptome Studies Brian Haas

WGS Sequencing Assemble Draft Genome Scaffolds SNPs Methylation Proteins Tx-factor binding sites A Paradigm for Genomic Research

WGS Sequencing Assemble Draft Genome Scaffolds Expression Transcripts SNPs Methylation Proteins Tx-factor binding sites Align

A Maturing Paradigm for Transcriptome Research WGS Sequencing Assemble Draft Genome Scaffolds Methylation Tx-factor binding sites Align

A Maturing Paradigm for Transcriptome Research WGS Sequencing Assemble Draft Genome Scaffolds Methylation Tx-factor binding sites Align $$$$$ $ $ +

A Maturing Paradigm for Transcriptome Research WGS Sequencing Assemble Draft Genome Scaffolds Methylation Tx-factor binding sites Align $$$$$ $ $ +

A Maturing Paradigm for Transcriptome Research WGS Sequencing Assemble Draft Genome Scaffolds Methylation Tx-factor binding sites Align $$$$$ $ $ +

A Maturing Paradigm for Transcriptome Research WGS Sequencing Assemble Draft Genome Scaffolds Methylation Tx-factor binding sites Align $$$$$ $ $ +

De-novo transcriptome assembly Brian Haas Moran Yassour Kerstin Lindblad-Toh Aviv Regev Nir Friedman David Eccles Alexie Papanicolaou Michael Ott …

The problem Transcript

The problem Transcript Reads

The problem Transcript Reads Transcript Assembly

The problem Transcript Reads Transcript Assembly Paralog A Paralog B

The problem Transcript Reads Transcript Assembly Isoform A Isoform B

Transcriptome vs. Genome assembly Genome: -Large -High coverage -Long mate pairs (hard to make)  Linear sequences  Even coverage Transcriptome: -Smaller -Standard paired-end Illumina (1 lane)  Multiple solutions (alternative splicing)  Uneven coverage (expression)

Transcriptome vs. Genome assembly Genome: -Large -High coverage -Long mate pairs (hard to make)  Linear sequences  Even coverage Transcriptome: -Smaller -Standard paired-end Illumina (1 lane)  Multiple solutions (alternative splicing)  Uneven coverage (expression) In common: k-mer based approach

The k-mer -K consecutive nucleotides Reads K-mers Graph

The de Bruijn Graph -Graph of overlapping sequences -Intended for cryptology -Fixed length element: k CTTGGAA TTGGAAC TGGAACA GGAACAA GAACAAT

The de Bruijn Graph -Graph has “nodes” and “edges” G GGCAATTGACTTTT… CTTGGAACAAT TGAATT A GAAGGGAGTTCCACT…

Iyer MK, Chinnaiyan AM (2011) Nature Biotechnology 29, 599–600

Inchworm Algorithm Decompose all reads into overlapping Kmers (25-mers) Extend kmer at 3’ end, guided by coverage. G A T C Identify seed kmer as most abundant Kmer, ignoring low-complexity kmers. GATTACA 9

Inchworm Algorithm G A T C 4 GATTACA 9

Inchworm Algorithm G A T C 4 1 GATTACA 9

Inchworm Algorithm G A T C GATTACA 9

Inchworm Algorithm G A T C GATTACA 9

G A T C Inchworm Algorithm

GATTACA G A T C G A T C G A T C Inchworm Algorithm

GATTACA G A A T C G T C G A T C Inchworm Algorithm

GATTACA G A Inchworm Algorithm

GATTACA G A G A T C Inchworm Algorithm

GATTACA G A A 6 A 7 Inchworm Algorithm Remove assembled kmers from catalog, then repeat the entire process. Report contig: ….AAGATTACAGA….

Inchworm Contigs from Alt-Spliced Transcripts => Minimal lossless representation of data +

Chrysalis Integrate isoforms via k-1 overlaps

Chrysalis Integrate isoforms via k-1 overlaps

Chrysalis Integrate isoforms via k-1 overlaps Verify via “welds”

Chrysalis Integrate isoforms via k-1 overlaps Verify via “welds” Build de Bruijn Graphs (ideally, one per gene) Build de Bruijn Graphs (ideally, one per gene)

Result: linear sequences grouped in components, contigs and sequences >comp1017_c1_seq1_FPKM_all:30.089_FPKM_rel:30.089_len:403_path:[5739,5784,5857,5863,353] TTGGGAGCCTGCCCAGGTTTTTGCTGGTACCAGGCTAAGTAGCTGCTAACACTCTGACTGGCCCGGCAGGTGATGGTGAC TTTTTCCTCCTGAGACAAGGAGAGGGAGGCTGGAGACTGTGTCATCACGATTTCTCCGGTGATATCTGGGAGCCAGAGTA ACAGAAGGCAGAGAAGGCGAGCTGGGGCTTCCATGGCTCACTCTGTGTCCTAACTGAGGCAGATCTCCCCCAGAGCACTG ACCCAGCACTGATATGGGCTCTGGAGAGAAGAGTTTGCTAGGAGGAACATGCAAAGCAGCTGGGGAGGGGCATCTGGGCT TTCAGTTGCAGAGACCATTCACCTCCTCTTCTCTGCACTTGAGCAACCCATCCCCAGGTGGTCATGTCAGAAGACGCCTG GAG >comp1017_c1_seq2_FPKM_all:4.913_FPKM_rel:2.616_len:525_path:[2317,2791] CTGGAGATGGTTGGAACAAATAGCCGGCTGGCTGGGCATCATTCCCTGCAGAAGGAAGCACACAGAATGGTCGTTAAGTA ACAGGGAAGTTCTCCACTTGGGTGTACTGTTTGTGGGCAACCCCAGGGCCCGGAAAGGACAGACAGAGCAGCTTATTCTG TGTGGCAATGAGGGAGGCCAAGAAACAGATTTATAATCTCCACAATCTTGAGTTTCTCTCGAGTTCCCACGTCTTAACAA AGTTTTTGTTTCAATCTTTGCAGCCATTTAAAGGACTTTTTGCTCTTCTGACCTCACCTTACTGCCTCCTGCAGTAAACA CAAGTGTTTCAGGCAAAGAAACAAAGGCCATTTCATCTGACCGCCCTCAGGATTTAGAATTAAGACTAGGTCTTGGACCC CTTTACACAGATCATTTCCCCCATGCCTCTCCCAGAACTGTGCAGTGGTGGCAGGCCGCCTCTTCTTTCCTGGGGTTTCT TTGAATGTATCAGGGCCCGCCCCACCCCATAATGTGGTTCTAAAC >comp1017_c1_seq3_FPKM_all:3.322_FPKM_rel:2.91_len:2924_path:[2317,2842,2863,1856,1835] CTGGAGATGGTTGGAACAAATAGCCGGCTGGCTGGGCATCATTCCCTGCAGAAGGAAGCACACAGAATGGTCGTTAAGTA ACAGGGAAGTTCTCCACTTGGGTGTACTGTTTGTGGGCAACCCCAGGGCCCGGAAAGGACAGACAGAGCAGCTTATTCTG TGTGGCAATGAGGGAGGCCAAGAAACAGATTTATAATCTCCACAATCTTGAGTTTCTCTCGAGTTCCCACGTCTTAACAA AGTTTTTGTTTCAATCTTTGCAGCCATTTAAAGGACTTTTTGCTCTTCTGACCTCACCTTACTGCCTCCTGCAGTAAACA

Result: linear sequences grouped in components, contigs and sequences GTTCGAGGACCTGAATAAGCGCAAGGACACCAAGGAGATCTACACGCACTTCACGTGCGCCACCGACACCAAGAACGTGC AGTTTGTGTTTGATGCCGTCACCGACGTCATCATCAAGAACAACCTGAAGGACTGCGGCCTCTTCTGAGGGGCAGCGGGG CCTGGCAGGATGG CCTGGCAGGATGGTGAGCCCGGGGTGGAGCGGAGCAGAGCTGTGGAGCCCAGAGAAGGGAGCGGTGGGGGCTGGGGTGGG CCGTGGTGGGGGTATGGTGGTAGAGTGGTAGGTCGGTAGGACGACCTGAGGGGCATGGGCACACGGATAGGCCGGGCCGG GGCCCAGATGGCAGAAGCATCCGGCCGTGCGCCGGGAGACAACGGAATGGCTGTCCTGACCACCCTTGGAGAAAGCTTAC CGGCTCTGTGCTCAGCCCTGCAGTCTTTCCCTCAGACCTATCTGAGGGTTCTGGGCTGACACTGGCCTCACTGGCCGTGG GGGAGATGGGCACGGTTCTGCCAGTACTGTAGATCCCCCTCCCTCACGTAACCCAGCAACACACACACTGGCTCTGGGGC AGCCACTGGGTCCCTCATAACAGGTGGAGGAGAAAAAGGAGAGAGTCCTTGTCTAGGGAGGGGGGAGGAGAGACACACCC GCCACCGCCGACTCTGCTTCCCCCAGTTCCTGAGGA TGGCCACCTCCCGACCCATGCCCTGACTGTCCCCCACCTCCAGGGCCACCGCCGACTCTGCTTCCCCCAGTTCCTGAGGA AGATGGGGGCAAGAGGACCACGCTCTCTGCCTGTCCGTACCCCCGCCCTGGCTGCTTTTCCCCTTTTCTTTGTTCTTGGC TCCCCTGTTCCCTCCCTCAGTTCCAGAGACTCGTGGGAGGAGCTGCCACAGGCCTCCCTGTTTGAAGCCGGCCCTTGTCC

Result: linear sequences grouped in components, contigs and sequences GTTCGAGGACCTGAATAAGCGCAAGGACACCAAGGAGATCTACACGCACTTCACGTGCGCCACCGACACCAAGAACGTGC AGTTTGTGTTTGATGCCGTCACCGACGTCATCATCAAGAACAACCTGAAGGACTGCGGCCTCTTCTGAGGGGCAGCGGGG CCTGGCAGGATGG CCTGGCAGGATGGTGAGCCCGGGGTGGAGCGGAGCAGAGCTGTGGAGCCCAGAGAAGGGAGCGGTGGGGGCTGGGGTGGG CCGTGGTGGGGGTATGGTGGTAGAGTGGTAGGTCGGTAGGACGACCTGAGGGGCATGGGCACACGGATAGGCCGGGCCGG GGCCCAGATGGCAGAAGCATCCGGCCGTGCGCCGGGAGACAACGGAATGGCTGTCCTGACCACCCTTGGAGAAAGCTTAC CGGCTCTGTGCTCAGCCCTGCAGTCTTTCCCTCAGACCTATCTGAGGGTTCTGGGCTGACACTGGCCTCACTGGCCGTGG GGGAGATGGGCACGGTTCTGCCAGTACTGTAGATCCCCCTCCCTCACGTAACCCAGCAACACACACACTGGCTCTGGGGC AGCCACTGGGTCCCTCATAACAGGTGGAGGAGAAAAAGGAGAGAGTCCTTGTCTAGGGAGGGGGGAGGAGAGACACACCC GCCACCGCCGACTCTGCTTCCCCCAGTTCCTGAGGA TGGCCACCTCCCGACCCATGCCCTGACTGTCCCCCACCTCCAGGGCCACCGCCGACTCTGCTTCCCCCAGTTCCTGAGGA AGATGGGGGCAAGAGGACCACGCTCTCTGCCTGTCCGTACCCCCGCCCTGGCTGCTTTTCCCCTTTTCTTTGTTCTTGGC TCCCCTGTTCCCTCCCTCAGTTCCAGAGACTCGTGGGAGGAGCTGCCACAGGCCTCCCTGTTTGAAGCCGGCCCTTGTCC

Reference transcript log 2 (FPKM) Trinity Assembly *Abundance Estimation via RSEM. R 2 =0.95 Near-Full-Length Assembled Transcripts Are Suitable Substrates for Expression Measurements (80-100% Length Agreement) Expression Level Comparison

*Abundance Estimation via RSEM. Reference transcript log 2 (FPKM) Trinity Assembly R 2 =0.95 R 2 =0.83R 2 =0.72 R 2 =0.58R 2 =0.40 Trinity Partially-reconstructed Transcripts Can Serve as a Proxy for Expression Measurements 60-80% Length % Length 20-40% Length 0-20% Length Only 13% of Trinity Assemblies (80-100% Length Agreement) Expression Level Comparison

General design issues Q: How many reads do I need? A: Depends on your biological question (1 lane saturates the sample). Q: How many tissues do I need? A: Depends on your organism. Q: Do I want strand-specific libraries? A: Yes! Q: polyA+ or duplex-specific nuclease (DSN)? A: polyA+ specific to pol II transcripts, DSN also gets others. Q: Can I assemble a mix of species? A: With limited success, yes. More to come.

Questions?