Introduction to Next-Generation Sequencing

Slides:



Advertisements
Similar presentations
Functional Genomics with Next-Generation Sequencing
Advertisements

An Introduction to Studying Expression Data Through RNA-seq
Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis Yan Guo.
IMGS 2012 Bioinformatics Workshop: RNA Seq using Galaxy
RNAseq.
Transcriptome Sequencing with Reference
Detecting DNA-protein Interactions Xinghua Lu Dept Biomedical Informatics BIOST 2055.
Peter Tsai Bioinformatics Institute, University of Auckland
Transcriptome Assembly and Quantification from Ion Torrent RNA-Seq Data Alex Zelikovsky Department of Computer Science Georgia State University Joint work.
Data Analysis for High-Throughput Sequencing
Next-generation sequencing and PBRC. Next Generation Sequencer Applications DeNovo Sequencing Resequencing, Comparative Genomics Global SNP Analysis Gene.
Greg Phillips Veterinary Microbiology
Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520
Transcriptomics Jim Noonan GENE 760.
RNA-Seq based discovery and reconstruction of unannotated transcripts in partially annotated genomes 3 Serghei Mangul*, Adrian Caciula*, Ion.
RNA-Seq An alternative to microarray. Steps Grow cells or isolate tissue (brain, liver, muscle) Isolate total RNA Isolate mRNA from total RNA (poly.
High Throughput Sequencing
mRNA-Seq: methods and applications
Software for Robust Transcript Discovery and Quantification from RNA-Seq Ion Mandoiu, Alex Zelikovsky, Serghei Mangul.
Department of Bioinformatics and Computational Biology
CS 6293 Advanced Topics: Current Bioinformatics
Next generation sequencing platforms Applications
Next generation sequencing Xusheng Wang 4/29/2010.
Li and Dewey BMC Bioinformatics 2011, 12:323
Expression Analysis of RNA-seq Data
Mapping protein-DNA interactions by ChIP-seq Zsolt Szilagyi Institute of Biomedicine.
Detecting enriched regions (Chip- seq, RIP-seq) Statistical evaluation of enriched regions Data displayed in Genome Browser Detection of enriched motifs.
Todd J. Treangen, Steven L. Salzberg
Transcriptome analysis With a reference – Challenging due to size and complexity of datasets – Many tools available, driven by biomedical research – GATK.
SIGNAL PROCESSING FOR NEXT-GEN SEQUENCING DATA
RNAseq analyses -- methods
Variables: – T(p) - set of candidate transcripts on which pe read p can be mapped within 1 std. dev. – y(t) -1 if a candidate transcript t is selected,
Massive Parallel Sequencing
Genomics and High Throughput Sequencing Technologies: Applications Jim Noonan Department of Genetics.
Next Generation Sequencing and its data analysis challenges Background Alignment and Assembly Applications Genome Epigenome Transcriptome.
Next Generation DNA Sequencing
Next Generation Sequencing Bioinformatics Stephen Taylor Computational Biology Research Group.
TopHat Mi-kyoung Seo. Today’s paper..TopHat Cole Trapnell at the University of Washington's Department of Genome Sciences Steven Salzberg Center.
Adrian Caciula Department of Computer Science Georgia State University Joint work with Serghei Mangul (UCLA) Ion Mandoiu (UCONN) Alex Zelikovsky (GSU)
Next Generation Sequencing. Overview of RNA-seq experimental procedures. Wang L et al. Briefings in Functional Genomics 2010;9: © The Author.
Chromatin Immunoprecipitation DNA Sequencing (ChIP-seq)
Introduction To Next Generation Sequencing (NGS) Data Analysis
ChIP-seq hands-on Iros Barozzi, Campus IFOM-IEO (Milan) Saverio Minucci, Gioacchino Natoli Labs.
The iPlant Collaborative
I519 Introduction to Bioinformatics, Fall, 2012
Next Generation Sequencing
Alistair Chalk, Elisabet Andersson Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet, September Day 5-2 What bioinformatics.
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.
Starting Monday M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in.
Introduction to RNAseq
TOX680 Unveiling the Transcriptome using RNA-seq Jinze Liu.
Biases in RNA-Seq data. Transcript length bias Two transcripts of length 50 and 100 have the same abundance in a control sample. The expression of both.
No reference available
An Integer Programming Approach to Novel Transcript Reconstruction from Paired-End RNA-Seq Reads Serghei Mangul Department of Computer Science Georgia.
RNA Sequencing and transcriptome reconstruction Manfred G. Grabherr.
Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
RNA-Seq with the Tuxedo Suite Monica Britton, Ph.D. Sr. Bioinformatics Analyst September 2015 Workshop.
Next-generation sequencing technology
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on
Cancer Genomics Core Lab
WS9: RNA-Seq Analysis with Galaxy (non-model organism )
Dr. Christoph W. Sensen und Dr. Jung Soh Trieste Course 2017
Next-generation sequencing technology
S1 Supporting information Bioinformatic workflow and quality of the metrics Number of slides: 10.
Reference based assembly
From: TopHat: discovering splice junctions with RNA-Seq
ChIP-seq Robert J. Trumbly
Presentation transcript:

Introduction to Next-Generation Sequencing Kihoon Yoon, Ph.D. Dept of Epidemiology & Biostatistics School of Medicine University of Texas Health Science Center at San Antonio

Outline Sequencing technologies Applications Bioinformatics tools for short-read sequencing Examples of Applications: ChIP-Seq /RNA-Seq

Sequencing technologies Next-next….-generation: how many ‘next’s are there? First Generation: automated version of Sanger sequencing (DNA-sequencing method invented by Fred Sanger in the 1970s) Take 500 days to read one Giga (billion) base (Gb) (1/3 of human genome) 1000 bases per read / Cost is high - $0.50 per 1000 bases Second Generation Roche/454 sequencing machine from 454 Life Science (2005) 450 bases per read / $0.02 per 1000 bases / 2 days per Gb Solexa from Illumina (2006) 75 bases per read / $0.001 per 1000 bases / 0.5 days per Gb SOLiD from Applied Biosystem (2006) 50 bases per read / $0.001 per 1ooo bases / 0.5 days per Gb Next-Next-Gen – Third Generation? HiSeq2000 from Illumina – 0.04 days per Gb Helicos HeliscopeTM (www.helicosbio.com) Pacific Biosciences SMRT (www.pacificbiosciences.com)

First vs Second Generation Figure 1 from Shendure & Ji, 2008

Second Generation Sequencing 454, SOLiD Solexa Figure 2 from Shendure & Ji, 2008

NGS A typical procedure: Sequencing Alignment How deep? Alignment References, assemble or both Experimental specific analysis A ‘one-size-fits-all’ program does not exist

Applications De novo sequence assembly Short Sequence Alignment Whole Genome Assembly Transcriptome Assembly Short Sequence Alignment Single read Paired read Genomic Variation Detection Detection of Single Nucleotide Polymorphism (SNP) Detection of Alternative Splicing Event Detection of major/minor transcript isoforms

Applications RNA-Seq Table 2 from Shendure & Ji, 2008

Bioinformatics Tools Table 3 from Shendure & Ji, 2008

File Format Sequence Reads Alignment fastq fasta Sequence Alignment Map (SAM) http://samtools.sourceforge.net/SAM1.pdf BAM http://iesdp.gibberlings3.net/file_formats/ie_formats/bam_v1.htm Samtools: http://samtools.sourceforge.net/

Data: Sequence Reads Size of raw data A challenge call for a new compression algorithm Size of raw data

Data: Sequence Reads Examples from Illumina sequcing read file - fastq Line 1: Line 2: Line 3: Line 4: @EAS042_0001:1:1:1061:20798#0/1 TNTCTGTGTCCTGGGGCATCAATGATAGTCACATAGTACTTGCTGGTCTCAAATTTCCACAAGGAGATATCAATGG +EAS042_0001:1:1:1061:20798#0/1 aB\^^Y]a^]cde`daaYaaa_bc\\`b^Y\a\aaUQY\]a\`aa\W__]HVZ]VQF^[`UH]\J^F^T^\\I]__ Line 1 Line 2: raw sequence Line 3: + ? Line 4: sequence quality score from -5 to 62 using ASCII 59 to 126 EAS042_0001 the unique instrument name 1 flowcell lane 2 tile number within the flowcell lane 1061 'x'-coordinate of the cluster within the tile 20798 'y'-coordinate of the cluster within the tile #0 index number for a multiplexed sample (0 for no indexing) /1 the member of a pair, /1 or /2 (paired-end or mate-pair reads only) Will Lossy Compression work?

Example of Applications ChIP-Seq allows you to assay the amount of binding and location of a protein to DNA, such as a transcription factor bound to the start site of a gene, or a histones of a certain type. RNA-Seq Transcriptome sequencing Substantial challenges exist for annotation Should be able to reconstruct transcripts & accurately measure their relative abundance w/o reference to an annotated genome

ChIP-Seq Chromatin immunoprecipitation (ChIP) followed by high-throughput sequencing Figure 1 from Mardis, 2007

ChIP-Seq ChIP-chip: ChIP is coupled to DNA hybridization array (chip) technology This is the closest methodology to ChIP-seq, but its mapping precision is lower, and the dynamic range of the readout is significantly less. Comparison of ChIP-seq and ChIP-chip. Representative signals from ChIP-seq (solid line) and ChIP-chip (dashed line) show both greater dynamic range and higher resolution with ChIP-seq. Whereas three binding peaks are identified using ChIP-seq, only one broad peak is detected using ChIP-chip. Liu et al. BMC Biology 2010 8:56   doi:10.1186/1741-7007-8-56

ChIP-Seq Three key steps antibody selection – most crucial actual sequencing, which is subject to several possible biases algorithmic analysis, including mapping and peak-calling. short tags (around 25 to 35 bp) can be ambiguous in regions of high homology or in repeat regions Align and Pick-calling to detect active binding sites Alignment tools: BWA, MAQ, SOAP …. a large number of free and commercial peak-calling software packages: MACS, SICER, PeakSeq, SISSR, F-seq Pepke S, Wold B, Mortazavi A: Computation for ChIP-seq and RNA-seq studies. Nat Methods 2009 , 6:S22-S32. Barski A, Zhao K: Genomic location analysis by ChIP-Seq. J Cell Biochem 2009 , 107:11-18.

ChIP-Seq Shirley Pepke, Barbara Wold & Ali Mortazavi Nature Methods 6, S22 - S32 (2009) Published online: 15 October 2009 doi:10.1038/nmeth.1371

ChIP-Seq: Wilbanks et al. Wilbanks EG, Facciotti MT (2010) Evaluation of Algorithm Performance in ChIP-Seq Peak Detection. PLoS ONE 5(7): e11471. doi:10.1371/journal.pone.0011471 Figure 1

ChIP-Seq: Wilbanks et al.

ChIP-Seq: Wilbanks et al. Figure 7. Positional accuracy and precision. The distance between the predicted binding site and high confidence motif occurrences within 250 bp was calcualted for different peak calling programs in the (A) NRSF….

ChIP-Seq: Wilbanks et al. Conclusion: It is a hard problem! Balance b/w sensitivity & specificity in compiling the final candidate peak list is desired High false positives! “We suggest that rather than focus solely on algorithmic development, equal or better gains could be made through careful consideration of experimental design and further development of sample preparations to reduce noise in the datasets.” New methods do not always give us clear ideas about the outcome…. Biologists do not think analysis part in advance, and quantitative scientists absolutely don’t have any idea to recommend on their experiments. And, the results of experiments are likely to be inclusive!

RNA-Seq Transcriptiome Analysis Figure 5 | Overview of RNA-Seq. A RNA fraction of interest is selected, fragmented and reverse transcribed. The resulting cDNA can then be sequenced using any of the current ultra-high-throughput technologies to obtain ten to a hundred million reads, which are then mapped back onto the genome. The reads are then analyzed to calculate expression levels. Shirley Pepke, Barbara Wold & Ali Mortazavi Nature Methods 6, S22 - S32 (2009) Published online: 15 October 2009 doi:10.1038/nmeth.1371

RNA-Seq: Strategies Figure 1 from Hass & Zody, 2010

RNA-Seq: Strategies Alignment Strategy Align to transcriptome no new transcript discovery Align to genome and exon-exon junction sequences extremely large search space due to all possible exon combinations De novo assembly Cufflink Scripture Shirley Pepke, Barbara Wold & Ali Mortazavi Nature Methods 6, S22 - S32 (2009) Published online: 15 October 2009 doi:10.1038/nmeth.1371

RNA-Seq two major objectives of RNA-Seq experiments: Identification of novel transcripts from the locations of regions covered in the mapping. Estimation of the abundance of the transcripts from their depth of coverage in the mapping.

TopHat/Cufflink Cole Trapnell, Lior Pachter, and Steven L. Salzberg, TopHat: discovering splice junctions with RNA-Seq Bioinformatics (2009) 25(9): 1105-1111 doi:10.1093/bioinformatics/btp120 Cole Trapnell,Brian A Williams,Geo Pertea,Ali Mortazavi,Gordon Kwan,Marijke J van Baren,Steven L Salzberg,Barbara J Wold& Lior, Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation, Nature Biotechnology, Vol: 28, 511–515 (2010)

TopHat/Cufflink Trapnell et al., 2010 Trapnell et al., 2009

Scripture Mitchell Guttman, Manuel Garber, Joshua Z Levin, Julie Donaghey, James Robinson, Xian Adiconis, Lin Fan, Magdalena J Koziol, Andreas Gnirke, Chad Nusbaum, John L Rinn, Eric S Lander & Aviv Regevaregev, Ab initio reconstruction of cell type–specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs. Nature Biotechnology. Vol: 28, 503–510 (2010)

Scripture Figure 1 Figure 2 Guttman et al., 2010

RNA-Seq Software Shirley Pepke, Barbara Wold & Ali Mortazavi Nature Methods 6, S22 - S32 (2009) Published online: 15 October 2009 doi:10.1038/nmeth.1371

Quantitation Metric for RNA-Seq Expression RPKM Reads per kilobase per million reads Count the number of reads which map to constitutive exon bodies. The set of constitutive exons was derived from Ensembl genes (hg18, UCSC genome browser), where an exon was defined to be constitutive if present in all transcripts for a given gene Determine the number of uniquely mappable positions in the same set of constitutive exons. "Uniquely mappable" was defined as being a unique 32-mer in the genome and our junction database. Count the total number of uniquely mapping reads in each tissue or sample. Compute RPKM as the number of reads which map per kilobase of exon model per million mapped reads for each gene, for each tissue or sample.

RNA-Seq De novo assembly algorithms Post-transcriptional regulation

References Metzker, M.L. (2010) Sequencing technologies - the next generation. Nat Rev Genet, 11, 31-46. Mardis, E.R. (2008) Next-generation DNA sequencing methods. Annu Rev Genom Hum G, 9, 387-402. Shendure, J. and Ji, H.L. (2008) Next-generation DNA sequencing. Nat Biotechnol, 26, 1135-1145. Mardis, E.R. (2007) ChIP-seq: welcome to the new frontier. Nat Methods, 4, 613-614. Wang, Z., Gerstein, M. and Snyder, M. (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nature Reviews Genetics, 10, 57-63. Haas, B.J. and Zody, M.C. (2010) Advancing RNA-Seq analysis. Nature Biotechnology 28, 421–423.

Question?