Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to Next-Generation Sequencing

Similar presentations


Presentation on theme: "Introduction to Next-Generation Sequencing"— Presentation transcript:

1 Introduction to Next-Generation Sequencing
Kihoon Yoon, Ph.D. Dept of Epidemiology & Biostatistics School of Medicine University of Texas Health Science Center at San Antonio

2 Outline Sequencing technologies Applications
Bioinformatics tools for short-read sequencing Examples of Applications: ChIP-Seq /RNA-Seq

3 Sequencing technologies
Next-next….-generation: how many ‘next’s are there? First Generation: automated version of Sanger sequencing (DNA-sequencing method invented by Fred Sanger in the 1970s) Take 500 days to read one Giga (billion) base (Gb) (1/3 of human genome) 1000 bases per read / Cost is high - $0.50 per 1000 bases Second Generation Roche/454 sequencing machine from 454 Life Science (2005) 450 bases per read / $0.02 per 1000 bases / 2 days per Gb Solexa from Illumina (2006) 75 bases per read / $0.001 per 1000 bases / 0.5 days per Gb SOLiD from Applied Biosystem (2006) 50 bases per read / $0.001 per 1ooo bases / 0.5 days per Gb Next-Next-Gen – Third Generation? HiSeq2000 from Illumina – 0.04 days per Gb Helicos HeliscopeTM ( Pacific Biosciences SMRT (

4 First vs Second Generation
Figure 1 from Shendure & Ji, 2008

5 Second Generation Sequencing
454, SOLiD Solexa Figure 2 from Shendure & Ji, 2008

6 NGS A typical procedure: Sequencing Alignment
How deep? Alignment References, assemble or both Experimental specific analysis A ‘one-size-fits-all’ program does not exist

7 Applications De novo sequence assembly Short Sequence Alignment
Whole Genome Assembly Transcriptome Assembly Short Sequence Alignment Single read Paired read Genomic Variation Detection Detection of Single Nucleotide Polymorphism (SNP) Detection of Alternative Splicing Event Detection of major/minor transcript isoforms

8 Applications RNA-Seq Table 2 from Shendure & Ji, 2008

9 Bioinformatics Tools Table 3 from Shendure & Ji, 2008

10 File Format Sequence Reads Alignment fastq fasta
Sequence Alignment Map (SAM) BAM Samtools:

11 Data: Sequence Reads Size of raw data A challenge call for
a new compression algorithm Size of raw data

12 Data: Sequence Reads Examples from Illumina sequcing read file - fastq
Line 1: Line 2: Line 3: Line 4: @EAS042_0001:1:1:1061:20798#0/1 TNTCTGTGTCCTGGGGCATCAATGATAGTCACATAGTACTTGCTGGTCTCAAATTTCCACAAGGAGATATCAATGG +EAS042_0001:1:1:1061:20798#0/1 aB\^^Y]a^]cde`daaYaaa_bc\\`b^Y\a\aaUQY\]a\`aa\W__]HVZ]VQF^[`UH]\J^F^T^\\I]__ Line 1 Line 2: raw sequence Line 3: + ? Line 4: sequence quality score from -5 to 62 using ASCII 59 to 126 EAS042_0001 the unique instrument name 1 flowcell lane 2 tile number within the flowcell lane 1061 'x'-coordinate of the cluster within the tile 20798 'y'-coordinate of the cluster within the tile #0 index number for a multiplexed sample (0 for no indexing) /1 the member of a pair, /1 or /2 (paired-end or mate-pair reads only) Will Lossy Compression work?

13 Example of Applications
ChIP-Seq allows you to assay the amount of binding and location of a protein to DNA, such as a transcription factor bound to the start site of a gene, or a histones of a certain type. RNA-Seq Transcriptome sequencing Substantial challenges exist for annotation Should be able to reconstruct transcripts & accurately measure their relative abundance w/o reference to an annotated genome

14 ChIP-Seq Chromatin immunoprecipitation (ChIP) followed by high-throughput sequencing Figure 1 from Mardis, 2007

15 ChIP-Seq ChIP-chip: ChIP is coupled to DNA hybridization array (chip) technology This is the closest methodology to ChIP-seq, but its mapping precision is lower, and the dynamic range of the readout is significantly less. Comparison of ChIP-seq and ChIP-chip. Representative signals from ChIP-seq (solid line) and ChIP-chip (dashed line) show both greater dynamic range and higher resolution with ChIP-seq. Whereas three binding peaks are identified using ChIP-seq, only one broad peak is detected using ChIP-chip. Liu et al. BMC Biology :56   doi: /

16 ChIP-Seq Three key steps
antibody selection – most crucial actual sequencing, which is subject to several possible biases algorithmic analysis, including mapping and peak-calling. short tags (around 25 to 35 bp) can be ambiguous in regions of high homology or in repeat regions Align and Pick-calling to detect active binding sites Alignment tools: BWA, MAQ, SOAP …. a large number of free and commercial peak-calling software packages: MACS, SICER, PeakSeq, SISSR, F-seq Pepke S, Wold B, Mortazavi A: Computation for ChIP-seq and RNA-seq studies. Nat Methods 2009 , 6:S22-S32. Barski A, Zhao K: Genomic location analysis by ChIP-Seq. J Cell Biochem 2009 , 107:11-18.

17 ChIP-Seq Shirley Pepke, Barbara Wold & Ali Mortazavi
Nature Methods 6, S22 - S32 (2009) Published online: 15 October 2009 doi: /nmeth.1371

18 ChIP-Seq: Wilbanks et al.
Wilbanks EG, Facciotti MT (2010) Evaluation of Algorithm Performance in ChIP-Seq Peak Detection. PLoS ONE 5(7): e doi: /journal.pone Figure 1

19 ChIP-Seq: Wilbanks et al.

20 ChIP-Seq: Wilbanks et al.
Figure 7. Positional accuracy and precision. The distance between the predicted binding site and high confidence motif occurrences within 250 bp was calcualted for different peak calling programs in the (A) NRSF….

21 ChIP-Seq: Wilbanks et al.
Conclusion: It is a hard problem! Balance b/w sensitivity & specificity in compiling the final candidate peak list is desired High false positives! “We suggest that rather than focus solely on algorithmic development, equal or better gains could be made through careful consideration of experimental design and further development of sample preparations to reduce noise in the datasets.” New methods do not always give us clear ideas about the outcome…. Biologists do not think analysis part in advance, and quantitative scientists absolutely don’t have any idea to recommend on their experiments. And, the results of experiments are likely to be inclusive!

22 RNA-Seq Transcriptiome Analysis
Figure 5 | Overview of RNA-Seq. A RNA fraction of interest is selected, fragmented and reverse transcribed. The resulting cDNA can then be sequenced using any of the current ultra-high-throughput technologies to obtain ten to a hundred million reads, which are then mapped back onto the genome. The reads are then analyzed to calculate expression levels. Shirley Pepke, Barbara Wold & Ali Mortazavi Nature Methods 6, S22 - S32 (2009) Published online: 15 October 2009 doi: /nmeth.1371

23 RNA-Seq: Strategies Figure 1 from Hass & Zody, 2010

24 RNA-Seq: Strategies Alignment Strategy Align to transcriptome
no new transcript discovery Align to genome and exon-exon junction sequences extremely large search space due to all possible exon combinations De novo assembly Cufflink Scripture Shirley Pepke, Barbara Wold & Ali Mortazavi Nature Methods 6, S22 - S32 (2009) Published online: 15 October 2009 doi: /nmeth.1371

25 RNA-Seq two major objectives of RNA-Seq experiments:
Identification of novel transcripts from the locations of regions covered in the mapping. Estimation of the abundance of the transcripts from their depth of coverage in the mapping.

26 TopHat/Cufflink Cole Trapnell, Lior Pachter, and Steven L. Salzberg, TopHat: discovering splice junctions with RNA-Seq Bioinformatics (2009) 25(9): doi: /bioinformatics/btp120 Cole Trapnell,Brian A Williams,Geo Pertea,Ali Mortazavi,Gordon Kwan,Marijke J van Baren,Steven L Salzberg,Barbara J Wold& Lior, Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation, Nature Biotechnology, Vol: 28, 511–515 (2010)

27 TopHat/Cufflink Trapnell et al., 2010 Trapnell et al., 2009

28 Scripture Mitchell Guttman, Manuel Garber, Joshua Z Levin, Julie Donaghey, James Robinson, Xian Adiconis, Lin Fan, Magdalena J Koziol, Andreas Gnirke, Chad Nusbaum, John L Rinn, Eric S Lander & Aviv Regevaregev, Ab initio reconstruction of cell type–specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs. Nature Biotechnology. Vol: 28, 503–510 (2010)

29 Scripture Figure 1 Figure 2 Guttman et al., 2010

30 RNA-Seq Software Shirley Pepke, Barbara Wold & Ali Mortazavi
Nature Methods 6, S22 - S32 (2009) Published online: 15 October 2009 doi: /nmeth.1371

31 Quantitation Metric for RNA-Seq Expression RPKM
Reads per kilobase per million reads Count the number of reads which map to constitutive exon bodies. The set of constitutive exons was derived from Ensembl genes (hg18, UCSC genome browser), where an exon was defined to be constitutive if present in all transcripts for a given gene Determine the number of uniquely mappable positions in the same set of constitutive exons. "Uniquely mappable" was defined as being a unique 32-mer in the genome and our junction database. Count the total number of uniquely mapping reads in each tissue or sample. Compute RPKM as the number of reads which map per kilobase of exon model per million mapped reads for each gene, for each tissue or sample.

32 RNA-Seq De novo assembly algorithms Post-transcriptional regulation

33 References Metzker, M.L. (2010) Sequencing technologies - the next generation. Nat Rev Genet, 11, Mardis, E.R. (2008) Next-generation DNA sequencing methods. Annu Rev Genom Hum G, 9, Shendure, J. and Ji, H.L. (2008) Next-generation DNA sequencing. Nat Biotechnol, 26, Mardis, E.R. (2007) ChIP-seq: welcome to the new frontier. Nat Methods, 4, Wang, Z., Gerstein, M. and Snyder, M. (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nature Reviews Genetics, 10, Haas, B.J. and Zody, M.C. (2010) Advancing RNA-Seq analysis. Nature Biotechnology 28, 421–423.

34 Question?


Download ppt "Introduction to Next-Generation Sequencing"

Similar presentations


Ads by Google