Presentation is loading. Please wait.

Presentation is loading. Please wait.

High Throughput Sequencing Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.

Similar presentations


Presentation on theme: "High Throughput Sequencing Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520."— Presentation transcript:

1 High Throughput Sequencing Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520

2 First Generation Sanger Sequencing: sequencing and detection 2 different steps: 384 * 1kb / 3 hours 2

3 Second Generation Massively parallel sequencing by synthesis Many different technologies: Illumina, 454, SOLiD, Helicos, etc Illumina: HiSeq, MiSeq, NextSeq 1-16 samples 25M-4B reads 30-300bp 1-8 days 15GB-1TB output Moving targets 3

4 Illumina Cluster Generation Amplify sequenced fragments in place on the flow cell Can sequence from both the pink and purple adapters (Paired-end seq) Can multiplex many samples / lane 4

5 Illumina Sequencing 5

6 Third Generation Single molecule sequencing: no amp Fewer but much longer reads Good for sequencing long reads, but not for read count applications, technology still in developmenthttp://www.youtube.com/watch?v=v8p4p h2MAvIhttp://www.youtube.com/watch?v=v8p4p h2MAvI https://www.nanoporetech.com/news/movies#movie- 28-minionhttps://www.nanoporetech.com/news/movies#movie- 28-minion 6

7 High Throughput Sequencing Big (data), fast (speed), cheap (cost), flexible (applications) Bioinformatic analyses become bottleneck 7

8 High Throughput Sequencing Data Analysis 8

9 FASTQ File Format –Sequence ID, sequence –Quality ID, quality score Quality score using ASCII (higher -> better) 9 @HWI-EAS305:1:1:1:991#0/1 GCTGGAGGTTCAGGCTGGCCGGATTTAAACGTAT +HWI-EAS305:1:1:1:991#0/1 MVXUWVRKTWWULRQQMMWWBBBBBBBBBBBBBB @HWI-EAS305:1:1:1:201#0/1 AAGACAAAGATGTGCTTTCTAAATCTGCACTAAT +HWI-EAS305:1:1:1:201#0/1 PXX[[[[XTXYXTTWYYY[XXWWW[TMTVXWBBB

10 FASTQC: Sequencing Quality 10

11 Read Mapping Mapping hundreds of millions of reads back to the reference genome is CPU and RAM intensive and slow Read quality decreases with length (small single nucleotide mismatches or indels) Most mappers allow ~2 mismatches within first 30bp (4 ^ 28 could still uniquely identify most 30bp sequences in a 3GB genome), slower when allowing indels Mapping output: SAM (BAM) or BED 11

12 Spaced seed alignment Tags and tag-sized pieces of reference are cut into small “ seeds. ” Pairs of spaced seeds are stored in an index. Look up spaced seeds for each tag. For each “ hit, ” confirm the remaining positions. Report results to the user.

13 Burrows-Wheeler Store entire reference genome. Align tag base by base from the end. When tag is traversed, all active locations are reported. If no match is found, then back up and try a substitution. Trapnell & Salzberg, Nat Biotech 2009

14 Burrows-Wheeler Transform Reversible permutation used originally in compression Once BWT(T) is built, all else shown here is discarded –Matrix will be shown for illustration only Burrows Wheeler Matrix Last column BWT(T)T Burrows M, Wheeler DJ: A block sorting lossless data compression algorithm. Digital Equipment Corporation, Palo Alto, CA 1994, Technical Report 124; 1994 Slides from Ben Langmead Encoding for compression gc$ac 1111001

15 Burrows-Wheeler Transform Property that makes BWT(T) reversible is “LF Mapping” –i th occurrence of a character in Last column is same text occurrence as the i th occurrence in First column T BWT(T) Burrows Wheeler Matrix Rank: 2 Slides from Ben Langmead

16 Burrows-Wheeler Transform To recreate T from BWT(T), repeatedly apply rule: T = BWT[ LF(i) ] + T; i = LF(i) –Where LF(i) maps row i to row whose first character corresponds to i’s last per LF Mapping Final T Slides from Ben Langmead

17 Exact Matching with FM Index To match Q in T using BWT(T), repeatedly apply rule: top = LF(top, qc); bot = LF(bot, qc) –Where qc is the next character in Q (right-to-left) and LF(i, qc) maps row i to the row whose first character corresponds to i’s last character as if it were qc Slides from Ben Langmead

18 Exact Matching with FM Index In progressive rounds, top & bot delimit the range of rows beginning with progressively longer suffixes of Q (from right to left) If range becomes empty the query suffix (and therefore the query) does not occur in the text If no match, instead of giving up, try to “backtrack” to a previous position and try a different base (mismatch, much slower) Slides from Ben Langmead

19 STAR Alignment Suffix Tree Very fast and accuracy for mapping PE-seq and high read counts O(n) time to build O(mlogn) time to search 19

20 Suffix tree (Example) Let s=abab, a suffix tree of s is a compressed trie of all suffixes of s=abab$ { $ b$ ab$ bab$ abab$ } a b a b $ a b $ b $ $ $

21 Mapped Seq Files Mapped SAM –Map: 0 OK, 4 unmapped, 16 mapped reverse strand –Sequence, quality score –XA (mapper-specific) –MD: mismatch info: 3 match, then C ref, 30 match, then T ref, 3 match –NM: number of mismatch BAM: binary SAM format Mapped BED –Chr, start, end, strand 21 HWUSI- EAS366_0112:6:1:1298:18828#0/1 16 chr9 98116600 255 38M * 0 0 TA CAATATGTCTTTATTTGAGATATGGATTTT AGGCCG Y\]bc^dab\[_UU`^`LbTUT\ccLbbYa Y`cWLYW^ XA:i:1 MD:Z:3C30T3 NM:i:2 HWUSI- EAS366_0112:6:1:1257:18819#0/1 4 * 0 0 * * 0 0 AGACCACATGA AGCTCAAGAAGAAGGAAGACAAAAGTG ece^dddT\cT^c`a`ccdK\c^^__]Yb\_cKS^_W\ X M:i:1 HWUSI- EAS366_0112:6:1:1315:19529#0/1 16 chr9 102610263 255 38M * 0 0 GC ACTCAAGGGTACAGGAAAAGGGTCAGAA GTGTGGCC ^c_Yc\Lcb`bbYdTa\dd\`dda`cdd\ Y\ddd^cT` XA:i:0 MD:Z:38 NM:i:0 chr1123450123500+ chr52837461528374615- http://samtools.github.io/hts-specs/SAMv1.pdf

22 Mapping Statistics Terms Mappable locations: reads that can find match to A location in the genome Uniquely mapped reads: reads that can find match to A SINGLE location in the genome –Repeat sequences in the genome, length- dependent Uniquely mapped locations: number of unique locations hit by uniquely mapped reads –Redundancy: potential PCR amplification bias 22

23 Summary Sequencing technologies –1 st, 2 nd, 3 rd generation Sequence quality assessment –FASTQC Read mapping –Spaced seed –BWA: Borrows Wheeler transformation, LF mapping –STAR: Suffix Tree, fast SAM / BAM format 23


Download ppt "High Throughput Sequencing Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520."

Similar presentations


Ads by Google