High Throughput Sequencing

High Throughput Sequencing
Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520 Guest lecturer: Wei Li

About me Wei Li, research fellow at DFCI
Studied high-throughput sequencing algorithms shortly after HTS comes out (2009) Transcript reconstruction algorithms from high-throughput RNA sequencing data (RNA-seq): IsoInfer/IsoLasso/CEM CRISPR/Cas9 screening algorithms: design, analysis (MAGeCK/MAGeCK-VISPR)

Why high-throughput sequencing?
High-throughput sequencing/HTS/Next-generation sequencing/NGS 2-3 orders of magnitude faster/cheaper/higher data throughput compared with “first generation” Huge applications in academia/industry

First generation: Sanger sequencing
Frederick Sanger: the 3rd person overall to win two Nobel prizes

First Generation Sanger Sequencing: 384 * 1kb / 3 hours

Sanger sequencing materials
Sanger sequencing uses DNA elongation to “read” sequences dNTPs: required for normal elongation process ddNTPs: missing oxygen bond, will stop the synthesis dideoxyNTP, di=two, deoxy=remove oxygen

Sanger sequencing setup
4 tubes, each test tube has deoxyA,G,C,T In addition each also has ONE of the 4 ddNTP

What happens if you have both dATP and ddATP?
The synthesis stops whenever you encounter “T”

Sequencing in 2001 [Enter any extra notes here; leave the item ID line at the bottom] Avitage Item ID: {{CE8AAEAA-A22F-47FE-A1F8-66CBC3CDB6FC}}

Sequencing in 2007 [Enter any extra notes here; leave the item ID line at the bottom] Avitage Item ID: {{010D7619-E070-4F7B-BC AA639C8D}}

Second Generation Massively parallel sequencing by synthesis
Many different technologies: Illumina, 454, SOLiD, Helicos, etc Illumina: HiSeq, MiSeq, NextSeq 1-16 samples 25M-4B reads 30-300bp 1-8 days 15GB-1TB output Moving targets

Illumina Cluster Generation
Amplify sequenced fragments in place on the flow cell Can sequence from both the pink and purple adapters (Paired-end seq) Can multiplex many samples / lane

Illumina Sequencing process
1. Incorporate all 4 nucleotides, each label with a different dye 2. Wash, 4-color imaging 4. Repeat cycles 3. Cleave dye and terminating groups, wash

Illumina Sequencing Cycle 1 2 3 4 5 6

Third Generation Single molecule sequencing: no amp
Fewer but much longer reads Good for sequencing long reads, but not for read count applications, technology still in developmenthttp://

High Throughput Sequencing
Big (data), fast (speed), cheap (cost), flexible (applications) Cost reduces faster than Moore’s law: Bioinformatic analyses become bottleneck!

High Throughput Sequencing Data Analysis

FASTQ File Format Quality score using ASCII (higher -> better)
Sequence ID, sequence Quality ID, quality score Quality score using ASCII (higher -> better) @HWI-EAS305:1:1:1:991#0/1 GCTGGAGGTTCAGGCTGGCCGGATTTAAACGTAT +HWI-EAS305:1:1:1:991#0/1 MVXUWVRKTWWULRQQMMWWBBBBBBBBBBBBBB @HWI-EAS305:1:1:1:201#0/1 AAGACAAAGATGTGCTTTCTAAATCTGCACTAAT +HWI-EAS305:1:1:1:201#0/1 PXX[[[[XTXYXTTWYYY[XXWWW[TMTVXWBBB

FASTQC: Sequencing Quality
Good quality! Poor quality!

Read Mapping Mapping hundreds of millions of reads back to the reference genome is CPU and RAM intensive and slow Read quality decreases with length (small single nucleotide mismatches or indels) Most mappers allow ~2 mismatches within first 30bp (4 ^ 28 could still uniquely identify most 30bp sequences in a 3GB genome), slower when allowing indels Mapping output: SAM (BAM) or BED

Read mapping algorithms
Spaced seed alignment Burrows-Wheeler Suffix tree

Spaced seed alignment Tags and tag-sized pieces of reference are cut into small “seeds.” Pairs of spaced seeds are stored in an index. Look up spaced seeds for each tag. For each “hit,” confirm the remaining positions. Report results to the user.

BW alignment

Burrows-Wheeler Store entire reference genome.
Align tag base by base from the end. When tag is traversed, all active locations are reported. If no match is found, then back up and try a substitution. Trapnell & Salzberg, Nat Biotech 2009

Burrows-Wheeler Transform
Reversible permutation used originally in compression Once BWT(T) is built, all else shown here is discarded First col can be derived by sorting the last col T (query sequence) BWT(T) Encoding for compression gc$ac Burrows Wheeler Matrix Last column Burrows M, Wheeler DJ: A block sorting lossless data compression algorithm. Digital Equipment Corporation, Palo Alto, CA 1994, Technical Report 124; 1994 Slides from Ben Langmead

Burrows-Wheeler Transform
Property that makes BWT(T) reversible is “LF Mapping” ith occurrence of a character in Last column is same text occurrence as the ith occurrence in First column Rank: 2 (2nd ‘a’ in First column) BWT(T) T Rank: 2 (2nd ‘a’ in Last column) Burrows Wheeler Matrix Slides modified from Ben Langmead

BWT: How to reconstruct T from BWT(T)?
To recreate T from BWT(T), repeatedly apply rule: T = BWT[ LF(i) ] + T; i = LF(i) Where LF(i) maps row i to row whose first character corresponds to i’s last per LF Mapping Final T Slides from Ben Langmead

To recreate T from BWT(T), repeatedly apply rule: T = BWT[ LF(i) ] + T; i = LF(i) Where LF(i) maps row i to row whose first character corresponds to i’s last per LF Mapping LF(i)=7; the first ‘g’ BWT[LF(i)]=‘c’; the second last character is ‘c’; i=LF(i)=7 i=1; this is the last character of T The first and last columns are known Slides from Ben Langmead

To recreate T from BWT(T), repeatedly apply rule: T = BWT[ LF(i) ] + T; i = LF(i) Where LF(i) maps row i to row whose first character corresponds to i’s last per LF Mapping LF(i)=6; the second ‘c’ BWT[LF(i)]=‘a’; the 3rd last character is a’; i=LF(i)=6 i=7; this is the second last character of T Slides from Ben Langmead

To recreate T from BWT(T), repeatedly apply rule: T = BWT[ LF(i) ] + T; i = LF(i) Where LF(i) maps row i to row whose first character corresponds to i’s last per LF Mapping Final T Slides from Ben Langmead

BWT: How To Do Exact Matching?
To match Q in T using BWT(T), repeatedly apply rule: top = LF(top, qc); bot = LF(bot, qc) Where qc is the next character in Q (right-to-left) and LF(i, qc) maps row i to the row whose first character corresponds to i’s last character as if it were qc Slides from Ben Langmead

To match Q in T using BWT(T), repeatedly apply rule: top = LF(top, qc); bot = LF(bot, qc) Where qc is the next character in Q (right-to-left) and LF(i, qc) maps row i to the row whose first character corresponds to i’s last character as if it were qc qc=‘a’ top=LF(5,’a’)=3 bot=LF(6,’a’)=4 qc=‘c’ top=5 The last character of row 5,6 is ‘a’ bot=6 Slides from Ben Langmead

To match Q in T using BWT(T), repeatedly apply rule: top = LF(top, qc); bot = LF(bot, qc) Where qc is the next character in Q (right-to-left) and LF(i, qc) maps row i to the row whose first character corresponds to i’s last character as if it were qc qc=‘a’ top=LF(3,’a’)=2 bot=LF(4,’a’)=2 The last character of row 3,4 is ‘a’,’$’ Slides from Ben Langmead

Exact Matching with FM Index
In progressive rounds, top & bot delimit the range of rows beginning with progressively longer suffixes of Q (from right to left) If range becomes empty the query suffix (and therefore the query) does not occur in the text If no match, instead of giving up, try to “backtrack” to a previous position and try a different base (mismatch, much slower) Slides from Ben Langmead

STAR Alignment Suffix Tree
Very fast and accuracy for mapping PE-seq and high read counts O(n) time to build O(mlogn) time to search

Suffix tree (Example) Let s=abab, a suffix tree of s is a compressed trie of all suffixes of s=abab$ $ { $ b$ ab$ bab$ abab$ } a b $

Mapped Seq Files Mapped SAM
HWUSI-EAS366_0112:6:1:1298:18828#0/1 16 chr9 255 38M * 0 0 TACAATATGTCTTTATTTGAGATATGGATTTTAGGCCG Y\]bc^dab\[_UU`^`LbTUT\ccLbbYaY`cWLYW^ XA:i:1 MD:Z:3C30T3 NM:i:2 HWUSI-EAS366_0112:6:1:1257:18819#0/1 4 * 0 0 * * 0 0 AGACCACATGAAGCTCAAGAAGAAGGAAGACAAAAGTG ece^dddT\cT^c`a`ccdK\c^^__]Yb\_cKS^_W\ XM:i:1 HWUSI-EAS366_0112:6:1:1315:19529#0/1 16 chr9 255 38M * 0 0 GCACTCAAGGGTACAGGAAAAGGGTCAGAAGTGTGGCC ^c_Yc\Lcb`bbYdTa\dd\`dda`cdd\Y\ddd^cT` XA:i:0 MD:Z:38 NM:i:0 chr chr Mapped SAM Map: 0 OK, 4 unmapped, 16 mapped reverse strand Sequence, quality score XA (mapper-specific) MD: mismatch info: 3 match, then C ref, 30 match, then T ref, 3 match NM: number of mismatch BAM: binary SAM format Mapped BED Chr, start, end, strand

Mapping Statistics Terms
Mappable locations: reads that can find match to A location in the genome Uniquely mapped reads: reads that can find match to A SINGLE location in the genome Repeat sequences in the genome, length-dependent Uniquely mapped locations: number of unique locations hit by uniquely mapped reads Redundancy: potential PCR amplification bias

Summary Sequencing technologies Sequence quality assessment
1st, 2nd, 3rd generation Sequence quality assessment FASTQC Read mapping Spaced seed BWA: Borrows Wheeler transformation, LF mapping STAR: Suffix Tree, fast SAM / BAM format

High Throughput Sequencing

Similar presentations

Presentation on theme: "High Throughput Sequencing"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

High Throughput Sequencing

Similar presentations

Presentation on theme: "High Throughput Sequencing"— Presentation transcript:

Similar presentations

About project

Feedback