Presentation is loading. Please wait.

Presentation is loading. Please wait.

Biostatistics-Lecture 15 High-throughput sequencing and sequence alignment Ruibin Xi Peking University School of Mathematical Sciences.

Similar presentations


Presentation on theme: "Biostatistics-Lecture 15 High-throughput sequencing and sequence alignment Ruibin Xi Peking University School of Mathematical Sciences."— Presentation transcript:

1 Biostatistics-Lecture 15 High-throughput sequencing and sequence alignment Ruibin Xi Peking University School of Mathematical Sciences

2 High-throughput sequencing  HTS platforms  Roche 454 platforms  Illumina/Solexa platforms (most widely used)  Applied Biosystem (ABI) SOLiD  Helicos HeliSope TM sequencer(single molecular sequencing)  Life Technologies platforms  The throughput is increasing and the price is dropping  Short read but high throughput

3 High-throughput sequencing  HTS platforms  Roche 454 platforms  Illumina/Solexa platforms (most widely used)  Applied Biosystem (ABI) SOLiD  Helicos HeliSope TM sequencer(single molecular sequencing)  Life Technologies platforms  The throughput is increasing and the price is dropping  Short read but high throughput

4 What sequencing data can do  Detection of genomic variations (Whole genome sequencing or targeted sequencing)  Single nucleotide polymorphisms (SNP)  Copy number variations (CNV)  Structural variations (SV)  Analyze protein interactions with DNA (ChIP-seq)  Whole Transcriptome study (RNA-seq)  DNA methylation study  and many more …..

5 Sequencing (Illunima) 5 Mardis Nature Reivew Genetics (2010)

6 What the data look like? Fastq Format detail: 1 st line: the name of a short read 2 nd line: the read itself (a short sequence of A,C,G,T) 3 rd line: the name of the short read or plus (+) sign 4 th line: the quality score 6

7 General strategy for analyzing HTS data Alignment-based Assembly-based

8 Comparing two DNA sequences

9 How can we evaluate an alignment

10 Scores: – Mutation (mismatch): 0 – Match: 1 – Gap: -1

11 Find a best alignment Exhaustively search all possible alignments – Computationally too expensive!!! Observation: for a pair (i,j)

12 Find a best alignment

13 Find a best alignment—Solution I

14 Find a best alignment—Solution II

15 Dynamical programming

16

17

18

19

20

21

22

23 Semiglobal alignment

24

25

26

27 Local Alignment

28

29

30

31 BLAST and BLAT BLAST: Basic Local Alignment Search Tool BLAT: BLAST Like Alignment Tool

32 BLAST BLAST: – A search algorithm for finding local alignments of two sequences S and T – An associated theory for evaluating the statistical significant Terminology and notation – S(a,b): scoring system – High-scoring Segment Pair (HSP): Cannot be extended or shortened without dropping the score

33 BLAST The number of HSPs with a score ≥ S approximately follows a Poisson Distribution (under the null hypothesis) with parameter – Assumptions Some probability to take positive score – Based on extreme value theory – E-value – Bit score

34 BLAST Algorithm: seed and extend – Build an index for k-mers of the query sequence – Find the hits of the k-mers in the database sequence in the query sequence – Extend the seeds with a score ≥ a threshold and find the HSPs with a score ≥ S – Evaluate the statistical significance

35 BLAT Strategy: seed and extend – In the seed stage, detects regions of two sequences that are likely to be homologous – In the extend stage, those regions are examined in detail and alignments are produced. Index the non-overlapping K-mers of the database sequences instead of the query sequence

36 BLAT Strategy: seed and extend – In the seed stage, detects regions of two sequences that are likely to be homologous – In the extend stage, those regions are examined in detail and alignments are produced. Index the non-overlapping K-mers of the database sequences instead of the query sequence

37 BLAT Seeding strategy – Single perfect K-mer matches – Single near perfect K-mer matches – Multiple perfect K-mer matches

38 BLAT Some definitions

39 BLAT Single perfect match – the probability that a specific K-mer in a homologous region of the database matches perfectly the corresponding K-mer in the query – Sensitivity: the probability of a hit (at least one non- overlapping K-mers in the database matches perfectly with the corresponding K-mer in the query) – Specificity: the expected number of non-overlapping K-mers that matches, assuming all letters are equally likely

40 BLAT Single perfect match – the probability that a specific K-mer in a homologous region of the database matches perfectly the corresponding K-mer in the query – Sensitivity: the probability of a hit (at least one non- overlapping K-mers in the database matches perfectly with the corresponding K-mer in the query) – Specificity: the expected number of non-overlapping K-mers that matches, assuming all letters are equally likely

41 BLAT Single imperfect match – The probability – The sensitivity – The specificity

42 BLAT Single imperfect match – The probability – The sensitivity – The specificity

43 BLAT Multiple perfect Matches – Probability – Sensitivity – Specificity

44 BLAT Multiple perfect Matches – Probability – Sensitivity – Specificity

45 BLAT Clumping the hits – The hit list L is sorted by database coordinate. – The list L is split into buckets of size 64 kb each, based on the database coordinate. – Each bucket is sorted along the diagonal, i.e. hits are sorted by the value of database position minus query position. – Hits that are within the gap limit are grouped together into proto-clumps. – Hits within proto-clumps are then sorted by their database coordinate and put into real clumps – Clumps within 300 bp or 100 amino acids of each other in the database are merged

46 BLAT Nucleotide alignment – A hit list is generated between the query sequence q and the homologous region h in the database, looking for smaller, perfect K-mers. – If a K-mer w in q matches multiple K-mers in h, then w is repeatedly extended by one until the match is unique or exceeds a certain size. – The hits are extended as far as possible, without mismatches – Overlapping hits are merged. – Then extensions using indels followed by matches are considered.


Download ppt "Biostatistics-Lecture 15 High-throughput sequencing and sequence alignment Ruibin Xi Peking University School of Mathematical Sciences."

Similar presentations


Ads by Google