Biostatistics-Lecture 15 High-throughput sequencing and sequence alignment Ruibin Xi Peking University School of Mathematical Sciences.

Slides:



Advertisements
Similar presentations
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Advertisements

Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
BLAST Sequence alignment, E-value & Extreme value distribution.
A new method of finding similarity regions in DNA sequences Laurent Noé Gregory Kucherov LORIA/UHP Nancy, France LORIA/INRIA Nancy, France Corresponding.
BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.
Last lecture summary.
Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally.
Searching Sequence Databases
. Class 4: Fast Sequence Alignment. Alignment in Real Life u One of the major uses of alignments is to find sequences in a “database” u Such collections.
Sequence similarity (II). Schedule Mar 23midterm assignedalignment Mar 30midterm dueprot struct/drugs April 6teams assignedprot struct/drugs April 13RNA.
Alignment to a database of sequences Biology 162 Computational Genetics Todd Vision 7 Sep 2004.
. Class 4: Fast Sequence Alignment. Alignment in Real Life u One of the major uses of alignments is to find sequences in a “database” u Such collections.
Sequence Alignment vs. Database Task: Given a query sequence and millions of database records, find the optimal alignment between the query and a record.
1 1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one.
Index-based search of single sequences Omkar Mate CS 374 Stanford University.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
Heuristic Approaches for Sequence Alignments
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
Protein Sequence Comparison Patrice Koehl
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Outline. 1. what is BLAT & why we need it 2
TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
BLAT – The B LAST- L ike A lignment T ool Kent, W.J. Genome Res : Presenter: 巨彥霖 田知本.
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
Inferring function by homology The fact that functionally important aspects of sequences are conserved across evolutionary time allows us to find, by homology.
An Introduction to Bioinformatics
Protein Sequence Alignment and Database Searching.
BLAST Workshop Maya Schushan June 2009.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
BIOINFORMATICS IN BIOCHEMISTRY Bioinformatics– a field at the interface of molecular biology, computer science, and mathematics Bioinformatics focuses.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
1 Data structure:Lookup Table Application:BLAST. 2 The Look-up Table Data Structure A k-mer is a string of length k. A lookup table is a table of size.
Comp. Genomics Recitation 3 The statistics of database searching.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Basic Local Alignment Search Tool BLAST Why Use BLAST?
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Applied Bioinformatics Week 3. Theory I Similarity Dot plot.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Constructing Probability Matrices Redux Suppose we live in a world with only 3 amino acids: Alanine Leucine Serine Furthermore suppose: Alanine Leucine.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Construction of Substitution matrices
Doug Raiford Phage class: introduction to sequence databases.
2016/1/27Summer Course1 Pattern Search Problems Part I: Fundament Concept.
Pairwise Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 4, 2004 ChengXiang Zhai Department of Computer Science University.
Sequence Search Abhishek Niroula Department of Experimental Medical Science Lund University
Step 3: Tools Database Searching
Heuristic Alignment Algorithms Hongchao Li Jan
CS 6293 AT: Current Bioinformatics HW2 Papers 1
What is sequencing? Video: WlxM (Illumina video) WlxM.
Heuristic Methods for Sequence Database Searching BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
BLAST BNFO 236 Usman Roshan. BLAST Local pairwise alignment heuristic Faster than standard pairwise alignment programs such as SSEARCH, but less sensitive.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Fast Sequence Alignments
Basic Local Alignment Search Tool
Basic Local Alignment Search Tool (BLAST)
BIOINFORMATICS Fast Alignment
Basic Local Alignment Search Tool
Sequence alignment, E-value & Extreme value distribution
Searching Sequence Databases
Presentation transcript:

Biostatistics-Lecture 15 High-throughput sequencing and sequence alignment Ruibin Xi Peking University School of Mathematical Sciences

High-throughput sequencing  HTS platforms  Roche 454 platforms  Illumina/Solexa platforms (most widely used)  Applied Biosystem (ABI) SOLiD  Helicos HeliSope TM sequencer(single molecular sequencing)  Life Technologies platforms  The throughput is increasing and the price is dropping  Short read but high throughput

High-throughput sequencing  HTS platforms  Roche 454 platforms  Illumina/Solexa platforms (most widely used)  Applied Biosystem (ABI) SOLiD  Helicos HeliSope TM sequencer(single molecular sequencing)  Life Technologies platforms  The throughput is increasing and the price is dropping  Short read but high throughput

What sequencing data can do  Detection of genomic variations (Whole genome sequencing or targeted sequencing)  Single nucleotide polymorphisms (SNP)  Copy number variations (CNV)  Structural variations (SV)  Analyze protein interactions with DNA (ChIP-seq)  Whole Transcriptome study (RNA-seq)  DNA methylation study  and many more …..

Sequencing (Illunima) 5 Mardis Nature Reivew Genetics (2010)

What the data look like? Fastq Format detail: 1 st line: the name of a short read 2 nd line: the read itself (a short sequence of A,C,G,T) 3 rd line: the name of the short read or plus (+) sign 4 th line: the quality score 6

General strategy for analyzing HTS data Alignment-based Assembly-based

Comparing two DNA sequences

How can we evaluate an alignment

Scores: – Mutation (mismatch): 0 – Match: 1 – Gap: -1

Find a best alignment Exhaustively search all possible alignments – Computationally too expensive!!! Observation: for a pair (i,j)

Find a best alignment

Find a best alignment—Solution I

Find a best alignment—Solution II

Dynamical programming

Semiglobal alignment

Local Alignment

BLAST and BLAT BLAST: Basic Local Alignment Search Tool BLAT: BLAST Like Alignment Tool

BLAST BLAST: – A search algorithm for finding local alignments of two sequences S and T – An associated theory for evaluating the statistical significant Terminology and notation – S(a,b): scoring system – High-scoring Segment Pair (HSP): Cannot be extended or shortened without dropping the score

BLAST The number of HSPs with a score ≥ S approximately follows a Poisson Distribution (under the null hypothesis) with parameter – Assumptions Some probability to take positive score – Based on extreme value theory – E-value – Bit score

BLAST Algorithm: seed and extend – Build an index for k-mers of the query sequence – Find the hits of the k-mers in the database sequence in the query sequence – Extend the seeds with a score ≥ a threshold and find the HSPs with a score ≥ S – Evaluate the statistical significance

BLAT Strategy: seed and extend – In the seed stage, detects regions of two sequences that are likely to be homologous – In the extend stage, those regions are examined in detail and alignments are produced. Index the non-overlapping K-mers of the database sequences instead of the query sequence

BLAT Strategy: seed and extend – In the seed stage, detects regions of two sequences that are likely to be homologous – In the extend stage, those regions are examined in detail and alignments are produced. Index the non-overlapping K-mers of the database sequences instead of the query sequence

BLAT Seeding strategy – Single perfect K-mer matches – Single near perfect K-mer matches – Multiple perfect K-mer matches

BLAT Some definitions

BLAT Single perfect match – the probability that a specific K-mer in a homologous region of the database matches perfectly the corresponding K-mer in the query – Sensitivity: the probability of a hit (at least one non- overlapping K-mers in the database matches perfectly with the corresponding K-mer in the query) – Specificity: the expected number of non-overlapping K-mers that matches, assuming all letters are equally likely

BLAT Single perfect match – the probability that a specific K-mer in a homologous region of the database matches perfectly the corresponding K-mer in the query – Sensitivity: the probability of a hit (at least one non- overlapping K-mers in the database matches perfectly with the corresponding K-mer in the query) – Specificity: the expected number of non-overlapping K-mers that matches, assuming all letters are equally likely

BLAT Single imperfect match – The probability – The sensitivity – The specificity

BLAT Single imperfect match – The probability – The sensitivity – The specificity

BLAT Multiple perfect Matches – Probability – Sensitivity – Specificity

BLAT Multiple perfect Matches – Probability – Sensitivity – Specificity

BLAT Clumping the hits – The hit list L is sorted by database coordinate. – The list L is split into buckets of size 64 kb each, based on the database coordinate. – Each bucket is sorted along the diagonal, i.e. hits are sorted by the value of database position minus query position. – Hits that are within the gap limit are grouped together into proto-clumps. – Hits within proto-clumps are then sorted by their database coordinate and put into real clumps – Clumps within 300 bp or 100 amino acids of each other in the database are merged

BLAT Nucleotide alignment – A hit list is generated between the query sequence q and the homologous region h in the database, looking for smaller, perfect K-mers. – If a K-mer w in q matches multiple K-mers in h, then w is repeatedly extended by one until the match is unique or exceeds a certain size. – The hits are extended as far as possible, without mismatches – Overlapping hits are merged. – Then extensions using indels followed by matches are considered.