Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Slides:



Advertisements
Similar presentations
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Advertisements

Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.
Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
1 Genome information GenBank (Entrez nucleotide) Species-specific databases Protein sequence GenBank (Entrez protein) UniProtKB (SwissProt) Protein structure.
Sources Page & Holmes Vladimir Likic presentation: 20show.pdf
Bioinformatics Tutorial I BLAST and Sequence Alignment.
BLAST Sequence alignment, E-value & Extreme value distribution.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.
Heuristic alignment algorithms and cost matrices
Overview of sequence database searching techniques and multiple alignment May 1, 2001 Quiz on May 3-Dynamic programming- Needleman-Wunsch method Learning.
1 1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 16 th, 2014.
Rationale for searching sequence databases June 22, 2005 Writing Topics due today Writing projects due July 8 Learning objectives- Review of Smith-Waterman.
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
Sequence alignment, E-value & Extreme value distribution
Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 17 th, 2013.
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
1 BLAST: Basic Local Alignment Search Tool Jonathan M. Urbach Bioinformatics Group Department of Molecular Biology.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Biostatistics-Lecture 15 High-throughput sequencing and sequence alignment Ruibin Xi Peking University School of Mathematical Sciences.
Multiple testing correction
An Introduction to Bioinformatics
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Local alignment, BLAST and Psi-BLAST October 25, 2012 Local alignment Quiz 2 Learning objectives-Learn the basics of BLAST and Psi-BLAST Workshop-Use BLAST2.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Statistical significance of alignment scores Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering.
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
Rationale for searching sequence databases June 25, 2003 Writing projects due July 11 Learning objectives- FASTA and BLAST programs. Psi-Blast Workshop-Use.
Significance in protein analysis
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Sequence Alignment.
Construction of Substitution matrices
Step 3: Tools Database Searching
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2010.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Practice -- BLAST search in your own computer 1.Download data file from the course web page, or Ensemble. Save in the blast\dbs folder. 2.Start a CMD window,
BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Using BLAST To Teach ‘E-value-tionary’ Concepts Cheryl A. Kerfeld 1, 2 and Kathleen M. Scott 3 1.Department of Energy-Joint Genome Institute, Walnut Creek,
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Sequence comparison: Significance of similarity scores
Sequence comparison: Multiple testing correction
BLAST.
Sequence alignment, Part 2
Sequence comparison: Multiple testing correction
Sequence comparison: Significance of similarity scores
Basic Local Alignment Search Tool
BLAST Slides adapted & edited from a set by
Sequence alignment, E-value & Extreme value distribution
BLAST Slides adapted & edited from a set by
Presentation transcript:

Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Problem set 3 will be posted by tonight. The Python part may take you a LONG time!

Unscaled EVD equation peak centered on 0 characteristic width (FYI this is 1 minus the cumulative density function or CDF)

Scaling the EVD An EVD derived from, e.g., the Smith-Waterman algorithm with BLOSUM62 matrix and a given gap penalty has a characteristic mode parameter μ and scale parameter λ. scaled: and  depend on the size of the query, the size of the target database, the substitution matrix and the gap penalties.

Similar to scaling the standard normal standard normal (  moves peak and v adjusts peak width)

Summary score significance A distribution plots the frequencies of types of observation. The area under the distribution curve is 1. Most statistical tests compare observed data to the expected result according to a null hypothesis. Sequence similarity scores follow an extreme value distribution, which is characterized by a long tail. The p-value associated with a score is the area under the curve to the right of that score. Selecting a significance threshold requires evaluating the cost of making a mistake. Bonferroni correction: Divide the desired p-value threshold by the number of statistical tests performed. The E-value is the expected number of times that a given score would appear in a randomized database.

Whole genome alignments genome-wide alignment data (efficient) inference of shared (orthologous) genes across species genome evolution Why?

known gap in assembly averaged conservation for 17 genomes individual genome alignments, darker = higher scoring alignment discontinuity (e.g. translocation break point) questionable alignment segment sequence present but unalignable UCSC Browser track

GQSQVGQGPPCPHHRCTTCCPDGCHFEPQVCMCDWESCCEEG GQSEVRQGPQCPYHKCIKCQPDGCHYEPTVCICREKPCDEKG

How are genome-wide alignments made? mouse and human genomes are each about 3x10 9 nucleotides. how many calculations would a dynamic programming alignment have to make? at a minimum - 3 integer additions and 3 inequality tests for each DP matrix position DP matrix size is 3x10 9 by 3x10 9 about 6 x (3x3x10 18 ) = 5.4x10 19 calculations! Age of the universe is about 4.3x10 17 seconds (by the way, there are other problems too, including assuming colinearity)

Most common method is the BLAST search (Basic Local Alignment Search Tool). Only the initial step is different from dynamic programming alignment. Search sequence is broken into small words (usually 3 residues long for proteins). 20 * 20 * 20 = 8,000 protein words. These act as seeds for searches. The target dataset is pre-indexed for all the positions in the database sequences that match each search word above some score threshold (using a score matrix such as BLOSUM62). Making large searches faster

...VFEWVHLLP... WIY Target sequences around each indexed word hit are retrieved and the initial match is extended in both directions: your sequence database (many sites) For example, the search sequence word “WVH” might score above threshold with these indexed sequences: Indexed wordScore WVH 23 WIH 22 WVY 17 WIY 16 BLAST searches (cont.)

Schematic of indexed matches Result – instead of aligning these 3 amino acids to everything, they are aligned only with the tiny fraction of sequence regions that are good candidates for a valid alignment. (note- blast actually looks for two such matches close to each other)

Extension and scoring...QSVFEWVHLLPGA.....WIY.....QSVFEWVHLLPGA.....WIYQ.....QSVFEWVHLLPGA.....WIYQK.....QSVFEWVHLLPGA.....WIYQKA.. Total Score: Match Score: [mention gap variant]

Extension termination Extension is continued until the cumulative score drops below some threshold (usually 0). This permits the match to cross a region of marginal similarity or frank mismatching (e.g. a small intron in tblastn) if it flanks a region of high similarity. Extensions whose maximal cumulative score is above some threshold are kept for reporting to user. For web interfaces, various formatting, links, and overviews are added. It is also easy to set up blast on your local computer; useful for custom databases and automation.

Key to speed: word matching and prior indexing Though gapped blast local alignment is slow, only a very small part of total search space is analyzed. Because the positions of all database word matches are indexed prior to the search, the relevant parts of search space are reached quickly. Tradeoff is in sensitivity – occasionally matches will be missed (when they are distant enough and dispersed enough that no local word pairs match well enough).

genome A genome B DP alignment region M x N manageable BLAST matches Dynamic programming after BLAST matching Anchored DP alignment: if two blast matches are nearby and in the same orientation, DP align everything between them.