Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.

Slides:



Advertisements
Similar presentations
Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Advertisements

Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
Sources Page & Holmes Vladimir Likic presentation: 20show.pdf
BLAST Sequence alignment, E-value & Extreme value distribution.
Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally.
Jeff Shen, Morgan Kearse, Jeff Shi, Yang Ding, & Owen Astrachan Genome Revolution Focus 2007, Duke University, Durham, North Carolina Introduction.
. Class 4: Fast Sequence Alignment. Alignment in Real Life u One of the major uses of alignments is to find sequences in a “database” u Such collections.
Heuristic alignment algorithms and cost matrices
SST:an algorithm for finding near- exact sequence matches in time proportional to the logarithm of the database size Eldar Giladi Eldar Giladi Michael.
. Class 4: Fast Sequence Alignment. Alignment in Real Life u One of the major uses of alignments is to find sequences in a “database” u Such collections.
1 1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Multiple Sequence Alignments
Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
Sequence alignment, E-value & Extreme value distribution
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Sequence comparison: Local alignment
Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 17 th, 2013.
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Chapter 5 Multiple Sequence Alignment.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Biostatistics-Lecture 15 High-throughput sequencing and sequence alignment Ruibin Xi Peking University School of Mathematical Sciences.
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
SISAP’08 – Approximate Similarity Search in Genomic Sequence Databases using Landmark-Guided Embedding Ahmet Sacan and I. Hakki Toroslu
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No. 1, January/February 2002 Presented by Jitimon Keinduangjun.
Local alignment, BLAST and Psi-BLAST October 25, 2012 Local alignment Quiz 2 Learning objectives-Learn the basics of BLAST and Psi-BLAST Workshop-Use BLAST2.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2015.
Doug Raiford Phage class: introduction to sequence databases.
Step 3: Tools Database Searching
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2010.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Heuristic Alignment Algorithms Hongchao Li Jan
Practice -- BLAST search in your own computer 1.Download data file from the course web page, or Ensemble. Save in the blast\dbs folder. 2.Start a CMD window,
Heuristic Methods for Sequence Database Searching BMI/CS 776 Mark Craven February 2002.
Using BLAST To Teach ‘E-value-tionary’ Concepts Cheryl A. Kerfeld 1, 2 and Kathleen M. Scott 3 1.Department of Energy-Joint Genome Institute, Walnut Creek,
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
BLAST BNFO 236 Usman Roshan. BLAST Local pairwise alignment heuristic Faster than standard pairwise alignment programs such as SSEARCH, but less sensitive.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Basics of BLAST Basic BLAST Search - What is BLAST?
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Sequence comparison: Local alignment
Identifying templates for protein modeling:
BLAST.
Sequence alignment, Part 2
Basic Local Alignment Search Tool
BLAST Slides adapted & edited from a set by
Sequence alignment, E-value & Extreme value distribution
BLAST Slides adapted & edited from a set by
Presentation transcript:

Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Review What a score matrix is and how to calculate and use one. Why an affine gap penalty is desirable. How to align sequences using dynamic programming. How to calculate and interpret p-values and E-values for pair alignments and database searches.

Whole genome alignments Why?

known gap in assembly averaged conservation for 17 genomes individual genome alignments, darker = higher scoring alignment discontinuity (e.g. translocation break point) questionable alignment segment sequence present but unalignable UCSC Browser track

GQSQVGQGPPCPHHRCTTCCPDGCHFEPQVCMCDWESCCEEG GQSEVRQGPQCPYHKCIKCQPDGCHYEPTVCICREKPCDEKG

How are genome-wide alignments made? mouse and human genomes are each about 3x10 9 nucleotides. how many calculations would a dynamic programming alignment have to make? at a minimum - 3 integer additions and 3 inequality tests for each DP matrix position (by the way, there are other problems too, including assuming colinearity)

Most common method is the BLAST search (Basic Local Alignment Search Tool). Only the initial step is substantially different from dynamic alignment. Search sequence is broken into small words (usually 3 residues long for proteins). 20 * 20 * 20 = 8,000 words. These act as seeds for searches. The target dataset is pre-indexed to indicate the positions in the database sequences that match each search word above some score threshold (using a global score matrix such as BLOSUM62). Making large searches faster

...VFEWVHLLP... WIY Target sequences around each indexed word hit are retrieved and the initial match is extended in both directions: your sequence database (many sites) For example, the search sequence word “WVH” might score above threshold with these indexed sequences: Indexed wordScore WVH 23 WIH 22 WVY 17 WIY 16 BLAST searches (cont.)

Schematic of indexed matches Result – instead of aligning these 3 amino acids to everything, they are aligned only with the tiny fraction of sequence regions that are good candidates for a valid alignment. (note- blast actually looks for two such matches close to each other)

Extension and scoring...QSVFEWVHLLPGA.....WIY.....QSVFEWVHLLPGA.....WIYQ.....QSVFEWVHLLPGA.....WIYQK.....QSVFEWVHLLPGA.....WIYQKA.. Total Score: Match Score: [mention gap variant]

Extension termination Extension is continued until the cumulative score drops below some threshold (usually 0). This permits the match to cross a region of marginal similarity or frank mismatching (e.g. a small intron in tblastn) if it flanks a region of high similarity. Extensions whose maximal cumulative score is above some threshold are kept for reporting to user. For web interfaces, various formatting, links, and overviews are added and reported according to user settings (it is also fairly easy to download and run your own blast).

Key to speed: word matching and prior indexing Though gapped blast local alignment is slow (like dynamic programming), only a very small part of total search space is analyzed. Because the positions of all database word matches are indexed and stored prior to the blast search, the relevant parts of search space are reached quickly. Tradeoff is in accuracy and certainty – occasionally matches will be missed (when they are distant enough and dispersed enough that no local word pairs match well enough).

genome A genome B DP alignment region M x N manageable BLAST matches Dynamic programming after BLAST matching

Defining what a “tree” means rooted tree (all real trees are rooted): unrooted tree (used when the root isn’t known): time ancestral sequence time vaguely radiates out from somewhere near the center …divergence time is the sum of (horizontal) branch lengths sequences (leaves or tips) branch points branches root

A tree has topology and distances Are these different trees?

The number of tree topologies grows extremely fast 3 leaves 3 branches 1 internal node 1 topology (3 insertions) 4 leaves 5 branches 2 internal nodes 3 topologies (x3) (5 insertions) 5 leaves 7 branches 3 internal nodes 15 topologies (x5) (7 insertions) In general, an unrooted tree with N leaves has: 2N – 3 branches N – 2 internal nodes ~ O(N!) topologies

There are many rooted trees for each unrooted tree For each unrooted tree, there are 2N - 3 times as many rooted trees, where N is the number of leaves (# internal branches = 2N – 3). 20 leaves - 564,480,989,588,730,591,336,960,000,000 topologies