“Multiple indexes and multiple alignments” Presenting:Siddharth Jonathan Scribing:Susan Tang DFLW:Neda Nategh Upcoming: 10/24:“Evolution of Multidomain.

Slides:



Advertisements
Similar presentations
Indexing DNA Sequences Using q-Grams
Advertisements

Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
Yasuhiro Fujiwara (NTT Cyber Space Labs)
Statistics in Bioinformatics May 2, 2002 Quiz-15 min Learning objectives-Understand equally likely outcomes, Counting techniques (Example, genetic code,
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.
1 Profile Hidden Markov Models For Protein Structure Prediction Colin Cherry
Profiles for Sequences
Psi-BLAST, Prosite, UCSC Genome Browser Lecture 3.
Seeds for Similarity Search Presentation by: Anastasia Fedynak.
Structural bioinformatics
Heuristic Local Alignerers 1.The basic indexing & extension technique 2.Indexing: techniques to improve sensitivity Pairs of Words, Patterns 3.Systems.
. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.
Heuristic alignment algorithms and cost matrices
Design of Optimal Multiple Spaced Seeds for Homology Search Jinbo Xu School of Computer Science, University of Waterloo Joint work with D. Brown, M. Li.
CS273a Lecture 8, Win07, Batzoglou Evolution at the DNA level …ACGGTGCAGTTACCA… …AC----CAGTCCACCA… Mutation SEQUENCE EDITS REARRANGEMENTS Deletion Inversion.
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT. 2 HMM Architecture Markov Chains What is a Hidden Markov Model(HMM)? Components of HMM Problems of HMMs.
1 Protein Multiple Alignment by Konstantin Davydov.
Designing Multiple Simultaneous Seeds for DNA Similarity Search Yanni Sun, Jeremy Buhler Washington University in Saint Louis.
Bioinformatics and Phylogenetic Analysis
Linear-Space Alignment. Linear-space alignment Using 2 columns of space, we can compute for k = 1…M, F(M/2, k), F r (M/2, N – k) PLUS the backpointers.
Overview of sequence database searching techniques and multiple alignment May 1, 2001 Quiz on May 3-Dynamic programming- Needleman-Wunsch method Learning.
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT
Index-based search of single sequences Omkar Mate CS 374 Stanford University.
CS273a Lecture 9/10, Aut 10, Batzoglou Multiple Sequence Alignment.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
Short Primer on Comparative Genomics Today: Special guest lecture 12pm, Alway M108 Comparative genomics of animals and plants Adam Siepel Assistant Professor.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Statistics in Bioinformatics May 12, 2005 Quiz 3-on May 12 Learning objectives-Understand equally likely outcomes, counting techniques (Example, genetic.
Sequence comparison: Local alignment
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Biostatistics-Lecture 15 High-throughput sequencing and sequence alignment Ruibin Xi Peking University School of Mathematical Sciences.
BIONFORMATIC ALGORITHMS Ryan Tinsley Brandon Lile May 9th, 2014.
Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒 黃尹柔 田耕豪 蕭逸嫻 謝朝茂 莊閔傑 2014/05/12 1.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.
Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Construction of Substitution Matrices
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.
Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Construction of Substitution matrices
Doug Raiford Phage class: introduction to sequence databases.
Computational Biology, Part C Family Pairwise Search and Cobbling Robert F. Murphy Copyright  2000, All rights reserved.
MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Heuristic Alignment Algorithms Hongchao Li Jan
1 Repeats!. 2 Introduction  A repeat family is a collection of repeats which appear multiple times in a genome.  Our objective is to identify all families.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Hidden Markov Models BMI/CS 576
bacteria and eukaryotes
Sequence comparison: Local alignment
Fast Sequence Alignments
Sahand Kashani, Stuart Byma, James Larus 2019/02/16
BIOINFORMATICS Fast Alignment
Presentation transcript:

“Multiple indexes and multiple alignments” Presenting:Siddharth Jonathan Scribing:Susan Tang DFLW:Neda Nategh Upcoming: 10/24:“Evolution of Multidomain Proteins”Wissam Kazan “Human Migrations”Anjalee Sujanani 10/26:“Comparison of Networks Across Species”Chuan Sheng Foo “Repetitive DNA Detection and Classification”Vijay Krishnan 10/19

CS374 Presentation - Searching Biological Sequence Databases 2 CS374 Algorithms in Biology Searching Biological Sequence Databases Siddharth Jonathan

CS374 Presentation - Searching Biological Sequence Databases 3 Outline Background Problem Typhon Overview Typhon Components Results

CS374 Presentation - Searching Biological Sequence Databases 4 Background Sequence Alignment Multiple Alignment Databases Probabilistic Profile Phylogenetic Tree

CS374 Presentation - Searching Biological Sequence Databases 5 Sequence Alignment Identifying regions of similarity in the genome, proteins etc. Types –Global –Local Seeded Non-seeded Why is it important? –Comparative analysis of genomes –Producing Phylogenetic trees –Understanding newly sequenced genomes

CS374 Presentation - Searching Biological Sequence Databases 6 Seeds – A Review A seed, P = a set of ordered list of w positions i.e. P = {x 1, x 2, …, x w } w = weight of P = |P| s = span of P = x w – x Ex: P = {0, 1, 4, 5} w = 4 s = 5 – = 6

CS374 Presentation - Searching Biological Sequence Databases 7 Indexing in Seeded Local Alignment algorithms …G A T T A C C A G A T T A C C A G A T T A … Gene Sequence S Seed A = {0,1,2,3} …G A T T A C C A G A T T A C C A G A T T A … GATTS,0 …G A T T A C C A G A T T A C C A G A T T A … ATTAS,1 The same idea holds for non-contiguous seeds as well! Average number of seeds indexed per position is called the Budget

CS374 Presentation - Searching Biological Sequence Databases 8 Seeded Local Alignment Algorithms BLAST BLAT BLASTZ Exonerate Usage of multiple seeds, spaced seeds What do they have in common? Indexing!

CS374 Presentation - Searching Biological Sequence Databases 9 Multiple alignment Species 1 Species 2

CS374 Presentation - Searching Biological Sequence Databases 10 Phylogenetic Tree

CS374 Presentation - Searching Biological Sequence Databases 11 Probabilistic Profile Each cell corresponds to one position in the alignment… We’ll learn what information it carries very shortly!

CS374 Presentation - Searching Biological Sequence Databases 12 Regions

CS374 Presentation - Searching Biological Sequence Databases 13 The Problem Say, we have a database of multiple alignments So what’s the challenge? Find local alignments for the query Candidate seeds

CS374 Presentation - Searching Biological Sequence Databases 14 The Problem Statement Budget Can we do better? Make use of information implicit in multiple alignment for selecting which seeds to index for a given position

CS374 Presentation - Searching Biological Sequence Databases 15 The Problem Statement - Typhon Given Probabilistic Profile Candidate Seeds Budget Indexing Scheme that indexes only a subset of candidate seeds at each position

CS374 Presentation - Searching Biological Sequence Databases 16 Overall Architecture of Typhon

CS374 Presentation - Searching Biological Sequence Databases 17 Step 1: Probabilistic Profile Construction 6 tuple for each position in the multiple alignment P present – existence probability P A P C P T P G P id – Probability that the corresponding query position has the consensus character Conditional Probability that the homologous position contains A,C,T,G given that a homologous position exists. Nucleotide with highest such value is called the consensus character

CS374 Presentation - Searching Biological Sequence Databases 18 Calculation of Probabilistic Profile A A A C T _ T T C C C C Human Chimp Rat Pig P Present =100% P A =75% P C =25% P G =0% P T =0% Propagation of values up the tree to the root is a tricky problem!

CS374 Presentation - Searching Biological Sequence Databases 19 Calculating probabilistic profile P Present and P N calculated independently P Present Weighted average of children’s P Present values. Weights proportional to the inverse of the branch length P N calculated through Felsentein’s algorithm with a Kimura Matrix P id = max(P N ) (This is calculated at the root)

CS374 Presentation - Searching Biological Sequence Databases 20 Overall Architecture of Typhon

CS374 Presentation - Searching Biological Sequence Databases 21 Region Decomposition ATTGGAACCCAGGCCA----AATT-GCGCC-----AA-TT------G----C-----ATGG-G-----ATGCCCAAAAAAT ATTGGAACTCAGGCCA----AATT--CGCC-----AA-T G----C-----AT--G------ATGCCCATAAAAT ATTGGAACCCAGGCCA----AATT-CG--C-----A-TT G----T-----A-GGG------ATGCCCAAAAAAT ATTGGAACCCAGGCCA----A-TTGC-G-C-----AAT-T------G-----C----ATGGGG-----ATGCCCATAAAAT Each region is characterized by a P Present and a P id How do we come up with these regions?

CS374 Presentation - Searching Biological Sequence Databases 22 Hidden Markov Models (HMM) Given an observation sequence Predict the sequence of Hidden states

CS374 Presentation - Searching Biological Sequence Databases 23 Region Decomposition – Simple Method Come up with a set of region classes (states) Construct an HMM Looking at the observation sequence, try to determine the most likely parse –Viterbi algorithm Problem – Need to determine classes at the beginning

CS374 Presentation - Searching Biological Sequence Databases 24 Alternative Split the Profile into 2 classes at a time Use 2 stage HMM Stop until bound on number of region classes is reached

CS374 Presentation - Searching Biological Sequence Databases 25 Region Decomposition with HMM

CS374 Presentation - Searching Biological Sequence Databases 26 Overall Architecture of Typhon

CS374 Presentation - Searching Biological Sequence Databases 27 Step 3: Seed Indexing What are we trying to do? 1213 A B D C E Candidate Seeds A D C B C A D CB C E D

CS374 Presentation - Searching Biological Sequence Databases 28 The Goal Maximize expected number of regions matched to a homologue

CS374 Presentation - Searching Biological Sequence Databases 29 Seed Assignment 2 Approaches: –General Method –Greedy Approximation

CS374 Presentation - Searching Biological Sequence Databases 30 General Method - Terminology Region Classes Size of the candidate set Object[i][j] i j

CS374 Presentation - Searching Biological Sequence Databases 31 Calculation of number of matching regions (done for each cell in the previous table) Probability that a region matches a homologue Conditional Probability that the seeds match the region and its homologue given that it exists Number of regions XX ‘P Present P hit |C|

CS374 Presentation - Searching Biological Sequence Databases 32 General Method - Explained P Present * P 1 hit * |C| P Present * P 2 hit * |C| P Present * P 3 hit * |C| P Present * P 4 hit * |C| P Present * P 5 hit * |C| P Present * P 1 hit * |C| P Present * P 2 hit * |C| P Present * P 3 hit * |C| P Present * P 4 hit * |C| P Present * P 5 hit * |C| P Present * P 1 hit * |C| P Present * P 2 hit * |C| P Present * P 3 hit * |C| P Present * P 4 hit * |C| P Present * P 5 hit * |C| P Present * P 1 hit * |C| P Present * P 2 hit * |C| P Present * P 3 hit * |C| P Present * P 4 hit * |C| P Present * P 5 hit * |C| Region Class 1 Region Class 2 Region Class 3 Region Class 4 Number of Candidate Seeds 12345

CS374 Presentation - Searching Biological Sequence Databases 33 Some Terminology Weight –Total Length of all regions in a region class * # of seeds indexed at each position –Sort of like the Budget for a region Value –Expected Number of Regions matched. (previous calculation)

CS374 Presentation - Searching Biological Sequence Databases 34 Solving the Seed Assignment Problem P Present * P 1 hit * |C| P Present * P 2 hit * |C| P Present * P 3 hit * |C| P Present * P 4 hit * |C| P Present * P 5 hit * |C| P Present * P 1 hit * |C| P Present * P 2 hit * |C| P Present * P 3 hit * |C| P Present * P 4 hit * |C| P Present * P 5 hit * |C| P Present * P 1 hit * |C| P Present * P 2 hit * |C| P Present * P 3 hit * |C| P Present * P 4 hit * |C| P Present * P 5 hit * |C| P Present * P 1 hit * |C| P Present * P 2 hit * |C| P Present * P 3 hit * |C| P Present * P 4 hit * |C| P Present * P 5 hit * |C| Region Class 1 Region Class 2 Region Class 3 Region Class 4 Number of Candidate Seeds 12345

CS374 Presentation - Searching Biological Sequence Databases 35 Solving the Seed Assignment Problem Weight, Value 10,5 Weight, Value 20,30 Weight, Value 30,31 Weight, Value 40,34 Weight, Value 50,40 Weight, Value 15,8 Weight, Value 30,20 Weight, Value 45,22 Weight, Value 60,24 Weight, Value 75,30 Weight, Value 12,7 Weight, Value 24,10 Weight, Value 36,32 Weight, Value 48,36 Weight, Value 60,40 Weight, Value 9,9 Weight, Value 18,10 Weight, Value 27,25 Weight, Value 36,27 Weight, Value 5,30 Region Class 1 Region Class 2 Region Class 3 Region Class 4 Number of Candidate Seeds 12345

CS374 Presentation - Searching Biological Sequence Databases 36 Solving the Seed Assignment Problem Budget =112 Weight, Value 10,5 Weight, Value 20,30 Weight, Value 30,31 Weight, Value 40,34 Weight, Value 50,40 Weight, Value 15,8 Weight, Value 30,20 Weight, Value 45,22 Weight, Value 60,24 Weight, Value 75,30 Weight, Value 12,7 Weight, Value 24,10 Weight, Value 36,32 Weight, Value 48,36 Weight, Value 60,40 Weight, Value 9,9 Weight, Value 18,10 Weight, Value 27,25 Weight, Value 36,27 Weight, Value 5,30 Region Class 1 Region Class 2 Region Class 3 Region Class 4 Number of Candidate Seeds 12345

CS374 Presentation - Searching Biological Sequence Databases 37 Looks Familiar? Closely related to the Knapsack Problem, a well studied problem in Computer Science

CS374 Presentation - Searching Biological Sequence Databases 38 Approximate Solution Faster Space Efficient New Terminology : –Density of an object = Value/Weight

CS374 Presentation - Searching Biological Sequence Databases 39 Approximate Solution – General Intuition Select objects in order of decreasing density Disallow more than one object per row

CS374 Presentation - Searching Biological Sequence Databases 40 Approximate Method in Action Candidate Set Object[1,1] Density=V/W=3 Object[2,1] Density=V/W=2 Object[3,1] Density=V/W=5 Object[4,1] Density=V/W=4 Object[3,2] Density=V/W=6 What are the new values of Weight, Value and Density? Value = additional number of regions matched Weight = amount of budget used by this one seed. And keep track of the Budget!

CS374 Presentation - Searching Biological Sequence Databases 41 Results Considerations –Sensitivity –Speed –Space

CS374 Presentation - Searching Biological Sequence Databases 42 Sensitivity Results Experimental Setup Detection of Hypothetical Homologous Alignments (HHA) Typhon Vs Standard

CS374 Presentation - Searching Biological Sequence Databases 43 Sensitivity Comparison

CS374 Presentation - Searching Biological Sequence Databases 44 Effect of Multiple Alignment on Sensitivity

CS374 Presentation - Searching Biological Sequence Databases 45 Running time Comparison Time spent building the index –Typhon takes longer Time spent scanning the index Typhon 3-4 times slower at run time which is reasonable

CS374 Presentation - Searching Biological Sequence Databases 46 Scanning time

CS374 Presentation - Searching Biological Sequence Databases 47 Conclusion Information implicit from Multiple Alignments helps search sensitivity Variable allocation of seeds by region classes helps (Typhon) Space and time complexities of Typhon comparable to STANDARD Most effective for queries far from each species in the alignment

CS374 Presentation - Searching Biological Sequence Databases 48 Questions?

CS374 Presentation - Searching Biological Sequence Databases 49 Acknowledgements Serafim Batzoglou, George Asimenos, Jason Flannick