Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.

Slides:



Advertisements
Similar presentations
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Advertisements

1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
Improved Alignment of Protein Sequences Based on Common Parts David Hoksza Charles University in Prague Department of Software Engineering Czech Republic.
BLAST Sequence alignment, E-value & Extreme value distribution.
Measuring the degree of similarity: PAM and blosum Matrix
Lecture 8 Alignment of pairs of sequence Local and global alignment
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.
Introduction to Bioinformatics Burkhard Morgenstern Institute of Microbiology and Genetics Department of Bioinformatics Goldschmidtstr. 1 Göttingen, March.
Sequence Similarity Searching Class 4 March 2010.
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
Heuristic alignment algorithms and cost matrices
Sequence similarity (II). Schedule Mar 23midterm assignedalignment Mar 30midterm dueprot struct/drugs April 6teams assignedprot struct/drugs April 13RNA.
Sequence Alignment.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2004.
Introduction to bioinformatics
Sequence similarity.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
Sequence Alignment III CIS 667 February 10, 2004.
Introduction to Bioinformatics Algorithms Sequence Alignment.
PAM250. M. Dayhoff Scoring Matrices Point Accepted Mutations or PAM matrices Proteins with 85% identity were used -> the function is not significantly.
Bioperl modules.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
Home Work I. Running Blast with BioPerl Input: 1) Sequence or Acc.Num. 2) Threshold (E value cutoff) Output: 1) Blast results – sequence names, alignment.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Sequence Alignment Topics: Introduction Exact Algorithm Alignment Models BioPerl functions.
Developing Pairwise Sequence Alignment Algorithms
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
Bioinformatics in Biosophy
An Introduction to Bioinformatics
BioPerl - documentation Bioperl tutorial tutorial Mastering Perl for Bioinformatics: Introduction.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
BioPerl Based on a presentation by Manish Anand/Jonathan Nowacki/ Ravi Bhatt/Arvind Gopu.
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.
Alignment methods April 26, 2011 Return Quiz 1 today Return homework #4 today. Next homework due Tues, May 3 Learning objectives- Understand the Smith-Waterman.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Lecture 6. Pairwise Local Alignment and Database Search Csc 487/687 Computing for bioinformatics.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Chapter 3 Computational Molecular Biology Michael Smith
Intro to Alignment Algorithms: Global and Local Intro to Alignment Algorithms: Global and Local Algorithmic Functions of Computational Biology Professor.
BioPerl Ketan Mane SLIS, IU. BioPerl Perl and now BioPerl -- Why ??? Availability Advantages for Bioinformatics.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Sequence Alignment.
Burkhard Morgenstern Institut für Mikrobiologie und Genetik Molekulare Evolution und Rekonstruktion von phylogenetischen Bäumen WS 2006/2007.
Construction of Substitution matrices
Bioinformatics Computing 1 CMP 807 – Day 2 Kevin Galens.
Step 3: Tools Database Searching
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Sequence Alignment 11/24/2018.
Intro to Alignment Algorithms: Global and Local
Sequence Based Analysis Tutorial
Presentation transcript:

Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases for related sequences and subsequences. Exploring frequently occurring patterns of nucleotides. Finding informative elements in protein and DNA sequences. Various experimental applications (reconstruction of DNA, etc.) Motivation:

Seq.Align. Protein Function Similar function More than 25% sequence identity ? Similar 3D structure ? Similar sequences produce similar proteins Gene1Gene2 ?

Alignment - inexact matching Substitution - replacing a sequence base by another. Insertion - an insertion of a base (letter) or several bases to the sequence. Deletion - deleting a base (or more) from the sequence. ( Insertion and deletion are the reverse of one another )

Seq. Align. Score Commonly used matrices: PAM250, BLOSUM64

Global Alignment INPUT: Two sequences S and T of roughly the same length. QUESTION: What is the maximum similarity between them? Find one of the best alignments.

The IDEA s[1…n] t[1…m] To align s[1...i] with t[1…j] we have three choices: * align s[1…i-1] with t[1…j-1] and match s[i] with t[j] * align s[1…i] with t[1…j-1] and match a space with t[j] * align s[1…i-1] with t[1…j] and match s[i] with a space s[1…i-1] i t[1…j-1] j s[1… i ] - t[1…j -1 ] j s[1…i -1 ] i t[1… j ] -

Global Alignment Alignment Score: V(n,m) Needleman-Wunsch 1970

Local Alignment INPUT: Two sequences S and T. QUESTION: What is the maximum similarity between a subsequence of S and a subsequence of T ? Find most similar subsequences.

Local Alignment Smith-Waterman 1981 *Penalties should be negative*

Local alignment match 2 mismatch -1 cxde c-de s: xxxcde t: abcxdex xxxcde abcxdex

Sequence Alignment Complexity: Time O(n*m) Space O(n*m) (exist algorithm with O(min(n,m)) )

Ends free alignment Ends free alignment INPUT: Two equences S and T (possibly of different length). QUESTION: Find one of the best alignments between subsequences of S and T when at least one of these subsequences is a prefix of the original sequence and one (not necessarily the other) is a suffix. or

Ends free alignment m n

Gap Alignment Definition: A gap is the maximal contiguous run of spaces in a single sequence within a given alignment.The length of a gap is the number of indel operations on it. A gap penalty function is a function that measure the cost of a gap as a (nonlinear) function of its length. Gap penalty INPUT: Two sequences S and T (possibly of different length). QUESTION: Find one of the best alignments between the two sequences using the gap penalty function. Affine Gap: W total = W g + qW s W g – weight to open the gap W s – weight to extend the gap

What Kind of Alignment to Use? The same protein from the different organisms. Two different proteins sharing the same function. Protein domain against a database of complete proteins....

M. Dayhoff Scoring Matrices Point Accepted Mutations or PAM matrices Proteins with 85% identity were used -> the function is not significantly changed -> the mutations are “accepted” PAM units – the measure of the amount of evolutionary distance between two amino acid sequences. One PAM unit – S 1 has converted (mutated) to S 2 with an average of one accepted point-mutation event per 100 amino acids.

PAM250

p a (=N a /N) – probability of occurrence of amino acid ‘a’ over a large, sufficiently varied, data set.  a p a = 1 f ab – the number of times the mutation a b was observed to occur. f a =  b  a f ab the total number of mutations in which a was involved f =  a f a - the total number of amino acid occurrences involved in mutations. 1)Probability matrix 2)Scoring matrix

M - 20x20 probability matrix M ab - the probability of amino acid ‘a’ changing into ‘b’ m a = f a / f * 1/(100 * p a ) - relative mutability of amino acid ‘a’. It is the probability that the given amino acid will change in the evolutionary period of interest. Assumptions – (a) 1 in 100 amino acids on average is changed. (b) mutations are position independent. (c) mutations are independent on its past.

M aa =1- m a - the probablity of ‘a’ to remain unchanged. M ab = Pr(a -> b) = Pr(a -> b | a changed) Pr(a changed)= = (f ab /f a )m a Easy to see:  b M ab =1 = M aa +  b  a (f ab /f a )m a = 1- m a + m a /f a  b  a f ab = 1  a p a M aa =0.99 -> in average 1 mutation every 100 positions.

What is the probability that ‘a’ mutates into ‘b’ in two PAM units of evolution? a->c->b or a->d-> …  c M ac M cb = M 2 ab ->M 2, M 3, M 4 … M 250 … k->  M k converges to a matrix with identical rows. M k ac = p c - no matter what amino acid you start with, after a long period of evolution the resulting amino acid will be ‘c’ with probability p c.

PAM-k ab = M k ab / p b - probability that a pair ‘ab’ is a mutation as opposed to being a random occurrence (likelihood or odds ratio). M ab / p b = [(f ab /f a )m a ] / p b = (f ab /f a ) f a / (f * 100 * p a * p b ) = f ab / (f * 100 * p a * p b ) = M ba / p a The total alignment score is the product of Pam-k ab. To avoid accuracy problems: Pam-k ab = 10 log M k ab / p b -> The total alignment score is the sum of Pam-k ab. PAM-k matrix

BioPerl “Bioperl is a collection of perl modules that facilitate the development of perl scripts for bio-informatics applications.” Bioperl is open source software that is still under active development. Tutorial Documentation

BioPerl Accessing sequence data from local and remote databases Transforming formats of database/ file records Manipulating individual sequences Searching for "similar" sequences Creating and manipulating sequence alignments Searching for genes and other structures on genomic DNA Developing machine readable sequence annotations

BioPerl library at TAU BioPerl is NOT yet installed globally on CS network. In each script you should add the following line: use lib "/home/silly6/mol/lib/bioperl";

Sequence Object Seq – stores sequence, identification labels (id, accession number, molecule type = DNA, RNA, Protein, …), multiple annotations and associated “sequence features”.

Seq Objects PrimarySeq – light-weight Seq. Stores only sequence and id. labels. LocatableSeq – Seq object with “start” and “stop” positions. Used by SimpleAlign object. Created by pSW, Clustalw, Tcoffee or bl2seq. LargeSeq – special type of Seq (more than 100 MB). LiveSeq – handles changing over time locations on a sequence. SeqI – interface for Seq objects.

Sequence Object $seq = Bio::Seq->new('-seq'=>'actgtggcgtcaact', '-desc'=>'Sample Bio::Seq object', '-display_id' => 'something', '-accession_number' => 'accnum', '-moltype' => 'dna' ); Usually Seq is not created this way.

Sequence Object $seqobj->display_id(); # the human read-able id of the sequence $seqobj->seq(); # string of sequence $seqobj->subseq(5,10); # part of the sequence as a string $seqobj->accession_number(); # when there, the accession number $seqobj->moltype(); # one of 'dna','rna','protein' $seqobj->primary_id(); # a unique id for this sequence irregardless # of its display_id or accession number

Accessing Data Base $gb = new Bio::DB::GenBank(); $seq1 = $gb->get_Seq_by_id('MUSIGHBA1'); $seq2 = $gb->get_Seq_by_acc('AF303112')); $seqio = $gb->get_Stream_by_batch([ qw(J00522 AF )])); Databases: genbank, genpept, swissprot and gdb.

Seq module ATGGAGCCCAAGCAAGGATACCTTCTTGTAAAATTGATAGAAGCTCGCAAGCTAGCA TCTAAGGATGTGGGCGGAGGGTCAGATCCATAC MEPKQGYLLVKLIEARKLASKDVGGGSDPY use Bio::DB::GenBank; $gb = new Bio::DB::GenBank(); $seq1 = $gb->get_Seq_by_acc('AF303112'); $seq2=$seq1->trunc(1,90); print $seq2->seq(), "\n"; $seq3=$seq2->translate; print $seq3->seq(), “\n“;

SeqIO object $seq = $gb->get_Seq_by_acc('AF303112')); $out = Bio::SeqIO->new('-file' => ">f.fasta", '-format' => 'Fasta'); $out->write_seq($seq); SeqIO can read/write/transform data in the following formats : Fasta, EMBL. GenBank, Swissprot, PIR, GCG, SCF, ACE, BSML

Transforming Sequence Files $in = Bio::SeqIO->new('-file' => “f.fasta", '-format' => 'Fasta'); $out = Bio::SeqIO->new('-file' => ">f.embl", '-format' => ‘EMBL'); $out->write_seq($in->next_seq()); #for several sequences while ( my $seq = $in->next_seq() ) { $out->write_seq($seq); } #better $in = Bio::SeqIO->new('-file' => “f.fasta", '-format' => 'Fasta'); $out = Bio::SeqIO->new('-file' => ">f.embl", ' -format' => ‘EMBL'); # Fasta EMBL format converter: print $out $_ while ;

BioPerl: Pairwise Sequence Alignment use Bio::Tools::pSW; $factory = new Bio::Tools::pSW( '-matrix' => 'blosum62.bla', '-gap' => 12, '-ext' => 2, ); $factory->align_and_show($seq1, $seq2, STDOUT); Smith-Waterman Algorithm Currently works only on protein sequences.

Alignment Objects SimpleAlign handles multiple alignments of sequences. #pSW module $aln = $factory->pairwise_alignment($seq1, $seq2); foreach $seq ( $aln->eachSeq() ) { print $seq->seq(), "\n"; } $alnout = Bio::AlignIO->new(-format => 'fasta', -fh => \*STDOUT); $alnout->write_aln($aln);

Homework Write a cgi script (using Perl) that performs pairwise Local/Global Alignment for DNA sequences. All I/O is via HTML only. Input: 1.Choice for Local/Global alignment. 2.Two sequences – text boxes or two accession numbers. 3.Values for match, mismatch, ins/dels. 4.Number of iterations for computing random scores. Output: 1.Alignment score. 2.z-score value (z= (score-average)/standard deviation.) Remarks: 1) you are allowed to use only linear space. 2)To compute z-score perform random shuffling: srand(time| $$); #init, $$-proc.id int(rand($i)); #returns rand. number between [0,$i]. 3)Shuffling is done in windows (non-overlapping) of 10 bases length. Number of shuffling for each window is random [0,10].

BioPerl: Multiple Sequence = ('ktuple' => 2, 'matrix' => 'BLOSUM'); $factory = $aln = foreach $seq ( $aln->eachSeq() ) { print $seq->seq(), "\n"; }