Aligning sequences and searching databases

Slides:

Advertisements

Similar presentations

1 Lesson 2 Aligning sequences and searching databases.

Advertisements

Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.

Sources Page & Holmes Vladimir Likic presentation: 20show.pdf

BLAST Sequence alignment, E-value & Extreme value distribution.

Measuring the degree of similarity: PAM and blosum Matrix

DNA sequences alignment measurement

Bioinformatics Unit 1: Data Bases and Alignments Lecture 2: “Homology” Searches and Sequence Alignments.

Lecture 8 Alignment of pairs of sequence Local and global alignment

Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally.

1 Exercise: BIOINFORMATIC DATABASES and BLAST. 2 Outline  NCBI and Entrez  Pubmed  Google scholar  RefSeq  Swissprot  Fasta format  PDB: Protein.

Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.

Heuristic alignment algorithms and cost matrices

Introduction to Bioinformatics Algorithms Sequence Alignment.

BLAST Basic Local Alignment Search Tool. BLAST החכה BLAST (Basic Local Alignment Search Tool) allows rapid sequence comparison of a query sequence [[רצף.

We continue where we stopped last week: FASTA – BLAST

Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.

|| || ||||| ||| || || ||||||||||||||||||| MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFE… ATGGTGAACCTGACCTCTGACGAGAAGACTGCCGTCCTTGCCCTGTGGAACAAGGTGGACG TGGAAGACTGTGGTGGTGAGGCCCTGGGCAGGTTTGTATGGAGGTTACAAGGCTGCTTAAG.

Introduction to bioinformatics

1 ATGGTGAACCTGACCTCTGACGAGAAGACTGCCGTCCTTGCCCTGTGGAACAAGGTGGACG TGGAAGACTGTGGTGGTGAGGCCCTGGGCAGGTTTGTATGGAGGTTACAAGGCTGCTTAAG GAGGGAGGATGGAAGCTGGGCATGTGGAGACAGACCACCTCCTGGATTTATGACAGGAACT.

Sequence similarity.

Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.

Sequence homology and alignment

Similar Sequence Similar Function Charles Yan Spring 2006.

1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.

Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.

1 Lesson 3 Aligning sequences and searching databases.

Sequence alignment, E-value & Extreme value distribution

BLAST Basic Local Alignment Search Tool. BLAST החכה BLAST (Basic Local Alignment Search Tool) allows rapid sequence comparison of a query sequence [[רצף.

Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.

Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.

Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.

An Introduction to Bioinformatics

. Sequence Alignment and Database Searching 2 Biological Motivation u Inference of Homology  Two genes are homologous if they share a common evolutionary.

CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)

BLAST Workshop Maya Schushan June 2009.

Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.

Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.

Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.

Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.

Comp. Genomics Recitation 3 The statistics of database searching.

Construction of Substitution Matrices

Sequence Alignment Csc 487/687 Computing for bioinformatics.

Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.

Chapter 3 Computational Molecular Biology Michael Smith

Basic terms:  Similarity - measurable quantity. Similarity- applied to proteins using concept of conservative substitutions Similarity- applied to proteins.

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.

BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.

Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.

Applied Bioinformatics Week 3. Theory I Similarity Dot plot.

Part 2- OUTLINE Introduction and motivation How does BLAST work?

Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.

Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.

Sequence Alignment.

Construction of Substitution matrices

Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University

Step 3: Tools Database Searching

The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.

Pairwise Sequence Alignment Exercise 2. || || ||||| ||| || || ||||||||||||||||||| MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFE… ATGGTGAACCTGACCTCTGACGAGAAGACTGCCGTCCTTGCCCTGTGGAACAAGGTGGACG.

1 Homology and sequence alignment.. Homology Homology = Similarity between objects due to a common ancestry Hund = Dog, Schwein = Pig.

Using BLAST To Teach ‘E-value-tionary’ Concepts Cheryl A. Kerfeld 1, 2 and Kathleen M. Scott 3 1.Department of Energy-Joint Genome Institute, Walnut Creek,

Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.

9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.

Pairwise Sequence Alignment and Database Searching

Sequence similarity, BLAST alignments & multiple sequence alignments

Basic Local Alignment Search Tool (BLAST)

Bioinformatics Lecture 2 By: Dr. Mehdi Mansouri

Basic Local Alignment Search Tool

BLAST Slides adapted & edited from a set by

Sequence alignment, E-value & Extreme value distribution

BLAST Slides adapted & edited from a set by

Presentation transcript:

Aligning sequences and searching databases Lesson 3 Aligning sequences and searching databases

Homology Similarity between objects due to a common ancestry

Sequence homology Similarity between sequences that results from a common ancestor VLSPAVKWAKVGAHAAGHG VLSEAVLWAKVEADVAGHG Basic assumption: Sequence homology → similar structure/function

Sequence alignment Alignment: Comparing two (pairwise) or more (multiple) sequences. Searching for a series of identical or similar characters in the sequences.

Homology Ortholog – homolog with similar function (via speciation) Paralog – homolog which arose from gene duplication Orthologs – 2 homologs from different species Paralogs – 2 homologs within the same species G G G1,G2

How close? Rule of thumb: Proteins are homologous if over 25% identical (length >100) DNA sequences are homologous if over 70% identical

Twilight zone < 20% identity in proteins – may be homologous and may not be…. (Note that 5% identity will be obtained completely by chance!)

Why sequence alignment? Predict characteristics of a protein – use the structure/function of known proteins for predicting the structure/function of an unknown proteins

Sequence modifications Sequences change in the course of evolution due to random mutations Three types of mutations: Insertion - an insertion of a nucleotide or several nucleotides to the sequence. AAGA AAGTA Deletion – a deletion of a nucleotide (or more) from the sequence. AAGA AGA Substitution – a replacement of a nucleotide by another. AAGA AACA Insertion or Deletion ? -> Indel

Local vs. Global Global alignment: forces alignment in regions which differ Global alignment – finds the best alignment across the entire two sequences. Local alignment – finds regions of similarity in parts of the sequences. ADLGAVFALCDRYFQ |||| |||| | ADLGRTQN-CDRYYQ Local alignment will return only regions of good alignment ADLG CDRYFQ |||| |||| | ADLG CDRYYQ

When global and when local?

Global alignment PTK2 protein tyrosine kinase 2 of human and rhesus monkey

Protein tyrosine kinase domain

Protein tyrosine kinase domain Human PTK2 and leukocyte tyrosine kinase Both function as tyrosine kinases, in completely different contexts Ancient duplication

Global alignment of PTK and LTK X

Local alignment of PTK and LTK

Pairwise alignment AAGCTGAATTCGAA AGGCTCATTTCTGA One possible alignment: AAGCTGAATT-C-GAA AGGCT-CATTTCTGA-

AAGCTGAATT-C-GAA AGGCT-CATTTCTGA- This alignment includes: 2 mismatches 4 indels (gap) 10 perfect matches

Choosing an alignment: Many different alignments are possible: AAGCTGAATTCGAA AGGCTCATTTCTGA A-AGCTGAATTC--GAA AG-GCTCA-TTTCTGA- AAGCTGAATT-C-GAA AGGCT-CATTTCTGA- Which alignment is better?

Alignment scoring - scoring of sequence similarity: Assumes independence between positions Each position is considered separately Scores each position Positive if identical (match) Negative if different (mismatch) or gap (indel) Total score = sum of position scores Can be positive or negative

Example - naïve scoring system: Perfect match: +1 Mismatch: -2 Indel (gap): -1 AAGCTGAATT-C-GAA AGGCT-CATTTCTGA- A-AGCTGAATTC--GAA AG-GCTCA-TTTCTGA- Score: = (+1)x10 + (-2)x2 + (-1)x4 = 2 Score: = (+1)x9 + (-2)x2 + (-1)x6 = -1 Higher score  Better alignment

Scoring system: The choice of +1,-2, and -1 scores is quite arbitrary Different scoring systems  different alignments Scoring systems implicitly represent a particular theory of evolution Some mismatches are more plausible Transition vs. Transversion LysArg ≠ LysCys Gap extension ≠ Gap opening

Scoring matrix Representing the scoring system as a table or matrix n  n (n is the number of letters the alphabet contains. n=4 for nucleotides, n=20 for amino acids) symmetric T C G A 2 -6

DNA scoring matrices Uniform substitutions between all nucleotides: T From To 2 -6 Match Mismatch

DNA scoring matrices Can take into account biological phenomena such as: Transition-transversion

Amino-acid scoring matrices Take into account physico-chemical properties

Amino-acid substitutions matrices Actual substitutions: Based on empirical data Commonly used by many bioinformatics programs PAM & BLOSUM

Protein matrices – actual substitutions The idea: Given an alignment of a large number of closely related sequences we can score the relation between amino acids based on how frequently they substitute each other M G Y D E M G Y E E M G Y Q E In the fourth column E and D are found in 7 / 8

PAM Matrix - Point Accepted Mutations Based on a database of 1,572 changes in 71 groups of closely related proteins (85% identity) Alignment was easy Counted the number of the substitutions per amino-acid pair (20 x 20) Found that common substitutions occurred between chemically similar amino acids

PAM Matrices Family of matrices PAM 80, PAM 120, PAM 250 The number on the PAM matrix represents evolutionary distance Larger numbers are for larger distances

Example: PAM 250 Similar amino acids have greater score

PAM - limitations Based only on a single, and limited dataset Examines proteins with few differences (85% identity) Based mainly on small globular proteins so the matrix is biased

BLOSUM Henikoff and Henikoff (1992) derived a set of matrices based on a much larger dataset BLOSUM observes significantly more replacements than PAM, even for infrequent pairs

BLOSUM: Blocks Substitution Matrix Based on BLOCKS database ~2000 blocks from 500 families of related proteins Families of proteins with identical function Blocks are short conserved patterns of 3-60 aa without gaps AABCDA----BBCDA DABCDA----BBCBB BBBCDA-AA-BCCAA AAACDA-A--CBCDB CCBADA---DBBDCC AAACAA----BBCCC

BLOSUM Each block represents a sequence alignment with different identity percentage For each block the amino-acid substitution rates were calculated to create the BLOSUM matrix

BLOSUM Matrices BLOSUMn is based on sequences that share at least n percent identity BLOSUM62 represents closer sequences than BLOSUM45

Example : Blosum62 derived from block where the sequences share at least 62% identity

More distant sequences PAM vs. BLOSUM PAM100 = BLOSUM90 PAM120 = BLOSUM80 PAM160 = BLOSUM60 PAM200 = BLOSUM52 PAM250 = BLOSUM45 More distant sequences

Scoring system = substitution matrix + gap penalty

Gap penalty We penalize gaps Scoring for gap opening & gap extension: Gap-extension penalty < gap-open penalty

Optimal alignment algorithms Needleman-Wunsch (global) Smith-Waterman (local)

Alignment Search Space The “search space” (number of possible gapped alignments) for optimally aligning two sequences is exponential in the length of the sequences (n). If n=100, there are 100100 = 10200 = 100,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000 different alignments! Average protein length is about n=250!

Searching databases

Searching a sequence database Using a sequence as a query to find homologous sequences in a sequence database

Query sequence: DNA or protein? For coding sequences, we can use the DNA sequence or the protein sequence to search for similar sequences. Which is preferable?

Protein is better! Selection (and hence conservation) works (mostly) on the protein level: CTTTCA = Leu-Ser TTGAGT = Leu-Ser

Query type Nucleotides: a four letter alphabet Amino acids: a twenty letter alphabet Two random DNA sequences will, on average, have 25% identity Two random protein sequences will, on average, have 5% identity

Conclusions Using the amino-acid sequence is preferable for homology search Why use a nucleotide sequence after all? No ORF found, e.g. newly sequenced genome No similar protein sequences were found Specific DNA databases are available (EST)

Some terminology Query sequence - the sequence with which we are searching Hit – a sequence found in the database, suspected as homologous

How do we search a database? Assume we perform pairwise alignment of the query against all the sequences in the database Exact pairwise alignment is O(mn) ≈ O(n2) (m – length of sequence 1, n – length of sequence 2)

How much time will it take? O(n2) computations per search. Assume n=200, so we have 40,000 computations per search Size of database - ~60 million entries 2.4 x 1012 computations for each sequence search we perform! Assume each computation takes 10-6 seconds  24,000 seconds ≈ 6.66 hours for each sequence search 150,000 searches (at least!!) are performed per day

Conclusion Using the exact comparison pairwise alignment algorithm between query and all DB entries – too slow

Heuristic Definition: a heuristic is a design to solve a problem that does not provide an exact solution (but is not too bad) but reduces the time complexity of the exact solution

BLAST BLAST - Basic Local Alignment and Search Tool A heuristic for searching a database for similar sequences

DNA or Protein All types of searches are possible Query: DNA Protein Database: DNA Protein blastn – nuc vs. nuc blastp – prot vs. prot blastx – translated query vs. protein database tblastn – protein vs. translated nuc. DB tblastx – translated query vs. translated database Translated databases: trEMBL genPept

BLAST - underlying hypothesis The underlying hypothesis: when two sequences are similar there are short ungapped regions of high similarity between them The heuristic: Discard irrelevant sequences Perform exact local alignment with remaining sequences

How do we discard irrelevant sequences quickly? Divide the database into words of length w (default: w = 3 for protein and w = 7 for DNA) Save the words in a look-up table that can be searched quickly WTD TDF DFG FGY GYP … WTDFGYPAILKGGTAC

BLAST: discarding sequences When the user gives a query sequence, divide it also into words Search the database for consecutive neighbor words

Neighbour words neighbor words are defined according to a scoring matrix (e.g., BLOSUM62 for proteins) with a certain cutoff level GFC (20) GFB GPC (11) WAC (5)

Search for consecutive words Neighbor word Look for a seed: hits on the same diagonal which can be connected A At least 2 hits on the same diagonal with distance which is smaller than a predetermined cutoff Database record This is the filtering stage – many unrelated hits are filtered, saving lots of time! Query

Perform local alignment Database record Query

Try to extend the alignment Stop extending when the score of the alignment drops X beneath the maximal score obtained so far Discard segments with score < S ASKIOPLLWLAASFLHNEQAPALSDAN JWQEOPLWPLAASOIHLFACNSIFYAS Score=15 Score=17 Score=14 X=4

The result – local alignment The result of BLAST will be a series of local alignments between the query and the different hits found

Theoretically, we could trust any result with an E-value ≤ 1 The number of times we will theoretically find an alignment with a score ≥ Y of a random sequence vs. a random database Theoretically, we could trust any result with an E-value ≤ 1 In practice – BLAST uses estimations. E-values of 10-4 and lower indicate a significant homology. E-values between 10-4 and 10-2 should be checked (similar domains, maybe non-homologous). E-values between 10-2 and 1 do not indicate a good homology

Filtering low complexity Low complexity regions : e.g., Proline rich areas (in proteins), Alu repeats (in DNA) Regions of low complexity generate high score of alignment, BUT – this does not indicate homology

Solution In BLAST there is an option to mask low-complexity regions in the query sequence (such regions are represented as XXXXX in query)