Alignment of biological sequences

Slides:

Advertisements

Similar presentations

Request Dispatching for Cheap Energy Prices in Cloud Data Centers

Advertisements

SpringerLink Training Kit

Luminosity measurements at Hadron Colliders

From Word Embeddings To Document Distances

Choosing a Dental Plan Student Name

Virtual Environments and Computer Graphics

Chương 1: CÁC PHƯƠNG THỨC GIAO DỊCH TRÊN THỊ TRƯỜNG THẾ GIỚI

THỰC TIỄN KINH DOANH TRONG CỘNG ĐỒNG KINH TẾ ASEAN –

D. Phát triển thương hiệu

NHỮNG VẤN ĐỀ NỔI BẬT CỦA NỀN KINH TẾ VIỆT NAM GIAI ĐOẠN

Điều trị chống huyết khối trong tai biến mạch máu não

BÖnh Parkinson PGS.TS.BS NGUYỄN TRỌNG HƯNG BỆNH VIỆN LÃO KHOA TRUNG ƯƠNG TRƯỜNG ĐẠI HỌC Y HÀ NỘI Bác Ninh 2013.

Nasal Cannula X particulate mask

Evolving Architecture for Beyond the Standard Model

HF NOISE FILTERS PERFORMANCE

Electronics for Pedestrians – Passive Components –

Parameterization of Tabulated BRDFs Ian Mallett (me), Cem Yuksel

L-Systems and Affine Transformations

CMSC423: Bioinformatic Algorithms, Databases and Tools

Some aspect concerning the LMDZ dynamical core and its use

Bayesian Confidence Limits and Intervals

实习总结（Internship Summary)

Current State of Japanese Economy under Negative Interest Rate and Proposed Remedies Naoyuki Yoshino Dean Asian Development Bank Institute Professor Emeritus,

Front End Electronics for SOI Monolithic Pixel Sensor

Face Recognition Monday, February 1, 2016.

Solving Rubik's Cube By: Etai Nativ.

CS284 Paper Presentation Arpad Kovacs

انتقال حرارت 2 خانم خسرویار.

Summer Student Program First results

Theoretical Results on Neutrinos

HERMESでのHard Exclusive生成過程による核子内クォーク全角運動量についての研究

Wavelet Coherence & Cross-Wavelet Transform

yaSpMV: Yet Another SpMV Framework on GPUs

Creating Synthetic Microdata for Higher Educational Use in Japan: Reproduction of Distribution Type based on the Descriptive Statistics Kiyomi Shirakawa.

MOCLA02 Design of a Compact L-band Transverse Deflecting Cavity with Arbitrary Polarizations for the SACLA Injector Sep. 14th, 2015 H. Maesaka, T. Asaka,

Hui Wang†*, Canturk Isci‡, Lavanya Subramanian*,

Fuel cell development program for electric vehicle

Overview of TST-2 Experiment

Optomechanics with atoms

داده کاوی سئوالات نمونه

Inter-system biases estimation in multi-GNSS relative positioning with GPS and Galileo Cecile Deprez and Rene Warnant University of Liege, Belgium

ლექცია 4 - ფული და ინფლაცია

10. predavanje Novac i financijski sustav

Wissenschaftliche Aussprache zur Dissertation

FLUORECENCE MICROSCOPY SUPERRESOLUTION BLINK MICROSCOPY ON THE BASIS OF ENGINEERED DARK STATES* *Christian Steinhauer, Carsten Forthmann, Jan Vogelsang,

Particle acceleration during the gamma-ray flares of the Crab Nebular

Interpretations of the Derivative Gottfried Wilhelm Leibniz

Advisor: Chiuyuan Chen Student: Shao-Chun Lin

Widow Rockfish Assessment

SiW-ECAL Beam Test 2015 Kick-Off meeting

On Robust Neighbor Discovery in Mobile Wireless Networks

Chapter 6 并发：死锁和饥饿 Operating Systems: Internals and Design Principles

You NEED your book!!! Frequency Distribution

Y V =0 a V =V0 x b b V =0 z

Fairness-oriented Scheduling Support for Multicore Systems

Climate-Energy-Policy Interaction

Hui Wang†*, Canturk Isci‡, Lavanya Subramanian*,

Ch48 Statistics by Chtan FYHSKulai

The ABCD matrix for parabolic reflectors and its application to astigmatism free four-mirror cavities.

Measure Twice and Cut Once: Robust Dynamic Voltage Scaling for FPGAs

Online Learning: An Introduction

Factor Based Index of Systemic Stress (FISS)

What is Chemistry? Chemistry is: the study of matter & the changes it undergoes Composition Structure Properties Energy changes.

THE BERRY PHASE OF A BOGOLIUBOV QUASIPARTICLE IN AN ABRIKOSOV VORTEX*

Quantum-classical transition in optical twin beams and experimental applications to quantum metrology Ivano Ruo-Berchera Frascati.

The Toroidal Sporadic Source: Understanding Temporal Variations

FW 3.4: More Circle Practice

ارائه یک روش حل مبتنی بر استراتژی های تکاملی گروه بندی برای حل مسئله بسته بندی اقلام در ظروف

Decision Procedures Christoph M. Wintersteiger 9/11/2017 3:14 PM

Limits on Anomalous WWγ and WWZ Couplings from DØ

Presentation transcript:

Alignment of biological sequences Bioinformatics Alignment of biological sequences UL, 2017, Juris Viksna

Topics Short review about sequence comparison: biological motivation to compare sequences sequence similarity criteria DP basic algorithm for distance computation between sequences Global and local sequence comparisons similarity matrices and gap penalties modified algorithms that use gap penalties local sequence comparison Similarity matrices how to obtain them relations between similarity matrices and sequence evolution suitability for matrices for specific sequences

Comparison of biological sequences Two sequence comparisons (pairwise alignment): the formulation of the problem DP algorithm (match = 1, mismatch = 1, gap = 2) gloabal and local comparisons affine gap penalties similarity matrices Multiple alignment the formulation of the problem (SOP) Star alignment relation with phylogenetic trees, progressive alignment Sequence classification: profiles and moitifs profile matrices HMM (Hidden Markov Models)

Why we need to compare sequences? Genome is already sequenced (assume...) There are methods that predict DNA coding regions (genes) What are biological functions of these genes?? We can find out what protein (sequence) gene encodes But we still do not know what this protein does... However we can search for known proteins with similar sequences and such that functions of these proteins are known We want to find out something about proteins in humans The best approach is “experimental”, but tricky with humans... But we can try to use similar protein (e.g. in mice) and start our experiments with them

Basic assumptions Will consider proteins/RNA/DNA just as sequences in correspondingly 20 and 4 letter alphabets Aims of comparison: to find out how similar the sequences are (some similarity measure) to find “common motif” of sequences (alignment) Regarding algorithmic complexity two distinctive cases: comparison of two sequences (relatively easy) simultaneous comparison of n sequences (complexity grows exponentially with n) In this lecture we will consider the problem of comparison of two sequences

Nucleotides and DNA [Watson, Crick 1953] For us DNA is a sequence in 4 letter alphabet [Adapted from Y.Guo]

Proteins For our purposes we will treat proteins as sequences in 20 symbol alphabet [Adapted from R.Shamir]

From DNA to proteins Each codon consists of 3 nucleotides Mutations: Substitution: (changes a single aa) Insertion / Deletion: “frame shift” (change all subsequent aa) NB! Insertion / Deletion might be a multiple of 3... “Silent mutation” – DNA changed, but not aa “Nonsense mutation” - creates “stop” codon

Genetic code Genetic code Completely worked out in 1962

Evolution of sequences Mutations are a natural process of DNA evolution DNA replication errors: substitutions insertions deletions Similarity between sequences: indicates their common ancestral origin indicates similarity of biological functions Well, this is of course simplification: the change of protein function will determine whether the organism will have offsprings and the changed gene will survive Protein sequence similarity is closely associated with similarity of DNA coding regions } indels

Sequence evolution Each codon consists of 3 nucleotides Mutations: Substitution: (changes a single aa) Insertion / Deletion: “frame shift” (change all subsequent aa) NB! Insertion / Deletion might be a multiple of 3... “Silent mutation” – DNA changed, but not aa “Nonsense mutation” - creates “stop” codon

Sequence evolution ggcatt agcatt agcata agcatg agccta aggatt gacatt

Sequence homology Homologs - evolved from the common ancestor Orthologs - the same function in different organisms Paralogs - similar function in the same organism

Orthologs vs paralogs [Adapted from R.Shamir]

How to compare sequences? Given two proteins: >sp|P69905|HBA_HUMAN Hemoglobin VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPT TKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDM PNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLP AEFTPAVHASLDKFLASVSTVLTSKYR >tr|Q61287|Q61287_MOUSE Hemoglobin MVLSGEDKSNIKAAWGKIGGHGAEYVAEALERMFASFP TTKTYFPHFDVSHGSAQVKGHGKKVADALASAAGHLDD LPGALSALSDLHAHKLRVDPVNFKLLSHCLLVTLASHH PADFTPAVHASLDKFLASVSTVLTSKYR How to assess their similarity?

Sequence alignment - BLAST

Sequence alignment - BLAST

Sequence alignment – the results we expect

Sequence alignment - SSEARCH

Sequence alignment - SSEARCH

Sequence alignment - scores sequence similarity/identity (%) This is well-defined for aligned sequence parts “Score” (usually very method-specific in absolute value) p-value – probability that alignment with given score or higher is found by chance Normally the given values are only approximations Expect(E)-value (a parameter that describes the number of hits one can "expect" to see by chance when searching a database of a particular size) The lower the better. But short similar sequences might have comparatively high values (E-value decreases exponentially with Score) Z-score – number of standard deviations from mean value

Z-score

Z-score

How to align two sequences - BLAST Find two exact similarity regions (usually 4 aa each) Try to join and extend these match until score falls below threshold Anyway, how we should do this “correctly”?

The “Manhattan Tourist” problem Visit as many sights as possible starting from top-left corner and moving just down or right

Longest common subsequence Given two sequences A and B find a longest possible sequence C that is subsequence of both A and B (such C does not need to be unique) Example: A = GGATATCGGGCGAT B = ATTCCCCCGCCCTA C = ATTCGCA or ATTAGCT How can we find it?

LCS – dynamic programming solution A = a1 a2an B = b1 b2bm c(i,k) - length of LCS of a1 a2ai and b1 b2bk 0, if i = 0 or k = 0 c(i–1,k–1) +1, if i, k > 0 and ai = bk max{c(i, k–1), c(i–1, k)}, if i, k > 0 and ai  bk c(i, k) =

LCS – example A = GADTAMAWGRAMMA B = GAGAWKIAMM

LCS - example  G A D T M W R  G A W K I M

LCS - example  G A D T M W R  G A W K I M 1

LCS - example  G A D T M W R  G A W K I M 1 2

LCS - example  G A D T M W R  G A W K I M 1 2 3

LCS - example  G A D T M W R  G A W K I M 1 2 3 4

LCS - example  G A D T M W R  G A W K I M 1 2 3 4

LCS - example  G A D T M W R  G A W K I M 1 2 3 4

LCS - example  G A D T M W R  G A W K I M 1 2 3 4

LCS - example  G A D T M W R  G A W K I M 1 2 3 4 5

LCS - example  G A D T M W R  G A W K I M 1 2 3 4 5 6

LCS - example  G A D T M W R  G A W K I M 1 2 3 4 5 6 7

LCS - example  G A D T M W R  G A W K I M 1 2 3 4 5 6 7

LCS - example LCS: GAWAAMM Alignment: GA-DTAMAW—GRAMMA  G A D T M W R  G A W K I M 1 2 3 4 5 6 7 LCS: GAWAAMM Alignment: GA-DTAMAW—GRAMMA GAG----AWKI—AMM-

Edit distance  Levenshtein 1966 Minimal number of operations that transforms one sequence into another insert, delete, substitute (1 simbols) Edit distance is 0 (sequences are identical) or positive For example “AIMS” & “AMOS”: (distance=2 for all three solutions) AIMS AMOS  AIM-S A-MOS AIMS AMOS [Adapted from D.Gilbert]

Edit distance Given two sequences A and B find a the smallest possible number of Insertion, Deletion and Substitution operations that chnages A to B Example: A = GGATATCGGGCGAT B = ATTCCCCCGCCCTA [G][G]AT[A]T[C][C][C][C]CG[G-C][G-C]C[G-C]C[A]T[A] ED = 12? How can we find it?

Edit distance A = a1 a2an B = b1 b2bm e(i,k) – lenght of ED for sequences a1 a2ai and b1 b2bk i, if k = 0 k, if i = 0 e(i–1,k–1), if i, k > 0 and ai = bk min{e(i–1,k–1),e(i,k–1),e(i–1,k)}+1,if i,k > 0 and ai  bk e(i, k) =

{ ED - modifications e(i,0) = i e(0,j) = j e(i-1,j)+ t e(i,j-1)+ t If you interested in result «up to a sign» it does not matter whether min or max is used. min is more natural for ED, max for LCS. max is also the usual choice for sequence comparison. tij – probability that aa ai changes to aa bj e(i,j)= min { e(i-1,j)+ t e(i,j-1)+ t e(i-1,j-1) + t(ai,bj) e(i,0) = i e(0,j) = j For ED: t = 1 t(ai,bj) = 0 if ai=bj t(ai,bj) = 1 if ai  bj For «inverse» LCS: t = 0 t(ai,bj) = 1 if ai=bj t(ai,bj) = 0 if ai  bj

Substitution (similarity) matrices A R N D C Q E G H I L K M F P S T W Y V A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 C 0 -3 -3 -3 8 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 H -2 0 1 -1 -3 0 0 -2 7 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 6 -1 -1 -4 -3 -2 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 10 2 -3 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 6 -1 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 Similarity Matrix Most popular: PAM Blossum Gonnet The one shown is BLOSSOM 62 (almost :) "Traditional assumption": substitution score > 0 for substitutions that are more frequent as random ones, and < 0 for less frequent than random ones.

Sequence similarity as the longest path problem We can treat matrix as graph with weighted edges. The problem then translates to finding path with the largest/smallest weight in Directed Acyclic Graph.

Complexity of similarity computation Size of matrix: nm Computing of value for each cell: const Total time: (nm) Total memory: (nm) Notice that if we want just score only two rows are needed. In this case the required memory: (nm) However, if we also need the alignment (and we usually do)?

Edit distance in linear space?

Interpretation of comparison results Alignment grid (edit graph). Every alignment is a path from (0,0) to (n,m).

Interpretation of comparison results

Needleman-Wunsch algorithm

Global and local alignments

Global and local alignments Using LCS the best local alignment will have the same score as the best global alignment (however the alignment might be «better») Using ED best local alignments are likely to be for sequences with length 1  Local comparison/alignment «starts to work» if scoring is somewhere between the two above – there are extra points for each match and penalty points for each mismatch

Computing local aligments Just allow a «free ride» from each node to the top-left vertex

Computing local aligments

Computing local aligments In this case global alignment has better score, but misses «conserved domain»

Global and local comparisons GLOBAL best alignment of entirety of both sequences For optimum global alignment, we want best score in the final row or final column Are these sequences generally the same? Needleman Wunsch find alignment in which total score is highest, perhaps at expense of areas of great local similarity LOCAL best alignment of segments, without regard to rest of sequence For optimum local alignment, we want best score anywhere in matrix Do these two sequences contain high scoring subsequences Smith Waterman find alignment in which the highest scoring subsequences are identified, at the expense of the overall score [Adapted from R Altman]

Alignment with gap penalties

Alignment with gap penalties

Finding gapped alignments Add yet another «gap» edge between each vetex and each of its «gap predecessors» However there are (nm(m+n)) of them 

Finding gapped alignments

Finding gapped alignments

Finding gapped alignments

Finding local gapped alignements [Adapted from M.Craven]

Gap penalties vin general case The computation requires time though O(n3)... [Adapted from M.Craven]

Substitutiom matrices Margaret Dayhoff (1925-1983) First woman in the field of Bioinformatics

Substitutiom matrices Dayhoff, M.O., Schwartz, R. and Orcutt, B.C. (1978). "A model of Evolutionary Change in Proteins". Atlas of protein sequence and structure (volume 5, supplement 3 ed.). Nat. Biomed. Res. Found. pp. 345–358.

Frequencies (probabilities) of amino acids 1978 1991 L 0.085 0.091 A 0.087 0.077 G 0.089 0.074 S 0.070 0.069 V 0.065 0.066 E 0.050 0.062 T 0.058 0.059 K 0.081 0.059 I 0.037 0.053 D 0.047 0.052 R 0.041 0.051 P 0.051 0.051 N 0.040 0.043 Q 0.038 0.041 F 0.040 0.040 Y 0.030 0.032 M 0.015 0.024 H 0.034 0.023 C 0.033 0.020 W 0.010 0.014 Frequencies (probabilities) of amino acids

Substitution matrices as mutation probabilities Therefore: score  1 (or log score  0) the better score, the better alignment

Substitution matrices as mutation probabilities Currently we assume that there are no gaps in alignments [Adapted from M.Craven]

Substitution matrices as mutation probabilities [Adapted from M.Craven]

Substitution matrices as mutation probabilities This gives extra ”scoring points” for each matched symbol. [Adapted from M.Craven]

PAM matrices [Adapted from M.Craven]

PAM matrices [Adapted from M.Craven]

PAM matricas [Adapted from M.Craven]

PAM matrices A PAM (Percent Accepted Mutation) is one accepted point mutation on the path between two sequences, per 100 residues. Most frequently used PAM250 Obtained from PAM1 by matrix multiplication...

PAM matrices - problems Qh Qm T years ancestor “The common ancestori", is actually unknown

PAM matrices - problems ancestor ~  A C Q shT PAMs Evolution distances will be different for different pairs...

PAM-250

BLOSOM matrices BLOSUM62 is the BLAST default [Adapted from M.Craven]

How to align two sequences - BLAST Find two exact similarity regions (usually 4 aa each) Try to join and extend these match until score falls below threshold

BLOSOM matrices

Relationship between PAM & BLOSUM

Substitution matrices The best matrix depends from evolutionary distance In general the similarity score should be proportional to logarithm of probability of having common ancestor The exact «right procedure» for computation of matrices is not trivial

Protein evolution rates Mutation frequencies are fairly stable, but still could differ for different groups of proteins: fibrinopeptides > hemoglobin > cytochrome > Hystone For longer proteins mutations rates might be different in different sequence regions.

Protein evolution rates

Substitution matrices matrices - problems If we observe a substitution a  b between two sequences this actually could mean: a  b a  x  b a  x  y  b ........................ As a result the computed probabilities will not be “exactly right"...

Nucleotide substitution matrices Probably describe better to mutation processes, but not the mutations that could survive during evolution. These tend to be much simpler, since they can not reflect the role of specific nucleotide position.

What about aligment scores? p-value – probability that alignment with given score or higher is found by chance E-value – average number of alignments with given score and higher Assuming all match probabilities to be equal to p, p-value could be computed using the fact that probability to have k matches from n is equal to: 𝑛 𝑘 𝑝 𝑘 (1−𝑝) 𝑛−𝑘 . Such probabilities correspond to binomial distribution. E-value can be derived from p-value and database size.

Probability distributions For larger n binomial distribution can be aslo well approximated by normal distribution. Both of them are easy to use to compute p-values. However...

Computing of p-values However situation are made much more complex by: Use of local, not global, alignments Use of similarity matrices with different probabilities for matches/mismatches Use of gap penalties The computations of exact distributions becomes non-realistic, still a good approximations exist that deals with all these additional requirements. Still, these rely on assumption that probabilities of protein sequences are determined just by probabilities of amino acids. This may result to p-values being considerably lover than «real probabilities» to reach specific alignment score.

P-Values P(s > S) = .01 Likewise for P=.001 and so on. P-value of .01 occurs at score threshold S (392 below) where score s from random comparison is greater than this threshold 1% of the time Likewise for P=.001 and so on. [Adapted from M.Gerstein]

ROC (Receiver Operating Characterisctic) curves We will consider proteins to be homologs, if their similarity exceed some threshold t true positives (tp) - s(a,b)  t and a, b are homologs false positives (fp) - s(a,b)  t and a, b are not homologs true negatives (tn) - s(a,b) < t and a, b are not homologs false negatives (fn) - s(a,b) < t and a, b are homologs Sensitivity = tp/n = tp/(tp+fn) Specificity = tn/n = tn(tn+fp)

ROC curves

ROC curves 100% Coverage (roughly, fraction of sequences that one confidently “says something” about) Thresh=10 Thresh=20 [sensitivity=tp/n=tp/(tp+fn)] Thresh=30 Different score thresholds Error rate (fraction of the “statements” that are false positives) Two “methods” (red is more effective) [Specificity = tn/n =tn/(tn+fp)] error rate = 1-specificity = fp/n [Adapted from M.Gerstein]

Homology (>10%) clusters of CATH 2

Multiple sequence alignments Very similar DP recurions work, however in time (nN), where N is number of sequences Not tractable for usual requirements on N In practice many heuristic methods are used that «work well» but do not guarantee the optimal result

Time required for sequence comparisons Smith-Waterman algorithm (1981) Local comparisons Linear gap penalties Use of substitution matrices Is (n2) time practically acceptable? Protein length - around 300 aa Comparison of 2 proteins: 100000 op. (0.0001 sec at 1 GHz) Protein database - 1000000 entries Database search: 1011 op, 100 sec Comparison of two databases - 1017 op, 25 years :(

Heuristic methods - FASTA

Heuristic methods - FASTA Hash table of short words in the query sequence Go through DB and look for matches in the query hash (linear in size of DB) K-tuple determines word size (k-tup 1 is single aa) Lipman & Pearson 1985 VLICT = _ VLICTAVLMVLICTAAAVLICTMSDFFD [Adapted from M.Gerstein]

The FASTA Algorithm 4 steps: use lookup table to find all identities at least ktup long, find regions of identities rescan 10 regions (diagonals) with highest density of identities using PAM250 join regions if possible without decreasing score below threshold rescore ala Smith-Waterman 32 residues around initial region (Note: doesn’t save alignment)

FASTA parameters ktup = 2 for proteins, 6 for DNA init1 Score after rescanning with PAM250 (or other) initn Score after joining regions opt Score after Smith-Waterman

FASTA algorithm [Fig.1 of Pearson and Lipman 1988]

FASTA algorithm [Adapted from D Brutlag]

Heuristic methods - BLAST Altschul, S., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. (1990). Basic local alignment search tool. J. Mol. Biol. 215, 403-410 Indexes query and DB Starts with all overlapping words from query Calculates “neighborhood” of each word using PAM matrix and probability threshold matrix and probability threshold Looks up all words and neighbors from query in database index Extends High Scoring Pairs (HSPs) left and right to maximal length Finds Maximal Segment Pairs (MSPs) between query and database Blast 1 does not permit gaps in alignments [Adapted from M.Gerstein]

Heuristic methods - BLAST

BLAST algorithm Keyword search of all words of length w from the in the query of length n in database of length m with score above threshold w = 11 for nucleotide queries, 3 for proteins Do local alignment extension for each found keyword Extend result until longest match above threshold is achieved Running time O(nm) [Adapted from S.Daudenarde]

BLAST algorithm keyword Neighborhood words neighborhood Query: KRHRKVLRDNIQGITKPAIRRLARRGGVKRISGLIYEETRGVLKIFLENVIRD GVK 18 GAK 16 GIK 16 GGK 14 GLK 13 GNK 12 GRK 11 GEK 11 GDK 11 Neighborhood words neighborhood score threshold (T = 13) extension Query: 22 VLRDNIQGITKPAIRRLARRGGVKRISGLIYEETRGVLK 60 +++DN +G + IR L G+K I+ L+ E+ RG++K Sbjct: 226 IIKDNGRGFSGKQIRNLNYGIGLKVIADLV-EKHRGIIK 263 High-scoring Pair (HSP) [Adapted from S.Daudenarde]

Original BLAST Dictionary Alignment Output All words of length w Ungapped extensions until score falls below some threshold Output All local alignments with score > statistical threshold [Adapted from S.Daudenarde]

Original BLAST: Example w = 4 Exact keyword match of GGTC Extend diagonals with mismatches until score is under 50% Output result: GTAAGGTCC GTTAGGTCC A C G A A G T A A G G T C C A G T C T G A T C C T G G A T T G C G A From lectures by Serafim Batzoglou (Stanford)

Gapped BLAST

Gapped BLAST

Gapped BLAST: Example GTAAGGTCC-AGT GTTAGGTCCTAGT Original BLAST exact keyword search, THEN: Extend with gaps in a zone around ends of exact match until score < threshold then merge nearby alignments Output result: GTAAGGTCC-AGT GTTAGGTCCTAGT A C G A A G T A A G G T C C A G T C T G A T C C T G G A T T G C G A From lectures by Serafim Batzoglou (Stanford)

Gapped BLAST : Example GTAAGGTCC-AGT GTTAGGTCCTAGT Original BLAST exact keyword search, THEN: Extend with gaps in a zone around ends of exact match until score < threshold then merge nearby alignments Output result: GTAAGGTCC-AGT GTTAGGTCCTAGT A C G A A G T A A G G T C C A G T C T G A T C C T G G A T T G C G A From lectures by Serafim Batzoglou (Stanford)

BLAST - Programms blastp compares an amino acid query sequence against a protein sequence database blastn compares a nucleotide query sequence against a nucleotide sequence database blastx compares a nucleotide query sequence translated in all reading frames against a protein sequence database tblastn compares a protein query sequence against a nucleotide sequence database dynamically translated in all reading frames tblastx compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database. Please note that tblastx is extremely slow and cpu-intensive

PSI-BLAST – position-specific matrices and transitive sequence comparison

PSSM – position-specific scoring matrix

Iterated PSI-BLAST

Iterated PSI-BLAST

General Protein Search Principles Choose between local or global search algorithms Use most sensitive search algorithm available Original BLAST for no gaps Smith-Waterman for most sensitivity FASTA with k-tuple 1 is a good compromise Gapped BLAST for well delimited regions PSI-BLAST for families Initially BLOSUM62 and default gap penalties If no significant results, use BLOSUM30 and lower gap penalties FASTA cutoff of .01 Blast cutoff of .0001 Examine results between exp. 0.05 and 10 for biological significance Ensure expected score is negative Beware of hits on long sequences or hits with unusual aa composition Reevaluate results of borderline significance using limited query region Segment long queries ³ 300 amino acids Segment around known motifs (some text adapted from D Brutlag)

Links to databases and search tools UniProt/Swiss-Prot – “main” protein sequence database: http://www.uniprot.org/ Text queries, BLAST search, etc https://www.ebi.ac.uk/uniprot (EBI site) http://pir.georgetown.edu/ (PIR site) Text queries, SSearch, ClustalW Search tools: http://www.ebi.ac.uk/Tools/sss/ (pair alignments) FASTA, BLAST, SSearch etc http://www.ebi.ac.uk/Tools/msa/ (multiple alignments) ClustalW, Tea-Coffee etc http://fasta.bioch.virginia.edu FASTA, SSearch – searches and software downloads

Links to databases and search tools Protein structure database (PDB): http://www.rcsb.org/pdb Ensembl genome browser: www.ensembl.org/