Pairwise Sequence Alignment
Three modifications for local alignment The scoring system uses negative scores for mismatches The minimum score for [i,j] is zero. The best score is sought anywhere in matrix, not just last column or row. These changes cause the method to seek high scoring subsequences, which are not penalized for their global effects, which don’t include poor match and which can occur anywhere.
Other methods for Alignment O(N 2 ) is too slow for large databases Heuristic methods based on frequency of shared subsequences Usually look for ungapped small sequences (See, for example, FASTA, BLAST, BLAZE)
Multiple Sequence Alignment
What is multiple sequence alignment? Simple extension of pairwise sequence alignments: Given: o Set of sequences o Scoring match table o Gap penalties Find: o Alignment of sequences such that optimal score is achieved.
Two major applications of multiple sequence alignment Aligning protein families – Takes advantage of richer alphabet – Establishes evolutionary relationships among proteins, starting point for trees – Can identify important functional regions – Can yield structural clues – Gold standard is clear Aligning non-coding DNA sequences – Conserved signals in DNA for control of expression – Can infer evolutionary relationships – Can identify important functional regions – Gold standard difficult to identify…
Why do we care about protein MSA? Useful way to summarize the sequences of related proteins. What do globin sequences look like? Useful way to find important functional amino acids by assessing conservation over many sequences. What is conserved?
Globin sequences 4mbn VLSEGEWQLVLHVWAKVE--ADVAGH 1myt ADFDAVLKCWGPVE--ADYTTM 2hhb A VLSPADKTNVKAAWGKVG--AHAGEY 2mhb A VLSAADKTNVKAAWSKVG--GHAGEY 1pbx A SLSDKDKAAVRALWSKIG--KSADAI 2hhb B VHLTPEEKSAVTALWGKV----NVDEV 2mhb B VQLSGEEKAAVLALWDKV----NEEEV 2lhb. -PIVDTGSVAPLSAAEKTKIRSAWAPVY--STYETS 1mba SLSAAEADLAGKSWAPVFA--NKNAN 1sdh A --PSVYDAAAQLTADVKKDLRDSWKVIGS--DKKGN 1lh GALTESQAALVKSSWEEFN--ANIPKH 1hlb. GGTLAIQAQGDLTLAQKKIVRKTWHQLMRN--KTSF 1ith A GLTAAQIKAIQDHWFLNI-KGCLQAA 1ecd LSADQISTVQASFDKVK------GD 2hbg GLSAAQRQVIAATWKDIAGADNGAGV
Conserved subsequences DRFKHLKTEAEMKASEDLKKHGVTVLTALGAILKKKG PKFAGI-AQADIAGNAAISAHGATVLKKLGELLKAKG PHF-DLSH-----GSAQVKGHGKKVADALTNAVAHVD PHF-DLSH-----GSAQVKAHGKKVGDALTLAVGHLD SHWPDVTP-----GSPHIKAHGKKVMGGIALAVSKID ESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLD DSFGDLSNPGAVMGNPKVKAHGKKVLHSFGEGVHHLD PKFKGLTTADELKKSADVRWHAERIINAVDDAVASMD ADFKGKSVAD-IKASPKLRDVSSRIFTRLNEFVNNAA KRLGNVS---QGMANDKLRGHSITLMYALQNFIDQLD SFLKGT--SEVPQNNPELQAHAGKVFKLVYEAAIQLE PQMAGM-SASQLRSSRQMQAHAIRVSSIMSEYVEELD HKFS-SVPLYGLRSNPAYKAQTLTVINYLDKVVDALG TQFAG-KDLESIKGTAPFETHANRIVGFFSKIIGELP GFSGA SDPGVAALGAKVLAQIGVAVSHLG
Why do we care about protein MSA? Establish evolutionary relationships between sequences. What was sequence of events leading to current species? More precisely understand how to model 3D structures. What other amino acids are acceptable in this structure?
The close relationship between msa and evolutionary tree
What is the protein MSA gold standard? Structural alignment! If sequences can be aligned, the alignment should reflect structural similarities. Thus, the alignment should lead to low RMS (in general) and certainly to “match” of common structural and functional elements. But remember: optimal computation is not same as optimal biology...
What about DNA MSA? May be conserved within species (to control expression in concerted fashion) May be conserved across species (using similar control mechanisms) May diverge within and across species for special purpose or evolutionary drift
What about DNA MSA? Much harder problem (4 letters only). What is being tested with multiple alignment of noncoding sequences? Common evolutionary descent Common mode of binding proteins Common overall function No structure to use as gold standard – would need to assess ability of aligned sequences to bind proteins, affect function. In general, statistical/probabilistic methods (EM, Gibbs Sampling) are more effective.
What else you need to know about all MSA methods? Almost all programs will align whatever sequences the user gives as input. They will always return an alignment, even if the sequences are completely unrelated. The biology thinking should be done by you. Most programs will insert gaps. However, if inserted, they are there to stay. You need to check how the program treats end gaps.