Pairwise and Multiple Sequence Alignment Lesson 2

Slides:



Advertisements
Similar presentations
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Advertisements

Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
Measuring the degree of similarity: PAM and blosum Matrix
Aligning sequences and searching databases
1 Exercise: BIOINFORMATIC DATABASES and BLAST. 2 Outline  NCBI and Entrez  Pubmed  Google scholar  RefSeq  Swissprot  Fasta format  PDB: Protein.
Heuristic alignment algorithms and cost matrices
1 Multiple sequence alignment Lesson 4. 2 VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG LSLTCTVSGTSFDD--YYSTWVRQPPG.
Expect value Expect value (E-value) Expected number of hits, of equivalent or better score, found by random chance in a database of the size.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
|| || ||||| ||| || || ||||||||||||||||||| MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFE… ATGGTGAACCTGACCTCTGACGAGAAGACTGCCGTCCTTGCCCTGTGGAACAAGGTGGACG TGGAAGACTGTGGTGGTGAGGCCCTGGGCAGGTTTGTATGGAGGTTACAAGGCTGCTTAAG.
Introduction to bioinformatics
1 ATGGTGAACCTGACCTCTGACGAGAAGACTGCCGTCCTTGCCCTGTGGAACAAGGTGGACG TGGAAGACTGTGGTGGTGAGGCCCTGGGCAGGTTTGTATGGAGGTTACAAGGCTGCTTAAG GAGGGAGGATGGAAGCTGGGCATGTGGAGACAGACCACCTCCTGGATTTATGACAGGAACT.
Sequence similarity.
1 Exercise 1 Bioinformatics Databases. 2 What’s in a database?  Sequences – genes, proteins, etc.  Full genomes  Annotation – information about the.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Similar Sequence Similar Function Charles Yan Spring 2006.
1 Exercise: BIOINFORMATIC DATABASES and BLAST. 2 Outline  NCBI and Entrez  Pubmed  Google scholar  RefSeq  Swissprot  Fasta format  PDB: Protein.
Introduction to Bioinformatics Algorithms Sequence Alignment.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Sequence Alignments Revisited
1 Lesson 3 Aligning sequences and searching databases.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Sequence Alignment and Phylogenetic Prediction using Map Reduce Programming Model in Hadoop DFS Presented by C. Geetha Jini (07MW03) D. Komagal Meenakshi.
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
Inferring function by homology The fact that functionally important aspects of sequences are conserved across evolutionary time allows us to find, by homology.
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
An Introduction to Bioinformatics
. Sequence Alignment and Database Searching 2 Biological Motivation u Inference of Homology  Two genes are homologous if they share a common evolutionary.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
BLAST Workshop Maya Schushan June 2009.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Multiple sequence alignment and their reliability The Bioinformatics Unit G.S. Wise Faculty of Life Science Tel Aviv University, Israel January 2013 By.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Construction of Substitution matrices
Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Protein Sequence Alignment Multiple Sequence Alignment
Pairwise Sequence Alignment Exercise 2. || || ||||| ||| || || ||||||||||||||||||| MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFE… ATGGTGAACCTGACCTCTGACGAGAAGACTGCCGTCCTTGCCCTGTGGAACAAGGTGGACG.
1 Homology and sequence alignment.. Homology Homology = Similarity between objects due to a common ancestry Hund = Dog, Schwein = Pig.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Pairwise Sequence Alignment. Three modifications for local alignment The scoring system uses negative scores for mismatches The minimum score for.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Sequence similarity, BLAST alignments & multiple sequence alignments
Multiple sequence alignment (msa)
The ideal approach is simultaneous alignment and tree estimation.
Pairwise Sequence Alignment
Basic Local Alignment Search Tool
Basic Local Alignment Search Tool (BLAST)
Sequence alignment, E-value & Extreme value distribution
Presentation transcript:

Pairwise and Multiple Sequence Alignment Lesson 2

Motivation MVNLTSDEKTAVLALWNKVDVEDCGGEALGRLLVVYPWTQRFFE… ATGGTGAACCTGACCTCTGACGAGAAGACTGCCGTCCTTGCCCTGTGGAACAAGGTGGACGTGGAAGACTGTGGTGGTGAGGCCCTGGGCAGGTTTGTATGGAGGTTACAAGGCTGCTTAAGGAGGGAGGATGGAAGCTGGGCATGTGGAGACAGACCACCTCCTGGATTTATGACAGGAACTGATTGCTGTCTCCTGTGCTGCTTTCACCCCTCAGGCTGCTGGTCGTGTATCCCTGGACCCAGAGGTTCTTTGAAAGCTTTGGGGACTTGTCCACTCCTGCTGCTGTGTTCGCAAATGCTAAGGTAAAAGCCCATGGCAAGAAGGTGCTAACTTCCTTTGGTGAAGGTATGAATCACCTGGACAACCTCAAGGGCACCTTTGCTAAACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGATCCTGAGAATTTCAAGGTGAGTCAATATTCTTCTTCTTCCTTCTTTCTATGGTCAAGCTCATGTCATGGGAAAAGGACATAAGAGTCAGTTTCCAGTTCTCAATAGAAAAAAAAATTCTGTTTGCATCACTGTGGACTCCTTGGGACCATTCATTTCTTTCACCTGCTTTGCTTATAGTTATTGTTTCCTCTTTTTCCTTTTTCTCTTCTTCTTCATAAGTTTTTCTCTCTGTATTTTTTTAACACAATCTTTTAATTTTGTGCCTTTAAATTATTTTTAAGCTTTCTTCTTTTAATTACTACTCGTTTCCTTTCATTTCTATACTTTCTATCTAATCTTCTCCTTTCAAGAGAAGGAGTGGTTCACTACTACTTTGCTTGGGTGTAAAGAATAACAGCAATAGCTTAAATTCTGGCATAATGTGAATAGGGAGGACAATTTCTCATATAAGTTGAGGCTGATATTGGAGGATTTGCATTAGTAGTAGAGGTTACATCCAGTTACCGTCTTGCTCATAATTTGTGGGCACAACACAGGGCATATCTTGGAACAAGGCTAGAATATTCTGAATGCAAACTGGGGACCTGTGTTAACTATGTTCATGCCTGTTGTCTCTTCCTCTTCAGCTCCTGGGCAATATGCTGGTGGTTGTGCTGGCTCGCCACTTTGGCAAGGAATTCGACTGGCACATGCACGCTTGTTTTCAGAAGGTGGTGGCTGGTGTGGCTAATGCCCTGGCTCACAAGTACCATTGA || || ||||| ||| || || ||||||||||||||||||| MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFE… MVNLTSDEKTAVLALWNKVDVEDCGGEALGRLLVVYPWTQRFFE…

What is sequence alignment? Alignment: Comparing two (pairwise) or more (multiple) sequences. Searching for a series of identical or similar characters in the sequences. MVNLTSDEKTAVLALWNKVDVEDCGGE || || ||||| ||| || || || MVHLTPEEKTAVNALWGKVNVDAVGGE

Why perform a pairwise sequence alignment? Finding homology between two sequences e.g., predicting characteristics of a protein – premised on: similar sequence (or structure) similar function

Local vs. Global Local alignment – finds regions of high similarity in parts of the sequences Global alignment – finds the best alignment across the entire two sequences ADLGAVFALCDRYFQ |||| |||| | ADLGRTQN CDRYYQ ADLGAVFALCDRYFQ |||| |||| | ADLGRTQN-CDRYYQ

Evolutionary changes in sequences Three types of nucleotide changes: Substitution – a replacement of one (or more) sequence characters by another: Insertion - an insertion of one (or more) sequence characters: Deletion – a deletion of one (or more) sequence characters: AAGA  AACA AAG A T A A GA Insertion + Deletion  Indel

Choosing an alignment: Many different alignments between two sequences are possible: AAGCTGAATTCGAA AGGCTCATTTCTGA AAGCTGAATT-C-GAA AGGCT-CATTTCTGA- A-AGCTGAATTC--GAA AG-GCTCA-TTTCTGA- . . . How do we determine which is the best alignment?

Gap penalty (opening = extending) Toy exercise Compute the scores of each of the following alignments using this naïve scoring scheme Scoring scheme: A C G T Match: +1 Mismatch: -2 Indel: -1 -2 1 Substitution matrix Gap penalty (opening = extending) AAGCTGAATT-C-GAA AGGCT-CATTTCTGA- A-AGCTGAATTC--GAA AG-GCTCA-TTTCTGA-

Substitution matrices: accounting for biological context Which best reflects the biological reality regarding nucleotide mismatch penalty? Tr > Tv > 0 Tv > Tr > 0 0 > Tr > Tv 0 > Tv > Tr Tr = Transition Tv = Transversion

Scoring schemes: accounting for biological context Which best reflects the biological reality regarding these mismatch penalties? Arg->Lys > Ala->Phe Arg->Lys > Thr->Asp Asp->Val > Asp->Glu

PAM matrices Family of matrices PAM 80, PAM 120, PAM 250, … The number with a PAM matrix (the n in PAMn) represents the evolutionary distance between the sequences on which the matrix is based The (ith,jth) cell in a PAMn matrix denotes the probability that amino-acid i will be replaced by amino-acid j in time n: Pi→j,n Greater n numbers denote greater distances

PAM - limitations Based on only one original dataset Examines proteins with few differences (85% identity) Based mainly on small globular proteins so the matrix is biased

BLOSUM matrices Different BLOSUMn matrices are calculated independently from BLOCKS (ungapped, manually created local alignments) BLOSUMn is based on a cluster of BLOCKS of sequences that share at least n percent identity The (ith,jth) cell in a BLOSUM matrix denotes the log of odds of the observed frequency and expected frequency of amino acids i and j in the same position in the data: log(Pij/qi*qj) Higher n numbers denote higher identity between the sequences on which the matrix is based

More distant sequences PAM Vs. BLOSUM PAM100 = BLOSUM90 PAM120 = BLOSUM80 PAM160 = BLOSUM60 PAM200 = BLOSUM52 PAM250 = BLOSUM45 More distant sequences BLOSUM62 for general use BLOSUM80 for close relations BLOSUM45 for distant relations PAM120 for general use PAM60 for close relations PAM250 for distant relations

Substitution matrices exercise Pick the best substitution matrix (PAM and BLOSUM) for each pairwise alignment: Human – chimp Human - yeast Human – fish PAM options: PAM60 PAM120 PAM250 BLOSUM options: BLOSUM45 BLOSUM62 BLOSUM80

Substitution matrices Nucleic acids: Transition-transversion Amino acids: Evolutionary (empirical data) based: (PAM, BLOSUM) Physico-chemical properties based (Grantham, McLachlan)

Gap penalty Which alignment has a higher score? AAGCGAAATTCGAAC A-G-GAA-CTCGAAC AAGCGAAATTCGAAC AGG---AACTCGAAC Which alignment has a higher score? Which alignment is more likely?

Pairwise alignment algorithm matrix representation: formulation 2 sequences: S1 and S2 and a Scoring scheme: match = 1, mismatch = -1, gap = -2 V[i,j] = value of the optimal alignment between S1[1…i] and S2[1…j] V[i,j] + S(S1[i+1],S2[j+1]) V[i+1,j+1] = max V[i+1,j] + S(gap) V[i,j+1] + S(gap) V[i,j] V[i+1,j] V[i,j+1] V[i+1,j+1]

Pairwise alignment algorithm matrix representation: initialization Match = 1 Mismatch = -1 Indel (gap) = -2 Scoring scheme:

Pairwise alignment algorithm matrix representation: filling the matrix Match = 1 Mismatch = -1 Indel (gap) = -2 Scoring scheme:

Pairwise alignment algorithm matrix representation: trace back

Pairwise alignment algorithm matrix representation: trace back AAAC AG-C

Assessing the significance of an alignment score True AAGCTGAATTCGAA AGGCTCATTTCTGA AAGCTGAATTC-GAA AGGCTCATTTCTGA- 28.0 Random AGATCAGTAGACTA GAGTAGCTATCTCT AGATCAGTAGACTA----- ----GAGTAG-CTATCTCT 26.0 . CGATAGATAGCATA GCATGTCATGATTC CGATAGATAGCATA--------- ---------GCATGTCATGATTC 16.0

Web servers for pairwise alignment

BLAST 2 sequences (bl2Seq) at NCBI Produces the local alignment of two given sequences using BLAST (Basic Local Alignment Search Tool) engine for local alignment Does not use an exact algorithm but a heuristic

Back to NCBI

BLAST – bl2seq

Bl2Seq - query blastn – nucleotide blastp – protein

Bl2seq results

Bl2seq results Similarity Dissimilarity Match Gaps Low complexity

lower chance of random homology for AA Query type: AA or DNA? For coding sequences, AA (protein) data are better Selection operates most strongly at the protein level → the homology is more evident AA – 20 char’ alphabet DNA - 4 char’ alphabet lower chance of random homology for AA ↓

BLAST – programs Query: DNA Protein Database: DNA Protein

BLAST – Blastp

Blastp - results

Blastp – results (cont’)

Blast scores: Bits score – A score for the alignment according to the number of similarities, identities, etc. It has a standard set of units and is thus independent of the scoring scheme Expected-score (E-value) –The number of alignments with the same or higher score one can “expect” to see by chance when searching a random database with a random sequence of particular sizes. The closer the e-value is to zero, the greater the confidence that the hit is really a homolog

Multiple Sequence Alignment (MSA)

Multiple sequence alignment Seq1 VTISCTGSSSNIGAG-NHVKWYQQLPG Seq2 VTISCTGTSSNIGS--ITVNWYQQLPG Seq3 LRLSCSSSGFIFSS--YAMYWVRQAPG Seq4 LSLTCTVSGTSFDD--YYSTWVRQPPG Seq5 PEVTCVVVDVSHEDPQVKFNWYVDG-- Seq6 ATLVCLISDFYPGA--VTVAWKADS-- Seq7 AALGCLVKDYFPEP--VTVSWNSG--- Seq8 VSLTCLVKGFYPSD--IAVEWWSNG-- Similar to pairwise alignment BUT n sequences are aligned instead of just 2 Each row represents an individual sequence Each column represents the ‘same’ position

Why perform an MSA? MSAs are at the heart of comparative genomics studies which seek to study evolutionary histories, functional and structural aspects of sequences, and to understand phenotypic differences between species

Multiple sequence alignment Seq1 VTISCTGSSSNIGAG-NHVKWYQQLPG Seq2 VTISCTGTSSNIGS--ITVNWYQQLPG Seq3 LRLSCSSSGFIFSS--YAMYWVRQAPG Seq4 LSLTCTVSGTSFDD--YYSTWVRQPPG Seq5 PEVTCVVVDVSHEDPQVKFNWYVDG-- Seq6 ATLVCLISDFYPGA--VTVAWKADS-- Seq7 AALGCLVKDYFPEP--VTVSWNSG--- Seq8 VSLTCLVKGFYPSD--IAVEWWSNG-- Seq1 VTISCTGSSSNIGAG-NHVKWYQQLPG Seq2 VTISCTGTSSNIGS--ITVNWYQQLPG Seq3 LRLSCSSSGFIFSS--YAMYWVRQAPG Seq4 LSLTCTVSGTSFDD--YYSTWVRQPPG Seq5 PEVTCVVVDVSHEDPQVKFNWYVDG-- Seq6 ATLVCLISDFYPGA--VTVAWKADS-- Seq7 AALGCLVKDYFPEP--VTVSWNSG--- Seq8 VSLTCLVKGFYPSD--IAVEWWSNG-- variable conserved

Alignment methods There is no available optimal solution for MSA – all methods are heuristics: Progressive/hierarchical alignment (ClustalX) Iterative alignment (MAFFT, MUSCLE)

Progressive alignment B C D E First step: compute pairwise distances Compute the pairwise alignments for all against all (10 pairwise alignments). The similarities are converted to distances and stored in a table E D C B A 8 17 15 10 14 16 32 31

Second step: build a guide tree 8 17 15 10 14 16 32 31 Second step: build a guide tree Cluster the sequences to create a tree (guide tree): represents the order in which pairs of sequences are to be aligned similar sequences are neighbors in the tree distant sequences are distant from each other in the tree The guide tree is imprecise and is NOT the tree which truly describes the evolutionary relationship between the sequences! A D C B E

Third step: align sequences in a bottom up order Sequence A Sequence B Sequence C Sequence D Sequence E Align the most similar (neighboring) pairs Align pairs of pairs Align sequences clustered to pairs of pairs deeper in the tree

Main disadvantages of progressive alignments C B E Sequence A Sequence B Sequence C Sequence D Sequence E Guide-tree topology may be considerably wrong Globally aligning pairs of sequences may create errors that will propagate through to the final result

Iterative alignment A B C DE Pairwise distance table Iterate until the MSA does not change (convergence) Guide tree A D C B E MSA

Blastp – acquiring sequences

blastp – acquiring sequences

blastp – acquiring sequences

MSA input: multiple sequence Fasta file >gi|4504351|ref|NP_000510.1| delta globin [Homo sapiens] MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFESFGDLSSPDAVMGNPKVKAHGKKVLG AFSDGLAHLDNLKGTFSQLSELHCDKLHVDPENFRLLGNVLVCVLARNFGKEFTPQMQAAYQKVVAGVAN ALAHKYH >gi|4504349|ref|NP_000509.1| beta globin [Homo sapiens] MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLG AFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVAN >gi|4885393|ref|NP_005321.1| epsilon globin [Homo sapiens] MVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKVLT SFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAI >gi|6715607|ref|NP_000175.1| G-gamma globin [Homo sapiens] MGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAHGKKVLT SLGDAIKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTGVAS ALSSRYH >gi|28302131|ref|NP_000550.2| A-gamma globin [Homo sapiens] SLGDATKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTAVAS >gi|4885397|ref|NP_005323.1| hemoglobin, zeta [Homo sapiens] MSLTKTERTIIVSMWAKISTQADTIGTETLERLFLSHPQTKTYFPHFDLHPGSAQLRAHGSKVVAAVGDA VKSIDDIGGALSKLSELHAYILRVDPVNFKLLSHCLLVTLAARFPADFTAEAHAAWDKFLSVVSSVLTEK YR

MSA using ClustalX

Step1: Load the sequences

A little unclear…

Edit Fasta headers… > delta globin >gi|4504351|ref|NP_000510.1| delta globin [Homo sapiens] MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFESFGDLSSPDAVMGNPKVKAHGKKVLG AFSDGLAHLDNLKGTFSQLSELHCDKLHVDPENFRLLGNVLVCVLARNFGKEFTPQMQAAYQKVVAGVAN ALAHKYH >gi|4504349|ref|NP_000509.1| beta globin [Homo sapiens] MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLG AFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVAN >gi|4885393|ref|NP_005321.1| epsilon globin [Homo sapiens] MVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKVLT SFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAI >gi|6715607|ref|NP_000175.1| G-gamma globin [Homo sapiens] MGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAHGKKVLT SLGDAIKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTGVAS ALSSRYH >gi|28302131|ref|NP_000550.2| A-gamma globin [Homo sapiens] SLGDATKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTAVAS >gi|4885397|ref|NP_005323.1| hemoglobin, zeta [Homo sapiens] MSLTKTERTIIVSMWAKISTQADTIGTETLERLFLSHPQTKTYFPHFDLHPGSAQLRAHGSKVVAAVGDA VKSIDDIGGALSKLSELHAYILRVDPVNFKLLSHCLLVTLAARFPADFTAEAHAAWDKFLSVVSSVLTEK YR > beta globin > epsilon globin > G-gamma globin > A-gamma globin > hemoglobin zeta

Step2: Perform alignment

MSA and conservation view

Messing-up alignment of HIV-1 env The left alignment is of CLUSTALW, which clusters nearby alignment gaps in tight blocks. The bottom expanded box suggests that the V2 evolves with a high rate if point substitutions and is shortened over evolutionary time by numerous overlapping deletions. E.g., the green box highlights a region that requires 8 independent deletions to occur in the lineages marked in green dots on the tree. The right alignment is of PRANK, which suggests a markedly different evolutionary process dominated by short insertions and deletions. Distinct insertions that happened at the same positions are separated (red and blue boxes) while multiple deletions in homologous sites are permitted if the data support this (green box) Both alignments used the same guide tree generated by CLUSTAL W.

MSA tools Progressive: Iterative: CLUSTALX/CLUSTALW MUSCLE, MAFFT, PRANK