Sequence Alignments Wednesday 6th October 2010 Aidan Budd Structural and Computational Biology Unit EMBL Heidelberg, Germany.

Slides:



Advertisements
Similar presentations
Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
Advertisements

Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
Measuring the degree of similarity: PAM and blosum Matrix
DNA sequences alignment measurement
Lecture 8 Alignment of pairs of sequence Local and global alignment
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Molecular Evolution Revised 29/12/06
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.
Heuristic alignment algorithms and cost matrices
Progressive MSA Do pair-wise alignment Develop an evolutionary tree Most closely related sequences are then aligned, then more distant are added. Genetic.
Multiple alignment June 29, 2007 Learning objectives- Review sequence alignment answer and answer questions you may have. Understand how the E value may.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Introduction to bioinformatics
Sequence Comparison Intragenic - self to self. -find internal repeating units. Intergenic -compare two different sequences. Dotplot - visual alignment.
Sequence similarity.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, All rights reserved.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Multiple Sequence Alignments
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Sequencing a genome and Basic Sequence Alignment
TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,
Chapter 5 Multiple Sequence Alignment.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Multiple sequence alignment
An Introduction to Bioinformatics
Multiple sequence alignment (MSA) Usean sekvenssin rinnastus Petri Törönen Help contributed by: Liisa Holm & Ari Löytynoja.
Protein Sequence Alignment and Database Searching.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Computational Biology, Part 3 Sequence Alignment Robert F. Murphy Copyright  1996, All rights reserved.
Sequence Alignment Techniques. In this presentation…… Part 1 – Searching for Sequence Similarity Part 2 – Multiple Sequence Alignment.
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Aidan Budd, EMBL Heidelberg Pairwise Alignments and Sequence Similarity-Based Searching.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Sequencing a genome and Basic Sequence Alignment
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Chapter 3 Computational Molecular Biology Michael Smith
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Protein Evolution: Introduction to Protein Structure and Function protEvolEllsEmblSept2009 Please open the.
Basic terms:  Similarity - measurable quantity. Similarity- applied to proteins using concept of conservative substitutions Similarity- applied to proteins.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Copyright OpenHelix. No use or reproduction without express written consent1.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
Manually Adjusting Multiple Alignments Chris Wilton.
Applied Bioinformatics Week 3. Theory I Similarity Dot plot.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Point Specific Alignment Methods PSI – BLAST & PHI – BLAST.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Construction of Substitution matrices
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Aidan Budd, EMBL Heidelberg Multiple Sequence Alignments.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Pairwise Sequence Alignment and Database Searching
A Very Basic Gibbs Sampler for Motif Detection
BLAST Slides adapted & edited from a set by
BLAST Slides adapted & edited from a set by
Presentation transcript:

Sequence Alignments Wednesday 6th October 2010 Aidan Budd Structural and Computational Biology Unit EMBL Heidelberg, Germany

Why focus on sequence alignment? Required for the development of almost all bioinformatics tools Enables sequence similarity to be measured in a way that provides confident prediction of structural- similarity/evolutionary-relatedness between sequences Similarity to sequences with known function provides KEY source of novel hypotheses for function of novel sequences Usually the first analysis of any 'novel' sequences involves aligning it to many sequences Have any of you ever used a tool to align sequences?

Sequence Alignment Sequence Alignment: Arrangement of two or more sequences in matrix/grid

Biological Sequence Alignment - Rows Residues in the same row are from the same biological macromolecule (protein or nucleic acid) Residues are arranged in the order they occur in the macromolecule N to C terminal in proteins 5' to 3' in nucleotides

Biological Sequence Alignment - Columns Elements/cells in the matrix only ever contain a single character (either a residue or a "blank"/"gap" character) Residues in the same column share a special 1:1 relationship that they don't share with residues in any other column

Biological Sequence Alignment - Gaps Gaps specify pairs of residues in the same sequence i.e. the residues flanking the gap within the same sequence Between the flanking residues, other sequences in the alignment have residues for which there is no 1:1 relationship in the gapped sequence

Biological Sequence Alignment - Gaps Gap-only column provide no information about relationships between residues in the alignment

Biological Sequence Alignment - Gaps Gap-only column provide no information about relationships between residues in the alignment Removing sequences sometimes leaves all-gap columns We usually remove these "Empty" columns does not effect (as this doesn't change the 1:1 relationships described by the rest of the alignment)

Pairwise Sequence Alignments >EMBLCDS:BAB82422 BAB Ephydatia fluviatilis protein tyrosine kinaseBAB Ephyd Length = 1488 Score = 89.7 bits (98), Expect = 3e-15 Identities = 104/138 (75%), Gaps = 2/138 (1%) Strand = Plus / Plus Query: 85 aagct-gggccagggctgctttggcgaggtgtggatggggacctggaacggtaccaccag 143 ||||| ||| | |||| | ||||| || || ||| ||| ||||| || |||||||| Sbjct: 703 aagcttggggcggggcag-tttggtgaagtttgggagggtgtgtggaatgggaccaccag 761 Query: 144 ggtggccatcaaaaccctgaagcctggcacgatgtctccagaggccttcctgcaggaggc 203 |||||| | || ||||| || || ||||| |||||| |||| |||||||||||||| Sbjct: 762 tgtggccgttaagaccctcaaaccaggcacaatgtctgtcgaggagttcctgcaggaggc 821 Query: 204 ccaggtcatgaagaagct 221 ||||||||||||| Sbjct: 822 aagcatcatgaagaagct 839 Pairwise Alignment: Alignment of two sequences

Build your own sequence alignment Sequences are usually aligned automatically Pairwise: BLAST, FASTA, Smith-Waterman etc. Multiple: MUSCLE, PRANK, CLUSTAL etc. Also possible 'manually' using tools such as JalView Trying some manual alignment can help understand how/why automatic methods are used

JalView Demo and Exercises Loading sequences Changing the way the sequences are displayed Manual editing of alignments Adding/removing sequences to an alignment Exporting sequences/alignments from JalView for use in another application

Multiple Sequence Alignment Multiple Sequence Alignment (MSA): Alignment of 3+ sequences

Multiple Sequence Alignment MSAs describe a set of pairwise alignments An MSA of n sequences n(n+1) 2 pairwise alignments For example: implies

Structurally equivalent/similar Evolutionary equivalent/related/homologous Residues in the same column either: Different applications assume different types of equivalence Different types of similarity not necessarily equivalent “Equivalence”/similarity of residues

Structural equivalence Demonstration: heidelberg.de/~seqanal/courses/commonCourseC ontent/commonMsaExercises.html#Demonstrating StructuralEquivalence Bacterial toxins 1ji6 and 1i5p1ji6

1i5p: 1 YVAPVVGTVSSFLLKKVGSLIGKR ji6: 1 DAVGTGISVVGQILGVVGVPFAGA Structural equivalence Residues in the same alignment column are "structurally equivalent" i.e. they should be the residues in the two structures whose location with respect to the rest of the structure are most similar in the two structure Such residues will have similar structural/functional "roles" in the two proteins e.g. form similar side-chain interactions

Structural equivalence Some regions of the structures do not have structurally equivalent residues in the other structure 1i5p: DNFLNPTQN----PVPLSITSSVN ji6: NSWKKTPLSLRSKRSQDRIRELFS Alignment gaps are a sure sign of such residues Placing such residues in the same column as residues from other sequences is a misalignment - to be avoided!

“Evolutionary Equivalence” AGWYTI AGWWTI AGWYTI AGWWTI AAWYTI AGWWTI AAQQQWYTI Mutation / Substitution Y-W Substitution G-A QQQ Insertion AGWYTI Two copies of gene generated AGWYTI AGWWTI AGWYTI AGWWTI AAWYTI AG---WWTI AAQQQWYTI Residues in the same alignment column should trace their history back to the same residue in the ancestral sequence with any changes due only to point substitutions

Quiz - Evolutionary Interpretation of Alignments Which alignment of the final sequences (X, Y or Z) only places residues in the same column if they are related by substitution events? KGE PGIGL------PG KGIPG DPAFGDPG RGIPGEVLGAQ PG Z KGEPG------IGL------PG KGIPG DPAFGDPG RGIPGEVLGAQ PG Y KGEPG---IGLPG KGIPGDPAFGDPG RGIPGEVLGAQPG X

Quiz - Evolutionary Interpretation of Alignments RGIPGEVLGAQPG KGIPGDPAFGDP G ---KGEPGIGLPG PRANK KGEPG---IGLPG KGIPGDPAFGDPG RGIPGEVLGAQPG MAFFT K---GEPGIGLPG KGIPGDPAFGDPG RGIPGEVLGAQPG CLUSTALX Different automatic MSA software gives different results They're all "wrong".... Because their model of evolutionary process is very divergent from the (very strange...) one under which I told you they evolved KGE PGIGL------PG KGIPG DPAFGDPG RGIPGEVLGAQ PG Z KGEPG------IGL------PG KGIPG DPAFGDPG RGIPGEVLGAQ PG Y KGEPG---IGLPG KGIPGDPAFGDPG RGIPGEVLGAQPG X

Interpreting Alignments Special 1:1 relationship between residues in the same column Structural: Very similar structural environment Evolutionary: Related by point mutations/no mutations from the same residue in the ancestral sequence Structural and Evolutionary equivalence need not necessarily be the same Not all residues have 1:1 equivalents in other structures

Non-Equivalence of Evolutionary and Structural Alignments Demonstration 1: Structural equivalence without evolutionary equivalence Structural alignment of SH3-interaction motifs from nef and ncf1 nef/fyn1 PDB:1efn ncf1 PDB:1w70 aligned ncf1/nef1 SH3 interaction motifs

Non-Equivalence of Evolutionary and Structural Alignments Demonstration 2: Evolutionary equivalence without structural equivalence Human Lymphotactin adopts different folds depending on the conditions PDB:2jp1 PDB:1j8i

Mis-alignment “Gold-standard” structural alignment “Gold-standard” structural alignment (with CLUSTALX default colouring) Mis-alignment of same region - residues that are known to be functionally equivalent are NOT in the same coloumns

Indicating Difficult-To-Align/Non-Equivalent Alignment Regions 1st3 1sbc 1sbt 2prk Differently code regions which e.g. with upper-case characters Create alignment to minimise number of gaps in alignment (typical solution - particularly for MSA software) Only include in same column characters known/believed structurally equivalent

Quiz - Numbers of Insertions (a) 2(b) 1(c) 0(d) 3 The minimum number of insertion events required to account for the section of haemoglobin alignment shown above is?

Quiz - Numbers of Insertions If all sequences are the same length, we can explain their diversity without inferring ANY insertions or deletions If and alignment contains sequences that are all either length x or y, then we can explain their diversity by inferring just one insertion or deletion The minimum number of insertion events required to account for the section of haemoglobin alignment shown above is?

Quiz - Numbers of Insertions The minimum number of insertion events required to account for the section of haemoglobin alignment shown above is? We can ALWAYS explain observed sequence length diversity with: 0 insertions (all length variation due to deletion) 0 deletions (all length variation due to insertion) a combination of insertions and deletions Perhaps we should instead focus on inferring the most likely scenario? (Although if this is not particularly relevant for our analysis, perhaps we should focus instead on something completely different!)

Sequence Similarity Searching

Overview We want to distinguish between alignments of 'related' sequences and of 'unrelated' sequences Provides an INVALUABLE source of novel functional hypotheses for the 'query' sequences We do this by scoring alignments so that score distributions are different for these different classes of alignments By knowing/modeling the distribution of scores of 'unrelated' alignments we can estimate how 'likely' an alignment is to belong to this class

How can we best align two related sequences? Choose the alignment that 'looks' most like good alignments between related sequences Do this by assigning scores to alignments so that these alignments get high scores compared to alignments looking less like these 'good' alignments ACU16467 V-VADAAL ACU16479 VVIDAAL- ACU16467 VVADAAL ACU16479 VVIDAAL sections of two leghemoglobin proteins Higher score Lower score

How can we best align two related sequences? ACU16467 V-VADAAL ACU16479 VVIDAAL- Alignment scores calculated by: assigning a value for each aligned pair of residues any column containing a gap contributes a low/negative value sum of these values for all pairs in the alignment is the alignment score 'Pairwise' values come from a 'substitution matrix'

How do we c >ACU16467 VVADAAL Choose the alignment that 'looks' most like good alignments between related sequences >ACU16479 VVIDAAL Calculate a score for each alignment Scoring scheme gives higher scores to alignments that 'look' more like good alignments between related sequences Examine all possible alignments Analyse good alignments between related sequences Count the frequency of amino acid pairs in these alignments Find a way of scoring alignments where such alignments get higher scores than alignments that look less like these good alignments

local alignment

Calculating/Building/Finding a Good Alignment ACU16467 VVA-DAAL ACU VVIDAAL ADILV A D I L 3.4V Gonnet PAM250 ACU16467 V-VADAAL ACU16479 VVIDAAL- Sections of two leghemoglobin proteins Sum the score for each position in the alignment, taking scores from a substitution matrix, with a penalty for gaps Best/Optimal alignment is the one with the highest score gap = -1 ACU16467 VVADAAL ACU16479 VVIDAAL = = = 19.5 Calculate which of these alignments has the highest score

Calculating Sequence Similarity: Substitution Matrices Analyse alignments between sequences you are certain are related/structurally equivalent (i.e. have lots of identical columns) Haemoglobins Measure the frequency with which each residue is observed how often each residue is observed

Building an Alignment

1. Formulate the question 2. Collect initial sequence set 3. Automatically align sequences 4. Examine alignment 5. Discard Sequence s 6. Include additional sequences 7. Manually adjust/edit alignment Building an Alignment

1. Formulate the question 2. Collect initial sequence set 3. Automatically align sequences 4. Examine alignment 5. Discard Sequence s 6. Include additional sequences 7. Manually adjust/edit alignment Formulate the Question

1. Formulate the question 2. Collect initial sequence set 3. Automatically align sequences 4. Examine alignment 5. Discard Sequence s 6. Include additional sequences 7. Manually adjust/edit alignment 1. Search databases for appropriate sequences Keyword-based Sequence-similarity-based 2. Get pre-aligned set of sequences PFAM pfam.sanger.ac.uk ENSEMBL HOMSTRAD TREEFAM Collect Initial Sequences

Pre-Calculated Alignments Demonstration and Exercise Alignments associated with SRC_HUMAN Ensembl TreeFam PFAM HOMSRAD

1. Formulate the question 2. Collect initial sequence set 3. Automatically align sequences 4. Examine alignment 5. Discard Sequence s 6. Include additional sequences 7. Manually adjust/edit alignment Many different tools available Choice depends on availability size of dataset specific application Automatic Alignment

Only one of many possible colouring schemes Highlights well differing patterns of conservation in different columns Designed for red/green colour-blindness Examine Alignments: CLUSTALX Colouring Scheme

Same colour for amino acids with similar properties Residues only coloured if a given % of residues in column share the property EXCEPTION - P and G are always coloured Examine Alignment: CLUSTALX Colouring Scheme

Hydrophobic: L V I M F W A C Polar: N T S Q Acidic: D E Basic: K R Secondary-structure breaking: GP Large Aromatic Polar: H Y Examine Alignment: CLUSTALX Colouring Scheme

Demo and Exercise: Comparing Automatic Alignment Results Downloading reference alignments from BaliBase Align them using several different tools Compare automatic alignments with reference alignment Different tools give different results Tools often make mistakes Experience using several different tools Practise looking at/examining alignments

Many (50+) Protein Sequences MUSCLE (Fast and fairly accurate) Few (<50) Protein Sequences PROBCONS (Most accurate available) Disordered Protein Regions MAFFT Few DNA sequences PRANK (Very accurate) Unhappy with initial alignment Change parameters Use different (more accurate) tool Alignment Tool Selection Default, first choice tool

Many (50+) Protein Sequences MUSCLE (Fast and fairly accurate) Few (<50) Protein Sequences PROBCONS (Most accurate available) Disordered Protein Regions MAFFT Few DNA sequences PRANK (Very accurate) Unhappy with initial alignment Change parameters Use different (more accurate) tool Alignment Tool Selection

1. Formulate the question 2. Collect initial sequence set 3. Automatically align sequences 4. Examine alignment 5. Discard Sequence s 6. Include additional sequences 7. Manually adjust/edit alignment One of the most important stages in the analysis Involves deciding whether or not MSA ready for use in down-stream analyses Examine Alignment

1. Formulate the question 2. Collect initial sequence set 3. Automatically align sequences 4. Examine alignment 5. Discard Sequence s 6. Include additional sequences 7. Manually adjust/edit alignment Adjusting the Initial Alignment

Exercise/Demo MSA analysis from start to finish

3. Automatically align sequences 4. Examine alignment 5. Discard Sequence s 6. Include additional sequences 7. Manually adjust/edit alignment Sequences/Regions important for the analysis? YES - Edit alignment manually to fix mis- alignment NO - Discard sequences/Ignore region Mis-Aligned Regions K Examine Alignment: What to look out for

Common Patterns - buried beta-strand bin/homstrad/showpage.cgi?family=response_reg&disp=str Response regulator receiver domain

Common Patterns - amphiphilic partially- buried alpha-helices bin/homstrad/showpage.cgi?family=response_reg&disp=str Response regulator receiver domain

Common Patterns - amphiphilic partially- buried alpha-helices subtilases

Common Patterns - amphiphilic beta strands ubiquitin conjugating enzyme

Common Patterns - non-globular sequence Different, more strongly biasesd (from equal representation of each of the 20 amino acids), sequence composition Low sequence complexity THE HAPPY PROTEIN TTT TTTTT TTTTTTT THE THEHT THETHET Increasing Complexity More variable sequence (more substitutions, more gaps than globular/structured regions)

Common Patterns - short linear motifs [RK].L.{0,1}[FYLIVMP] LIG_CYCLIN_1 Mostly occur in disordered protein regions Often show greater conservation than neighbouring sequence Caution Absence of patterns that characterise different structrures (e.g. helices, motifs, etc.) doesn’t mean the structures are not present! Only means that either the sturcture is showing different evolutionary dynamics from those shown here, or insufficient/too-much diversity is present in alignment, obscuring tell-tale patterns of conservation

Clustalx helps spotting misaligned regions BaliBase kinase2_ref5 as misaligned by clustalx Quality->“Show Low-Scoring Segments” NB other software is available to help in this way maybe add “Low-scorring elements” bit to previous slide...

Demo - edit/correct misalignments View alignment with ClustalX Quality -> Show low-scoring segments Edit using JalView ml

An aside - different tools give different alignments clustalx probcons

An aside - different tools give different alignments BaliBase kinase2_ref5 mafft prank

3. Automatically align sequences 4. Examine alignment 5. Discard Sequence s 6. Include additional sequences 7. Manually adjust/edit alignment Missing regions Sequences important/vital for the analysis? YES Try and determine why sequence is missing If due to error, attempt to correct it NO Discard sequences Examine Alignment: What to look out for

3. Automatically align sequences 4. Examine alignment 5. Discard Sequence s 6. Include additional sequences 7. Manually adjust/edit alignment Apparently nonsense/unrelated regions Sequences important/vital for the analysis? YES Try and determine why sequence is different If due to error, attempt to correct it NO Discard sequences With CLUSTALX “”Quality”->”Show Low-Scorring Segments” switched on Examine Alignment: What to look out for

3. Automatically align sequences 4. Examine alignment 5. Discard Sequence s 6. Include additional sequences 7. Manually adjust/edit alignment Lots of very similar sequences Slow down calculations If there are no more divergent sequences in the alignment, there may not be enough information for successful analysis Remove most of them, leaving most “interesting” sequences e.g. keep human, remove chimp, guinea-pig Find alternative data-sources to supplement initial sequence set e.g. ENSEMBL for access to predicted peptides not yet relasesed elsewhere Examine Alignment: What to look out for

Demo: Comparing Automatic Alignment Tools 1aboA_ref1 from BaliBase strasbg.fr/fr/Products/Databases/BAliBASE2/align_index.html

Difficult Alignments - low similarity between sequences Possible Strategies: Collect more sequences to reduce lowest pairwise similarity Use structural information if possible Typical Problems Difficult to align throughout the sequence - more problems in local regions of lower conservation BaliBase: 1aboA_ref1

Difficult Alignments - “orphan” sequences Possible Strategies: Collect more sequences to reduce lowest pairwise similarity Use structural information if possible Typical Problems Orphan difficult to align correctly - typically it is forced to have similar conserved block structure as other sequences Reference alignment ClustalX alignment BaliBase: 1aboA_ref2

Difficult Alignments - Several Subfamilies Possible Strategies: Align each subfamiliy on its own initially Then align the subfamiliy alignments directly to each other Add additional “bridging” sequences if you can find them

Difficult Alignments - Insertions/Extensions clustalx alignment BaliBase Reference alignment Possible Strategies: Initially align sequences without insertion Then align (perhaps manually) sequence(s) with the insertion

Difficult Alignments - Repeats BaliBase Reference alignment Possible Strategies: Determine repeat structure of your sequences Decide which order Typical Problems Repeats are aligned “out of order”

Difficult Alignments - Rearrangement/Shuffling BaliBase: cellulase_1_ref8 Possible Strategies: Determine repeat/domain structure of your sequences before aligning them Extract sequences of individual domains, and only include sequences of the same domain in the same alignment Typical Problems Most software prefers to minimise gaps - so completely misalign domains as seen left

Difficult Alignments - Short Linear Motifs [DE][DES][DEGAS]F[SGAD][DEAP][LVIMFD] LIG_AP_GAE_1 [RK].L.{0,1}[FYLIVMP] LIG_CYCLIN_1 Possible Strategies: Align regions suspected of containing motifs separately from the rest of the proteins Use MAFFT (works best for regions of this kind) Typical Problems Most software finds it difficult to identify short regions of conservation in a region of low conservation

Summary It’s not “Is my alignment correct” Rather “Where is my alignment wrong”! The most important steps in building an alignment are: Specifying an appropriate question Critically examining the alignments you build Quality of the alignment influences quality of down-stream analyses Mis-alignment introduces systematic errors in downstream analyses There is no “best” alignment tool - which one you use depends on your question It can cost lots of time to make a good alignment!!