Sequence Alignments Wednesday 6th October 2010 Aidan Budd Structural and Computational Biology Unit EMBL Heidelberg, Germany.

Sequence Alignments Wednesday 6th October 2010 Aidan Budd Structural and Computational Biology Unit EMBL Heidelberg, Germany

Why focus on sequence alignment? Required for the development of almost all bioinformatics tools Enables sequence similarity to be measured in a way that provides confident prediction of structural- similarity/evolutionary-relatedness between sequences Similarity to sequences with known function provides KEY source of novel hypotheses for function of novel sequences Usually the first analysis of any 'novel' sequences involves aligning it to many sequences Have any of you ever used a tool to align sequences?

Sequence Alignment Sequence Alignment: Arrangement of two or more sequences in matrix/grid

Biological Sequence Alignment - Rows Residues in the same row are from the same biological macromolecule (protein or nucleic acid) Residues are arranged in the order they occur in the macromolecule N to C terminal in proteins 5' to 3' in nucleotides

Biological Sequence Alignment - Columns Elements/cells in the matrix only ever contain a single character (either a residue or a "blank"/"gap" character) Residues in the same column share a special 1:1 relationship that they don't share with residues in any other column

Biological Sequence Alignment - Gaps Gaps specify pairs of residues in the same sequence i.e. the residues flanking the gap within the same sequence Between the flanking residues, other sequences in the alignment have residues for which there is no 1:1 relationship in the gapped sequence

Biological Sequence Alignment - Gaps Gap-only column provide no information about relationships between residues in the alignment

Biological Sequence Alignment - Gaps Gap-only column provide no information about relationships between residues in the alignment Removing sequences sometimes leaves all-gap columns We usually remove these "Empty" columns does not effect (as this doesn't change the 1:1 relationships described by the rest of the alignment)

Pairwise Sequence Alignments >EMBLCDS:BAB82422 BAB82422.1 Ephydatia fluviatilis protein tyrosine kinaseBAB82422.1 Ephyd Length = 1488 Score = 89.7 bits (98), Expect = 3e-15 Identities = 104/138 (75%), Gaps = 2/138 (1%) Strand = Plus / Plus Query: 85 aagct-gggccagggctgctttggcgaggtgtggatggggacctggaacggtaccaccag 143 ||||| ||| | |||| | ||||| || || ||| ||| ||||| || |||||||| Sbjct: 703 aagcttggggcggggcag-tttggtgaagtttgggagggtgtgtggaatgggaccaccag 761 Query: 144 ggtggccatcaaaaccctgaagcctggcacgatgtctccagaggccttcctgcaggaggc 203 |||||| | || ||||| || || ||||| |||||| |||| |||||||||||||| Sbjct: 762 tgtggccgttaagaccctcaaaccaggcacaatgtctgtcgaggagttcctgcaggaggc 821 Query: 204 ccaggtcatgaagaagct 221 ||||||||||||| Sbjct: 822 aagcatcatgaagaagct 839 Pairwise Alignment: Alignment of two sequences

Build your own sequence alignment Sequences are usually aligned automatically Pairwise: BLAST, FASTA, Smith-Waterman etc. Multiple: MUSCLE, PRANK, CLUSTAL etc. Also possible 'manually' using tools such as JalView Trying some manual alignment can help understand how/why automatic methods are used

JalView Demo and Exercises Loading sequences Changing the way the sequences are displayed Manual editing of alignments Adding/removing sequences to an alignment Exporting sequences/alignments from JalView for use in another application

Multiple Sequence Alignment Multiple Sequence Alignment (MSA): Alignment of 3+ sequences

Multiple Sequence Alignment MSAs describe a set of pairwise alignments An MSA of n sequences n(n+1) 2 pairwise alignments For example: implies

Structurally equivalent/similar Evolutionary equivalent/related/homologous Residues in the same column either: Different applications assume different types of equivalence Different types of similarity not necessarily equivalent “Equivalence”/similarity of residues

Structural equivalence Demonstration: http://www.embl- heidelberg.de/~seqanal/courses/commonCourseC ontent/commonMsaExercises.html#Demonstrating StructuralEquivalence Bacterial toxins 1ji6 and 1i5p1ji6

1i5p: 1 YVAPVVGTVSSFLLKKVGSLIGKR 111111111111111111111111 1ji6: 1 DAVGTGISVVGQILGVVGVPFAGA Structural equivalence Residues in the same alignment column are "structurally equivalent" i.e. they should be the residues in the two structures whose location with respect to the rest of the structure are most similar in the two structure Such residues will have similar structural/functional "roles" in the two proteins e.g. form similar side-chain interactions

Structural equivalence Some regions of the structures do not have structurally equivalent residues in the other structure 1i5p: DNFLNPTQN----PVPLSITSSVN 111111 111111111111ji6: NSWKKTPLSLRSKRSQDRIRELFS Alignment gaps are a sure sign of such residues Placing such residues in the same column as residues from other sequences is a misalignment - to be avoided!

“Evolutionary Equivalence” AGWYTI AGWWTI AGWYTI AGWWTI AAWYTI AGWWTI AAQQQWYTI Mutation / Substitution Y-W Substitution G-A QQQ Insertion AGWYTI Two copies of gene generated AGWYTI AGWWTI AGWYTI AGWWTI AAWYTI AG---WWTI AAQQQWYTI Residues in the same alignment column should trace their history back to the same residue in the ancestral sequence with any changes due only to point substitutions

Quiz - Evolutionary Interpretation of Alignments Which alignment of the final sequences (X, Y or Z) only places residues in the same column if they are related by substitution events? KGE--------PGIGL------PG KGIPG-----------DPAFGDPG RGIPGEVLGAQ-----------PG Z KGEPG------IGL------PG KGIPG---------DPAFGDPG RGIPGEVLGAQ---------PG Y KGEPG---IGLPG KGIPGDPAFGDPG RGIPGEVLGAQPG X

Quiz - Evolutionary Interpretation of Alignments RGIPGEVLGAQPG KGIPGDPAFGDP G ---KGEPGIGLPG PRANK KGEPG---IGLPG KGIPGDPAFGDPG RGIPGEVLGAQPG MAFFT K---GEPGIGLPG KGIPGDPAFGDPG RGIPGEVLGAQPG CLUSTALX Different automatic MSA software gives different results They're all "wrong".... Because their model of evolutionary process is very divergent from the (very strange...) one under which I told you they evolved KGE--------PGIGL------PG KGIPG-----------DPAFGDPG RGIPGEVLGAQ-----------PG Z KGEPG------IGL------PG KGIPG---------DPAFGDPG RGIPGEVLGAQ---------PG Y KGEPG---IGLPG KGIPGDPAFGDPG RGIPGEVLGAQPG X

Interpreting Alignments Special 1:1 relationship between residues in the same column Structural: Very similar structural environment Evolutionary: Related by point mutations/no mutations from the same residue in the ancestral sequence Structural and Evolutionary equivalence need not necessarily be the same Not all residues have 1:1 equivalents in other structures

Non-Equivalence of Evolutionary and Structural Alignments Demonstration 1: Structural equivalence without evolutionary equivalence Structural alignment of SH3-interaction motifs from nef and ncf1 nef/fyn1 PDB:1efn ncf1 PDB:1w70 aligned ncf1/nef1 SH3 interaction motifs

Non-Equivalence of Evolutionary and Structural Alignments Demonstration 2: Evolutionary equivalence without structural equivalence Human Lymphotactin adopts different folds depending on the conditions PDB:2jp1 PDB:1j8i

Mis-alignment “Gold-standard” structural alignment “Gold-standard” structural alignment (with CLUSTALX default colouring) Mis-alignment of same region - residues that are known to be functionally equivalent are NOT in the same coloumns

Indicating Difficult-To-Align/Non-Equivalent Alignment Regions 1st3 1sbc 1sbt 2prk Differently code regions which e.g. with upper-case characters Create alignment to minimise number of gaps in alignment (typical solution - particularly for MSA software) Only include in same column characters known/believed structurally equivalent

Quiz - Numbers of Insertions (a) 2(b) 1(c) 0(d) 3 The minimum number of insertion events required to account for the section of haemoglobin alignment shown above is?

Quiz - Numbers of Insertions If all sequences are the same length, we can explain their diversity without inferring ANY insertions or deletions If and alignment contains sequences that are all either length x or y, then we can explain their diversity by inferring just one insertion or deletion The minimum number of insertion events required to account for the section of haemoglobin alignment shown above is?

Quiz - Numbers of Insertions The minimum number of insertion events required to account for the section of haemoglobin alignment shown above is? We can ALWAYS explain observed sequence length diversity with: 0 insertions (all length variation due to deletion) 0 deletions (all length variation due to insertion) a combination of insertions and deletions Perhaps we should instead focus on inferring the most likely scenario? (Although if this is not particularly relevant for our analysis, perhaps we should focus instead on something completely different!)

Sequence Similarity Searching

Overview We want to distinguish between alignments of 'related' sequences and of 'unrelated' sequences Provides an INVALUABLE source of novel functional hypotheses for the 'query' sequences We do this by scoring alignments so that score distributions are different for these different classes of alignments By knowing/modeling the distribution of scores of 'unrelated' alignments we can estimate how 'likely' an alignment is to belong to this class

How can we best align two related sequences? Choose the alignment that 'looks' most like good alignments between related sequences Do this by assigning scores to alignments so that these alignments get high scores compared to alignments looking less like these 'good' alignments ACU16467 V-VADAAL ACU16479 VVIDAAL- ACU16467 VVADAAL ACU16479 VVIDAAL sections of two leghemoglobin proteins Higher score Lower score

How can we best align two related sequences? ACU16467 V-VADAAL ACU16479 VVIDAAL- Alignment scores calculated by: assigning a value for each aligned pair of residues any column containing a gap contributes a low/negative value sum of these values for all pairs in the alignment is the alignment score 'Pairwise' values come from a 'substitution matrix'

How do we c >ACU16467 VVADAAL Choose the alignment that 'looks' most like good alignments between related sequences >ACU16479 VVIDAAL Calculate a score for each alignment Scoring scheme gives higher scores to alignments that 'look' more like good alignments between related sequences Examine all possible alignments Analyse good alignments between related sequences Count the frequency of amino acid pairs in these alignments Find a way of scoring alignments where such alignments get higher scores than alignments that look less like these good alignments

local alignment

Calculating/Building/Finding a Good Alignment ACU16467 VVA-DAAL ACU16479 -VVIDAAL ADILV 2.4-0.3-0.8-1.20.1A 4.7-3.8-4.0-2.9D 4.02.83.1I 4.01.8L 3.4V Gonnet PAM250 ACU16467 V-VADAAL ACU16479 VVIDAAL- Sections of two leghemoglobin proteins Sum the score for each position in the alignment, taking scores from a substitution matrix, with a penalty for gaps Best/Optimal alignment is the one with the highest score gap = -1 ACU16467 VVADAAL ACU16479 VVIDAAL -1 + 3.4 + 0.1 + -1 + 4.7 + 2.4 + 2.4 + 4.0 = 15.0 3.4 + -1 + 3.1 + -0.3 + -0.3 + 2.4 -1.2 -1 = 5.1 3.4 + 3.4 + -0.8 + 4.7 + 2.4 + 2.4 + 4.0 = 19.5 Calculate which of these alignments has the highest score 1. 2. 3.

Calculating Sequence Similarity: Substitution Matrices Analyse alignments between sequences you are certain are related/structurally equivalent (i.e. have lots of identical columns) Haemoglobins Measure the frequency with which each residue is observed how often each residue is observed

Building an Alignment

1. Formulate the question 2. Collect initial sequence set 3. Automatically align sequences 4. Examine alignment 5. Discard Sequence s 6. Include additional sequences 7. Manually adjust/edit alignment Building an Alignment

1. Formulate the question 2. Collect initial sequence set 3. Automatically align sequences 4. Examine alignment 5. Discard Sequence s 6. Include additional sequences 7. Manually adjust/edit alignment Formulate the Question

1. Formulate the question 2. Collect initial sequence set 3. Automatically align sequences 4. Examine alignment 5. Discard Sequence s 6. Include additional sequences 7. Manually adjust/edit alignment 1. Search databases for appropriate sequences Keyword-based Sequence-similarity-based 2. Get pre-aligned set of sequences PFAM pfam.sanger.ac.uk ENSEMBL www.ensembl.org HOMSTRAD http://tardis.nibio.go.jp/homstrad/ TREEFAM http://www.treefam.org/ Collect Initial Sequences

Pre-Calculated Alignments Demonstration and Exercise Alignments associated with SRC_HUMAN Ensembl TreeFam PFAM HOMSRAD

1. Formulate the question 2. Collect initial sequence set 3. Automatically align sequences 4. Examine alignment 5. Discard Sequence s 6. Include additional sequences 7. Manually adjust/edit alignment Many different tools available Choice depends on availability size of dataset specific application Automatic Alignment

Only one of many possible colouring schemes Highlights well differing patterns of conservation in different columns Designed for red/green colour-blindness Examine Alignments: CLUSTALX Colouring Scheme

Same colour for amino acids with similar properties Residues only coloured if a given % of residues in column share the property EXCEPTION - P and G are always coloured Examine Alignment: CLUSTALX Colouring Scheme

Hydrophobic: L V I M F W A C Polar: N T S Q Acidic: D E Basic: K R Secondary-structure breaking: GP Large Aromatic Polar: H Y Examine Alignment: CLUSTALX Colouring Scheme

Demo and Exercise: Comparing Automatic Alignment Results Downloading reference alignments from BaliBase Align them using several different tools Compare automatic alignments with reference alignment Different tools give different results Tools often make mistakes Experience using several different tools Practise looking at/examining alignments

Many (50+) Protein Sequences MUSCLE (Fast and fairly accurate) Few (<50) Protein Sequences PROBCONS (Most accurate available) Disordered Protein Regions MAFFT Few DNA sequences PRANK (Very accurate) Unhappy with initial alignment Change parameters Use different (more accurate) tool Alignment Tool Selection Default, first choice tool

Many (50+) Protein Sequences MUSCLE (Fast and fairly accurate) Few (<50) Protein Sequences PROBCONS (Most accurate available) Disordered Protein Regions MAFFT Few DNA sequences PRANK (Very accurate) Unhappy with initial alignment Change parameters Use different (more accurate) tool Alignment Tool Selection

1. Formulate the question 2. Collect initial sequence set 3. Automatically align sequences 4. Examine alignment 5. Discard Sequence s 6. Include additional sequences 7. Manually adjust/edit alignment One of the most important stages in the analysis Involves deciding whether or not MSA ready for use in down-stream analyses Examine Alignment

1. Formulate the question 2. Collect initial sequence set 3. Automatically align sequences 4. Examine alignment 5. Discard Sequence s 6. Include additional sequences 7. Manually adjust/edit alignment Adjusting the Initial Alignment

Exercise/Demo MSA analysis from start to finish

3. Automatically align sequences 4. Examine alignment 5. Discard Sequence s 6. Include additional sequences 7. Manually adjust/edit alignment Sequences/Regions important for the analysis? YES - Edit alignment manually to fix misalignment NO - Discard sequences/Ignore region Mis-Aligned Regions K Examine Alignment: What to look out for

Common Patterns - buried beta-strand http://tardis.nibio.go.jp/cgi- bin/homstrad/showpage.cgi?family=response_reg&disp=str Response regulator receiver domain

Common Patterns - amphiphilic partially- buried alpha-helices http://tardis.nibio.go.jp/cgi- bin/homstrad/showpage.cgi?family=response_reg&disp=str Response regulator receiver domain

Common Patterns - amphiphilic partially- buried alpha-helices http://tardis.nibio.go.jp/cgi-bin/homstrad/showpage.cgi?family=subt&disp=str subtilases

Common Patterns - amphiphilic beta strands ubiquitin conjugating enzyme

Common Patterns - non-globular sequence Different, more strongly biasesd (from equal representation of each of the 20 amino acids), sequence composition Low sequence complexity THE HAPPY PROTEIN TTT TTTTT TTTTTTT THE THEHT THETHET Increasing Complexity More variable sequence (more substitutions, more gaps than globular/structured regions)

Common Patterns - short linear motifs [RK].L.{0,1}[FYLIVMP] LIG_CYCLIN_1 Mostly occur in disordered protein regions Often show greater conservation than neighbouring sequence Caution Absence of patterns that characterise different structrures (e.g. helices, motifs, etc.) doesn’t mean the structures are not present! Only means that either the sturcture is showing different evolutionary dynamics from those shown here, or insufficient/too-much diversity is present in alignment, obscuring tell-tale patterns of conservation

Clustalx helps spotting misaligned regions BaliBase kinase2_ref5 as misaligned by clustalx Quality->“Show Low-Scoring Segments” NB other software is available to help in this way maybe add “Low-scorring elements” bit to previous slide...

Demo - edit/correct misalignments View alignment with ClustalX Quality -> Show low-scoring segments Edit using JalView http://://www.jalview.org/examples/editing.ht ml

An aside - different tools give different alignments clustalx probcons

An aside - different tools give different alignments BaliBase kinase2_ref5 mafft prank

3. Automatically align sequences 4. Examine alignment 5. Discard Sequence s 6. Include additional sequences 7. Manually adjust/edit alignment Missing regions Sequences important/vital for the analysis? YES Try and determine why sequence is missing If due to error, attempt to correct it NO Discard sequences Examine Alignment: What to look out for

3. Automatically align sequences 4. Examine alignment 5. Discard Sequence s 6. Include additional sequences 7. Manually adjust/edit alignment Apparently nonsense/unrelated regions Sequences important/vital for the analysis? YES Try and determine why sequence is different If due to error, attempt to correct it NO Discard sequences With CLUSTALX “”Quality”->”Show Low-Scorring Segments” switched on Examine Alignment: What to look out for

3. Automatically align sequences 4. Examine alignment 5. Discard Sequence s 6. Include additional sequences 7. Manually adjust/edit alignment Lots of very similar sequences Slow down calculations If there are no more divergent sequences in the alignment, there may not be enough information for successful analysis Remove most of them, leaving most “interesting” sequences e.g. keep human, remove chimp, guinea-pig Find alternative data-sources to supplement initial sequence set e.g. ENSEMBL for access to predicted peptides not yet relasesed elsewhere Examine Alignment: What to look out for

Demo: Comparing Automatic Alignment Tools 1aboA_ref1 from BaliBase http://bips.u- strasbg.fr/fr/Products/Databases/BAliBASE2/align_index.html http://www.ebi.ac.uk/mafft/ http://www.ebi.ac.uk/Tools/muscle/ http://www.ebi.ac.uk/t-coffee/

Difficult Alignments - low similarity between sequences Possible Strategies: Collect more sequences to reduce lowest pairwise similarity Use structural information if possible Typical Problems Difficult to align throughout the sequence - more problems in local regions of lower conservation BaliBase: 1aboA_ref1

Difficult Alignments - “orphan” sequences Possible Strategies: Collect more sequences to reduce lowest pairwise similarity Use structural information if possible Typical Problems Orphan difficult to align correctly - typically it is forced to have similar conserved block structure as other sequences Reference alignment ClustalX alignment BaliBase: 1aboA_ref2

Difficult Alignments - Several Subfamilies Possible Strategies: Align each subfamiliy on its own initially Then align the subfamiliy alignments directly to each other Add additional “bridging” sequences if you can find them

Difficult Alignments - Insertions/Extensions clustalx alignment BaliBase Reference alignment Possible Strategies: Initially align sequences without insertion Then align (perhaps manually) sequence(s) with the insertion

Difficult Alignments - Repeats BaliBase Reference alignment Possible Strategies: Determine repeat structure of your sequences Decide which order Typical Problems Repeats are aligned “out of order”

Difficult Alignments - Rearrangement/Shuffling BaliBase: cellulase_1_ref8 Possible Strategies: Determine repeat/domain structure of your sequences before aligning them Extract sequences of individual domains, and only include sequences of the same domain in the same alignment Typical Problems Most software prefers to minimise gaps - so completely misalign domains as seen left

Difficult Alignments - Short Linear Motifs [DE][DES][DEGAS]F[SGAD][DEAP][LVIMFD] LIG_AP_GAE_1 [RK].L.{0,1}[FYLIVMP] LIG_CYCLIN_1 Possible Strategies: Align regions suspected of containing motifs separately from the rest of the proteins Use MAFFT (works best for regions of this kind) Typical Problems Most software finds it difficult to identify short regions of conservation in a region of low conservation

Summary It’s not “Is my alignment correct” Rather “Where is my alignment wrong”! The most important steps in building an alignment are: Specifying an appropriate question Critically examining the alignments you build Quality of the alignment influences quality of down-stream analyses Mis-alignment introduces systematic errors in downstream analyses There is no “best” alignment tool - which one you use depends on your question It can cost lots of time to make a good alignment!!

Sequence Alignments Wednesday 6th October 2010 Aidan Budd Structural and Computational Biology Unit EMBL Heidelberg, Germany.

Similar presentations

Presentation on theme: "Sequence Alignments Wednesday 6th October 2010 Aidan Budd Structural and Computational Biology Unit EMBL Heidelberg, Germany."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Sequence Alignments Wednesday 6th October 2010 Aidan Budd Structural and Computational Biology Unit EMBL Heidelberg, Germany.

Similar presentations

Presentation on theme: "Sequence Alignments Wednesday 6th October 2010 Aidan Budd Structural and Computational Biology Unit EMBL Heidelberg, Germany."— Presentation transcript:

Similar presentations

About project

Feedback