Sequence Searching and Alignments

Sequence Searching and Alignments
Tools & techniques for finding and aligning homologous sequences Andrew Cowley Web Production Team

Materials

About me MA Biochemistry, Cambridge University
MRes Bioinformatics, University of York PhD CASE Studentship, Structural Bioinformatics, University of York

About me Training bioinformatics since 2005 Joined EMBL-EBI in 2010
From Derbyshire, UK Many hobbies!

Derbyshire Beautiful countryside
1871: South Derbyshire Football Association As usual – Britain invents something only for other countries to become way better. Hopefully the same with bioinformatics! 1884: Derby County F.C. “Hold the record for the lowest ever points finish in the Premier League”

Contents Sequence databases Text searching
Sequence similarity searching Alignment basics Similarity searching tools Improving algorithms Guidelines Problem sequences Multiple sequence alignments

Sequence Databases

Primary vs Secondary Primary data comes from experiments/submitters
Derived (or secondary) data is generated with additional work (by curators etc.) from the primary data

Nucleotide primary data
Individual scientists Large-scale sequencing projects ACTGCTGCTAGCTAGCTGATCTATGCTAGC TGTAGCTGAG Patent Offices Primary sequence data Primary sequence database Original sequence data Experimental data Patent data Submitter-defined

GenBank DDBJ (Japan) (U.S.A.) Submission can be made to any INSDC database ENA INSDC: International Nucleotide Sequence Database Collaboration Daily exchange of data Submissions made to any INSDC database - The only exceptions are for patent data, where EPO submits to EBI, USTPO submits to GenBank and JPO submits to DDBJ (Europe)

Ensembl/genomes Large-scale sequencing projects ENA EMBL-Bank (ENA Annotation) Sequence Read Archive (SRA) Raw data Individual scientists Assembled sequences Annotated sequence Patent Offices IMGT/HLA EMBL-Coding etc.

Nucleotide sequence resources at EMBL-EBI
European Nucleotide Archive (ENA) ENA sequence – Annotated sequence entries Sequence Read Archive (SRA) – sequence read data Sequence Version Archive (SVA) – historical entry version ENA Coding/Non-coding Ensembl Assembled genomes and annotations for Vertebrates Ensembl Genomes Extending Ensembl to other species

When is the data updated?
ENA updates When is the data updated? Data is updated every night, but main releases are quarterly Normal Release Quarterly release of all EMBL-Bank eg. Rel 116 Updates All updates since last normal release Rolled into quarterly release

Protein sequence data Since 2002 a merger and collaboration of three databases: Swiss-Prot & TrEMBL PIR-PSD

UniProtKB UniProtKB/TrEMBL UniProtKB/Swiss-Prot
1 entry per nucleotide submission UniProtKB/Swiss-Prot 1 entry per protein Redundant, automatically annotated - unreviewed Non-redundant, high-quality manual annotation - reviewed

UniProtKB/TrEMBL Computationally annotated UniProtKB/Swiss-Prot Manually annotated

Data sources of UniProtKB
UniProt/TrEMBL ENA (EMBL) DNA database PDB Sub/ Peptide Data Ensembl We’ll look at this in more detail after the break but briefly UniProt is the gold-standard resource for information on proteins. Every entry initially receives automatic annotation so its not just a bare sequence, there’s a team of Curators that undertake manual curation using the literature and sequence analysis. We also use in-house bioinformatics tools for protein classification and domain prediction. Data from other databases is imported and cross-referenced. It comprises three different databases, but I haven’t shown all three here for the sake of simplicity. UniProtKB is the central database of protein sequences with accurate, consistent, and rich sequence and functional annotation. It comprises the manually annotated UniProtKB/Swiss-Prot section and the automatically annotated UniProtKB/TrEMBL section. The UniProt archive is an archive of all the protein sequences in the public domain, and the UniRef databases are a series of three databases that store sequences of 100%, 90% and 50% identity in the same records to speed up searching without losing information. UniProtKB contains more than 29 million cross references to over 100 other data resources; a few key ones are shown here. FlyBase WormBase VEGA (Sanger) mRNA Data Patent Data

Automatic annotation UniProtKB employs two prediction programs which are referred to as UniRule and SAAS. SAAS, Statistical Automatic Annotation System, generates a new set of decision-trees with every UniProtKB release using data-mining. UniRule maintains a set of manually established and maintained annotation rules. Automatic annotation is produced by two methods. One method just uses a computer program to generate rules, this is named SAAS, which stands for Statistical Automatic Annotation System. The other method is called UniRule and is a collection of rules created by Scientists to propagate a specific set of data based on a defined criteria. Both of these systems use Swiss-Prot and InterPro as training sets. Swiss-Prot InterPro

Curation of a UniProtKB/Swiss-Prot entry
UniProtKB/TrEMBL References Sequence variants Literature Annotations Nomenclature Sequence features Ontologies We’ll look at this in more detail after the break but briefly UniProt is the gold-standard resource for information on proteins. Every entry initially receives automatic annotation so its not just a bare sequence, there’s a team of Curators that undertake manual curation using the literature and sequence analysis. We also use in-house bioinformatics tools for protein classification and domain prediction. Data from other databases is imported and cross-referenced. It comprises three different databases, but I haven’t shown all three here for the sake of simplicity. UniProtKB is the central database of protein sequences with accurate, consistent, and rich sequence and functional annotation. It comprises the manually annotated UniProtKB/Swiss-Prot section and the automatically annotated UniProtKB/TrEMBL section. The UniProt archive is an archive of all the protein sequences in the public domain, and the UniRef databases are a series of three databases that store sequences of 100%, 90% and 50% identity in the same records to speed up searching without losing information. UniProtKB contains more than 29 million cross references to over 100 other data resources; a few key ones are shown here. UniProtKB/SwissProt

UniProtKB

UniProt databases UniProtKB/Swiss-Prot UniProtKB/TrEMBL UniRef UniParc
Manually curated UniProtKB/TrEMBL Automatically curated UniRef Sequences clustered by %identity UniParc Sequence archive – keeps track of historical sequences & identifiers Proteomes

Data Simplistically, much the data held at EMBL- EBI can be thought of as like a container Part of it is the raw data itself (eg. Protein sequence) Another part being meta-information or annotation about this data

Example ID AJ131285; SV 1; linear; mRNA; STD; INV; 919 BP. XX
AC AJ131285; DT 24-APR-2001 (Rel. 67, Created) DT 20-JUL-2001 (Rel. 68, Last updated, Version 4) DE Sabella spallanzanii mRNA for globin 3 KW globin; globin 3; globin gene. OS Sabella spallanzanii OC Eukaryota; Metazoa; Annelida; Polychaeta; Palpata; Canalipalpata; OC Sabellida; Sabellidae; Sabella. RN [1] RP RA Negrisolo E.M.; RT ; RL Submitted (11-DEC-1998) to the EMBL/GenBank/DDBJ databases. RL Negrisolo E.M., Biologia, Universita degli Studi di Padova, via U. Bassi RL 58/B, Padova,35131, Brazil. FH Key Location/Qualifiers FH FT source FT /organism="Sabella spallanzanii" FT /mol_type="mRNA" FT /db_xref="taxon:85702" FT CDS FT /gene="globin" FT /product="globin 3" FT /function="respiratory pigment" FT /db_xref="GOA:Q9BHK1" FT /db_xref="InterPro:IPR000971" FT /db_xref="InterPro:IPR014610" FT /db_xref="UniProtKB/TrEMBL:Q9BHK1" FT /experiment="experimental evidence, no additional details FT recorded" FT /protein_id="CAC " FT /translation="MYKWLLCLALIGCVSGCNILQRLKVKNQWQEAFGYADDRTSXGTA FT LWRSIIMQKPESVDKFFKRVNGKDISSPAFQAHIQRVFGGFDMCISMLDDSDVLASQLA FT HLHAQHVERGISAEYFDVFAESLMLAVESTIESCFDKDAWSQCTKVISSGIGSGV" SQ Sequence 919 BP; 244 A; 246 C; 199 G; 225 T; 5 other; caaacagtca rttaattcac agagccctga ggtctctcgc tcctttctgc gtcactctct cttaccgtca tcatgtacaa gtggttgctt tgcctggctc tgattggctg cgtcagcggc tgcaacatcc tccagaggct gaaggtcaag aaccagtggc aggaggcttt cggctatgct gacgacagga catcccycgg taccgcattg tggagatcca tcatcatgca gaagcccgag // Example

Formats Different databases store this data in different formats EMBL
ID AJ131285; SV 1; linear; mRNA; STD; INV; 919 BP. XX AC AJ131285; DT 24-APR-2001 (Rel. 67, Created) DT 20-JUL-2001 (Rel. 68, Last updated, Version 4) DE Sabella spallanzanii mRNA for globin 3 KW globin; globin 3; globin gene. OS Sabella spallanzanii OC Eukaryota; Metazoa; Annelida; Polychaeta; Palpata; Canalipalpata; OC Sabellida; Sabellidae; Sabella. RN [1] RP LOCUS AJ bp mRNA linear INV 20-JUL-2001 DEFINITION Sabella spallanzanii mRNA for globin 3. ACCESSION AJ131285 VERSION AJ GI: KEYWORDS globin; globin 3; globin gene. SOURCE Sabella spallanzanii ORGANISM Sabella spallanzanii Eukaryota; Metazoa; Lophotrochozoa; Annelida; Polychaeta; Palpata; Canalipalpata; Sabellida; Sabellidae; Sabella. REFERENCE 1 ORIGIN 1 caaacagtca rttaattcac agagccctga ggtctctcgc tcctttctgc gtcactctct 61 cttaccgtca tcatgtacaa gtggttgctt tgcctggctc tgattggctg cgtcagcggc 121 tgcaacatcc tccagaggct gaaggtcaag aaccagtggc aggaggcttt cggctatgct 181 gacgacagga catcccycgg taccgcattg tggagatcca tcatcatgca gaagcccgag 241 tctgtggaca agttcttcaa gcgtgtcaat ggcaaggaca tcagctcccc agccttccag 301 gctcacatcc agcgtgtgtt cggtggcttt gacatgtgca tctccatgct tgatgacagt 361 gatgtgctcg cctctcagct ggctcacctc cacgcccagc acgtcgagag aggaatctct // SQ Sequence 919 BP; 244 A; 246 C; 199 G; 225 T; 5 other; caaacagtca rttaattcac agagccctga ggtctctcgc tcctttctgc gtcactctct cttaccgtca tcatgtacaa gtggttgctt tgcctggctc tgattggctg cgtcagcggc tgcaacatcc tccagaggct gaaggtcaag aaccagtggc aggaggcttt cggctatgct gacgacagga catcccycgg taccgcattg tggagatcca tcatcatgca gaagcccgag // EMBL GENBANK

Formats >gi| |emb|AJ | Sabella spallanzanii mRNA for globin 3 CAAACAGTCARTTAATTCACAGAGCCCTGAGGTCTCTCGCTCCTTTCTGCGTCACTCTCTCTTACCGTCA TCATGTACAAGTGGTTGCTTTGCCTGGCTCTGATTGGCTGCGTCAGCGGCTGCAACATCCTCCAGAGGCT GAAGGTCAAGAACCAGTGGCAGGAGGCTTTCGGCTATGCTGACGACAGGACATCCCYCGGTACCGCATTG TGGAGATCCATCATCATGCAGAAGCCCGAGTCTGTGGACAAGTTCTTCAAGCGTGTCAATGGCAAGGACA TCAGCTCCCCAGCCTTCCAGGCTCACATCCAGCGTGTGTTCGGTGGCTTTGACATGTGCATCTCCATGCT TGATGACAGTGATGTGCTCGCCTCTCAGCTGGCTCACCTCCACGCCCAGCACGTCGAGAGAGGAATCTCT FASTA format

Format conversion tools

Meta-information Contains information important for:
Identifying/referencing a piece of data or entry Classifying an entry Determining the source of the data And can also contain annotation that adds value: Identification of sequence features Keywords, GO terms etc. Cross-references to other entries that share some property Etc.

Searching using the meta-data
When looking for a sequence we can perform text searches against the meta-data Accession look-up Keyword search eg: function, species Protein family classification Accession changes? Cross reference services PICR UniProt ID Mapping

Text search tools Each database has its own search engine
Interface tailored to their specific data use There are also EBI-wide search tools EBI Search

EBI Search First approach/entry point to data resources at EBI

EBI Search Just type, with auto-complete

EBI Search One stop search across many resources, grouped into categories

Categories

Domains Multi-domain facet

Domain-specific facet

EBI Search First approach/entry point to data resources at EBI
One stop search across many resources Non-expert friendly summaries

EBI Search

EBI Search First approach/entry point to data resources at EBI
One stop search across many resources Non-expert friendly summaries Advanced search available (via direct URL) Allows domain/field specification Boolean etc.

Sequence searching

Sequence searching tools
Central to modern techniques Genome annotation Characterising protein families Exploring evolutionary relationships

How? Search by comparing sequence data rather than meta- data
Find sequences/entries when missing or inaccurate meta- data More than just an exact look-up Allow for sequence variability – look for ‘similar’ sequences Sequence variation is important information for bioinformaticians Infer homology (shared ancestry) IF homologous, then can transfer information

Homology vs. Similarity
Presence of similar features because of common decent Cannot be observed since the ancestors are not anymore Is inferred as a conclusion based on ‘similarity’ Homology is like pregnancy: Either one is or one isn’t! (Gribskov – 1999) Quantifies a ‘likeness’ Uses statistics to determine ‘significance’ of a similarity Statistically significant similar sequences are considered ‘homologous’ Measurable Inferred

Sequence alignment Query: ACATAGGT 2 1 TCATAGAT AAATTCTG

Sequence alignment Query: ACATAGGT ACATAGGT ACATAGGT 1 2 TCATAGAT
AAATTCTG

Sequence alignment Query: ACATAGGT 1 2 ACATAGGT ACATAGGT TCATAGAT
AAATTCTG 3/8 Score: 6/8 Identity

Sequence alignment Query: 1 2
atttcacagaggaggacaaggctactatcacaagcctgtggggcaaggtgaatgtggaag atgctggaggagaaaccctgggaaggctcctggttgtctacccatggacccagaggttct ttgacagctttggcaacctgtcctctgcctctgccatcatgggcaaccccaaagtcaagg cacatggcaagaaggtgctgacttccttgggagatgccattaaagcacctgggatgatct caagggcacctttgcccagcttgagt Query: 1 atggtgctctctgcagctgacaaaaccaacatcaagaactgctgggggaagattggtggc catggtggtgaatatggcgaggaggccctacagaggatgttcgctgccttccccaccacc aagacctacttctctcacattgatgtaagccccggctctgcccaggtcaaggctcacggc aagaaggttgctgatgccctggccaaagctgcagaccacgtcgaagacctgcctggtgcc ctgtccactctgagcgacctgc cacaagcctgtggggcaaggtgaatgtggaagatgctggaggagaaaccctgggaaggct cctggttgtntacccatggacccagaggttctttgacagctttggcaacctgtcctctgc ctctgccatcatgggcaaccccaaagtcaaggcacatggcaagaaggtgctgacttcctt gggagatgccataaagcacctggatgatctcaagggca 2

Dot plot Maybe a dot plot will help GATACT Sequence 1 A C A T A G
Query

Dot plot Query vs Sequence 1 Query vs Sequence 2 1 2 Query Query

Algorithms To get a computer to solve a problem, the first step is to create a way for the computer to know what is relatively ‘good’ and what is relatively ‘bad’ I.e. a score. Computer can then assess solutions and choose best.

Simple algorithm – penalise movement away from diagonal – gap penalty
-10 -10 -10 -10

Why gap open and extension?
Adjacent gap positions are likely to have been created by the same in/del event, rather than multiple independent events Use a smaller gap extension compared to opening penalty to account for this G---ATTA G-A-T-TA

Gap extend To encourage this we apply a low penalty per each gap, and a high one just to open a gap. Gap open = 10 Gap extend = 0.5 -11 -10.5 -0.5 -10.5 -10.5 -10.5 -10.5 -10 -0.5 -10.5 -11 -10 -0.5 -0.5

Match/mismatch Of course, we need to tell the algorithm that matching letters are better than mismatches too This is done via a scoring matrix A C G T A C G T

Putting the two together gives us a scoring mechanism
-14 -13.5 6 C -18.5 1 -13.5 T -4 -18.5 -23 A C A

To pick the optimal alignment, start at the end and trace back the highest scoring route.
-14 -13.5 6 C -18.5 1 -13.5 Where there are 2 paths, pick best scoring T -4 -18.5 -23 A C A

Needleman-Wunsch Congratulations! You’ve just reconstructed the Needleman-Wunsch algorithm! An example of dynamic programming Comparing the full length of both sequences is called a global-global or just global alignment Where there are 2 paths, pick best scoring

Global vs Local But global-global might not be suitable for sequences that are very different lengths A modified form of this algorithm for local alignment is called the Smith-Waterman algorithm. Sets negative scores in matrix to 0, and allows trace back to end and restart Where there are 2 paths, pick best scoring

Global vs Local A T G T A T A C G C A - T G T A T A C G C
A G T A T A - G C A G T A T A G C

Scoring Parameters so far: Can we improve it? Match/mismatch
Gap opening Gap extending Can we improve it?

Substitutions Some substitutions are more likely than others

Protein substitution matrices
Can look at closely related proteins to determine substitution rates Two most commonly used models: PAM BLOSUM

PAM PAM 250 Point Accepted Mutation
Observed mutations in a set of closely related proteins Markov chain model created to describe substitutions Normalised so that PAM1 = 1 mutation per 100 amino acids Extrapolate matrices from model Higher PAM number = less closely related PAM 250

BLOSUM Blocks of Amino Acid Substitution Matrix
Align conserved regions of evolutionary divergent sequences clustered at a given % identity Count relative frequencies of amino acids and substitution probability Turn that into a matrix where the more positive a substitution is, the more likely is it to be found, and the more negative, the less likely. Higher BLOSUM number = more closely related

More divergent Less divergent BLOSUM 45 PAM 250 BLOSUM 62 PAM 160

Scoring Parameters: Match/mismatch Gap opening Gap extending
Substitution matrix

Dynamic programming alignments at the EBI
EMBOSS Pairwise Alignment algorithms European Molecular Biology Open Software Suite Suite of useful tools for molecular biology Command line based Designed to be used as part of scripts/chained programs We implement selected tools to provide web and Web Services access Database alignments via FASTA suite of programs

Where to find at the EBI?

Pairwise alignment tools
Global alignment Local alignment Genomic DNA alignment Needle Stretcher Big sequences Water Matcher Big sequences LALIGN WISE tools

Change to nucleotide Sequence input Parameters Submit!

Key Gap : Positive match Negative match | Identity

Example sequences www.ebi.ac.uk/~apc/Courses/Brazil
Pairwise_align1.fsa Pairwise_align2.fsa

Searching a database Multiple pairwise alignments between query sequence and database sequence

Dynamic programming sequence search methods at the EBI
Global alignment Local alignment Global query vs local database Profile-iterative search GGSEARCH SSEARCH GLSEARCH PSI-SEARCH

Database selection Sequence input Parameters Submit!

Dynamic programming methods are rigorous and guarantee an optimal result
But take up a lot of memory And evaluate each position of the matrix Predictably, this makes them slow and demanding when you are aligning large sequences

Heuristics Therefore we need methods of estimating alignments
Estimation methods are called heuristics Try and take short cuts in an intelligent manner Speed up the search At the possible expense of accuracy Accuracy in sequence searches is important for: Aligning the right bits Scoring the alignment correctly Identifying similar sequences - sensitivity

Going back to our dot plot

Instead of searching the whole matrix, if we narrow the search space down to a likely region we will improve the speed.

This is the method used by FASTA
Of course, we have to identify likely regions – not all alignments will be as nice as that one! This is the method used by FASTA W.R. Pearson & D.J. Lipman PNAS (1988) 85:

FASTA – step 1 Identify runs of identical sequence and pick regions with highest density of runs Ktup parameter: How small are ‘words’ considered before they are ignored Increase Ktup = faster, but less sensitive

FASTA – step 2 Weight scoring of runs using matrix, trim back regions to those contributing to highest scores Parameter: Substitution matrix

FASTA – step 3 Discard regions too far from the highest scoring region
Joining threshold: Internally determined

FASTA – step 4 Use dynamic programming to optimise alignment in a narrow band encompassing the top scoring regions Parameters: Gap open Gap extend Substitution matrix

FASTA Repeat against all sequences in the database

FASTA – programs available at EBI
FASTA: ”a fast approximation to Smith & Waterman” FASTA – scan a protein or DNA sequence library for similar sequences. FASTX/Y – compare a DNA sequence to a protein sequence database, comparing the translated DNA sequence in forward or reverse translation frames. TFASTX/Y – compare a protein sequence to a translated DNA data bank. FASTF – compares ordered peptides (Edman degradation) to a protein databank. FASTS – compares unordered peptides (Mass Spec.) to a protein databank.

Database selection Sequence input Parameters Submit!

FASTA - results

FASTA - results Key Gap : Identity Similarity X Filtered

Example sequence test_prot.fasta

BLAST – Basic Local Alignment Search Tool
Instead of narrowing the dynamic programming search space, BLAST works a slightly different way Firstly, it creates a word list both of the exact sequence and high scoring substitutions Altschul et al (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:

BLAST – step 1 w=3 SEWRFKHIYRGQPRRHLLTTGWSTFVT SEW EWR WRF Parameter:
Word length (w) WRF Increase = faster, but less sensitive

BLAST – step 1(cont.d) w=3 T=13 SEWRFKHIYRGQPRRHLLTTGWSTFVT GQP 18
GEP 15 GRP 14 GKP 14 GNP 13 GDP 13 Parameters: Neighbourhood threshold (T) Substitution matrix AQP 12 NQP 12

BLAST – step 2 Then it scans database sequences for exact matches with these words

BLAST – step 3 If two hits are found on the same diagonal the alignment is extended until the score drops by a certain amount This results in a High-scoring Segment Pair (HSP) Parameters: Drop off Substitution matrix

BLAST – step 4 If the total HSP score is above another threshold then a gapped extension is initiated Parameters: Extension threshold (Sg) Substitution matrix

BLAST The steps rule out many database sequences early on
Large increase in speed

BLAST – programs available at the EBI
Basic Local Alignment Search Tool NCBI BLAST programs: BLASTP – protein sequence vs. protein sequence library BLASTN – nucleotide query vs. nucleotide database BLASTX – translated DNA vs. protein sequence library

Key Gap [residue] Identity Similarity X Filtered

When to use what? BLAST Query length FASTA Database size GGSEARCH
SSEARCH PSI-SEARCH GGSEARCH Database size

When to use what? BLAST time to search FASTA GGSEARCH
SSEARCH PSI-SEARCH GGSEARCH PDB Swiss-Prot UniRef50 UniRef 90 UniRef100 UniProtKB UniParc

Homology and Similarity

Similarity

Homology

So far, we’ve talked about scoring alignments
Direct function of the algorithm But what we want is to assign some kind of quality to that score

Score vs significance A A A A C A T A A G G C T A A A
A T A C A A G C C T High score Higher significance?

“Lies, damn lies, and statistics”

“Lies, damn lies, and statistics”
Not just interested in score... ...But how likely we are to get that alignment by chance alone It is this ‘non-random’ alignment that infers homology Statistics are used to estimate this chance

E-value ‘Expect’ value (really ‘expectation’)
Probability of obtaining this score by chance in the given database, or “how many times you might be wrong” Best measure of how biologically significant an alignment is Used for ranking results by default Most people use 10-3 “Happy to be wrong one time in a thousand” Most people… this is fine for a one-off search. If you are doing lots of searches this might be too relaxed..

Calculated in slightly different ways for BLAST and FASTA
Short alignments are more likely to be found by chance so have higher E-values Affected by database size BLAST and FASTA both optimised for distant relationships By default there is an E-Value cut off, so results worse than this won’t be displayed. If you have something with poor stats, eg. Short alignments, you may need to raise the E-Value threshold to see more (any) alignments. If you have a large database your chances of obtaining things randomly also increases, so value might need to be adjusted (better to use a smaller database).

FASTA statistics Compares query sequence with every sequence in database As most of these sequences are unrelated it is possible to use the distribution of scores (sampled) to assign statistical significance As distribution is taken from a random sample, exact E- Value can vary slightly from search to search

FASTA - histogram Key * = Predicted distribution of scores
Observed distribution of scores * = Sudden peak of high scores at bottom of chart (more than expect by random distribution) indicates something other than random must have occurred, ie they score well because of homology. Use FASTA to demonstrate histogram for any normal input sequence – histogram needs to be turned on in ‘more options’ then it is found at the top of the tool output tab in results. High scoring region

BLAST statistics “Appears to yield fairly accurate results”
Main reason for speed is that it doesn’t compare query with lots of other sequences Therefore it pre-estimates statistical values using a random sequence model “Appears to yield fairly accurate results”

Improving algorithms

Sensitivity, Selectivity & Speed
Sensitivity is how distantly you can determine a homologous sequence (avoid false negative) Selectivity is how accurately you can determine whether a sequence is homologous or not (avoid false positive) Speed is obviously how long it takes!

In general, the more information we can add to an alignment, the better the result
Conserved regions Structural information Motifs [R, T or D]-[D, A or Q]-[F, E or A]-A-T-H

Conserved regions We can add a new ‘position’ parameter to the substitution matrix We can even modify a normal search to generate a position specific scoring matrix, or PSSM

PSI-BLAST Position Specific Iterative – BLAST:
Takes the result of a normal BLAST Aligns them and generates profile of conserved positions Uses this to weight scoring on next iteration

PSI-BLAST

PHI-BLAST Pattern Hit Initiated-BLAST
User provides a pattern alongside a protein Database hits have to contain this pattern, and similarity to rest of sequence Results can initiate a PSI-BLAST search as well

PSI-BLAST By adding importance to conserved residues we might be able to find more distant sequences But iterate too far and we might be assigning importance where there is none Problem of Homologous Over-Extension (HOE) More sensitive Less selective

Homologous Over-Extension (HOE)
3rd 2nd Alignment region Extends over subsequent iterations

Contaminated PSSM

Which can cause (significant) alignment with unrelated protein

PSI-BLAST initial search Expect score: 9.0x10-5 Note the score of this alignment region (correctly aligned with homologous domain)

PSI-BLAST 2nd Iteration

PSI-BLAST 3rd Iteration Expect score: 1.0x10-4

PSI-BLAST 5th Iteration Expect score: 1.0x10-19 Note the expect scores again – we have even more significant alignments than before, but with non-homologous domains. Expect score: 7.0x10-4

Reducing HOE Look for domains in results and manually select sequences that form part of PSSM Mask boundaries according to initial alignment Results in improvement of false-positives (selectivity)

PSI-SEARCH Smith-Waterman implementation (SSEARCH)
With iterative position specific scoring Optional boundary masking to reduce HOE

Reducing HOE errors Sequence boundary masking procedure
First time a significant alignment occurs for a library sequence, store co-ordinates

Reducing HOE errors Mask regions outside so can’t contribute to PSSM
Masking tells the algorithm to ignore a portion of sequence – when using filters (next section) you are saying to treat that portion of sequence as aligning equally to any other sequence, here we are masking largely for the benefit of the PSSM generation (ie these regions won’t influence the way the PSSM is generated).

Reducing HOE errors PSI-Search 2nd Iteration PSI-Search 5th Iteration

PSI-Search

So what does that do to sensitivity/selectivity?
PSI-Search = Very sensitive + Much more selective Sensitivity Ie PSI Search is already much more sensitive and selective than PSI-BLAST, but adding the HOE-reduction masking gives a great improvement on selectivity (even for PSI-BLAST). jackHMMER (a HMM based aligner) is very sensitive, but still not as selective as PSI Search. Selectivity

Coming soon PSI-Search 2!
Use domain annotations/predictions to inform alignment

Low complexity regions
Biologically irrelevant, but likely to skew alignment scoring E.g. CA repeats, poly-A tails and Proline rich regions

Good Statistics: The inset shows good correlation
between the observed over expected numbers of scores. This is the region of the histogram to look out for first when evaluating results.

Bad Statistics: The inset shows bad correlation
between the observed and expected scores in this search. The spaces between the = and * symbols indicate this poor correlation. One reason for this can be low complexity regions.

Low complexity regions
Biologically irrelevant, but likely to skew alignment scoring E.g. CA repeats, poly-A tails and Proline rich regions Compensate by filtering/masking sequence so these regions don’t contribute to scoring Filters: seg, xnu, dust, CENSOR But check what you are filtering!

Filtered: Inset showing the effect of using a low
complexity filter (seg) and searching the database using the segment with highest complexity. Note that there is now good agreement between the observed and expected high score in the search and that the distance between = and * has been significantly reduced.

Example sequence Filtertest_seq.fsa

Database composition Statistics rely on database containing wide coverage Assumption query is not homologous to most of the data Specialist databases might cause problems Eg Innuno- databases, made up of relatively few genes A lot of the database IS homologous Skews statistics

Database composition Can’t make same assumptions about coverage
So don’t use BLAST FASTA based tools sample the score so provide accurate statistics Use the histogram to check Use shuffled versions of database to create additional coverage

Search Guidelines

Search guidelines 1 Whenever possible, compare at the amino acid level rather than at the nucleotide level (fasta, blastp, etc…)‏ Then with translated DNA query sequences (fastx, blastx) Search with DNA vs. DNA as the next resort And then against translated DNA database sequences (tfastx, tblastx) as the VERY LAST RESORT!

Search guidelines 2 Search the smallest database that is likely to contain the sequence(s) of interest Use sequence statistics (E()-values) rather than % identity or % similarity, as your primary criterion for sequence homology

Search guidelines 3 Check that the statistics are likely to be accurate by looking for the highest scoring unrelated sequence Examine the histograms Use programs such as prss3 to confirm the expectation values. Searching with shuffled sequences (use MLE/Shuffle in fasta) which should have an E() ~1.0 Perform reverse search

Search guidelines 4 Default parameters are set up for most common queries Consider searches with different gap penalties and other scoring matrices, especially for short queries/domains Use shallower matrices and/or more stringent gaps in order to uncover or force out relationships in partial sequences Use BLOSUM62 instead of BLOSUM50 (or PAM100 instead of PAM250) Remember to change the gap penalty defaults (if the tool doesn’t change them for you) MATRIX open ext. BLOSUM BLOSUM BLOSUM PAM PAM As the scale of numbers varies with different matrices, the gap penalties need to scale with them. We change them automatically on our website.

Search guidelines 5 Homology can be reliably inferred from statistically significant similarity But remember: Orthologous sequences have similar functions Paralogous sequences can acquire very different functional roles So further work might be needed to tease out details

Search guidelines 6 Consult motif or fingerprint databases in order to uncover evidence for conservation-critical or functional residues However, motif identity in the absence of significant sequence similarity usually occurs by chance alone Try to produce multiple sequence alignments in order to examine the relatedness of your sequence data Clustal Omega MUSCLE T-Coffee Kalign MAFFT

Problem Sequences

Short sequences What about short sequences? Depends on their nature:
Protein Use shallow matrices Reduce word length and/or increase the E() value cut off DNA Reduce the word length‏ Ignore gap penalties (force local alignments only)‏ Use rigorous methods But ask what you are trying to do!

Vector contamination You think you know what your sequence is..
.. But the results are really confusing! Maybe you have vector contamination Search against known vectors to check

Vector contamination

Example sequences www.ebi.ac.uk/~apc/Courses/Brazil
vectortest_seq1.fsa vectortest_seq2.fsa

Multiple Sequence Alignments

Uses of Multiple Sequence Alignment (MSA)
Alignment of three or more sequences Functional prediction Structural prediction Conservation analysis Classification Phylogeny To help distinguish between orthology and parology

We have a (computational) problem…
Pairwise alignments are simple enough to find the optimal (highest scoring) solution in a reasonable timeframe Multiple sequence alignment is in a class of problems that is ‘NP-hard’

NP-easy NP-hard Problems that are solvable in polynomial time
E.g. operations to solve = n2 Problems that are hard to solve E.g. operations to solve = 2n

n2 vs 2n Imagine a computer running 109 operations a second n = 10 n = 30 n = 50 n = 70 n2 2n 100 < 1 sec 900 < 1 sec 2500 < 1 sec 4900 < 1 sec 1024 < 1 sec 109 1 sec 1015 13 days 1021 37 trillion years

What to do about NP-hard problems?
Give up (do you really need MSA?) Use approximations and heuristics

Weighted Sums of Pairs: WSP
Human beta VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLST Horse beta VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN Human alpha VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS- Horse alpha VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS- Whale myoglobin VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKT Lamprey globin PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT Lupin globin GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE *: : : * : .: * : * : . Human beta PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRL Horse beta PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRL Human alpha HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKL Horse alpha HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKL Whale myoglobin EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEF Lamprey globin ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKV Lupin globin VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV . .:: *. : : *. * : . Human beta LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH------ Horse beta LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH------ Human alpha LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR------ Horse alpha LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR------ Whale myoglobin ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQG Lamprey globin LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY Lupin globin VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA--- : : .: : Weighted Sums of Pairs: WSP Sequences Time From Des Higgins talk 2 1 second Time O(LN) seconds hours 5 39 days 6 16 years years

Progressive Alignment:
Human beta VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLST Horse beta VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN Human alpha VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS- Horse alpha VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS- Whale myoglobin VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKT Lamprey globin PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT Lupin globin GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE *: : : * : .: * : * : . Human beta PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRL Horse beta PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRL Human alpha HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKL Horse alpha HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKL Whale myoglobin EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEF Lamprey globin ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKV Lupin globin VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV . .:: *. : : *. * : . Human beta LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH------ Horse beta LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH------ Human alpha LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR------ Horse alpha LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR------ Whale myoglobin ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQG Lamprey globin LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY Lupin globin VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA--- : : .: : Progressive Alignment: Barton and Sternberg, 1987 Florence Corpet, 1988 Feng and Doolittle, 1987 Jotun Hein, 1989 Higgins and Sharp, 1988 Hogeweg and Hesper, 1984 Willie Taylor, 1987, 1988 From Des Higgins talk

Lamprey cyanohaemoglobin Lupin leghaemoglobin
Human beta VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLST Horse beta VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN Human alpha VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS- Horse alpha VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS- Whale myoglobin VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKT Lamprey globin PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT Lupin globin GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE *: : : * : .: * : * : . Human beta PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRL Horse beta PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRL Human alpha HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKL Horse alpha HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKL Whale myoglobin EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEF Lamprey globin ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKV Lupin globin VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV . .:: *. : : *. * : . Human beta LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH------ Horse beta LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH------ Human alpha LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR------ Horse alpha LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR------ Whale myoglobin ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQG Lamprey globin LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY Lupin globin VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA--- : : .: : Horse beta Human beta Horse alpha Human alpha Whale myoglobin Lamprey cyanohaemoglobin Lupin leghaemoglobin

Guide Tree Horse beta Human beta Horse alpha Human alpha
Human beta VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLST Horse beta VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN Human alpha VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS- Horse alpha VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS- Whale myoglobin VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKT Lamprey globin PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT Lupin globin GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE *: : : * : .: * : * : . Human beta PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRL Horse beta PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRL Human alpha HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKL Horse alpha HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKL Whale myoglobin EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEF Lamprey globin ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKV Lupin globin VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV . .:: *. : : *. * : . Human beta LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH------ Horse beta LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH------ Human alpha LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR------ Horse alpha LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR------ Whale myoglobin ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQG Lamprey globin LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY Lupin globin VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA--- : : .: : Guide Tree Horse beta Human beta Horse alpha Human alpha Whale myoglobin Lamprey cyanohaemoglobin Lupin leghaemoglobin

Clustal www.clustal.org >85,000 citations Clustal1-Clustal4
1988, Paul Sharp, Dublin Clustal V 1992 EMBL Heidelberg, Rainer Fuchs Alan Bleasby Clustal W, Clustal X Toby Gibson, EMBL, Heidelberg Julie Thompson, ICGEB, Strasbourg Clustal W and Clustal X University College Dublin

ClustalW2 at the EBI

ClustalW2 Sequence input Parameters Submit!

ClustalW2

Jalview

ClustalW2 Advantages Disadvantages Quite fast for low numbers
Not too demanding Widely used Disadvantages Fixing of early alignments Propagate errors Doesn’t search far Local minima Compresses gaps

Example sequences Prot_MSA.fsa

Other progressive aligners
MUSCLE Optimised progressive aligner Good alternative to ClustalW BaliBase % correct time(s) Clustal W Muscle

Other progressive aligners
KAlign Local regions progressive aligner Extremely fast! Good for large alignments/input BaliBase % correct time(s) Clustal W Kalign

Consistency based alignment
Maximise similarity to a library of residue pairs

COFFEE Consistency based Objective Function For alignmEnt Evaluation
Maximum Weight Trace (John Kececioglu) Maximise similarity to a LIBRARY of residue pairs Notredame, C., Holm, L. and Higgins, D.G. (1998) COFFEE: An objective function for multiple sequence alignments. Bioinformatics 14:

COFFEE Library of reference pairwise alignments Objective Function
For your given set of sequences Objective Function Evaluates consistency between multiple alignment and the library of pairwise alignments Use SAGA to optimise this function Weigh depending on quality of alignment SAGA is another alignment method, using genetic algorithms

COFFEE More accurate than ClustalW
Much less prone to problems in early alignment stages VERY slow!

T-Coffee Tree-based COFFEE Heuristic approach to COFFEE
Gets rid of genetic algorithm portion Uses progressive alignments Changes algorithm based on number of sequences

T-Coffee Much faster than COFFEE Avoids some of ClustalW’s pitfalls
Can take information from several data sources Still not that fast Can be very demanding of memory etc.

Other Tools MAFFT Iterative based Fast Fourier Transform
Different modes – can operate in both progressive and consistency type alignments

NEW!: Clustal Omega Completely different way of doing things from ClustalW Two major areas of improvement: 1) Guide tree generation 2) Profile-profile alignments

Clustal Omega – Guide Tree improvements
Guide tree generation is one of the slowest steps Especially with large numbers of sequence Clustal Omega uses the embed method to sample range of sequences and represent all sequences as vectors to these samples Results in better scaling with more sequences

Clustal Omega – Profile-profile alignments
Like sequence searching, profiles can be used to increase sensitivity HMMs are a form of profile Clustal Omega aligns HMMs to HMMs

Clustal Omega Better scaling for many sequences
Speed Accuracy Better scaling for many computers More accurate alignments Nucleic Acid alignments still work in progress

Which tool should I use? 2-100 sequences of typical protein length
Input data Recommendation 2-100 sequences of typical protein length sequences >500 sequences Small number of unusually long sequences MUSCLE, T-Coffee, MAFFT, ClustalW2/Omega Clustal Omega, MUSCLE, MAFFT Clustal Omega, KALIGN ClustalW, KALIGN

How to evaluate? Use a benchmark BaliBASE

BaliBASE Thompson, JD, Plewniak, F. and Poch, O. (1999) NAR and Bioinformatics ICGEB Strasbourg 141 manual alignments using structures 5 sections core alignment regions marked 3. Two groups (12) 1. Equidistant (82) 4. Long internal gaps (13) 5. Long terminal gaps (11) 2. Orphan (23)

BaliBase % correct time(s)
Clustal Omega Clustal W Mafft (default) Muscle Kalign T-Coffee Probcons Mafft (auto/consistency) MsaProbs

Benchmark pitfalls Benchmark dataset may not be representative
Danger of over-training towards benchmark Goldman: Most MSAs have unrealistic gaps Tend towards multiple, independent deletions Insertions are rare Sequences shrink in length over evolution No supporting evidence that this is the case

Solutions Use phylogentic data to guide alignment
Keep track of changes to ancestor sequences Don’t change them again so easily in decendents

Phylogeny Multiple Sequence Alignment tries to find best alignment of three or more sequences Used to identify groups of similar sequences Conserved regions etc. But if we want to examine evolutionary relationships we need more than just current sequence similarity Phylogeny is an estimate of evolutionary history between sequences Model substitutions from theoretical ancestor sequences

Neighbour Joining Simple phylogenetic tree method Fast
Bottom up (starts from alignment of current day sequences) Iterate to form a tree with nodes forming minimum distances between paired taxa Fast Dependant on accuracy of input Can sometimes get negative branch lengths

ClustalW2 - Phylogeny Neighbour joining (and UPGMA) phylogenetic tree algorithm from the ClustalW2 package

ALIGNED sequence input
ClustalW2 - Phylogeny ALIGNED sequence input

ClustalW2 - Phylogeny

PRANK Probabilistic Alignment Kit webPRANK
Better suited for closely related sequences Tied solutions are chosen from at random Avoids incorrect confidence in result Means alignments might not be reproducible Alignments look quite different Might look worse! But gap patterns make sense Gaps are good!

Common problems with MSA
Input format Try using FASTA format Unique sequence identifiers Include sequence! Usually limit of 500 sequences/1MB Job can’t be found/other error Results deleted after 7 days Some sequence/program combinations run out of memory Use a different program

Example sequences www.ebi.ac.uk/~apc/Courses/Brazil Problem_MSA1.fsa

Common mis-uses of MSA Performing a sequence assembly
Specialist type of MSA Use other tools (Staden etc.) Aligning ESTs to a reference genome Use EST2Genome Designing primers Use primer tools (primer3 etc.) Aligning two sequences Use a pairwise alignment tool!

Putting it all together
EBI Search Sequence retrieval Sequence search Sequences retrieval Multiple sequence alignment Phylogeny Analysis

Final remarks Don’t assume a single tool will cater for all your needs
Change the parameters of the tools Remember where the tool excels and what its limitations are A tool intended for specific task A can also be used for task B (and may be better than the tool intended for task B specifically!) Crazy input will always give crazy results!

Getting Help

Getting Help Database documentation EBI Support EBI training programme
EBI training programme EBI online training IMGT/HLA

Thank you! Facebook: EMBLEBI

Final remarks Don’t assume a single tool will cater for all your needs
Change the parameters of the tools Remember where the tool excels and what its limitations are A tool intended for specific task A can also be used for task B (and may be better than the tool intended for task B specifically!) Crazy input will always give crazy results!

Getting Help

Getting Help Database documentation EBI Support EBI training programme
EBI training programme EBI online training

With thanks to our funders
EMBL-EBI is primarily funded by EMBL member states Other major funders: European Commission National Institutes of Health Research Councils UK Wellcome Trust Industry Programme

Appendix

How is the data organised?
ENA database structure How is the data organised? Data in EMBL-Bank is divided in 2 ways: 1) Data classes Type of data or Methodology used to obtain data Each entry belongs to one data class 2) Taxonomic Divisions Each entry belongs to one taxonomic division

1) Data Classes CON Constructed from sequence assemblies EST
Expressed Sequence Tag (cDNA) GSS Genome Survey Sequence (high-throughput short sequence) HTC High-Throughput cDNA (unfinished) HTG High-Throughput Genome sequencing (unfinished) MGA Mass Genome Annotation PAT Patent sequences SRA Sequence Read Archive (both databank and data class) STS Sequence Tagged Site (short unique genomic sequences) EST contains sequence and mapping data on ‘single-pass’ cDNAs and ESTs from various organisms. HTG sequences: in order to make them available as soon as possible, they include unfinished genome project data; much of the annotation is computer generated. STD Standard (high quality annotated sequence) TPA Third Party Annotation (re-annotated and re-assembled) TSA Transcriptome Shotgun Assembly (computational assembly) WGS Whole Genome Shotgun

Expressed Sequence Tag (cDNA) GSS Genome Survey Sequence (high-throughput short sequence) Single pass reads  variable quality HTC High-Throughput cDNA (unfinished) HTG High-Throughput Genome sequencing (unfinished) MGA Mass Genome Annotation PAT Patent sequences SRA Sequence Read Archive (both databank and data class) STS Sequence Tagged Site (short unique genomic sequences) STD Standard (high quality annotated sequence) TPA Third Party Annotation (re-annotated and re-assembled) TSA Transcriptome Shotgun Assembly (computational assembly) WGS Whole Genome Shotgun

Expressed Sequence Tag (cDNA) GSS Genome Survey Sequence (high-throughput short sequence) HTC High-Throughput cDNA (unfinished) SRA is a separate databank from EMBL-Bank SRA can also be searched as a data class within EMBL-Bank HTG High-Throughput Genome sequencing (unfinished) MGA Mass Genome Annotation PAT Patent sequences SRA Sequence Read Archive (both databank and data class) STS Sequence Tagged Site (short unique genomic sequences) STD Standard (high quality annotated sequence) TPA Third Party Annotation (re-annotated and re-assembled) TSA Transcriptome Shotgun Assembly (computational assembly) WGS Whole Genome Shotgun

Expressed Sequence Tag (cDNA) GSS Genome Survey Sequence (high-throughput short sequence) HTC High-Throughput cDNA (unfinished) HTG High-Throughput Genome sequencing (unfinished) MGA Mass Genome Annotation PAT Patent sequences Bulk of entries Highest level of tracked information SRA Sequence Read Archive (both databank and data class) STS Sequence Tagged Site (short unique genomic sequences) STD Standard (high quality annotated sequence) TPA Third Party Annotation (re-annotated and re-assembled) TSA Transcriptome Shotgun Assembly (computational assembly) WGS Whole Genome Shotgun

2) Taxonomy Which taxonomy database does ENA use? HUM Human MUS Mouse
All INSDC databases use NCBI Taxonomy Divisions: HUM Human MUS Mouse MAM Mammal VRT Vertebrate ROD Rodent FUN Fungi INV Invertebrate PLN Plant PHG Phage PRO Prokaryote VIR Viral ENV Environmental SYN Synthetic TGN Transgenic UNC Unclassified Other: UNC primarily used by ~GenBank for PAT (patent) sequences

from certain taxonomic ranges
2) Taxonomy: exclusion Some species EXCLUDED from certain taxonomic ranges ROD Rodent  excludes mouse MAM Mammal  excludes human mouse rodent Applies to: ftp files and Sequence search tools But not: ENA Browser VRT Vertebrate excludes human mouse rodent mammal

How does data organization differ from GenBank?
Database structure How does data organization differ from GenBank? EMBL-Bank GenBank Divisions Data split into parallel slices Large search sets Classes incomplete for taxonomy Taxonomy incomplete for classes con est gss htc htg pat sts std ... hum mus rod mam vrt fun inv pln Data classes Taxonomy Data classes con est gss htc htg pat sts std ... hum mus rod mam vrt fun Taxonomic Divisions Data split into intersecting slices Reduces search set Ensures complete result set

How does data organization differ from GenBank?
Database structure How does data organization differ from GenBank? EMBL-Bank GenBank ‘Mouse’ set large data set includes all mouse entries ‘EST’ set includes all EST entries Data classes Divisions con est gss htc htg pat sts std ... hum mus rod mam vrt fun Data classes Taxonomy Taxonomic Divisions con est gss htc htg pat sts std ... hum mus rod mam vrt fun inv pln ‘Mouse’ + ‘EST’ intersection small data set ensured complete set of mouse ESTs Data split into intersecting slices Reduces search set Ensures complete result set Data split into parallel slices Large search sets Classes incomplete for taxonomy Taxonomy incomplete for classes

Sequence Searching and Alignments

Similar presentations

Presentation on theme: "Sequence Searching and Alignments"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Sequence Searching and Alignments

Similar presentations

Presentation on theme: "Sequence Searching and Alignments"— Presentation transcript:

Similar presentations

About project

Feedback