Download presentation
Presentation is loading. Please wait.
Published byDamian Weaver Modified over 9 years ago
1
EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk
2
EBI is an Outstation of the European Molecular Biology Laboratory. Andrew Cowley External Services, EMBL-EBI Sequence Searching and Alignments
3
External Services Sequence searching and alignments - Andrew Cowley11/09/20153 Andrew Cowley Bioinformatics Trainer Hamish McWilliam Software engineer Rodrigo Lopez Head of External Services + many others!
4
Contents Sequence databases Database browsing tools Similarity searching and alignments Alignment basics Similarity searching tools More advanced tools Alignment tools Guidelines (slightly) More advanced tools Problem sequences Sequence searching and alignments - Andrew Cowley11/09/20154
5
Materials Sequence searching and alignments - Andrew Cowley11/09/20155 Presentations and tutorials can be found on the roadshow course page at the EBI Data files for exercises can be found at: www.ebi.ac.uk/~watson/africa
6
Data Simplistically, much of the data at the EBI can be thought of as a container One part being the raw data (eg. Sequence) Another part being annotation on this data Sequence searching and alignments - Andrew Cowley11/09/20156
7
Example Sequence searching and alignments - Andrew Cowley11/09/20157 ID AJ131285; SV 1; linear; mRNA; STD; INV; 919 BP. XX AC AJ131285; XX DT 24-APR-2001 (Rel. 67, Created) DT 20-JUL-2001 (Rel. 68, Last updated, Version 4) XX DE Sabella spallanzanii mRNA for globin 3 XX KW globin; globin 3; globin gene. XX OS Sabella spallanzanii OC Eukaryota; Metazoa; Annelida; Polychaeta; Palpata; Canalipalpata; OC Sabellida; Sabellidae; Sabella. XX RN [1] RP 1-919 RA Negrisolo E.M.; RT ; RL Submitted (11-DEC-1998) to the EMBL/GenBank/DDBJ databases. RL Negrisolo E.M., Biologia, Universita degli Studi di Padova, via U. Bassi RL 58/B, Padova,35131, ITALY. FH Key Location/Qualifiers FH FT source 1..919 FT /organism="Sabella spallanzanii" FT /mol_type="mRNA" FT /db_xref="taxon:85702" FT CDS 73..552 FT /gene="globin" FT /product="globin 3" FT /function="respiratory pigment" FT /db_xref="GOA:Q9BHK1" FT /db_xref="InterPro:IPR000971" FT /db_xref="InterPro:IPR014610" FT /db_xref="UniProtKB/TrEMBL:Q9BHK1" FT /experiment="experimental evidence, no additional details FT recorded" FT /protein_id="CAC37412.1" FT /translation="MYKWLLCLALIGCVSGCNILQRLKVKNQWQEAFGYADDRTSXGTA FT LWRSIIMQKPESVDKFFKRVNGKDISSPAFQAHIQRVFGGFDMCISMLDDSDVLASQLA FT HLHAQHVERGISAEYFDVFAESLMLAVESTIESCFDKDAWSQCTKVISSGIGSGV" XX SQ Sequence 919 BP; 244 A; 246 C; 199 G; 225 T; 5 other; caaacagtca rttaattcac agagccctga ggtctctcgc tcctttctgc gtcactctct 60 cttaccgtca tcatgtacaa gtggttgctt tgcctggctc tgattggctg cgtcagcggc 120 tgcaacatcc tccagaggct gaaggtcaag aaccagtggc aggaggcttt cggctatgct 180 gacgacagga catcccycgg taccgcattg tggagatcca tcatcatgca gaagcccgag 240 // ID AJ131285; SV 1; linear; mRNA; STD; INV; 919 BP. XX AC AJ131285; XX DT 24-APR-2001 (Rel. 67, Created) DT 20-JUL-2001 (Rel. 68, Last updated, Version 4) XX DE Sabella spallanzanii mRNA for globin 3 XX KW globin; globin 3; globin gene. XX OS Sabella spallanzanii OC Eukaryota; Metazoa; Annelida; Polychaeta; Palpata; Canalipalpata; OC Sabellida; Sabellidae; Sabella. XX RN [1] RP 1-919 RA Negrisolo E.M.; RT ; RL Submitted (11-DEC-1998) to the EMBL/GenBank/DDBJ databases. RL Negrisolo E.M., Biologia, Universita degli Studi di Padova, via U. Bassi RL 58/B, Padova,35131, ITALY. FH Key Location/Qualifiers FH FT source 1..919 FT /organism="Sabella spallanzanii" FT /mol_type="mRNA" FT /db_xref="taxon:85702" FT CDS 73..552 FT /gene="globin" FT /product="globin 3" FT /function="respiratory pigment" FT /db_xref="GOA:Q9BHK1" FT /db_xref="InterPro:IPR000971" FT /db_xref="InterPro:IPR014610" FT /db_xref="UniProtKB/TrEMBL:Q9BHK1" FT /experiment="experimental evidence, no additional details FT recorded" FT /protein_id="CAC37412.1" FT /translation="MYKWLLCLALIGCVSGCNILQRLKVKNQWQEAFGYADDRTSXGTA FT LWRSIIMQKPESVDKFFKRVNGKDISSPAFQAHIQRVFGGFDMCISMLDDSDVLASQLA FT HLHAQHVERGISAEYFDVFAESLMLAVESTIESCFDKDAWSQCTKVISSGIGSGV" XX SQ Sequence 919 BP; 244 A; 246 C; 199 G; 225 T; 5 other; caaacagtca rttaattcac agagccctga ggtctctcgc tcctttctgc gtcactctct 60 cttaccgtca tcatgtacaa gtggttgctt tgcctggctc tgattggctg cgtcagcggc 120 tgcaacatcc tccagaggct gaaggtcaag aaccagtggc aggaggcttt cggctatgct 180 gacgacagga catcccycgg taccgcattg tggagatcca tcatcatgca gaagcccgag 240 //
8
Data - Nucleotide ENA/EMBL-Bank: Release and updates Divided into classes and divisions Supplementary sets: EMBL-CDS, EMBL-MGA Specialist data sets, e.g.: Immunoglobulins: IMGT/HLA, IMGT/LIGM, etc. Alternative splicing: ASD, ASTD, etc. Completed genomes: Ensembl, Integr8, etc. Variation: HGVBase, dbSNP, etc. Sequence searching and alignments - Andrew Cowley11/09/20158
9
Individual sequencing Individual scientists ACTGCTGCTAGCTGGCTGACTATTCTAGCTTTAGCTGAGTGACTATTATCAGCTATTACAGCATCCG Sequence individual gene What sequence data is submitted? ACTGC TGCTA GCTAG submission add annotation submission
10
High throughput sequencing chromosome fragment ACTGC TGCTA GCTAG cyp30cyp309insv cg343 annotation sequence reads sequencing library assemble sequence
11
chromosome fragment sequencing library ACTGC TGCTA GCTAG assemble sequence cyp30cyp309insv cg343 annotation sequence reads Large-scale sequencing projects submission e.g. whole genome shotgun submission High throughput sequencing
12
What are primary sequence databases? Individual scientists Large-scale sequencing projects Primary sequence data Primary sequence database Original sequence data Experimental data Patent data Submitter-defined Patent Offices ACTGCTGCTA GCTAGCTGAT CTATGCTAGC TGTAGCTGAG ACTGC TGCTA GCTAG
13
How do primary and derived databases differ? Individual scientists Large-scale sequencing projects Primary sequence data Primary sequence database Patent Offices ACTGCTGCTA GCTAGCTGAT CTATGCTAGC TGTAGCTGAG ACTGC TGCTA GCTAG Derived data e.g. protein sequence Derived database
14
Primary v. derived data ACGTACGCATCGTCACTACTAGCTACGACGACGACACGCTACTACTCGACAT DNA sequence translate Derived mRNA sequence AUGCGUAGUGAUGAAUGCUGCUGUGCGAUGAGCUGC Derived protein sequence MRSNECCCAMSC transcribe submit ACTGC TGCTA GCTAG
15
How do primary and derived databases differ? Individual scientists Large-scale sequencing projects Primary sequence data Primary sequence database Patent Offices ACTGCTGCTA GCTAGCTGAT CTATGCTAGC TGTAGCTGAG ACTGC TGCTA GCTAG Derived database redundant may be non-redundant If anything in submission varies (e.g. source / submitter / sequence) generates a new entry Derived data e.g. protein sequence
16
How do primary and derived databases differ? Individual scientists Large-scale sequencing projects Primary sequence data Primary sequence database Patent Offices ACTGCTGCTA GCTAGCTGAT CTATGCTAGC TGTAGCTGAG ACTGC TGCTA GCTAG Derived database data lost regenerate data Derived data e.g. protein sequence
17
Primary nucleotide sequence databases GenBankDDBJ ENA (Japan) (U.S.A.) (Europe) INSDC: International Nucleotide Sequence Database Collaboration Daily exchange of data ENA DDBJGenBank ACTGC TGCTA GCTAG Submission can be made to any INSDC database
18
Sequence informationReads How is sequence data processed? Sequence machine output (reads) Quality scores Assembly Fragmented sequence reads assembled into contigs mapped onto chromosomes Annotation Functional information assigned to assembled regions ENA DDBJGenBank ACTGC TGCTA GCTAG
19
Sequence informationReads What type of sequence data is submitted? Assembly Annotation Assembled sequences Raw data Annotated sequence Interpreted information: Assembly Mapping Functional annotation Sample information RawAnnotated / ENA DDBJGenBank Input information: Sample Set-up Machine configuration Output machine data: Sequence traces Reads Quality scores Metagenomic data: Where originated ACTGC TGCTA GCTAG
20
How does ENA store the data? Assembled sequences Raw data Annotated sequence Large-scale sequencing projects Individual scientists Patent Offices ENAENA European Nucleotide Archive RawAnnotated / ENA DDBJGenBank ENA-Annotation (formerly EMBL-Bank) Sequence Read Archive (SRA) Trace Archive SRAAnnTrace ACTGC TGCTA GCTAG
21
How does ENA store the data? Assembled sequences Raw data Annotated sequence Large-scale sequencing projects Individual scientists Patent Offices ENA-Annotation (formerly EMBL-Bank) Sequence Read Archive (SRA) Trace Archive ENAENA European Nucleotide Archive RawAnnotated / SRAAnn ENA DDBJGenBank Trace Trace sequence reads Capillary sequencing instruments Intensity reads Next-generation sequencing instruments ACTGC TGCTA GCTAG
22
INDSC Sequencing Projects Can data be traced to an Institute? Institute Database records Consortium genomic ESTs... shotgun Complete genome / metagenome (single organism / metagenomic study) Assembly & annotation genomic ESTs... shotgun Track projects Comparative analysis RawAnnotated / SRAAnn ENA DDBJGenBank Trace ACTGC TGCTA GCTAG Pulls information together
23
Nucleotides: European Nucleotide Archive (ENA) 11/09/201523 Figure adapted from: Cochrane, G. et al. Public Data Resources as the Foundation for a Worldwide Metagenomics Data Infrastructure. In: Metagenomics: Theory, Methods and Applications (Chapter 5), Caister Academic Press, Universidad Nacional de Cordoba, Argentina. Ed. D. Marco (2010). The ENA has a three-tiered data architecture. It consolidates information from EMBL-Bank, the European Trace Archive (containing raw data from electrophoresis-based sequencing machines) and the Sequence Read Archive (containing raw data from next-generation sequencing platforms). Sequence searching and alignments - Andrew Cowley
24
Data Quality Is the data cleaned up? Automatic quality checks Validation of submitted data: Some manual inspection and curation Errors can still exist in sequence and annotation RawAnnotated / SRAAnn Clean-up ENA DDBJGenBank Trace ACTGC TGCTA GCTAG
25
How is the data organized? Data in ENA Annotation is divided in 2 ways: Database Structure Type of data or Methodology used to obtain data Each entry belongs to one data class 1) Data classes Each entry belongs to one taxonomic division 2) Taxonomic Divisions RawAnnotated / SRAAnn Clean-up ENA DDBJGenBank Class Taxon Trace ACTGC TGCTA GCTAG
26
CON EST GSS HTC HTG MGA PAT STS TPA TSA WGS Constructed from sequence assemblies Expressed Sequence Tag (cDNA) Genome Survey Sequence (high-throughput short sequence) High-Throughput cDNA (unfinished) High-Throughput Genome sequencing (unfinished) Mass Genome Annotation Patent sequences Sequence Tagged Site (short unique genomic sequences) Third Party Annotation (re-annotated and re-assembled) Transcriptome Shotgun Assembly (computational assembly) Whole Genome Shotgun STD Standard (high quality annotated sequence) Data Classes RawAnnotated / SRAAnn Clean-up ENA DDBJGenBank Class Taxon Trace ACTGC TGCTA GCTAG
27
CON EST GSS HTC HTG MGA PAT STS TPA TSA WGS Constructed from sequence assemblies Expressed Sequence Tag (cDNA) Genome Survey Sequence (high-throughput short sequence) High-Throughput cDNA (unfinished) High-Throughput Genome sequencing (unfinished) Mass Genome Annotation Patent sequences Sequence Tagged Site (short unique genomic sequences) Third Party Annotation (re-annotated and re-assembled) Transcriptome Shotgun Assembly (computational assembly) Whole Genome Shotgun STD Standard (high quality annotated sequence) Single pass reads variable quality Need to search both EST and RNA data Data Classes RawAnnotated / SRAAnn Clean-up ENA DDBJGenBank Class Taxon Trace ACTGC TGCTA GCTAG
28
CON EST GSS HTC HTG MGA PAT STS TPA TSA WGS Constructed from sequence assemblies Expressed Sequence Tag (cDNA) Genome Survey Sequence (high-throughput short sequence) High-Throughput cDNA (unfinished) High-Throughput Genome sequencing (unfinished) Mass Genome Annotation Patent sequences Sequence Tagged Site (short unique genomic sequences) Third Party Annotation (re-annotated and re-assembled) Transcriptome Shotgun Assembly (computational assembly) Whole Genome Shotgun STD Standard (high quality annotated sequence) Often copies of existing entries Records not clean, even for taxonomy Data Classes RawAnnotated / SRAAnn Clean-up ENA DDBJGenBank Class Taxon Trace ACTGC TGCTA GCTAG
29
CON EST GSS HTC HTG MGA PAT STS TPA TSA WGS Constructed from sequence assemblies Expressed Sequence Tag (cDNA) Genome Survey Sequence (high-throughput short sequence) High-Throughput cDNA (unfinished) High-Throughput Genome sequencing (unfinished) Mass Genome Annotation Patent sequences Sequence Tagged Site (short unique genomic sequences) Third Party Annotation (re-annotated and re-assembled) Transcriptome Shotgun Assembly (computational assembly) Whole Genome Shotgun STD Standard (high quality annotated sequence) Bulk of entries Highest level of tracked information Data Classes RawAnnotated / SRAAnn Clean-up ENA DDBJGenBank Class Taxon Trace ACTGC TGCTA GCTAG
30
CON EST GSS HTC HTG MGA PAT STS TPA TSA WGS Constructed from sequence assemblies Expressed Sequence Tag (cDNA) Genome Survey Sequence (high-throughput short sequence) High-Throughput cDNA (unfinished) High-Throughput Genome sequencing (unfinished) Mass Genome Annotation Patent sequences Sequence Tagged Site (short unique genomic sequences) Third Party Annotation (re-annotated and re-assembled) Transcriptome Shotgun Assembly (computational assembly) Whole Genome Shotgun STD Standard (high quality annotated sequence) Derived data entries e.g. patch genomic and RNA data to construct complete coverage Must have publication Must show which entries data is derived from Data Classes RawAnnotated / SRAAnn Clean-up ENA DDBJGenBank Class Taxon Trace ACTGC TGCTA GCTAG
31
CON EST GSS HTC HTG MGA PAT STS TPA TSA WGS Constructed from sequence assemblies Expressed Sequence Tag (cDNA) Genome Survey Sequence (high-throughput short sequence) High-Throughput cDNA (unfinished) High-Throughput Genome sequencing (unfinished) Mass Genome Annotation Patent sequences Sequence Tagged Site (short unique genomic sequences) Third Party Annotation (re-annotated and re-assembled) Transcriptome Shotgun Assembly (computational assembly) Whole Genome Shotgun STD Standard (high quality annotated sequence) Also derived data entries ESTs assembled to construct RNA Must show which EST/HTC entries data is derived from Data Classes RawAnnotated / SRAAnn Clean-up ENA DDBJGenBank Class Taxon Trace ACTGC TGCTA GCTAG
32
CON EST GSS HTC HTG MGA PAT STS TPA TSA WGS Constructed from sequence assemblies Expressed Sequence Tag (cDNA) Genome Survey Sequence (high-throughput short sequence) High-Throughput cDNA (unfinished) High-Throughput Genome sequencing (unfinished) Mass Genome Annotation Patent sequences Sequence Tagged Site (short unique genomic sequences) Third Party Annotation (re-annotated and re-assembled) Transcriptome Shotgun Assembly (computational assembly) Whole Genome Shotgun STD Standard (high quality annotated sequence) Entries change over time (completely replaced) Raw WGS entries assembled into contigs CON entries Data Classes RawAnnotated / SRAAnn Clean-up ENA DDBJGenBank Class Taxon Trace ACTGC TGCTA GCTAG
33
Data Classes How stable is the data? Data is always changing: Assembly of sequences into larger fragments Deletion of obsolete entries (i.e. once assembled) Sequence modifications Daily updates Identifier changes Corrections (databases can contain errors) etc… RawAnnotated / SRAAnn Clean-up ENA DDBJGenBank Class Taxon Trace ACTGC TGCTA GCTAG
34
Data Classes How does assembly affect entries? WGS Shotgun Example: Fragments in separate entry CON Constructed Join to make new CON entries Old WGS entries archived STD Standard Join into large STD entry (e.g. Completed genome) Add annotation Old CON entries archived RawAnnotated / SRAAnn Clean-up ENA DDBJGenBank Class Taxon Trace ACTGC TGCTA GCTAG
35
TaxonomyENV HUM Human MUS Mouse MAM Mammal VRT Vertebrate FUN Fungi INV Invertebrate PLN Plant PHG Phage Environmental PRO Prokaryote ROD Rodent VIR ViralSYN Synthetic TGN Transgenic UNC Unclassified Other: RawAnnotated / SRAAnn Clean-up ENA DDBJGenBank Class Taxon Trace ACTGC TGCTA GCTAG
36
TaxonomyHUM Human MUS Mouse MAM Mammal VRT Vertebrate FUN Fungi INV Invertebrate PLN Plant PHG Phage PRO Prokaryote ROD Rodent VIR Viral ENV Environmental SYN Synthetic TGN Transgenic UNC Unclassified Other: CAUTION: organism never isolated May blast sequence to assign putative organism RawAnnotated / SRAAnn Clean-up ENA DDBJGenBank Class Taxon Trace ACTGC TGCTA GCTAG
37
TaxonomyHUM Human MUS Mouse MAM Mammal VRT Vertebrate FUN Fungi INV Invertebrate PLN Plant PHG Phage PRO Prokaryote ROD Rodent VIR Viral ENV Environmental SYN Synthetic TGN Transgenic UNC Unclassified Other: RawAnnotated / SRAAnn Clean-up ENA DDBJGenBank Class Taxon Trace ACTGC TGCTA GCTAG CAUTION: not consistently handled, variable quality Transgenics may be from multiple organisms
38
TaxonomyHUM Human MUS Mouse MAM Mammal VRT Vertebrate FUN Fungi INV Invertebrate PLN Plant PHG Phage PRO Prokaryote ROD Rodent VIR Viral ENV Environmental SYN Synthetic TGN Transgenic UNC Unclassified Other: Division primarily used by GenBank for PAT (patent) sequences RawAnnotated / SRAAnn Clean-up ENA DDBJGenBank Class Taxon Trace ACTGC TGCTA GCTAG
39
Taxonomy exclusion RawAnnotated / SRAAnn Clean-up ENA DDBJGenBank Class Taxon Trace ACTGC TGCTA GCTAG Some species excluded from certain taxonomic ranges VRT Vertebrate excludes human mouse rodent mammal MAM Mammal excludes human mouse rodent ROD Rodent excludes mouse Applies to: ftp files and sequence search tools But not: ENA Browser
40
Taxonomy Database RawAnnotated / SRAAnn Clean-up ENA DDBJGenBank Class Taxon Trace ACTGC TGCTA GCTAG All INSDC databases use the NCBI Taxonomy Browser Which taxonomy database does ENA use? Only organisms with sequence are represented EBI-wide service maps resources into taxonomy service Culture collection – physical data, e.g. sample or stored version Biomaterial Specimen voucher EBI Taxonomy Portal representation, e.g. picture
41
Database Structure How does data organization differ from GenBank? ENA-Annotation Data classes TaxonomicDivisions Data split into intersecting slices Reduces search set Ensures complete result set con est gss htc htgpat sts std... hum mus rod mam vrt fun... GenBank Taxonomic Divisions Data classes Data split into parallel slices Large search sets Classes incomplete for taxonomy Taxonomy incomplete for classes con est gss htc htgpat sts std... hum mus rod mam vrt fun inv pln... RawAnnotated / SRAAnn Clean-up ENA DDBJGenBank Class Taxon Trace ACTGC TGCTA GCTAG
42
How does data organization differ from GenBank? ENA-Annotation Data classes TaxonomicDivisions Data split into intersecting slices Reduces search set Ensures complete result set GenBank Taxonomic Divisions Data classes Data split into parallel slices Large search sets Classes incomplete for taxonomy Taxonomy incomplete for classes con est gss htc htgpat sts std... hum mus rod mam vrt fun... con est gss htc htgpat sts std... hum mus rod mam vrt fun inv pln... ‘Mouse’ + ‘EST’ intersection small data set ensured complete set of mouse ESTs Database Structure RawAnnotated / SRAAnn Clean-up ENA DDBJGenBank Class Taxon Trace ‘Mouse’ set large data set includes all mouse entries ‘EST’ set large data set includes all EST entries ACTGC TGCTA GCTAG
43
Data – Protein Sequence UniProt databases: UniProtKB: human curated and automatic translation sections UniRef: non-redundant sequence clusters UniParc: non-identical sequence archive Sequence from structures: PDB SGT Specialist data sets, e.g.: Immunoglobulins: IMGT/HLA Alternative splicing: ASD, ASTD Completed proteomes: Ensembl, Integr8 Protein Interactions: IntAct Patent Proteins: EPO, JPO, KIPO and USPTO Sequence searching and alignments - Andrew Cowley11/09/201543
44
Sequence Databases NucleotideNucleotide/ProteinProtein Not curatedCuratedAutomatically annotated/curated Author submitsNCBI creates from existing dataEntries created from large scale genomics, small scale cloning and peptide submissions Only author can reviseNCBI revises as new data emergeUniProt regularly revises entries Multiple records for same loci commonSingle records for each molecule of major organisms An entry per protein including any isoforms (when curated). Records can contradict each other? No limit to species includedLimited to model organismsAll species Data exchanged among INSDC membersExclusive NCBI databaseInternational consortium, inc leading figures in Bioinformatics Akin to primary literatureAkin to review articlesAkin to review articles plus contain extensive cross-references to other DBBs – almost a portal. Proteins identified and linked Proteins and transcripts identified and linkedLinks to supporting nucleotide data added to entries Access via NCBI Nucleotide databasesAccess via Nucleotide & Protein databasesUniProt.org – modern, multi functional website with incorporated bioinformatics tools. Genbank
45
11/09/201545 Protein sequence: UniProt UniProt Manual curation Literature-based annotation Sequence analysis Automated annotation PRIDE GO InterPro IntAct IntEnz HAMAP RESID Functional info Protein identification data Protein families and domains Molecular interactions Enzymes Microbial protein families Post- translational modifications Some data sources for annotation Transmembrane prediction InterPro classification Signal prediction Other predictions Protein classification Sequence searching and alignments - Andrew Cowley
46
Patent Data WormBaseFlyBase Sub/ Peptide Data PDBVEGAEnsemblRefSeq INSDC (incl. WGS, Env.) Database sources Proteome Sets IPI UniProt data sources and data flow UniRef 100 UniRef 90 UniRef 50 UniRef Pre-computed clusters of similar proteins UniMes UniProt Metagenomic and Environmental Sequences (available by FTP only) UniParc UniProt Sequence Archive. Contains all current and obsolete UniProtKB sequences UniSave UniProt protein entry archive. Contains all versions of each protein entry. (Accessed via www.uniprot.org and www.ebi.ac.uk/unisave) UniProtKB UniProt Knowledgebase, the centrepiece of the UniProt Consortium’s activities. It provides a richly curated protein database.
47
The Two Sides of UniProtKB UniProtKB/TrEMBL UniProtKB/Swiss-Prot Non-redundant, high-quality manual annotation - reviewed Redundant, automatically annotated - unreviewed
48
Databases Many databases and they are getting bigger Efficient searching involves knowledge of what is stored in these Don’t assume that everything in the databases is correct Nothing is constant, but changes... Deletions, sequence modifications Daily updates, identifier changes, etc. Sequence searching and alignments - Andrew Cowley11/09/201548
49
Searching databases Sequence searching and alignments - Andrew Cowley11/09/201549
50
? What methods of searching databases do you know of? Sequence searching and alignments - Andrew Cowley11/09/201550 ? What is the difference between a primary and secondary database? ? What is the best protein sequence database to search(specific part)?
51
Searching Many ways of searching databases Annotation/title Know something about your sequence Gene name Function Accession Sequence searching and alignments - Andrew Cowley11/09/201551
52
New search service Data organised according to: gene expression protein structure literature Species selector allows for easy comparison Explore data, return easily to your results Access from the EBI’s homepage
53
Database webpages Sequence searching and alignments - Andrew Cowley11/09/201553
54
Database searching Sequence searching and alignments - Andrew Cowley11/09/201554
55
Searching Many ways of searching databases Annotation/title Know something about your sequence Gene name Function Accession Raw data Don’t know! Or want to check... Infer extra information Homology? Annotation? Function? Sequence searching and alignments - Andrew Cowley11/09/201555
56
Sequence alignment Relatively easy if we have an exact match.. But sequence is variable Between individuals, species, location etc. That variability is useful data too! Need a search method that allows for some variability And even better – helps us assess that variability Sequence searching and alignments - Andrew Cowley11/09/201556
57
Sequence alignment Sequence searching and alignments - Andrew Cowley11/09/201557 ACATAGGT TCATAGATAAATTCTG Query: 1 2
58
Sequence alignment Sequence searching and alignments - Andrew Cowley11/09/201558 ACATAGGT TCATAGATAAATTCTG Query: 1 2 ACATAGGT
59
Sequence alignment Sequence searching and alignments - Andrew Cowley11/09/201559 ACATAGGT TCATAGATAAATTCTG Query: 1 2 Score:6/8 3/8 ACATAGGT
60
Sequence alignment Sequence searching and alignments - Andrew Cowley11/09/201560 atttcacagaggaggacaaggctactatcacaagcctgtggggcaaggtgaatgtggaag atgctggaggagaaaccctgggaaggctcctggttgtctacccatggacccagaggttct ttgacagctttggcaacctgtcctctgcctctgccatcatgggcaaccccaaagtcaagg cacatggcaagaaggtgctgacttccttgggagatgccattaaagcacctgggatgatct caagggcacctttgcccagcttgagt atggtgctctctgcagctgacaaaaccaacatcaagaactgctgggggaagattggtggc catggtggtgaatatggcgaggaggccctacagaggatgttcgctgccttccccaccacc aagacctacttctctcacattgatgtaagccccggctctgcccaggtcaaggctcacggc aagaaggttgctgatgccctggccaaagctgcagaccacgtcgaagacctgcctggtgcc ctgtccactctgagcgacctgc cacaagcctgtggggcaaggtgaatgtggaagatgctggaggagaaaccctgggaaggct cctggttgtntacccatggacccagaggttctttgacagctttggcaacctgtcctctgc ctctgccatcatgggcaaccccaaagtcaaggcacatggcaagaaggtgctgacttcctt gggagatgccataaagcacctggatgatctcaagggca Query: 1 2
61
Dot plot Maybe a dot plot will help Sequence searching and alignments - Andrew Cowley11/09/201561 Query Sequence 1 A C A T A G GATACTGATACT
62
Dot plot Sequence searching and alignments - Andrew Cowley11/09/201562 Query vs Sequence 1Query vs Sequence 2 Query 1 2
63
We can see the difference, but how to turn that into something a computer can evaluate? Computers rely on algorithms which give them a score They can then compare scores Sequence searching and alignments - Andrew Cowley11/09/201563
64
Simple algorithm – penalise movement away from diagonal – gap penalty Sequence searching and alignments - Andrew Cowley11/09/201564 0 0 0 0 -10
65
Having opened a gap, we should assign a lesser penalty to extending it -10 Gap extend Sequence searching and alignments - Andrew Cowley11/09/201565 0 -10 0 -10.5 0 -10 -0.5 -10.5 -0.5 -10 Actual implementation is usually to apply gap extension penalty to every gap
66
Why a lesser gap extend penalty? Single block of insertions/deletions is more likely than multiple in/del events Sequence searching and alignments - Andrew Cowley11/09/201566 NVELKAETNVDEATNFELKAET NV-ELKAET NVDE--A-TNFELKAET NV------ELKAET NVDEATNFELKAET
67
Match/mismatch Of course, we need to tell the algorithm that matching letters are better than mismatches too This is done via a scoring matrix Sequence searching and alignments - Andrew Cowley11/09/201567 A C G T ACGTACGT 5 -4 -4 -4 -4 5 -4 -4 -4 -4 5 -4 -4 -4 -4 5
68
Putting the two together gives us a scoring mechanism Sequence searching and alignments - Andrew Cowley11/09/201568 -4 -18 1 -13.5-13 -22.5 -13 T A C A CA 6 -10 -4 Gap Mismatch A C G T ACGTACGT 5 -4 -4 -4 -4 5 -4 -4 -4 -4 5 -4 -4 -4 -4 5 -10 0 0 -10.5 0 -10 -0.5 -10.5 -0.5 -10
69
To pick the optimal alignment, start at the end and trace back the highest scoring route. Sequence searching and alignments - Andrew Cowley11/09/201569 -4 -18 1 -13.5-13 -22.5 -13 T A C A CA 6
70
Needleman-Wunsch Congratulations! You’ve just reconstructed the Needleman-Wunsch algorithm! An example of dynamic programming Comparing the full length of both sequences is called a global-global or just global alignment Sequence searching and alignments - Andrew Cowley11/09/201570
71
Global vs Local But global-global might not be suitable for sequences that are very different lengths A modified form of this algorithm for local alignment is called the Smith-Waterman algorithm. Sets negative scores in matrix to 0, and allows trace back to end and restart Sequence searching and alignments - Andrew Cowley11/09/201571
72
QUESTION: Global vs Local - which is which? Sequence searching and alignments - Andrew Cowley11/09/201572 A T G T A T A C G C - A G T A T A - G C A - T G T A T A C G C A G T A T A - - - G C LOCALGLOBAL
73
Scoring Parameters so far: Match/mismatch Gap opening Gap extending Can we improve it? Sequence searching and alignments - Andrew Cowley11/09/201573
74
Substitutions Some substitutions are more likely than others DNA: Purines (A,G) – dual ring Pyrimidines (C, T) – single ring Substitutions of the same type are called transitions, where as exchanging one for another is called a transversion Transistions occur more frequently than transversions, so we can score them higher in the scoring matrix Sequence searching and alignments - Andrew Cowley11/09/201574
75
Sequence searching and alignments - Andrew Cowley11/09/201575
76
Proteins What about proteins? Sequence searching and alignments - Andrew Cowley11/09/201576
77
Protein substitution matrices Can look at closely related proteins to determine substitution rates Two most commonly used models: BLOSUM PAM Sequence searching and alignments - Andrew Cowley11/09/201577
78
BLOSUM Blocks of Amino Acid Substitution Matrix Align conserved regions of evolutionary divergent sequences clustered at a given % identity Count relative frequencies of amino acids and substitution probability Turn that into a matrix where the more positive a substitution is, the more likely is it to be found, and the more negative, the less likely. Higher BLOSUM number = more closely related Sequence searching and alignments - Andrew Cowley11/09/201578
79
PAM Point Accepted Mutation Observed mutations in a set of closely related proteins Markov chain model created to describe substitutions Normalised so that PAM1 = 1 mutation per 100 amino acids Extrapolate matrices from model Higher PAM number = less closely related Sequence searching and alignments - Andrew Cowley11/09/201579 PAM 250
80
Effect of applying PAM10 -> 500 matrices to the human LDL receptor sequence Sequence searching and alignments - Andrew Cowley11/09/201580 10100200 400500300
81
Sequence searching and alignments - Andrew Cowley11/09/201581 BLOSUM 45 PAM 250 BLOSUM 45 PAM 250 BLOSUM 62 PAM 160 BLOSUM 62 PAM 160 BLOSUM 90 PAM 100 BLOSUM 90 PAM 100 More divergent Less divergent
82
Scoring Parameters: Match/mismatch Gap opening Gap extending Substitution matrix Sequence searching and alignments - Andrew Cowley11/09/201582
83
Dynamic programming alignments at the EBI EMBOSS Pairwise Alignment Algorithms European Molecular Biology Open Software Suite Suite of useful tools for molecular biology Command line based Designed to be used as part of scripts/chained programs We implement selected tools to provide web-based access Sequence searching and alignments - Andrew Cowley11/09/201583
84
Where to find at the EBI? Sequence searching and alignments - Andrew Cowley11/09/201584 http://www.ebi.ac.uk/Tools/sequence.html Or...
85
Where to find at the EBI? Sequence searching and alignments - Andrew Cowley11/09/201585
86
EMBOSS align tools Global alignment Local alignment Sequence searching and alignments - Andrew Cowley11/09/201586 Needle Water
87
Sequence searching and alignments - Andrew Cowley11/09/201587 Program selection Parameters Sequence input Submit!
88
Sequence searching and alignments - Andrew Cowley11/09/201588
89
Sequence searching and alignments - Andrew Cowley11/09/201589 Key - Gap : Positive match. Negative match | Identity Key - Gap : Positive match. Negative match | Identity
90
Pairwise Alignments - Example sequences Sequence searching and alignments - Andrew Cowley11/09/201590 Pairwise_align1.fsa Pairwise_align2.fsa Pages 25-30 in full booklet: Questions 7-10 www.ebi.ac.uk/~watson/africa
91
Dynamic programming sequence search methods at the EBI Global alignment Local alignment Global query vs local database Sequence searching and alignments - Andrew Cowley11/09/201591 GGSEARCH SSEARCH GLSEARCH
92
Where to find at the EBI? Sequence searching and alignments - Andrew Cowley11/09/201592 www.ebi.ac.uk/Tools/sss/ Or...
93
Similarity search Sequence searching and alignments - Andrew Cowley11/09/201593 Database selection Sequence input Parameters Submit!
94
Dynamic programming methods are rigorous and guarantee an optimal result But have to store the matrix of both sequences in memory And evaluate each position of the matrix Predictably, this makes them slow and demanding when you are aligning large sequences Sequence searching and alignments - Andrew Cowley11/09/201594
95
Heuristics Therefore we need methods of estimating alignments Estimation methods are called heuristics Try and take short cuts in an intelligent manner Speed up the search At the possible expense of accuracy Accuracy in sequence searches is important for: Aligning the right bits Scoring the alignment correctly Identifying similar sequences - sensitivity Sequence searching and alignments - Andrew Cowley11/09/201595
96
Going back to our dot plot Sequence searching and alignments - Andrew Cowley11/09/201596
97
Instead of searching the whole matrix, if we narrow the search space down to a likely region we will improve the speed. Sequence searching and alignments - Andrew Cowley11/09/201597
98
Of course, we have to identify likely regions – not all alignments will be as nice as that one! This is the method used by FASTA W.R. Pearson & D.J. Lipman PNAS (1988) 85:2444-2448 Sequence searching and alignments - Andrew Cowley11/09/201598
99
FASTA – step 1 Identify runs of identical sequence and pick regions with highest density of runs Sequence searching and alignments - Andrew Cowley11/09/201599 Ktup parameter: How many consecutive identities before considered a ‘run’ Also called ‘word size’ Ktup parameter: How many consecutive identities before considered a ‘run’ Also called ‘word size’ Increase Ktup = faster, but less sensitive
100
FASTA – step 2 Weight scoring of runs using matrix, trim back regions to those contributing to highest scores Sequence searching and alignments - Andrew Cowley11/09/2015100 Parameter: Substitution matrix Parameter: Substitution matrix
101
FASTA – step 3 Discard regions too far from the highest scoring region Sequence searching and alignments - Andrew Cowley11/09/2015101 Joining threshold: Internally determined Joining threshold: Internally determined
102
FASTA – step 4 Use dynamic programming to optimise alignment in a narrow band encompassing the top scoring regions Sequence searching and alignments - Andrew Cowley11/09/2015102 Parameters: Gap open Gap extend Substitution matrix Parameters: Gap open Gap extend Substitution matrix
103
FASTA Repeat against all sequences in the database Sequence searching and alignments - Andrew Cowley11/09/2015103
104
FASTA – programs available at EBI FASTA: ”a fast approximation to Smith & Waterman” FASTA – scan a protein or DNA sequence library for similar sequences. FASTX/Y – compare a DNA sequence to a protein sequence databases, comparing the translated DNA sequence in forward or reverse translation frames. TFASTX/Y – compare a protein sequence to a translated DNA data bank. FASTF – compares ordered peptides (Edman degradation) to a protein databank. FASTS – compares unordered peptides (Mass Spec.) to a protein databank. SSEARCH – Rigorous scan of protein or DNA sequence library (S&W Algorithm). Sequence searching and alignments - Andrew Cowley11/09/2015104
105
Where to find at the EBI? Sequence searching and alignments - Andrew Cowley11/09/2015105 www.ebi.ac.uk/Tools/sss/ Or...
106
Similarity search Sequence searching and alignments - Andrew Cowley11/09/2015106 Database selection Sequence input Parameters Submit!
107
FASTA - results Sequence searching and alignments - Andrew Cowley11/09/2015107
108
FASTA - results Sequence searching and alignments - Andrew Cowley11/09/2015108
109
FASTA - results Sequence searching and alignments - Andrew Cowley11/09/2015109
110
FASTA - results Sequence searching and alignments - Andrew Cowley11/09/2015110 Key - Gap : Identity. Similarity X Filtered Key - Gap : Identity. Similarity X Filtered
111
Using FASTA - Example sequence Sequence searching and alignments - Andrew Cowley11/09/2015111 www.ebi.ac.uk/~watson/africa test_prot.fasta Page 37-46 in full booklet: Questions 11-14
112
BLAST – Basic Local Alignment Search Tool Instead of narrowing the dynamic programming search space, BLAST works a different way Firstly, it creates a word list both of the exact sequence and high scoring substitutions Sequence searching and alignments - Andrew Cowley11/09/2015112 Altschul et al (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402.
113
BLAST – step 1 w=3 Sequence searching and alignments - Andrew Cowley11/09/2015113 SEWRFKHIYRGQPRRHLLTTGWSTFVT SEW EWR WRF Parameter: Word length (w) Parameter: Word length (w) Increase = faster, but less sensitive
114
BLAST – step 1(cont.d) w=3 T=13 Sequence searching and alignments - Andrew Cowley11/09/2015114 SEWRFKHIYRGQPRRHLLTTGWSTFVT GQP 18 GEP 15 GRP 14 GKP 14 GNP 13 GDP 13 AQP 12 NQP 12 Parameters: Neighbourhood threshold (T) Substitution matrix Parameters: Neighbourhood threshold (T) Substitution matrix
115
BLAST – step 2 Then it scans database sequences for exact matches with these words Sequence searching and alignments - Andrew Cowley11/09/2015115
116
If two hits are found on the same diagonal the alignment is extended until the score drops by a certain amount This results in a High-scoring Segment Pair (HSP) BLAST – step 3 Sequence searching and alignments - Andrew Cowley11/09/2015116 Parameters: Drop off Substitution matrix Parameters: Drop off Substitution matrix
117
If the total HSP score is above another threshold then a gapped extension is initiated BLAST – step 4 Sequence searching and alignments - Andrew Cowley11/09/2015117 Parameters: Extension threshold (Sg) Substitution matrix Parameters: Extension threshold (Sg) Substitution matrix
118
BLAST The steps rule out many database sequences early on Large increase in speed Sequence searching and alignments - Andrew Cowley11/09/2015118
119
BLAST – programs available at the EBI Basic Local Alignment Search Tool NCBI-BLAST programs: BLASTP – protein sequence vs. protein sequence library BLASTN – nucleotide query vs. nucleotide database BLASTX – translated DNA vs. protein sequence library WU-BLAST programs: BLASTP – protein query vs. protein database BLASTN – nucleotide query vs. nucleotide database BLASTX – translated nucleotide query vs. protein database TBLASTN – protein query vs. translated nucleotide database TBLASTX – translated nucleotide query vs. translated nucleotide database Sequence searching and alignments - Andrew Cowley11/09/2015119 Combines several parameters into ‘sensitivity’ option
120
Sequence searching and alignments - Andrew Cowley11/09/2015120
121
Using BLAST - Example sequence Sequence searching and alignments - Andrew Cowley11/09/2015121 test_prot.fasta Pages 47-50 in full booklet: Questions 15-17 www.ebi.ac.uk/~watson/africa
122
Sequence searching and alignments - Andrew Cowley11/09/2015122 Key - Gap [residue] Identity + Similarity X Filtered Key - Gap [residue] Identity + Similarity X Filtered
123
Differences between BLAST and FASTA BLAST Fast Good with proteins Produces good local alignments + short global alignments Produces HSP (reports internal matches in long sequences) Might miss a potential alignment due to ruling out sequences early on in the process Good at finding siblings FASTA Not as fast as BLAST Much better with DNA than BLASTN Produces S&W alignments Checks each possible alignment with database sequences Good at finding cousins Sequence searching and alignments - Andrew Cowley11/09/2015123
124
When to use what? Sequence searching and alignments - Andrew Cowley11/09/2015124 Database size Query length FASTA WU-BLAST NCBI BLAST PSI-SEARCH
125
Sequence searching and alignments - Andrew Cowley11/09/2015125 When to use what? PDB Swiss-Prot UniRef50 UniRef 90 UniRef100 UniProtKB UniParc FASTA WU-BLAST NCBI BLAST PSI-SEARCH time to search
126
Homology and Similarity Sequence searching and alignments - Andrew Cowley11/09/2015126
127
Similarity Sequence searching and alignments - Andrew Cowley11/09/2015127
128
Homology Sequence searching and alignments - Andrew Cowley11/09/2015128
129
Unrelated! Sequence searching and alignments - Andrew Cowley11/09/2015129
130
Homology vs. Similarity Presence of similar features because of common decent Cannot be observed since the ancestors are not anymore Is inferred as a conclusion based on ‘similarity’ Homology is like pregnancy: Either one is or one isn’t! (Gribskov – 1999) Quantifies a ‘likeness’ Uses statistics to determine ‘significance’ of a similarity Statistically significant similar sequences are considered ‘homologous’ Sequence searching and alignments - Andrew Cowley11/09/2015130
131
So far, we’ve talked about scoring alignments Direct function of the algorithm But what we want is to assign some kind of quality to that score Sequence searching and alignments - Andrew Cowley11/09/2015131
132
Score vs significance Sequence searching and alignments - Andrew Cowley11/09/2015132 A A A A C A T A A G G C T A T A C A A G C C T High score High significance
133
“Lies, damn lies, and statistics” Sequence searching and alignments - Andrew Cowley11/09/2015133
134
“Lies, damn lies, and statistics” Not just interested in score......But how likely we are to get that alignment by chance alone It is this ‘non-random’ alignment that infers homology Statistics are used to estimate this chance Sequence searching and alignments - Andrew Cowley11/09/2015134
135
E-value ‘Expect’ value Probability of obtaining this alignment by chance Best measure of how good an alignment is Often used for ranking results by default Sequence searching and alignments - Andrew Cowley11/09/2015135
136
Calculated in different ways for BLAST and FASTA Short query sequences are more likely to be found by chance so have higher E-values Affected by parameter values like gap penalties and substitution matrices Sequence searching and alignments - Andrew Cowley11/09/2015136
137
FASTA statistics Compares query sequence with every sequence in database As most of these sequences are unrelated it is possible to use the distribution of scores to assign statistical significance Sequence searching and alignments - Andrew Cowley11/09/2015137
138
FASTA - histogram Sequence searching and alignments - Andrew Cowley11/09/2015138 Predicted distribution of scores Observed distribution of scores Key *=*= High scoring region
139
BLAST statistics Main reason for speed is that it doesn’t compare query with lots of other sequences Therefore it pre-estimates statistical values using a random sequence model Sequence searching and alignments - Andrew Cowley11/09/2015139 “Appears to yield fairly accurate results”
140
EBI is an Outstation of the European Molecular Biology Laboratory. Search Guidelines
141
Search guidelines 1 Whenever possible, compare at the amino acid level rather than at the nucleotide level (fasta, blastp, etc…) Then with translated DNA sequences (fastx, blastx) Search with DNA vs. DNA as the next resort And then with translated DNA vs. translated DNA (tfastx, tblastx) as the VERY LAST RESORT! Sequence searching and alignments - Andrew Cowley11/09/2015141
142
Search guidelines 2 Search the smallest database that is likely to contain the sequence(s) of interest Use sequence statistics (E()-values) rather than % identity or % similarity, as your primary criterion for sequence homology Sequence searching and alignments - Andrew Cowley11/09/2015142
143
Search guidelines 3 Check that the statistics are likely to be accurate by looking for the highest scoring unrelated sequence Examine the histograms Use programs such as prss3 to confirm the expectation values. Searching with shuffled sequences (use MLE/Shuffle in fasta) which should have an E() ~1.0 Sequence searching and alignments - Andrew Cowley11/09/2015143
144
Sequence searching and alignments - Andrew Cowley11/09/2015144
145
Search guidelines 4 Sequence searching and alignments - Andrew Cowley11/09/2015145 Consider searches with different gap penalties and other scoring matrices Use shallower matrices and/or more stringent gaps in order to uncover or force out relationships in partial sequences Use BLOSUM62 instead of BLOSUM50 (or PAM100 instead of PAM250) Remember to change the gap penalty defaults! MATRIX open ext. BLOSUM50 -10 -2 BLOSUM62 -7 -1 BLOSUM80 -16 -4 PAM250 -10 -2 PAM120 -16 -4
146
Search guidelines 5 Homology can be reliably inferred from statistically significant similarity But remember: Orthologous sequences have similar functions Paralogous sequences can acquire very different functional roles So further work might be needed to tease out details Sequence searching and alignments - Andrew Cowley11/09/2015146
147
Sequence searching and alignments - Andrew Cowley11/09/2015147
148
Search guidelines 6 Consult motif or fingerprint databases in order to uncover evidence for conservation-critical or functional residues However, motif identity in the absence of overall sequence similarity is not a reliable indicator of homology! Try to produce multiple sequence alignments in order to validate the relatedness of your sequence data ClustalW MUSCLE T-Coffee Kalign MAFFT Mview (available from EBI FASTA & BLAST services) DBCLUSTAL (available from EBI BLAST services) Sequence searching and alignments - Andrew Cowley11/09/2015148
149
EBI is an Outstation of the European Molecular Biology Laboratory. Advanced
150
In general, the more information we can add to an alignment, the better the result Sequence searching and alignments - Andrew Cowley11/09/2015150 Conserved regionsStructural informationMotifs [R, T or D]-[D, A or Q]-[F, E or A]-A-T-H
151
Conserved regions We can add a new ‘position’ parameter to the substitution matrix Sequence searching and alignments - Andrew Cowley11/09/2015151 We can even modify a normal search to generate a position specific scoring matrix, or PSSM
152
PSI-BLAST Position Specific Iterative – BLAST: 1.Takes the result of a normal BLAST 2.Aligns them and generates profile of conserved positions 3.Uses this to weight scoring on next iteration Sequence searching and alignments - Andrew Cowley11/09/2015152
153
PSI-BLAST By adding importance to conserved residues we might be able to find more distant sequences But iterate too far and we might be assigning importance where there is none Sequence searching and alignments - Andrew Cowley11/09/2015153 More sensitive
154
PSI-BLAST Sequence searching and alignments - Andrew Cowley11/09/2015154
155
PSI-BLAST Sequence searching and alignments - Andrew Cowley11/09/2015155
156
PSI-BLAST Sequence searching and alignments - Andrew Cowley11/09/2015156
157
PHI-BLAST Pattern Hit Initiated-BLAST User provides a pattern alongside a protein Database hits have to contain this pattern, and similarity to rest of sequence Results can initiate a PSI-BLAST search as well Sequence searching and alignments - Andrew Cowley11/09/2015157
158
PSI-SEARCH Smith-Waterman implementation (SSEARCH) But with iterative position specific scoring Sequence searching and alignments - Andrew Cowley11/09/2015158
159
Using PSI-BLAST - Example sequence Sequence searching and alignments - Andrew Cowley11/09/2015159 test_prot.fasta Pages 52-55 in full booklet: Questions 18-20 www.ebi.ac.uk/~watson/africa
160
EBI is an Outstation of the European Molecular Biology Laboratory. Problem Sequences
161
Short sequences What about short sequences? Depends on their nature: Protein Reduce word length and/or increase the E() value cut off Use shallow matrices DNA Reduce the word length Ignore gap penalties (force local alignments only) Use rigorous methods But ask what you are trying to do! Sequence searching and alignments - Andrew Cowley11/09/2015161
162
Low complexity regions Sometimes biologically relevant, but always likely to skew alignment scoring E.g. CA repeats, poly-A tails and Proline rich regions Sequence searching and alignments - Andrew Cowley11/09/2015162
163
Sequence searching and alignments - Andrew Cowley11/09/2015163 Good Statistics: The inset shows good correlation between the observed over expected numbers of scores. This is the region of the histogram to look out for first when evaluating results.
164
Sequence searching and alignments - Andrew Cowley11/09/2015164 The inset shows bad correlation between the observed and expected scores in this search. The spaces between the = and * symbols indicate this poor correlation. One reason for this can be low complexity regions. Bad Statistics:
165
Low complexity regions Sometimes biologically relevant, but always likely to skew alignment scoring E.g. CA repeats, poly-A tails and Proline rich regions Compensate by filtering sequence so these regions don’t contribute to scoring Filters: seg, xnu, dust, CENSOR But check what you are filtering! Sequence searching and alignments - Andrew Cowley11/09/2015165
166
Sequence searching and alignments - Andrew Cowley11/09/2015166 Inset showing the effect of using a low complexity filter (seg) and searching the database using the segment with highest complexity. Note that there is now good agreement between the observed and expected high score in the search and that the distance between = and * has been significantly reduced. Filtered:
167
Using Filters - Example sequence Sequence searching and alignments - Andrew Cowley11/09/2015167 Pages 56-57 in full booklet: Questions 21-22 Filtertest_seq.fsa www.ebi.ac.uk/~watson/africa
168
Vector contamination You think you know what your sequence is.... But the results are really confusing! Maybe you have vector contamination Search against known vectors to check Sequence searching and alignments - Andrew Cowley11/09/2015168
169
Vector contamination Sequence searching and alignments - Andrew Cowley11/09/2015169
170
Vector Contamination - Example sequences Sequence searching and alignments - Andrew Cowley11/09/2015170 vectortest_seq1.fsa vectortest_seq2.fsa Page 57 in full booklet: Question 23 www.ebi.ac.uk/~watson/africa
171
EBI is an Outstation of the European Molecular Biology Laboratory. Multiple Sequence Alignments
172
Uses of MSA Functional prediction Phylogeny Structural prediction Homology detection Protein analysis To distinguish between orthology and parology Sequence searching and alignments - Andrew Cowley11/09/2015172
173
Ideally, you would build up multiple alignments through weighted sum of pairs (pairwise scores) But this is too computationally intensive Sequence searching and alignments - Andrew Cowley11/09/2015173
174
Human beta --------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLST Horse beta --------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN Human alpha ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS- Horse alpha ---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS- Whale myoglobin ---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKT Lamprey globin PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT Lupin globin --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE *: : : *. :.: * : * :. Human beta PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRL Horse beta PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRL Human alpha ----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKL Horse alpha ----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKL Whale myoglobin EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEF Lamprey globin ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKV Lupin globin VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV..:: *. :. : *. *. :. Human beta LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH------ Horse beta LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH------ Human alpha LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR------ Horse alpha LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR------ Whale myoglobin ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQG Lamprey globin LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY------- Lupin globin VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA--- : :.:.... : Weighted Sums of Pairs: WSP Sequences Time 21 second 3150 seconds 46.25 hours 539 days 616 years 72404 years Time O(L N ) 11/09/2015174Sequence searching and alignments - Andrew Cowley
175
Ideally, you would build up multiple alignments through weighted sum of pairs (pairwise scores) But this is too computationally intensive Therefore we have to use heuristics and progressive alignment methods Sequence searching and alignments - Andrew Cowley11/09/2015175
176
Clustal >60,000 citations Clustal1-Clustal4 1988, Paul Sharp, Dublin Clustal V 1992 EMBL Heidelberg, Rainer Fuchs Alan Bleasby Clustal W, Clustal X 1994-2005 Toby Gibson, EMBL, Heidelberg Julie Thompson, ICGEB, Strasbourg Clustal W and Clustal X 2.0 2006 University College Dublin www.clustal.org 11/09/2015176Sequence searching and alignments - Andrew Cowley
177
CLUSTAL Quick, pairwise alignment of all sequences Line up pairs, with the most similar first Sequence searching and alignments - Andrew Cowley11/09/2015177
178
CLUSTAL Fix the alignment between pairs and treat as one sequence Sequence searching and alignments - Andrew Cowley11/09/2015178
179
CLUSTAL Align your fixed pairs with each other Sequence searching and alignments - Andrew Cowley11/09/2015179
180
Note, this is not a phylogram! Only a guide tree for the alignment Sequence searching and alignments - Andrew Cowley11/09/2015180
181
ClustalW at the EBI Sequence searching and alignments - Andrew Cowley11/09/2015181
182
ClustalW Sequence searching and alignments - Andrew Cowley11/09/2015182 Parameters Sequence input Submit! Help!
183
ClustalW Sequence searching and alignments - Andrew Cowley11/09/2015183 Interactive – results in browser, deleted after 24 hours Email – receive URL to results page, deleted after 7 days
184
ClustalW Sequence searching and alignments - Andrew Cowley11/09/2015184
185
ClustalW Sequence searching and alignments - Andrew Cowley11/09/2015185
186
Jalview Sequence searching and alignments - Andrew Cowley11/09/2015186
187
ClustalW Advantages Fast Not too demanding Widely used Fine for most uses Disadvantages Fixing of early alignments Propagate errors Doesn’t search far Local minima Compresses gaps Sequence searching and alignments - Andrew Cowley11/09/2015187
188
Use of Clustal/JalView - Example sequences Sequence searching and alignments - Andrew Cowley11/09/2015188 prot_MSA.fasta Problem_MSA1.fsa Problem_MSA2.fsa Problem_MSA3.fsa Problem_MSA4.fsa Pages 59-66 in full booklet: Questions 24-28 www.ebi.ac.uk/~watson/africa
189
Other Tools Sequence searching and alignments - Andrew Cowley11/09/2015189
190
COFFEE Consistency based Objective Function For alignmEnt Evaluation Maximum Weight Trace (John Kececioglu) Maximise similarity to a LIBRARY of residue pairs Notredame, C., Holm, L. and Higgins, D.G. (1998) COFFEE: An objective function for multiple sequence alignments. Bioinformatics 14: 407-422. 11/09/2015190Sequence searching and alignments - Andrew Cowley
191
COFFEE Library of reference pairwise alignments For your given set of sequences Objective Function Evaluates consistency between multiple alignment and the library of pairwise alignments Use SAGA to optimise this function Weigh depending on quality of alignment Sequence searching and alignments - Andrew Cowley11/09/2015191 SAGA is another alignment method, using genetic algorithms
192
COFFEE More accurate than ClustalW Much less prone to problems in early alignment stages VERY slow! Sequence searching and alignments - Andrew Cowley11/09/2015192
193
T-Coffee Tree-based COFFEE Heuristic approach to COFFEE Gets rid of genetic algorithm portion Uses progressive alignments Changes algorithm based on number of sequences Sequence searching and alignments - Andrew Cowley11/09/2015193
194
T-Coffee Much faster than COFFEE Avoids some of ClustalW’s pitfalls Can take information from several data sources Still not that fast Can be very demanding of memory etc. Sequence searching and alignments - Andrew Cowley11/09/2015194
195
Others MUSCLE – Bob Edgar Iterative/progressive alignment Fast Good for big alignments, proteins MAFFT Iterative based Fast Fourier Transform Fast and accurate Good for huge alignments Kalign Very fast, local-regions aligning Good for very large numbers of alignments! Sequence searching and alignments - Andrew Cowley11/09/2015195
196
Which tool should I use? Input data 2-100 sequences of typical protein length 100-500 sequences >500 sequences Small number of unusually long sequences Recommendation MUSCLE, T-Coffee, MAFFT, ClustalW MUSCLE, MAFFT MUSCLE, KALIGN ClustalW Sequence searching and alignments - Andrew Cowley11/09/2015196
197
How to evaluate? Use a benchmark BaliBASE Sequence searching and alignments - Andrew Cowley11/09/2015197
198
BaliBASE Thompson, JD, Plewniak, F. and Poch, O. (1999) NAR and Bioinformatics ICGEB Strasbourg 141 manual alignments using structures 5 sections core alignment regions marked 1. Equidistant (82) 2. Orphan (23) 3. Two groups (12) 4. Long internal gaps (13) 5. Long terminal gaps (11) 11/09/2015198Sequence searching and alignments - Andrew Cowley
199
Benchmark pitfalls Benchmark dataset may not be representative Danger of over-training towards benchmark Goldman: Most MSAs have unrealistic gaps Tend towards multiple, independent deletions Insertions are rare Sequences shrink in length over evolution No supporting evidence that this is the case Sequence searching and alignments - Andrew Cowley11/09/2015199
200
Solutions Use phylogentic data to guide alignment Keep track of changes to ancestor sequences Don’t change them again so easily in decendents Sequence searching and alignments - Andrew Cowley11/09/2015200
201
PRANK Probabilistic Alignment Kit webPRANK Better suited for closely related sequences Tied solutions are chosen from at random Avoids incorrect confidence in result Means alignments might not be reproducible Alignments look quite different Might look worse! But gap patterns make sense Gaps are good! Sequence searching and alignments - Andrew Cowley11/09/2015201
202
Sequence searching and alignments - Andrew Cowley11/09/2015202
203
Sequence searching and alignments - Andrew Cowley11/09/2015203
204
Comparing Alignments - Example sequences Sequence searching and alignments - Andrew Cowley11/09/2015204 prot_MSA.fasta Pages 67-74 in full booklet: Questions 29-30 www.ebi.ac.uk/~watson/africa
205
Common problems with MSA Input format FASTA format Unique sequence identifiers Include sequence! Job can’t be found Interactive results deleted after 24hrs Use email Consider other tool Sequence searching and alignments - Andrew Cowley11/09/2015205
206
Common mis-uses of MSA Performing a sequence assembly Specialist type of MSA Use other tools (Staden etc.) Aligning ESTs to a reference genome Use EST2Genome Designing primers Use primer tools (primer3 etc.) Aligning two sequences Use a pairwise alignment tool! Sequence searching and alignments - Andrew Cowley11/09/2015206
207
Putting it all together EB-Eye search Sequence retrieval Sequence search Sequences retrieval Multiple sequence alignment Analysis Sequence searching and alignments - Andrew Cowley11/09/2015207
208
Final remarks Don’t assume a single tool will cater for all your needs DO change the parameters of the tools Remember where the tool excels and what its limitations are A tool intended for specific task A can also be used for task B (and may be better than the tool intended for task B specifically!) Crazy input will always give crazy results! Sequence searching and alignments - Andrew Cowley11/09/2015208
209
EBI is an Outstation of the European Molecular Biology Laboratory. Getting Help
210
Database documentation Frequently Asked Questions http://www.ebi.ac.uk/help/faq.html 2can Support Portal http://www.ebi.ac.uk/2can/ EBI Support http://www.ebi.ac.uk/support/ Hands-on training programme http://www.ebi.ac.uk/training/handson/ Sequence searching and alignments - Andrew Cowley11/09/2015210
211
Thanks! Hamish McWilliam and Andrew Cowley Vicky Schneider Rodrigo Lopez EMBL-EBI SLING You! Sequence searching and alignments - Andrew Cowley11/09/2015211
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.