Download presentation
Presentation is loading. Please wait.
Published byBerenice Stevenson Modified over 9 years ago
1
Sequence Search Abhishek Niroula Department of Experimental Medical Science Lund University 2015-12-101
2
Sequence search Sequences –Nucleotide and amino acid sequences –Known sequences are stored in different databases NCBI, ensembl, and others –Number of organisms being sequenced is increasing Ensembl, Ensembl plant, Ensembl fungi, Ensembl bacteria, etc Genome 10K (genomic zoo) project –Use of sequences is expanding rapidly in biomedical research –1000 Genomes Project, 100K Genomes Project Sequence search –Search for an appropriate sequence –Search for similar sequences in a database 2015-12-102
3
Sequence search Identify a new sequence Functional and structural annotation of sequences Finding homolog sequences for –Genomic, phylogenetic, structural studies, etc Haemophilus influezae –The genome of Haemophilus influezae was reported in 1995 (Fleischmann et al.) –1,743 assumed coding regions were translated into amino acid sequences, and searched for similarity in the Swiss-Prot database –1,007 of them matched the biochemical function could be deduced for each of them Multiple Sclerosis ( source: Martin Tompa ) –Autoimmune disease in which the T-cells recognise nerves’ myelin sheaths as foreign body and attack them –Hypothesis: Nerves’ myelin sheath proteins were similar to bacterial sheath proteins from earlier infection –Methodology Myelin sheath proteins were sequenced Search in a database for similar bacteria and virus sequences Lab tests to check if the T-cells attacked the identified bacterial and viral proteins 2015-12-103
4
Similarity vs homology Similarity –Similarity is the degree of likeness of two sequences –It is a quantitative measure Homology –Homology is an evolutionary relationship between two sequences –It can not be measured. –Distant and close homology refers to the distance between the sequences and their common ancestors Two sequences are 80% similar. Two sequences are 80% homologous. 2015-12-104
5
Orthologs and paralogs Source: http://www.ensembl.org/info/genome/compara/tree_example1.png 2015-12-105
6
Sequence search: Problem Given –Query sequence –Database Goal –To find statistically significant similarity that can be used to infer homology Query Result Database Search Sensitivity: Are all related sequences identified? Specificity: Are all unrelated sequences rejected? TP 1, 2, 4 FP 7 FN 3, 5 TN 6 12345671234567 12471247 2015-12-106
7
Heuristic database searching Sequence search: problem –Exact similarity computation between a query sequence and a database using dynamic programming is computationally intense –With available technology, aligning a query sequence against an entire database is not feasible Solution –Heuristic methods: Fast scanning of similar sequences –Sequences similar to a query sequence are searched from the database using heuristic methods before computing exact alignment scores –Tools BLAST FASTA 2015-12-107
8
BLAST BASIC Local Alignment Search Tool Developed by Altschul et al., 1990 Determines the local alignment between a query and a database BLAST consist of two steps: –Searching matches –Computing statistical significance of the matches 2015-12-108
9
BLAST Given a query sequence, split the query sequence into words with k residues –k = 3, for amino acid sequence –K = 11-12, for nucleotide sequence Generate all other combination of words with k residues Score each of the words using substitution matrix –Words with scores higher than threshold are considered in the next step M D L S A L T R Q MDL DLSLSASALALT--- MDV DLR MDM QLS MRL DVS MQL DKS --- Query sequence k-mers 2015-12-109
10
BLAST Match each of the high scoring words in the database sequences The matches are extended on both directions to form ungapped local alighment to find high scoring pair (HSP) The HSP with a cutoff score greater than the threshold are kept Significance of the ungapped HSP is calculated High scoring words Database HSP Ungapped extension 2015-12-1010
11
Gapped BLAST Altschul et al., 1997 Extension of matches requires two non-overlapping matches in the same diagonal within a distance ”A” Less number of extensions makes the search faster Perform gapped alignments around the hits that have higher scores than a pre-defined score 2015-12-1011
12
FASTA 2015-12-1012
13
FASTA 2015-12-1013
14
Variants of BLAST and FASTA QueryDatabaseProgramComment Protein blastp fasta Nucleotide blastn fasta NucleotideProteinblastx fastx, fasty Translate query to a protein ProteinNucleotidetblastn tfastx, tfasty Translate database Nucleotide tblastxTranslate both query and database 2015-12-1014
15
Using BLAST and FASTA Web application –BLAST http://blast.ncbi.nlm.nih.gov/Blast.cgi –FASTA http://www.ebi.ac.uk/Tools/sss/fasta/ Standalone –Local installation –Database should also be downloaded 2015-12-1015
16
BLASTP 2015-12-1016
17
FASTA 2015-12-1017
18
Input formats FASTA format files –Widely used in bioinformatics Other file formats –GCG, EMBL, GenBank, PIR, UniProtKB/Swiss-Prot, PHYLIP Identifiers –Supported in BLAST –Accession –Gene identifier 2015-12-1018
19
Database Generic databases –UniProt or RefSeq databases –UniRef and Non-redundant database: Database of unique sequence entries –Genome, Chromosome Structure databases –Database of sequences for which 3D structures are available in PDB –Used specially for finding template sequence for homology modelling Specialized database –Local database can be created including the sequences that are relevant for your purpose 2015-12-1019
20
Other parameters Expect –Statistical significance parameter –Default = 10,i.e. 10 matches are expected by chance Filter –Mask regions of low-complexity and short repeats Alignment options –Substitution table and gap function 2015-12-1020
21
Output There are three major sections in BLAST output –Header Information about the query sequence and the database searched Graphical overview of matches (only in web version) –Description Description of the sequences (hits) Scores: Generated from alignment, Higher is better E-value: Number of hits expected by chance, Lower is better Sequence identifier in NCBI databases –Alignment Pairwise alignment Details of the alignment (Score, E-value, similarity, etc.) 2015-12-1021
22
PSI BLAST Position Specific Iterated BLAST More sensitive to distantly related sequences Algorithm –In the first iteration, standard BLAST is run A PSSM (position specific scoring matrix) is generated based on the significant alignments –In the next iteration, the new PSSM is used to score the alignments A new PSSM is generated based on the significant alignments –The above step is repeated until a stop criterion is met. Stop criteria may be: No new sequences are identified in two consecutive iterations Number of desired iteration reached 2015-12-1022
23
Sequence search: Challenges Self hits are uninteresting Size of target database –Use no big database than required Paralogs have similar sequences but often have different function Low-complexity regions reduce the quality of alignments Short repeats give false hits Results for very short queries may be less reliable –Matches that are 50% identical with length 20-40 amino acids occur frequently by chance Distant homologs may have very low similarity 2015-12-1023
24
Sequence search Exercise –BLAST –FASTA 2015-12-1024
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.