Sequence Search Abhishek Niroula Department of Experimental Medical Science Lund University 2015-12-101.

Sequence Search Abhishek Niroula Department of Experimental Medical Science Lund University 2015-12-101

Sequence search Sequences –Nucleotide and amino acid sequences –Known sequences are stored in different databases NCBI, ensembl, and others –Number of organisms being sequenced is increasing Ensembl, Ensembl plant, Ensembl fungi, Ensembl bacteria, etc Genome 10K (genomic zoo) project –Use of sequences is expanding rapidly in biomedical research –1000 Genomes Project, 100K Genomes Project Sequence search –Search for an appropriate sequence –Search for similar sequences in a database 2015-12-102

Sequence search Identify a new sequence Functional and structural annotation of sequences Finding homolog sequences for –Genomic, phylogenetic, structural studies, etc Haemophilus influezae –The genome of Haemophilus influezae was reported in 1995 (Fleischmann et al.) –1,743 assumed coding regions were translated into amino acid sequences, and searched for similarity in the Swiss-Prot database –1,007 of them matched the biochemical function could be deduced for each of them Multiple Sclerosis ( source: Martin Tompa ) –Autoimmune disease in which the T-cells recognise nerves’ myelin sheaths as foreign body and attack them –Hypothesis: Nerves’ myelin sheath proteins were similar to bacterial sheath proteins from earlier infection –Methodology Myelin sheath proteins were sequenced Search in a database for similar bacteria and virus sequences Lab tests to check if the T-cells attacked the identified bacterial and viral proteins 2015-12-103

Similarity vs homology Similarity –Similarity is the degree of likeness of two sequences –It is a quantitative measure Homology –Homology is an evolutionary relationship between two sequences –It can not be measured. –Distant and close homology refers to the distance between the sequences and their common ancestors Two sequences are 80% similar. Two sequences are 80% homologous. 2015-12-104

Orthologs and paralogs Source: http://www.ensembl.org/info/genome/compara/tree_example1.png 2015-12-105

Sequence search: Problem Given –Query sequence –Database Goal –To find statistically significant similarity that can be used to infer homology Query Result Database Search Sensitivity: Are all related sequences identified? Specificity: Are all unrelated sequences rejected? TP 1, 2, 4 FP 7 FN 3, 5 TN 6 12345671234567 12471247 2015-12-106

Heuristic database searching Sequence search: problem –Exact similarity computation between a query sequence and a database using dynamic programming is computationally intense –With available technology, aligning a query sequence against an entire database is not feasible Solution –Heuristic methods: Fast scanning of similar sequences –Sequences similar to a query sequence are searched from the database using heuristic methods before computing exact alignment scores –Tools BLAST FASTA 2015-12-107

BLAST BASIC Local Alignment Search Tool Developed by Altschul et al., 1990 Determines the local alignment between a query and a database BLAST consist of two steps: –Searching matches –Computing statistical significance of the matches 2015-12-108

BLAST Given a query sequence, split the query sequence into words with k residues –k = 3, for amino acid sequence –K = 11-12, for nucleotide sequence Generate all other combination of words with k residues Score each of the words using substitution matrix –Words with scores higher than threshold are considered in the next step M D L S A L T R Q MDL DLSLSASALALT--- MDV DLR MDM QLS MRL DVS MQL DKS --- Query sequence k-mers 2015-12-109

BLAST Match each of the high scoring words in the database sequences The matches are extended on both directions to form ungapped local alighment to find high scoring pair (HSP) The HSP with a cutoff score greater than the threshold are kept Significance of the ungapped HSP is calculated High scoring words Database HSP Ungapped extension 2015-12-1010

Gapped BLAST Altschul et al., 1997 Extension of matches requires two non-overlapping matches in the same diagonal within a distance ”A” Less number of extensions makes the search faster Perform gapped alignments around the hits that have higher scores than a pre-defined score 2015-12-1011

FASTA 2015-12-1012

FASTA 2015-12-1013

Variants of BLAST and FASTA QueryDatabaseProgramComment Protein blastp fasta Nucleotide blastn fasta NucleotideProteinblastx fastx, fasty Translate query to a protein ProteinNucleotidetblastn tfastx, tfasty Translate database Nucleotide tblastxTranslate both query and database 2015-12-1014

Using BLAST and FASTA Web application –BLAST http://blast.ncbi.nlm.nih.gov/Blast.cgi –FASTA http://www.ebi.ac.uk/Tools/sss/fasta/ Standalone –Local installation –Database should also be downloaded 2015-12-1015

BLASTP 2015-12-1016

FASTA 2015-12-1017

Input formats FASTA format files –Widely used in bioinformatics Other file formats –GCG, EMBL, GenBank, PIR, UniProtKB/Swiss-Prot, PHYLIP Identifiers –Supported in BLAST –Accession –Gene identifier 2015-12-1018

Database Generic databases –UniProt or RefSeq databases –UniRef and Non-redundant database: Database of unique sequence entries –Genome, Chromosome Structure databases –Database of sequences for which 3D structures are available in PDB –Used specially for finding template sequence for homology modelling Specialized database –Local database can be created including the sequences that are relevant for your purpose 2015-12-1019

Other parameters Expect –Statistical significance parameter –Default = 10,i.e. 10 matches are expected by chance Filter –Mask regions of low-complexity and short repeats Alignment options –Substitution table and gap function 2015-12-1020

Output There are three major sections in BLAST output –Header Information about the query sequence and the database searched Graphical overview of matches (only in web version) –Description Description of the sequences (hits) Scores: Generated from alignment, Higher is better E-value: Number of hits expected by chance, Lower is better Sequence identifier in NCBI databases –Alignment Pairwise alignment Details of the alignment (Score, E-value, similarity, etc.) 2015-12-1021

PSI BLAST Position Specific Iterated BLAST More sensitive to distantly related sequences Algorithm –In the first iteration, standard BLAST is run A PSSM (position specific scoring matrix) is generated based on the significant alignments –In the next iteration, the new PSSM is used to score the alignments A new PSSM is generated based on the significant alignments –The above step is repeated until a stop criterion is met. Stop criteria may be: No new sequences are identified in two consecutive iterations Number of desired iteration reached 2015-12-1022

Sequence search: Challenges Self hits are uninteresting Size of target database –Use no big database than required Paralogs have similar sequences but often have different function Low-complexity regions reduce the quality of alignments Short repeats give false hits Results for very short queries may be less reliable –Matches that are 50% identical with length 20-40 amino acids occur frequently by chance Distant homologs may have very low similarity 2015-12-1023

Sequence search Exercise –BLAST –FASTA 2015-12-1024

Sequence Search Abhishek Niroula Department of Experimental Medical Science Lund University 2015-12-101.

Similar presentations

Presentation on theme: "Sequence Search Abhishek Niroula Department of Experimental Medical Science Lund University 2015-12-101."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Sequence Search Abhishek Niroula Department of Experimental Medical Science Lund University 2015-12-101.

Similar presentations

Presentation on theme: "Sequence Search Abhishek Niroula Department of Experimental Medical Science Lund University 2015-12-101."— Presentation transcript:

Similar presentations

About project

Feedback