Fasta and Blast Heuristic algorithm for database search.

Fasta and Blast Heuristic algorithm for database search

Why search databases? To find out if a new DNA sequence already is deposited in the databanks. To find proteins homologous to a putative coding ORF. To find similar non-coding DNA stretches in the database, (for example: repeat elements, regulatory sequences). To locate false priming sites for a set of PCR oligonucleotides.

What databases are available? DNA (nucleotide sequences): The big databases: Genbank, Embl, DDBJ an their weekly updates. These databases exchange information routinely. Genomic databases like the: Human, Mouse, Yeast, etc… Special databases: ESTs (expressed sequence tags) STSs (sequence-tagged sites) EPD (eukaryotic promotor database) REPBASE (repetitive sequence database) and many others.

What databases are available? Protein (amino acid sequences): The big databases are: Swiss-Prot ( high level of annotation) PIR (protein identification resource) Translated databases like: SPTREMBL (translated EMBL) GenPept (translation of coding regions in GenBank) Special databases like: PDB(sequences derived from the 3D structure Brookhaven PDB)

Main algorithms for database searching FastA (Program for rapid alignment of pairs of protein and DNA sequences. – Better for nucleotides than for proteins BLAST - Basic Local Alignment Search Tool – Better for proteins than for nucleotides Smith-Waterman – More sensitive than FastA or BLAST.

The FastA software package FastA uses the method of Pearson and Lipman (PNAS 85: 2444-2448, 1988). FastA compares a DNA sequence to a DNA database or a protein sequence to a protein database. FastA is a family of programs, which include: – FastA, TFastA, Ssearch, etc...

General view of how the fasta program works: FastA locates regions of the query sequence and the search set sequence that have high densities of exact word matches. (Find runs of identical words e.g ktup)

The ten highest-scoring regions are re-scored using a scoring matrix (like PAM-250 matrix). The score of the highest scoring initial region is saved as the init1 score.

Next: FastA determines if any of the initial regions from different diagonals may be joined together to form an approximate alignment with gaps. Only non- overlapping regions may be joined. The score for the joined regions is the sum of the scores of the initial regions minus a joining penalty for each gap. The score of the highest scoring region, at the end of this step, is saved as the initn score.

After computing the initial scores, FastA determines the best segment of similarity between the query sequence and the search set sequence, using a variation of the Smith-Waterman algorithm. The score for this alignment is the opt score.

Last: FastA uses a simple linear regression against the natural log of the search set sequence length to calculate a normalized z-score for the sequence pair. Using the distribution of the z-score, the program can estimate the number of sequences that would be expected to produce, purely by chance, a z-score greater than or equal to the z-score obtained in the search. This is reported as the E() score.

Where to find the FastA programs? FastA searches can be done on the WWW FastA server at EBI: http://www.ebi.ac.uk/Tools/sss/fasta/ On a stand alone computer such as dapsas1 at the Weizmann institute. From the GCG software package.

Comparison programs in the FastA3 package Fasta3 - Compare a protein sequence to a protein database, or a DNA sequence to a DNA database, using the fasta algorithm. Search speed and selectivity are controlled with the “ktup” (wordsize) parameter. – Tips for ktup: For proteins, the defualt, ktup=2, ktup=1 is more sensitive but slower. – For DNA, ktup=6, the defualt, ktup=3 or ktup=4 more sensitivity, ktup=1 for oligonucleotides (length <20).

Comparison programs in the FastA3 package Ssearch3 - Compare a protein sequence to a protein database, or a DNA database, using the Smith-Waterman algorithm. It is very slow but much more sensitive for full-length proteins comparison. Fastx3 - Compare a DNA sequence to a protein database, by comparing the translated DNA sequence in three frames and allowing gaps and frame shifts.

Which program When? Identify unknown protein - fasta3, ssearch3, tfastx3 Identify structural DNA sequence - (repeated DNA, structural RNA) fasta3, (first with ktup=6 than ktup=3 ) Identify EST sequence - fastx3 (check first if the EST encodes a protein homologous to a known protein).

Blast Blast Basic Local Alignment Search Tool ( http://www.ncbi.nih.gov/BLA ST/)

What is BLAST? BLAST (Basic Local Alignment Search Tool) is a set of similarity search programs designed to explore all of the available sequence databases regardless of whether the query is protein or DNA. The BLAST programs have been designed for speed, with a minimal sacrifice of sensitivity to distant sequence relationships.

Contd… The scores assigned in a BLAST search have a well-defined statistical interpretation, making real matches easier to distinguish from random background hits. BLAST uses a heuristic algorithm which seeks local as opposed to global alignments and is therefore able to detect relationships among sequences which share only isolated regions of similarity

Selecting BLAST programme blastpCompares an amino acid query sequence against a protein sequence database. blastnCompares a nucleotide query sequence against a nucleotide sequence database. blastxCompares a nucleotide query sequence translated in all reading frames against a protein sequence database. You could use this option to find potential translation products of an unknown nucleotide sequence. tblastnCompares a protein query sequence against a nucleotide sequence database dynamically translated in all reading frames. tblastxCompares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database.

Selecting the Database (protein) nrAll non-redundant GenBank CDS translations+PDB+SwissProt+PIR month All new or revised GenBank CDS translation+PDB+SwissProt+PIR reased in the last 30 days. swissprot The last major release of the SWISS-PROT protein sequence database (no updates). These are uploaded to our system when they are received from EMBL. patentsProtein sequences derived from the Patent division of GenBank. yeast Yeast (Saccharomyces cerevisiae) protein sequences. This database is not to be confused with a listing of all Yeast protein sequences. It is a database of the protein translations of the Yeast complete genome. E. coliE. coli (Escherichia coli) genomic CDS translations. pdb Sequences derived from the 3-dimensional structure Brookhaven Protein Data Bank. alu Translations of select Alu repeats from REPBASE, suitable for masking Alu repeats from query sequences.

Nucleotide Databases nr All non-redundant GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS, or HTGS sequences). month All new or revised GenBank+EMBL+DDBJ+PDB sequences released in the last 30 days. dbestNon-redundant database of GenBank+EMBL+DDBJ EST Divisions. dbstsNon-redundant database of GenBank+EMBL+DDBJ STS Divisions. mouse ests The non-redundant Database of GenBank+EMBL+DDBJ EST Divisions limited to the organism mouse. human ests The Non-redundant Database of GenBank+EMBL+DDBJ EST Divisions limited to the organism human. other ests The non-redundant database of GenBank+EMBL+DDBJ EST Divisions all organisms except mouse and human. yeast Yeast (Saccharomyces cerevisiae) genomic nucleotide sequences. Not a collection of all Yeast nucelotides sequences, but the sequence fragments from the Yeast complete genome. E. coliE. coli (Escherichia coli) genomic nucleotide sequences.

Entering your Sequence The BLAST web pages accept input sequences in three formats; FASTA sequence format, NCBI Accession numbers, or GIs. The FASTA sequence format consists of a sequence name and description on a single line starting with the greater than symbol '>', followed by the sequence: > SequenceName description here ATGTCGTTACCGTCGTCGGGACCGACCATG AGAGCGA

Setting Up a Query Is the query sequence represented in the database? Choose a current nucleic acid database. Select from among organism-specific (e.g.: yeast), inclusive (e.g., nonredundant), or specialized set (e.g., dbEST, dbSTS, GSS, HTG) databases blastn

Are there homologs or evolutionary relatives of the query sequence in the database? Are there proteins whose function is related to the query sequence? Choose a protein database if the query is protein or DNA expected to encode a protein because amino acid searches are more sensitive blastp for amino acid queries; blastx for translated nucleic acid queries. Use Tblastn or tblastx for comparisons of an amino acid or translated nucleic acid query versus a translated nucleic acid database.

Search Parameters Default Special Cases Short Query Large Sequence Family Ungapped BLAST Filteronoffon Scoring MatrixBLOSUM62 PAM30 for 35 and under BLOSUM62 Word Size 3 3, or reduce to 2 3 3 E value 10 1000 or more 10 Gap costs 11,1 4 Alignments 50 2000 50

Filter The default setting will filter repetitive or low-complexity sequences from the query using the SEG (protein) or DUST (nucleic acid) programs If a low complexity region in the query is of interest, filtering will need to be turned off. If the number of hits returned is small when searching with a short query, it may help to re-search with filtering turned off. The Human repeat filter option human repeats such as LINEs and SINEs and is especially useful for human sequences that may contain these repeats.

Low complexity region HHHHHHHHKMAY HHHHHHHHSRHD How to remove Low complexity region Given a segment length L, with each amino acid occuring n1, n2 …. N20 times Slide a window of ~12 residues along query sequence Use a threshhold

Scoring Matrices BLOSUM62 (Block amino acid substitution matrices) is the default matrix. The BLOSUM matrix assigns a probability score for each position in an alignment that is based on the frequency with which that substitution is known to occur within conserved blocks of related proteins. The matrix values are based on the observed aa substitutions in large set of ~2000 conserved aa pattern (block). These blocks have been found in a database of ~500 families of related proteins. BLOSUM62 has been empirically shown to be among the best for detecting weak protein similarities

Scoring Matrices Other supported options include PAM30, PAM70, BLOSUM80, and BLOSUM45.

Gap opening and gap extension penalties BLAST program Default Gap Penalty (G) Default Gap Extension Penalty (E) Other supported (G) values blastp-11 -10, -1; -10, -2; -11, -1; -8, -2; -9, -2 blastn-5-2none

E value threshold The E value for an alignment score "S" represents the number of hits with a score equal to or better than "S" that would be "expected" by chance (the background noise) when searching a database of a particular size. The default E value for blastn, blastp, blastx and tblastn is 10. At this setting, 10 hits with scores equal to or better than the defined alignment score, S, are expected to occur by chance (in a search of the database using a random query with similar length). Increase the E value to 1000 or more when searching with a short query, since it is likely to be found many times by chance in a given database.

Alignments If the number of alignments requested (x) is fewer than those exceeding the significance threshold only the top (x) hits will be reported. To detect low-similarity matches, the number of alignments to be shown should be increased when searching with a member of a large sequence family.

Analyzing the output Step 1. Examine the alignment scores and statistics Scores for each position of an alignment are derived from a substitution matrix The raw score "S" of the alignment is usually calculated by summing the scores for each letter-to-letter and letter-to-null position in the alignment. The bit score (logarithm to base 2) is calculated from the raw score by normalizing with the statistical variables that define a given scoring system. Therefore, bit scores from different alignments, even those employing different scoring matrices can be compared.

The higher the score the better the alignment There is no widely accepted theory for selecting gap costs. It is rarely necessary to change gap opening or extension values from the default.

Graphic Representation

At the top is a linear map of the query. Each bar drawn below the map represents a protein (or protein fragment) that matches the query sequence. The position of each bar relative to the linear map of the query allows the user to see instantly the extent to which the database matches align with a single or multiple regions of the query. The most similar hits are shown at the top in red. Pink, green, blue and black bars follow, representing proteins in decreasing order of similarity.

Phylogeny

Cow-to-Pig

Cow-to-Pig cDNA

11/21/2016Chuck Staben41 DNA similarity reflects polypeptide similarity

Coding vs Non-coding Regions

Conservation in Non-coding Conserved regulatory elements Genetic process (gene conversion…)

Third Base of Codon Hypervariable

Cow-to-Fish Protein

Fasta and Blast Heuristic algorithm for database search.

Similar presentations

Presentation on theme: "Fasta and Blast Heuristic algorithm for database search."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Fasta and Blast Heuristic algorithm for database search.

Similar presentations

Presentation on theme: "Fasta and Blast Heuristic algorithm for database search."— Presentation transcript:

Similar presentations

About project

Feedback