Module 2 Sequence DBs and Similarity Searches Learning objectives Understand how information is stored in GenBank. Learn how to read a Genbank flat file.

Slides:



Advertisements
Similar presentations
Scoring Matrices.
Advertisements

Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
NCBI data, sliding window programs and dot plots Sept. 25, 2012 Learning objectives-Become familiar with OMIM and PubMed. Understand the difference between.
DNA sequences alignment measurement
Lecture 8 Alignment of pairs of sequence Local and global alignment
Introduction to Bioinformatics
Sequence Databases April 28, 2005 Learning objectives-Understand how information is stored in GenBank. Learn how to read a GenBank flat file. Learn how.
Sequence Similarity Searching Class 4 March 2010.
Sequence analysis June 20, 2006 Learning objectives-Understand sliding window programs. Understand difference between identity, similarity and homology.
Heuristic alignment algorithms and cost matrices
Sequence analysis course
Sequence analysis June 18, 2008 Learning objectives-Understand the concept of sliding window programs. Understand difference between identity, similarity.
Sequence Databases June 21, 2005 Learning objectives-Understand how information is stored in GenBank. Learn how to read a GenBank flat file. Learn how.
Sequence analysis June 19, 2007 Learning objectives-Understand the concept of sliding window programs. Understand difference between identity, similarity.
Sequence analysis June 17, 2003 Learning objectives-Review amino acids structures. Understand sliding window programs. Understand difference between identity,
Scoring Matrices June 19, 2008 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Alignment methods April 12, 2005 Return Homework (Ave. = 7.5)
Scoring Matrices June 22, 2006 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
Sequence similarity.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Sequence Databases – 20 June 2008 Learning objectives- Be able to describe how information is stored in GenBank. Be able to read a GenBank flat file. Be.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence comparisons June 23, 2009 Learning objectives-Understand the concept of sliding window programs. Understand difference between identity, similarity.
Sequence Databases – 21 June 2007 Learning objectives- Be able to describe how information is stored in GenBank. Be able to read a GenBank flat file. Be.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Roadmap The topics:  basic concepts of molecular biology  more on Perl  overview of the field  biological databases and database searching  sequence.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
Pairwise & Multiple sequence alignments
Thursday and Friday Dr Michael Carton Formerly VO’F group, now National Disease Surveillance Centre (NDSC) Wed (tomorrow) 10am - this suite booked for.
An Introduction to Bioinformatics
Protein Sequence Alignment and Database Searching.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Computational Biology, Part 3 Sequence Alignment Robert F. Murphy Copyright  1996, All rights reserved.
Chapter 11 Assessing Pairwise Sequence Similarity: BLAST and FASTA (Lecture follows chapter pretty closely) This lecture is designed to introduce you to.
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Sequence Alignment Techniques. In this presentation…… Part 1 – Searching for Sequence Similarity Part 2 – Multiple Sequence Alignment.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
1 P6a Extra Discussion Slides Part 1. 2 Section A.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
Rationale for searching sequence databases June 25, 2003 Writing projects due July 11 Learning objectives- FASTA and BLAST programs. Psi-Blast Workshop-Use.
Alignment methods April 21, 2009 Quiz 1-April 23 (JAM lectures through today) Writing assignment topic due Tues, April 23 Hand in homework #3 Why has HbS.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Construction of Substitution matrices
Module 2 Sequence DBs and Similarity Searches Learning objectives Understand how information is stored in GenBank. Learn how to read a Genbank flat file.
Sequence comparisons April 9, 2002 Review homework Learning objectives-Review amino acids. Understand difference between identity, similarity and homology.
Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University
Step 3: Tools Database Searching
Alignment methods April 17, 2007 Quiz 1—Question on databases Learning objectives- Understand difference between identity, similarity and homology. Understand.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Copyright OpenHelix. No use or reproduction without express written consent1.
GENBANK FILE FORMAT LOCUS –LOCUS NAME Is usually the first letter of the genus and species name, followed by the accession number –SEQUENCE LENGTH Number.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Using BLAST To Teach ‘E-value-tionary’ Concepts Cheryl A. Kerfeld 1, 2 and Kathleen M. Scott 3 1.Department of Energy-Joint Genome Institute, Walnut Creek,
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Presentation transcript:

Module 2 Sequence DBs and Similarity Searches Learning objectives Understand how information is stored in GenBank. Learn how to read a Genbank flat file. Learn how to search Genbank for information. Understand difference between header, features and sequence. Learn the difference between a primary database and secondary database. Principle of similarity searches using the BLAST program

What is GenBank? Gene sequence database Annotated records that represent single contiguous stretches of DNA or RNA-may have more than one coding region (limit 350 kb) Generated from direct submissions to the DNA sequence databases from the authors. Part of the International Nucleotide Sequence Database Collaboration.

Exchange of information on a daily basis GenBank (NCBI) EMBL (EBI) United Kingdom DDBJ Japan International Nucleotide Sequence Database Collaboration

History of GenBank Began with Atlas of Protein Sequences and Structures (Dayhoff et al., 1965) In 1986 it collaborated with EMBL and in 1987 it collaborated with DDBJ. It is a primary database-(i.e., experimental data is placed into it) Examples of secondary databases derived from GenBank/EMBL/DDBJ: Swiss-Prot, PRI. GenBank Flat File is a human readable form of the records.

General Comments on GBFF Three sections: 1) Header-information about the whole record 2) Features-description of annotations-each represented by a key. 3) Nucleotide sequence-each ends with // on last line of record. DNA-centered Translated sequence is only a feature

Feature Keys Purpose: 1) Indicates biological nature of sequence 2) Supplies information about changes to sequences Feature KeyDescription conflictSeparate deter’s of the same seq. differ rep_originOrigin of replication protein_bindProtein binding site on DNA CDSProtein coding sequence

Feature Keys-Terminology Feature Key Location/Qualifiers CDS /product=“alcohol dehydro.” /gene=“adhI” Interpretation-The feature CDS is a coding sequence beginning at base 23 and ending at base 400, has a product called “alcohol dehydrogenase” and corresponds to the gene called “adhI”.

Feature Keys-Terminology (Cont.) Feat. Key Location/Qualifiers CDSjoin ( , ) /product=“T-cell recep. B-ch.” /partial Interpretation-The feature CDS is a partial coding sequence formed by joining the indicated elements to form one contiguous sequence encoding a product called T-cell receptor beta-chain.

Record from GenBank LOCUS SCU bp DNA PLN 21-JUN-1999 DEFINITION Saccharomyces cerevisiae TCP1-beta gene, partial cds, and Axl2p (AXL2) and Rev7p (REV7) genes, complete cds. ACCESSION U49845 VERSION U GI: KEYWORDS. SOURCE baker's yeast. ORGANISM Saccharomyces cerevisiae Eukaryota; Fungi; Ascomycota; Hemiascomycetes; Saccharomycetales; Saccharomycetaceae; Saccharomyces. Modification date GenBank division (plant, fungal and algal) Coding region Unique identifier (never changes) Nucleotide sequence identifier (changes when there is a change in sequence (accession.version)) GeneInfo identifier (changes whenever there is a change) Word or phrase describing the sequence (not based on controlled vocabulary). Not used in newer records. Common name for organism Formal scientific name for the source organism and its lineage based on NCBI Taxonomy Database

Record from GenBank (cont.1) REFERENCE 1 (bases 1 to 5028) AUTHORS Torpey,L.E., Gibbs,P.E., Nelson,J. and Lawrence,C.W. TITLE Cloning and sequence of REV7, a gene whose function is required for DNA damage-induced mutagenesis in Saccharomyces cerevisiae JOURNAL Yeast 10 (11), (1994) MEDLINE REFERENCE 2 (bases 1 to 5028) AUTHORS Roemer,T., Madden,K., Chang,J. and Snyder,M. TITLE Selection of axial growth sites in yeast requires Axl2p, a novel plasma membrane glycoprotein JOURNAL Genes Dev. 10 (7), (1996) MEDLINE Oldest reference first Medline UID REFERENCE 3 (bases 1 to 5028) AUTHORS Roemer,T. TITLE Direct Submission JOURNAL Submitted (22-FEB-1996) Terry Roemer, Biology, Yale University, New Haven, CT, USA Submitter of sequence (always the last reference)

Record from GenBank (cont.2) FEATURES Location/Qualifiers source /organism="Saccharomyces cerevisiae" /db_xref="taxon:4932" /chromosome="IX" /map="9" CDS < /codon_start=3 /product="TCP1-beta" /protein_id="AAA " /db_xref="GI: " /translation="SSIYNGISTSGLDLNNGTIADMRQLGIVESYKLKRAVVSSASEA AEVLLRVDNIIRARPRTANRQHM" Partial sequence on the 5’ end. The 3’ end is complete. There are three parts to the feature key: a keyword (indicates functional group), a location (instruction for finding the feature), and a qualifier (auxiliary information about a feature) Keys Location Qualifiers Descriptive free text must be quotations Start of open reading frame Database cross-refs Protein sequence ID # Note: only a partial sequence Values

Record from GenBank (cont.3) gene /gene="AXL2" CDS /gene="AXL2" /note="plasma membrane glycoprotein" /codon_start=1 /function="required for axial budding pattern of S. cerevisiae" /product="Axl2p" /protein_id="AAA " /db_xref="GI: " /translation="MTQLQISLLLTATISLLHLVVATPYEAYPIGKQYPPVARVN... “ gene complement( ) /gene="REV7" CDS complement( ) /gene="REV7" /codon_start=1 /product="Rev7p" /protein_id="AAA " /db_xref="GI: " /translation="MNRWVEKWLRVYLKCYINLILFYRNVYPPQSFDYTTYQSFNLPQ... “ Cutoff New location

Record from GenBank (cont.4) BASE COUNT 1510 a 1074 c 835 g 1609 t ORIGIN 1 gatcctccat atacaacggt atctccacct caggtttaga tctcaacaac ggaaccattg 61 ccgacatgag acagttaggt atcgtcgaga gttacaagct aaaacgagca gtagtcagct...//

Primary databases contain experimental biological information GenBank/EMBL/DDBJ Alu-alu repeats in human DNA dbEST-expressed sequence tags-single pass cDNA sequences (high error freq.) It is non-redundant HTGS-high-throughput genomic sequence database (errors!) PDB-Three-dimensional structure coordinates of biological molecules PROSITE-database of protein domain/function relationships.

Types of secondary databases that contain biological information dbSTS-Non-redundant db of sequence-tagged sites (useful for physical mapping) Genome databases-(there are over 20 genome databases that can be searched EPD:eukaryotic promoter database NR-non-redundant GenBank+EMBL+DDBJ+PDB. Entries with 100% sequence identity are merged as one. Vector: A subset of GenBank containing vector DNA ProDom PRINTS BLOCKS

Workshop 2 A-Look up a Genbank record. Use the annotations to determine the the first open reading frame.

Similarity Searching It is easy to score if an amino acid is identical to another (the score is 1 if identical and 0 if not). However, it is not easy to give a score for amino acids that are somewhat similar. + NH 3 CO NH 3 CO 2 - Leucine Isoleucine Should they get a 0 (non-identical) or a 1 (identical) or Something in between?

Purpose of finding differences and similarities of amino acids. Infer structural information Infer functional information Infer evolutionary relationships

Evolutionary Basis of Sequence Alignment 1. Similarity: Quantity that relates to how alike two sequences are. 2. Identity: Quantity that describes how alike two sequences are in the strictest terms. 3. Homology: a conclusion drawn from data suggesting that two genes share a common evolutionary history.

Evolutionary Basis of Sequence Alignment (Cont. 1) 1. Example: Shown on the next page is a pairwise alignment of two proteins. One is mouse trypsin and the other is crayfish trypsin. They are homologous proteins. The sequences share 41% identity. 2. Underlined residues are identical. Asterisks and diamond represent those residues that participate in catalysis. Five gaps are placed to optimize the alignment.

Evolutionary Basis of Sequence Alignment (Cont. 2) Why are there regions of identity? 1) Conserved function-residues participate in reaction. 2) Structural-residues participate in maintaining structure of protein. (For example, conserved cysteine residues that form a disulfide linkage) 3) Historical-Residues that are conserved solely due to a common ancestor gene.

Evolutionary Basis of Sequence Alignment (Cont. 3) Note: It is possible that two proteins share a high degree of similarity but have two different functions. For example, human gamma-crystallin is a lens protein that has no known enzymatic activity. It shares a high percentage of identity with E. coli quinone oxidoreductase. These proteins likely had a common ancestor but their functions diverged. Analogous to railroad car and diner function.

Modular nature of proteins The previous alignment was global. However, many proteins do not display global patterns of similarity. Instead, they possess local regions of similarity. Proteins can be thought of as assemblies of modular domains. It is thought that this may, in some cases, be due to a process known as exon shuffling.

Modular nature of proteins (cont. 1) Exon 1a Exon 2a Duplication Exon 1a Exon 2a Exchange Gene A Gene B Gene A Gene B Exon 1a Exon 2a Exon 3 (Ex. 2b from Gene B) Exon 1b Exon 2b Exon 3 (Ex. 2a from Gene A)

Dot Plots A T G C C T A G ATGCCTAGATGCCTAG * * * * * * * * * * * * * * * * Window = 1 Note that 25% of the table will be filled due to random chance. 1 in 4 chance at each position

Dot Plots with window = 2 A T G C C T A G ATGCCTAGATGCCTAG * * * * * * * Window = 2 The larger the window the more noise can be filtered What is the percent chance that you will receive a match randomly? 1/16 * 100 = 6.25% { { { { { { {

Identity Matrix Simplest type of scoring matrix LICA 1000L 100I 10C 1A

Similarity It is easy to score if an amino acid is identical to another (the score is 1 if identical and 0 if not). However, it is not easy to give a score for amino acids that are somewhat similar. + NH 3 CO NH 3 CO 2 - Leucine Isoleucine Should they get a 0 (non-identical) or a 1 (identical) or Something in between?

Scoring Matrices Importance of scoring matrices Scoring matrices appear in all analyses involving sequence comparisons. The choice of matrix can strongly influence the outcome of the analysis. Scoring matrices implicitly represent a particular theory of sequence alignment. Understanding theories underlying a given scoring matrix can aid in making the proper choice when performing sequence alignments.

Scoring Matrices When we consider scoring matrices, we encounter the convention that matrices have numeric indices corresponding to the rows and columns of the matrix. For example, M 11 refers to the entry at the first row and the first column. In general, M ij refers to the entry at the ith row and the jth column. To use this for sequence alignment, we simply associate a numeric value to each letter in the alphabet of the sequence. For example, if the matrix is: {A,C,T,G} then A = 1,1; C = 1,2, etc.

Steps to building the first PAM (Point Accepted Mutation) 1. Dayhoff aligned sequences that were at least 85% identical. 2. Reconstructed phylogenetic trees and inferred ancestral sequences. 71 trees containing 1,572 aa exchanges were used. 3. Tallied aa replacements "accepted" by natural selection, in all pair-wise comparisons.

Steps to building PAM (cont. 1) 4. Computed amino acid mutability, m j (the propensity of a given amino acid, j, to be replaced) 5. Combined data from 3 & 4 to produce a Mutation Probability Matrix for one PAM of evolutionary distance, according to the following formula: Replacements M jj = 1 - m j MPM of aa j for aa j

Steps to building PAM (cont. 2) 6. Took the log odds ratio to obtain each score: S ij = log (M ij /f i ) (Note: this is what you see in the matrix) Where f i is the normalized frequency of aa i in the sequences used. 7. Note: must multiply the M ij /f i by factors of 10 prior to avoid fractions.

Assumptions in the PAM model 1. Replacement at any site depends only on the amino acid at that site and the probability given by the table (Markov model). 2. Sequences that are being compared have average amino acid composition.

The bottom line on PAM Frequencies of alignment Frequencies of occurrence The probability that two amino acids, i and j are aligned by evolutionary descent divided by the probability that they are aligned by chance

Sources of error in PAM model 1. Many sequences depart from average aa composition. 2. Rare replacements were observed too infrequently to resolve relative probabilities accurately (for 36 aa pairs (out of appoxi- mately 400 aa pairs) no replacements were observed!). 3. Errors in 1PAM are magnified in the extrapolation to 250 PAM. (M ij k = k PAM) 4. This process (Markov) is an imperfect representation of evolution: distantly related sequences usually have islands (blocks) of conserved residues. This implies that replacement is not equally probable over entire sequence.

BLOSUM Matrices BLOSUM is built from distantly related sequences whereas PAM is built from closely related sequences BLOSUM is built from conserved blocks of aligned protein segment found in the BLOCKS database (remember the BLOCKS database is a secondary database that depends on the PROSITE Family)

Gap Penalties Takes into account insertions and deletions. Can’t have too many that may make the alignment meaningless Typically, there is a fixed deduction for introducing a gap plus additional deduction for the length of the gap. Gap penalty = G + Ln where G = gap opening penalty, L = gap extension penalty and n = gap length. G = 2 to 12, L = 2

Global Alignment vs. Local Alignment Global alignment is used when the overall gene sequence is similar to another sequence-often used in multiple sequence alignment. Clustal W algorithm (Needleman-Wunsch) Local alignment is used when only a small portion of one gene is similar to a small portion of another gene. BLAST FASTA Smith-Waterman algorithm

Two proteins that are similar in certain regions Tissue plasminogen activator (PLAT) Coagulation factor 12 (F12).

The Dotter Program Program consists of three components: Sliding window A scoring matrix that gives a score for each amino acid A graph that converts the score to a dot of certain pixel density

Region of similarity Single region on F12 is similar to two regions on PLAT

BLAST Basic Local Alignment Search Tool Speed is achieved by: Pre-indexing the database before the search Parallel processing Uses a hash table that contains neighborhood words rather than just identical words.

Neighborhood words The program declares a hit if the word taken from the query sequence has a score >= T when a substitution matrix is used. This allows the word size (W (this is similar to ktup value)) to be kept high (for speed) without sacrificing sensitivity. If T is increased by the user the number of background hits is reduced and the program will run faster

Workshop for module 2: Use the Dotter program to determine the optimal alignment between two sequences. Perform a Blast search on a protein sequence.