Accessing information on molecular sequences Bio 224 Dr. Tom Peavy February 5, 2008.

Slides:



Advertisements
Similar presentations
Pairwise sequence alignment.
Advertisements

Sources Page & Holmes Vladimir Likic presentation: 20show.pdf
Bioinformatics Tutorial I BLAST and Sequence Alignment.
Dayhoff Model: Accepted Point Mutation (PAM) Arthur W. Chou Fall 2005 Tunghai University.
Lecture 8 Alignment of pairs of sequence Local and global alignment
Introduction to Bioinformatics
Heuristic alignment algorithms and cost matrices
Sequence Alignment.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Scoring Matrices June 22, 2006 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
Introduction to bioinformatics
Sequence similarity.
Similar Sequence Similar Function Charles Yan Spring 2006.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
COMP 571 Bioinformatics: Sequence Analysis Pairwise Sequence Alignment (II)
Pairwise sequence alignment.
Pairwise sequence Alignment.
Pairwise Sequence Alignment (PSA)
TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,
Biology 224 Instructor: Tom Peavy Sept 13 &15
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Developing Pairwise Sequence Alignment Algorithms
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
Pairwise Sequence Alignment. Pairwise alignments in the 1950s  -corticotropin (sheep) Corticotropin A (pig) ala gly glu asp asp glu asp gly ala glu.
8/27/20151 Pairwise sequence Alignment. 8/27/20152 Many of the images in this power point presentation are from Bioinformatics and Functional Genomics.
Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.
Pairwise & Multiple sequence alignments
Pairwise Alignments Part 1 Biology 224 Instructor: Tom Peavy Sept 8
An Introduction to Bioinformatics
. Sequence Alignment and Database Searching 2 Biological Motivation u Inference of Homology  Two genes are homologous if they share a common evolutionary.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Content of the previous class Introduction The evolutionary basis of sequence alignment The Modular Nature of proteins.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Alignment methods April 26, 2011 Return Quiz 1 today Return homework #4 today. Next homework due Tues, May 3 Learning objectives- Understand the Smith-Waterman.
November 22, 2010 Pairwise sequence alignment Jonathan Pevsner, Ph.D. Bioinformatics Johns Hopkins School Med.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Biology 4900 Biocomputing.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Accessing information on molecular sequences Bio 224 Dr. Tom Peavy Sept 1, 2010.
Genomics and Personalized Care in Health Systems Lecture 1: Introduction Leming Zhou, PhD Department of Health Information management School of Health.
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
September 10, 2003 Pairwise sequence alignment. Many of the images in this powerpoint presentation are from Bioinformatics and Functional Genomics by.
COT 6930 HPC and Bioinformatics Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
In-Class Assignment #1: Research CD2
Pairwise sequence Alignment. Learning objectives Define homologs, paralogs, orthologs Perform pairwise alignments (NCBI BLAST) Understand how scores are.
Pairwise sequence Alignment Homology, Score Matrix.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Construction of Substitution matrices
Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University
Step 3: Tools Database Searching
BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.
Genomics and Personalized Care in Health Systems Lecture 3 Sequence Alignment Leming Zhou School of Health and Rehabilitation Sciences Department of Health.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Using BLAST To Teach ‘E-value-tionary’ Concepts Cheryl A. Kerfeld 1, 2 and Kathleen M. Scott 3 1.Department of Energy-Joint Genome Institute, Walnut Creek,
©CMBI 2009 Transfer of information The main topic of this course is transfer of information. In the protein world that leads to the questions: 1)From which.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Sequence similarity, BLAST alignments & multiple sequence alignments
Introduction to sequence alignment Mike Hallett (David Walsh)
Pairwise sequence alignment November 24, 2008 Jonathan Pevsner, Ph.D.
Pairwise sequence Alignment.
Pairwise Sequence Alignment
Basic Local Alignment Search Tool
Presentation transcript:

Accessing information on molecular sequences Bio 224 Dr. Tom Peavy February 5, 2008

What is an accession number? An accession number is label that used to identify a sequence. It is a string of letters and/or numbers that corresponds to a molecular sequence. Examples (all for retinol-binding protein, RBP4): X02775GenBank genomic DNA sequence NT_030059Genomic contig Rs dbSNP (single nucleotide polymorphism) N An expressed sequence tag (1 of 170) NM_006744RefSeq DNA sequence (from a transcript) NP_007635RefSeq protein AAC02945GenBank protein Q28369SwissProt protein 1KT7Protein Data Bank structure record protein DNA RNA

The RefSeq Accession number format and molecule types. (NCBI Handbook) Accession prefixMolecule type NC_Complete genomic molecule NG_Genomic region NM_mRNA NP_Protein NR_RNA NT_ a Genomic contig NW_ a Genomic contig (WGS b ) XM_ a mRNA XP_ a Protein XR_ a RNA NZ_ c Genomic (WGS) ZP_ a Protein, on NZ_ a Computed. b Assembly of Whole Genome Shotgun (WGS) sequence data. c An ordered collection of WGS for a genome.

Six ways to access DNA and protein sequences 1) RefSeq database (NCBI) 2) Entrez 3) UniGene 4) Nucleotide or Protein databases (NCBI) 5) European Bioinformatics Institute (EBI) and Ensembl (separate from NCBI) 6) ExPASy Sequence Retrieval System (separate from NCBI)

Pairwise Alignments Biology 224 Instructor: Tom Peavy February 5 & 7, 2008 <PowerPoint slides based on Bioinformatics and Functional Genomics by Jonathan Pevsner>

Pairwise alignments in the 1950s  -corticotropin (sheep) Corticotropin A (pig) ala gly glu asp asp glu asp gly ala glu asp glu Oxytocin Vasopressin CYIQNCPLG CYFQNCPRG Early alignments revealed --differences in amino acid sequences between species --differences in amino acids responsible for distinct functions

It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify domains or motifs that are shared between proteins It is the basis of BLAST searching (next week) It is used in the analysis of genomes Pairwise sequence alignment is the most fundamental operation of bioinformatics

Pairwise alignment: protein sequences can be more informative than DNA protein is more informative (20 vs 4 characters); many amino acids share related biophysical properties codons are degenerate: changes in the third position often do not alter the amino acid that is specified protein sequences offer a longer “look-back” time (relatedness over millions or billions of years) (note: issue of convergent evolution) DNA sequences can be translated into protein, and then used in pairwise alignments

DNA can be translated into six potential proteins 5’ CAT CAA 5’ ATC AAC 5’ TCA ACT 5’ GTG GGT 5’ TGG GTA 5’ GGG TAG Pairwise alignment: protein sequences can be more informative than DNA 5’ CATCAACTACAACTCCAAAGACACCCTTACACATCAACAAACCTACCCAC 3’ 3’ GTAGTTGATGTTGAGGTTTCTGTGGGAATGTGTAGTTGTTTGGATGGGTG 5’

Query: 181 catcaactacaactccaaagacacccttacacccactaggatatcaacaaacctacccac 240 |||||||| |||| |||||| ||||| | ||||||||||||||||||||||||||||||| Sbjct: 189 catcaactgcaaccccaaagccacccct-cacccactaggatatcaacaaacctacccac 247 Pairwise alignment: protein sequences can be more informative than DNA Many times, DNA alignments are appropriate --to confirm the identity of a cDNA --to study noncoding regions of DNA --to study DNA polymorphisms --to study molecular evolution (syn. vs nonsyn) --example: Neanderthal vs modern human DNA

Pairwise alignment The process of lining up two or more sequences to achieve maximal levels of identity (and conservation, in the case of amino acid sequences) for the purpose of assessing the degree of similarity and the possibility of homology. Definitions

Homology Similarity attributed to descent from a common ancestor. Definitions Identity The extent to which two (nucleotide or amino acid) sequences are invariant. RBP 26 RVKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWD- 84 +K GTW++MA + L + A V T + +L+ W+ glycodelin 23 QTKQDLELPKLAGTWHSMAMA-TNNISLMATLKAPLRVHITSLLPTPEDNLEIVLHRWEN 81

Conservation Changes at a specific position of an amino acid or (less commonly, DNA) sequence that preserve the physico- chemical properties of the original residue. Similarity The extent to which nucleotide or protein sequences are related. It is based upon identity plus conservation. Definitions

Orthologs Homologous sequences in different species that arose from a common ancestral gene during speciation; may or may not be responsible for a similar function. Paralogs Homologous sequences within a single species that arose by gene duplication. Definitions: two types of homology

1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP. ||| |. |... | :.||||.:| : 1...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin 51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP : | | | | :: |.|. || |: || |. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin 98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV QYSC 136 RBP || ||. | :.|||| |..| 94 IPAVFKIDALNENKVL VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin 137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP. | | | : ||. | || | 136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI lactoglobulin Pairwise GLOBAL alignment of retinol-binding protein and  -lactoglobulin 25% identity; 32% similarity

retinol-binding protein (NP_006735)  -lactoglobulin (P02754) RBP and  -lactoglobulin are homologous proteins that share related three-dimensional structures

Positions at which a letter is paired with a null are called gaps. Gap scores are typically negative. Since a single mutational event may cause the insertion or deletion of more than one residue, the presence of a gap is ascribed more significance than the length of the gap. In BLAST, it is rarely necessary to change gap values from the default. Gaps

Should distantly related species have more gaps than closely related species (or genes)? What about their relationship in regards to sequence identity?

There are 3 Principal Methods of Pair-wise Sequence Alignment 1)Dot Matrix Analysis (e.g. Dotlet, Dotter, Dottup) 2)Dynamic Programming (DP) algorithm 3)Word or k-tuple methods (e.g. FASTA & BLAST)

Exon and Introns

General approach to pairwise alignment Choose two sequences Select an algorithm that generates a score Allow gaps (insertions, deletions) Score reflects degree of similarity Alignments can be global or local Estimate probability that the alignment occurred by chance

Calculation of an alignment score

Scoring Matrices take into account: Conservation – acceptable substitutions while not changing function of protein (charge, size, hydrophobicity) Frequency – reflect how often particular residues occur among entire collection of proteins (rare residues given more weight) Evolution – different scoring matrices are designed to either detect closely related or more distantly related proteins

global alignment algorithm --Needleman and Wunsch (1970) local alignment algorithm --Smith and Waterman (1981) Two kinds of sequence alignment: global and local

Global alignment (Needleman-Wunsch) extends from one end of each sequence to the other Local alignment finds optimally matching regions within two sequences (“subsequences”) Local alignment is almost always used for database searches such as BLAST. It is useful to find domains (or limited regions of homology) within sequences Smith and Waterman (1981) solved the problem of performing optimal local sequence alignment. Other methods (BLAST, FASTA) are faster but less thorough. Global alignment versus local alignment

Two sequences can be compared in a matrix along x- and y-axes. If they are identical, a path along a diagonal can be drawn Find the optimal subpaths, and add them up to achieve the best score. This involves --adding gaps when needed --allowing for conservative substitutions --choosing a scoring system (simple or complicated) N-W is guaranteed to find optimal alignment(s) Global alignment with the algorithm of Needleman and Wunsch (1970)

[1] set up a matrix [2] score the matrix [3] identify the optimal alignment(s) Three steps to global alignment with the Needleman-Wunsch algorithm

[1] identity (stay along a diagonal) [2] mismatch (stay along a diagonal) [3] gap in one sequence (move vertically!) [4] gap in the other sequence (move horizontally!) Four possible outcomes in aligning two sequences 1 2

How the Smith-Waterman algorithm works Set up a matrix between two proteins (size m+1, n+1) No values in the scoring matrix can be negative! S > 0 The score in each cell is the maximum of four values: [1] s(i-1, j-1) + the new score at [i,j] (a match or mismatch) [2] s(i,j-1) – gap penalty [3] s(i-1,j) – gap penalty [4] zero

Smith-Waterman local alignment algorithm

Rapid, heuristic versions of Smith-Waterman: FASTA and BLAST Smith-Waterman is very rigorous and it is guaranteed to find an optimal alignment. But Smith-Waterman is slow. It requires computer space and time proportional to the product of the two sequences being aligned (or the product of a query against an entire database). Gotoh (1982) and Myers and Miller (1988) improved the algorithms so both global and local alignment require less time and space. FASTA and BLAST provide rapid alternatives to S-W

How FASTA works [1] A “lookup table” is created. It consists of short stretches of amino acids (e.g. k=3 for a protein search). The length of a stretch is called a k-tuple. The FASTA algorithm finds the ten highest scoring segments that align to the query. [2] These ten aligned regions are re-scored with a PAM or BLOSUM matrix. [3] High-scoring segments are joined. [4] Needleman-Wunsch or Smith-Waterman is then performed.

Go to Choose BLAST 2 sequences In the program, [1] choose blastp or blastn [2] paste in your accession numbers (or use FASTA format) [3] select optional parameters --3 BLOSUM and 3 PAM matrices --gap creation and extension penalties --filtering --word size [4] click “align” Pairwise alignment: BLAST 2 sequences

True positivesFalse positives False negatives Sequences reported as related Sequences reported as unrelated True negatives homologous sequences non-homologous sequences Sensitivity: ability to find true positives Specificity: ability to minimize false positives

A substitution matrix contains values proportional to the probability that amino acid i mutates into amino acid j for all pairs of amino acids. Substitution matrices are constructed by assembling a large and diverse sample of verified pairwise alignments (or multiple sequence alignments) of amino acids. Substitution matrices should reflect the true probabilities of mutations occurring through a period of evolution. The two major types of substitution matrices are PAM and BLOSUM. Substitution Matrix

Normalized frequencies of amino acids: variations in frequency of occurrence Gly8.9%Arg4.1% Ala8.7%Asn4.0% Leu8.5%Phe4.0% Lys8.1%Gln3.8% Ser7.0%Ile3.7% Val6.5%His3.4% Thr5.8%Cys3.3% Pro5.1%Tyr3.0% Glu5.0%Met1.5% Asp4.7%Trp1.0% blue=6 codons; red=1 codon; note: should be 5% for each if equally distributed

fly GAKKVIISAP SAD.APM..F VCGVNLDAYK PDMKVVSNAS CTTNCLAPLA human GAKRVIISAP SAD.APM..F VMGVNHEKYD NSLKIISNAS CTTNCLAPLA plant GAKKVIISAP SAD.APM..F VVGVNEHTYQ PNMDIVSNAS CTTNCLAPLA bacterium GAKKVVMTGP SKDNTPM..F VKGANFDKY. AGQDIVSNAS CTTNCLAPLA yeast GAKKVVITAP SS.TAPM..F VMGVNEEKYT SDLKIVSNAS CTTNCLAPLA archaeon GADKVLISAP PKGDEPVKQL VYGVNHDEYD GE.DVVSNAS CTTNSITPVA fly KVINDNFEIV EGLMTTVHAT TATQKTVDGP SGKLWRDGRG AAQNIIPAST human KVIHDNFGIV EGLMTTVHAI TATQKTVDGP SGKLWRDGRG ALQNIIPAST plant KVVHEEFGIL EGLMTTVHAT TATQKTVDGP SMKDWRGGRG ASQNIIPSST bacterium KVINDNFGII EGLMTTVHAT TATQKTVDGP SHKDWRGGRG ASQNIIPSST yeast KVINDAFGIE EGLMTTVHSL TATQKTVDGP SHKDWRGGRT ASGNIIPSST archaeon KVLDEEFGIN AGQLTTVHAY TGSQNLMDGP NGKP.RRRRA AAENIIPTST fly GAAKAVGKVI PALNGKLTGM AFRVPTPNVS VVDLTVRLGK GASYDEIKAK human GAAKAVGKVI PELNGKLTGM AFRVPTANVS VVDLTCRLEK PAKYDDIKKV plant GAAKAVGKVL PELNGKLTGM AFRVPTSNVS VVDLTCRLEK GASYEDVKAA bacterium GAAKAVGKVL PELNGKLTGM AFRVPTPNVS VVDLTVRLEK AATYEQIKAA yeast GAAKAVGKVL PELQGKLTGM AFRVPTVDVS VVDLTVKLNK ETTYDEIKKV archaeon GAAQAATEVL PELEGKLDGM AIRVPVPNGS ITEFVVDLDD DVTESDVNAA Dayhoff et al. examined multiple sequence alignments (e.g. glyceraldehyde 3-phosphate dehydrogenases) to generate tables of accepted point mutations Examined 1572 changes in 71 groups of closely related proteins

Dayhoff’s PAM1 mutation probability matrix Original amino acid Each element of the matrix shows the probability that an amino acid (top) will be replaced by another residue (side) (n=10,000)

All the PAM data come from alignments of closely related proteins (>85% amino acid identity) PAM matrices are based on global sequence alignments. The PAM1 is the matrix calculated from comparisons of sequences with no more than 1% divergence (all other PAM matrices are extrapolated from PAM1). For the PAM1 matrix, that interval is 1% amino acid Divergence; note that the interval is not in units of time. Dayhoff’s PAM1 mutation probability matrix (Point-Accepted Mutations)

Consider a PAM0 matrix. No amino acids have changed, so the values on the diagonal are 100%. Consider a PAM2000 (nearly infinite) matrix. The values approach the background frequencies of the amino acids (given in Table 3-2). PAM0 and PAM 2000 mutation probability matrices

The PAM250 matrix is of particular interest because it corresponds to an evolutionary distance of about 20% amino acid identity (the approximate limit of detection for the comparison of most proteins). Note the loss of information content along the main diagonal, relative to the PAM1 matrix. The PAM250 mutation probability matrix

PAM250 mutation probability matrix Top: original amino acid Side: replacement amino acid

Why do we go from a mutation probability matrix to a log odds matrix? We want a scoring matrix so that when we do a pairwise alignment (or a BLAST search) we know what score to assign to two aligned amino acid residues. Logarithms are easier to use for a scoring system. They allow us to sum the scores of aligned residues (rather than having to multiply them).

How do we go from a mutation probability matrix to a log odds matrix? The cells in a log odds matrix consist of an “odds ratio”: the probability that an alignment is authentic the probability that the alignment was random The score S for an alignment of residues a,b is given by: S(a,b) = 10 log 10 (M ab /p b ) As an example, for tryptophan, S(a,tryptophan) = 10 log 10 (0.55/0.010) = 17.4

PAM250 log odds scoring matrix

What do the numbers mean in a log odds matrix? S(a,tryptophan) = 10 log 10 (0.55/0.010) = 17.4 A score of +17 for tryptophan means that this alignment is 55 times more likely than a chance alignment of two Trp residues. S(a,b) = 17 Probability of replacement (M ab /p b ) = x Then 17.4 = 10 log 10 x 1.74 = log 10 x = x = 55

PAM10 log odds scoring matrix Note that penalties for mismatches are far more severe than for PAM250; e.g. W  T –19 vs. –5.

BLOSUM matrices are based on local alignments Created in 1992 (rather than 1978 for PAM; global aln) BLOSUM stands for blocks substitution matrix. Used 2000 blocks of conserved regions representing more than 500 groups of proteins examined BLOSUM62 is a matrix calculated from comparisons of sequences with no less than 62% identity BLOSUM Matrices

All BLOSUM matrices are based on observed alignments; they are not extrapolated from comparisons of closely related proteins. BLOSUM62 is the default matrix in BLAST 2.0. Though it is tailored for comparisons of moderately distant proteins, it performs well in detecting closer relationships. A search for distant relatives may be more sensitive with a different matrix. BLOSUM Matrices

Percent identity Differences per 100 residues At PAM1, two proteins are 99% identical At PAM10.7, there are 10 differences per 100 residues At PAM80, there are 50 differences per 100 residues At PAM250, there are 80 differences per 100 residues WHY? “twilight zone”

Ancestral sequence Sequence 1 ACCGATC Sequence 2 AATAATC A no changeA C single substitutionC --> A C multiple substitutionsC --> A --> T C --> G coincidental substitutionsC --> A T --> A parallel substitutionsT --> A A --> C --> T convergent substitutionsA --> T C back substitutionC --> T --> C ACCCTAC Li (1997) p.70

Summary of alignment scoring system global versus local alignment positive and negative values assigned gap creation and extension penalties positive score for identities some partial positive score for conservative substitutions use of a substitution matrix