Accessing information on molecular sequences Bio 224 Dr. Tom Peavy February 5, 2008.

Accessing information on molecular sequences Bio 224 Dr. Tom Peavy February 5, 2008

What is an accession number? An accession number is label that used to identify a sequence. It is a string of letters and/or numbers that corresponds to a molecular sequence. Examples (all for retinol-binding protein, RBP4): X02775GenBank genomic DNA sequence NT_030059Genomic contig Rs7079946dbSNP (single nucleotide polymorphism) N91759.1An expressed sequence tag (1 of 170) NM_006744RefSeq DNA sequence (from a transcript) NP_007635RefSeq protein AAC02945GenBank protein Q28369SwissProt protein 1KT7Protein Data Bank structure record protein DNA RNA

The RefSeq Accession number format and molecule types. (NCBI Handbook) Accession prefixMolecule type NC_Complete genomic molecule NG_Genomic region NM_mRNA NP_Protein NR_RNA NT_ a Genomic contig NW_ a Genomic contig (WGS b ) XM_ a mRNA XP_ a Protein XR_ a RNA NZ_ c Genomic (WGS) ZP_ a Protein, on NZ_ a Computed. b Assembly of Whole Genome Shotgun (WGS) sequence data. c An ordered collection of WGS for a genome.

Six ways to access DNA and protein sequences 1) RefSeq database (NCBI) 2) Entrez 3) UniGene 4) Nucleotide or Protein databases (NCBI) 5) European Bioinformatics Institute (EBI) and Ensembl (separate from NCBI) 6) ExPASy Sequence Retrieval System (separate from NCBI)

Pairwise Alignments Biology 224 Instructor: Tom Peavy February 5 & 7, 2008 <PowerPoint slides based on Bioinformatics and Functional Genomics by Jonathan Pevsner>

Pairwise alignments in the 1950s  -corticotropin (sheep) Corticotropin A (pig) ala gly glu asp asp glu asp gly ala glu asp glu Oxytocin Vasopressin CYIQNCPLG CYFQNCPRG Early alignments revealed --differences in amino acid sequences between species --differences in amino acids responsible for distinct functions

It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify domains or motifs that are shared between proteins It is the basis of BLAST searching (next week) It is used in the analysis of genomes Pairwise sequence alignment is the most fundamental operation of bioinformatics

Pairwise alignment: protein sequences can be more informative than DNA protein is more informative (20 vs 4 characters); many amino acids share related biophysical properties codons are degenerate: changes in the third position often do not alter the amino acid that is specified protein sequences offer a longer “look-back” time (relatedness over millions or billions of years) (note: issue of convergent evolution) DNA sequences can be translated into protein, and then used in pairwise alignments

DNA can be translated into six potential proteins 5’ CAT CAA 5’ ATC AAC 5’ TCA ACT 5’ GTG GGT 5’ TGG GTA 5’ GGG TAG Pairwise alignment: protein sequences can be more informative than DNA 5’ CATCAACTACAACTCCAAAGACACCCTTACACATCAACAAACCTACCCAC 3’ 3’ GTAGTTGATGTTGAGGTTTCTGTGGGAATGTGTAGTTGTTTGGATGGGTG 5’

Query: 181 catcaactacaactccaaagacacccttacacccactaggatatcaacaaacctacccac 240 |||||||| |||| |||||| ||||| | ||||||||||||||||||||||||||||||| Sbjct: 189 catcaactgcaaccccaaagccacccct-cacccactaggatatcaacaaacctacccac 247 Pairwise alignment: protein sequences can be more informative than DNA Many times, DNA alignments are appropriate --to confirm the identity of a cDNA --to study noncoding regions of DNA --to study DNA polymorphisms --to study molecular evolution (syn. vs nonsyn) --example: Neanderthal vs modern human DNA

Pairwise alignment The process of lining up two or more sequences to achieve maximal levels of identity (and conservation, in the case of amino acid sequences) for the purpose of assessing the degree of similarity and the possibility of homology. Definitions

Homology Similarity attributed to descent from a common ancestor. Definitions Identity The extent to which two (nucleotide or amino acid) sequences are invariant. RBP 26 RVKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWD- 84 +K++ +++ GTW++MA + L + A V T + +L+ W+ glycodelin 23 QTKQDLELPKLAGTWHSMAMA-TNNISLMATLKAPLRVHITSLLPTPEDNLEIVLHRWEN 81

Conservation Changes at a specific position of an amino acid or (less commonly, DNA) sequence that preserve the physico- chemical properties of the original residue. Similarity The extent to which nucleotide or protein sequences are related. It is based upon identity plus conservation. Definitions

Orthologs Homologous sequences in different species that arose from a common ancestral gene during speciation; may or may not be responsible for a similar function. Paralogs Homologous sequences within a single species that arose by gene duplication. Definitions: two types of homology

1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP. ||| |. |... | :.||||.:| : 1...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin 51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP : | | | | :: |.|. || |: || |. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin 98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC 136 RBP || ||. | :.|||| |..| 94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin 137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP. | | | : ||. | || | 136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI....... 178 lactoglobulin Pairwise GLOBAL alignment of retinol-binding protein and  -lactoglobulin 25% identity; 32% similarity

retinol-binding protein (NP_006735)  -lactoglobulin (P02754) RBP and  -lactoglobulin are homologous proteins that share related three-dimensional structures

Positions at which a letter is paired with a null are called gaps. Gap scores are typically negative. Since a single mutational event may cause the insertion or deletion of more than one residue, the presence of a gap is ascribed more significance than the length of the gap. In BLAST, it is rarely necessary to change gap values from the default. Gaps

Should distantly related species have more gaps than closely related species (or genes)? What about their relationship in regards to sequence identity?

There are 3 Principal Methods of Pair-wise Sequence Alignment 1)Dot Matrix Analysis (e.g. Dotlet, Dotter, Dottup) 2)Dynamic Programming (DP) algorithm 3)Word or k-tuple methods (e.g. FASTA & BLAST)

Exon and Introns

General approach to pairwise alignment Choose two sequences Select an algorithm that generates a score Allow gaps (insertions, deletions) Score reflects degree of similarity Alignments can be global or local Estimate probability that the alignment occurred by chance

Calculation of an alignment score http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/Alignment_Scores2.html

Scoring Matrices take into account: Conservation – acceptable substitutions while not changing function of protein (charge, size, hydrophobicity) Frequency – reflect how often particular residues occur among entire collection of proteins (rare residues given more weight) Evolution – different scoring matrices are designed to either detect closely related or more distantly related proteins

global alignment algorithm --Needleman and Wunsch (1970) local alignment algorithm --Smith and Waterman (1981) Two kinds of sequence alignment: global and local

Global alignment (Needleman-Wunsch) extends from one end of each sequence to the other Local alignment finds optimally matching regions within two sequences (“subsequences”) Local alignment is almost always used for database searches such as BLAST. It is useful to find domains (or limited regions of homology) within sequences Smith and Waterman (1981) solved the problem of performing optimal local sequence alignment. Other methods (BLAST, FASTA) are faster but less thorough. Global alignment versus local alignment

Two sequences can be compared in a matrix along x- and y-axes. If they are identical, a path along a diagonal can be drawn Find the optimal subpaths, and add them up to achieve the best score. This involves --adding gaps when needed --allowing for conservative substitutions --choosing a scoring system (simple or complicated) N-W is guaranteed to find optimal alignment(s) Global alignment with the algorithm of Needleman and Wunsch (1970)

[1] set up a matrix [2] score the matrix [3] identify the optimal alignment(s) Three steps to global alignment with the Needleman-Wunsch algorithm

[1] identity (stay along a diagonal) [2] mismatch (stay along a diagonal) [3] gap in one sequence (move vertically!) [4] gap in the other sequence (move horizontally!) Four possible outcomes in aligning two sequences 1 2

How the Smith-Waterman algorithm works Set up a matrix between two proteins (size m+1, n+1) No values in the scoring matrix can be negative! S > 0 The score in each cell is the maximum of four values: [1] s(i-1, j-1) + the new score at [i,j] (a match or mismatch) [2] s(i,j-1) – gap penalty [3] s(i-1,j) – gap penalty [4] zero

Smith-Waterman local alignment algorithm

Rapid, heuristic versions of Smith-Waterman: FASTA and BLAST Smith-Waterman is very rigorous and it is guaranteed to find an optimal alignment. But Smith-Waterman is slow. It requires computer space and time proportional to the product of the two sequences being aligned (or the product of a query against an entire database). Gotoh (1982) and Myers and Miller (1988) improved the algorithms so both global and local alignment require less time and space. FASTA and BLAST provide rapid alternatives to S-W

How FASTA works [1] A “lookup table” is created. It consists of short stretches of amino acids (e.g. k=3 for a protein search). The length of a stretch is called a k-tuple. The FASTA algorithm finds the ten highest scoring segments that align to the query. [2] These ten aligned regions are re-scored with a PAM or BLOSUM matrix. [3] High-scoring segments are joined. [4] Needleman-Wunsch or Smith-Waterman is then performed.

Go to http://www.ncbi.nlm.nih.gov/BLAST Choose BLAST 2 sequences In the program, [1] choose blastp or blastn [2] paste in your accession numbers (or use FASTA format) [3] select optional parameters --3 BLOSUM and 3 PAM matrices --gap creation and extension penalties --filtering --word size [4] click “align” Pairwise alignment: BLAST 2 sequences

True positivesFalse positives False negatives Sequences reported as related Sequences reported as unrelated True negatives homologous sequences non-homologous sequences Sensitivity: ability to find true positives Specificity: ability to minimize false positives

A substitution matrix contains values proportional to the probability that amino acid i mutates into amino acid j for all pairs of amino acids. Substitution matrices are constructed by assembling a large and diverse sample of verified pairwise alignments (or multiple sequence alignments) of amino acids. Substitution matrices should reflect the true probabilities of mutations occurring through a period of evolution. The two major types of substitution matrices are PAM and BLOSUM. Substitution Matrix

Normalized frequencies of amino acids: variations in frequency of occurrence Gly8.9%Arg4.1% Ala8.7%Asn4.0% Leu8.5%Phe4.0% Lys8.1%Gln3.8% Ser7.0%Ile3.7% Val6.5%His3.4% Thr5.8%Cys3.3% Pro5.1%Tyr3.0% Glu5.0%Met1.5% Asp4.7%Trp1.0% blue=6 codons; red=1 codon; note: should be 5% for each if equally distributed

fly GAKKVIISAP SAD.APM..F VCGVNLDAYK PDMKVVSNAS CTTNCLAPLA human GAKRVIISAP SAD.APM..F VMGVNHEKYD NSLKIISNAS CTTNCLAPLA plant GAKKVIISAP SAD.APM..F VVGVNEHTYQ PNMDIVSNAS CTTNCLAPLA bacterium GAKKVVMTGP SKDNTPM..F VKGANFDKY. AGQDIVSNAS CTTNCLAPLA yeast GAKKVVITAP SS.TAPM..F VMGVNEEKYT SDLKIVSNAS CTTNCLAPLA archaeon GADKVLISAP PKGDEPVKQL VYGVNHDEYD GE.DVVSNAS CTTNSITPVA fly KVINDNFEIV EGLMTTVHAT TATQKTVDGP SGKLWRDGRG AAQNIIPAST human KVIHDNFGIV EGLMTTVHAI TATQKTVDGP SGKLWRDGRG ALQNIIPAST plant KVVHEEFGIL EGLMTTVHAT TATQKTVDGP SMKDWRGGRG ASQNIIPSST bacterium KVINDNFGII EGLMTTVHAT TATQKTVDGP SHKDWRGGRG ASQNIIPSST yeast KVINDAFGIE EGLMTTVHSL TATQKTVDGP SHKDWRGGRT ASGNIIPSST archaeon KVLDEEFGIN AGQLTTVHAY TGSQNLMDGP NGKP.RRRRA AAENIIPTST fly GAAKAVGKVI PALNGKLTGM AFRVPTPNVS VVDLTVRLGK GASYDEIKAK human GAAKAVGKVI PELNGKLTGM AFRVPTANVS VVDLTCRLEK PAKYDDIKKV plant GAAKAVGKVL PELNGKLTGM AFRVPTSNVS VVDLTCRLEK GASYEDVKAA bacterium GAAKAVGKVL PELNGKLTGM AFRVPTPNVS VVDLTVRLEK AATYEQIKAA yeast GAAKAVGKVL PELQGKLTGM AFRVPTVDVS VVDLTVKLNK ETTYDEIKKV archaeon GAAQAATEVL PELEGKLDGM AIRVPVPNGS ITEFVVDLDD DVTESDVNAA Dayhoff et al. examined multiple sequence alignments (e.g. glyceraldehyde 3-phosphate dehydrogenases) to generate tables of accepted point mutations Examined 1572 changes in 71 groups of closely related proteins

Dayhoff’s PAM1 mutation probability matrix Original amino acid Each element of the matrix shows the probability that an amino acid (top) will be replaced by another residue (side) (n=10,000)

All the PAM data come from alignments of closely related proteins (>85% amino acid identity) PAM matrices are based on global sequence alignments. The PAM1 is the matrix calculated from comparisons of sequences with no more than 1% divergence (all other PAM matrices are extrapolated from PAM1). For the PAM1 matrix, that interval is 1% amino acid Divergence; note that the interval is not in units of time. Dayhoff’s PAM1 mutation probability matrix (Point-Accepted Mutations)

Consider a PAM0 matrix. No amino acids have changed, so the values on the diagonal are 100%. Consider a PAM2000 (nearly infinite) matrix. The values approach the background frequencies of the amino acids (given in Table 3-2). PAM0 and PAM 2000 mutation probability matrices

The PAM250 matrix is of particular interest because it corresponds to an evolutionary distance of about 20% amino acid identity (the approximate limit of detection for the comparison of most proteins). Note the loss of information content along the main diagonal, relative to the PAM1 matrix. The PAM250 mutation probability matrix

PAM250 mutation probability matrix Top: original amino acid Side: replacement amino acid

Why do we go from a mutation probability matrix to a log odds matrix? We want a scoring matrix so that when we do a pairwise alignment (or a BLAST search) we know what score to assign to two aligned amino acid residues. Logarithms are easier to use for a scoring system. They allow us to sum the scores of aligned residues (rather than having to multiply them).

How do we go from a mutation probability matrix to a log odds matrix? The cells in a log odds matrix consist of an “odds ratio”: the probability that an alignment is authentic the probability that the alignment was random The score S for an alignment of residues a,b is given by: S(a,b) = 10 log 10 (M ab /p b ) As an example, for tryptophan, S(a,tryptophan) = 10 log 10 (0.55/0.010) = 17.4

PAM250 log odds scoring matrix

What do the numbers mean in a log odds matrix? S(a,tryptophan) = 10 log 10 (0.55/0.010) = 17.4 A score of +17 for tryptophan means that this alignment is 55 times more likely than a chance alignment of two Trp residues. S(a,b) = 17 Probability of replacement (M ab /p b ) = x Then 17.4 = 10 log 10 x 1.74 = log 10 x 10 1.74 = x = 55

PAM10 log odds scoring matrix Note that penalties for mismatches are far more severe than for PAM250; e.g. W  T –19 vs. –5.

BLOSUM matrices are based on local alignments Created in 1992 (rather than 1978 for PAM; global aln) BLOSUM stands for blocks substitution matrix. Used 2000 blocks of conserved regions representing more than 500 groups of proteins examined BLOSUM62 is a matrix calculated from comparisons of sequences with no less than 62% identity BLOSUM Matrices

All BLOSUM matrices are based on observed alignments; they are not extrapolated from comparisons of closely related proteins. BLOSUM62 is the default matrix in BLAST 2.0. Though it is tailored for comparisons of moderately distant proteins, it performs well in detecting closer relationships. A search for distant relatives may be more sensitive with a different matrix. BLOSUM Matrices

Percent identity Differences per 100 residues At PAM1, two proteins are 99% identical At PAM10.7, there are 10 differences per 100 residues At PAM80, there are 50 differences per 100 residues At PAM250, there are 80 differences per 100 residues WHY? “twilight zone”

Ancestral sequence Sequence 1 ACCGATC Sequence 2 AATAATC A no changeA C single substitutionC --> A C multiple substitutionsC --> A --> T C --> G coincidental substitutionsC --> A T --> A parallel substitutionsT --> A A --> C --> T convergent substitutionsA --> T C back substitutionC --> T --> C ACCCTAC Li (1997) p.70

Summary of alignment scoring system global versus local alignment positive and negative values assigned gap creation and extension penalties positive score for identities some partial positive score for conservative substitutions use of a substitution matrix

Accessing information on molecular sequences Bio 224 Dr. Tom Peavy February 5, 2008.

Similar presentations

Presentation on theme: "Accessing information on molecular sequences Bio 224 Dr. Tom Peavy February 5, 2008."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Accessing information on molecular sequences Bio 224 Dr. Tom Peavy February 5, 2008.

Similar presentations

Presentation on theme: "Accessing information on molecular sequences Bio 224 Dr. Tom Peavy February 5, 2008."— Presentation transcript:

Similar presentations

About project

Feedback