Last lecture summary. New generation sequencing (NGS) The completion of human genome was just a start of modern DNA sequencing era – “high-throughput.

Slides:



Advertisements
Similar presentations
Substitution matrices
Advertisements

1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
Sources Page & Holmes Vladimir Likic presentation: 20show.pdf
Measuring the degree of similarity: PAM and blosum Matrix
DNA sequences alignment measurement
Last lecture summary.
Last lecture summary.
Lecture 8 Alignment of pairs of sequence Local and global alignment
Introduction to Bioinformatics
Sequence Alignments and Database Searches Introduction to Bioinformatics.
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.
Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al.
Heuristic alignment algorithms and cost matrices
Sequence analysis course
Introduction to Bioinformatics Algorithms Sequence Alignment.
Summer Bioinformatics Workshop 2008 Sequence Alignments Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University.
Scoring Matrices June 22, 2006 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
Sequence Alignments Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center
Introduction to bioinformatics
Sequence similarity.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
Introduction to Bioinformatics Algorithms Sequence Alignment.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Scoring matrices Identity PAM BLOSUM.
Sequence Alignments Introduction to Bioinformatics.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Sequence Alignments Revisited
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Substitution matrices
Pairwise alignment Computational Genomics and Proteomics.
Roadmap The topics:  basic concepts of molecular biology  more on Perl  overview of the field  biological databases and database searching  sequence.
Bioinformatics. Not only small molecules and QM, MM techniques rule the world.
1 BLAST: Basic Local Alignment Search Tool Jonathan M. Urbach Bioinformatics Group Department of Molecular Biology.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Bioinformatics in Biosophy
An Introduction to Bioinformatics
Multiple Sequence Alignment May 12, 2009 Announcements Quiz #2 return (average 30) Hand in homework #7 Learning objectives-Understand ClustalW Homework#8-Due.
Last lecture summary. New generation sequencing (NGS) The completion of human genome was just a start of modern DNA sequencing era – “high-throughput.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Sequence Alignment Csc 487/687 Computing for bioinformatics.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Last lecture summary. Flavors of sequence alignment pair-wise alignment × multiple sequence alignment.
Pairwise Sequence Analysis-III
Applied Bioinformatics Week 3. Theory I Similarity Dot plot.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Construction of Substitution matrices
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
DNA sequences alignment measurement Lecture 13. Introduction Measurement of “strength” alignment Nucleic acid and amino acid substitutions Measurement.
Last lecture summary.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Last lecture summary. Sequence alignment What is sequence alignment Three flavors of sequence alignment Point mutations, indels.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Pairwise Sequence Alignment and Database Searching
Last lecture summary.
Protein Sequence Alignments
Last lecture summary.
Presentation transcript:

Last lecture summary

New generation sequencing (NGS) The completion of human genome was just a start of modern DNA sequencing era – “high-throughput next generation sequencing” (NGS). New approaches, reduce time and cost. Holly Grail of sequencing – complete human genome below $ st generation – Sanger dideoxy method 2 nd generation – sequencing by synthesis (pyrosequencing) 3 rd generation – single molecule sequencing

cDNA, EST libraries cDNA – reverse transcriptase, contains only expressed genes (no introns) cDNA library – a collection of different DNA sequences that have been incorporated into a vector EST – Expressed Sequence Tag short, unedited (single-pass read), randomly selected subsequence ( bps) of cDNA sequence generated either from 5’ or from 3’ higher quality in the middle cDNA/EST – direct evidence of transcriptome

What is sequence alignment ? CTTTTCAAGGCTTA GGCTTATTATTGC CTTTTCAAGGCTTA GGCTATTATTGC CTTTTCAAGGCTTA GGCT-ATTATTGC Fragments overlaps

What is sequence alignment ? CCCCATGGTGGCGGCAGGTGACAG CATGGGGGAGGATGGGGACAGTCCGG TTACCCCATGGTGGCGGCTTGGGAAACTT TGGCGGCTCGGGACAGTCGCGCATAAT CCATGGTGGTGGCTGGGGATAGTA TGAGGCAGTCGCGCATAATTCCG TTACCCCATGGTGGCGGCTGGGGACAGTCGCGCATAATTCCG “EST clustering” CCCCATGGTGGCGGCAGGTGACAG CATGGGGGAGGATGGGGACAGTCCGG TTACCCCATGGTGGCGGCTTGGGAAACTT TGGCGGCTCGGGACAGTCGCGCATAAT CCATGGTGGTGGCTGGGGATAGTA TGAGGCAGTCGCGCATAATTCCG consensus

Sequence alignment Procedure of comparing sequences Point mutations – easy More difficult example However, gaps can be inserted to get something like this ACGTCTGATACGCCGTATAGTCTATCT ACGTCTGATTCGCCCTATCGTCTATCT ACGTCTGATACGCCGTATAGTCTATCT CTGATTCGCATCGTCTATCT ACGTCTGATACGCCGTATAGTCTATCT ----CTGATTCGC---ATCGTCTATCT gapless alignment gapped alignment insertion × deletion indel

Why align sequences – continuation The draft human genome is available Automated gene finding is possible Gene: AGTACGTATCGTATAGCGTAA What does it do? One approach: Is there a similar gene in another species? Align sequences with known genes Find the gene with the “best” match

Flavors of sequence alignment gapped x gapless pairwise x multiple global x local

Evolution of sequences The sequences are the products of molecular evolution. When sequences share a common ancestor, they tend to exhibit similarity in their sequences, structures and biological functions. Similar function Sequence similarity Similar 3D structure Protein1Protein2 DNA1DNA2 However, this statement is not a rule. See Gerlt JA, Babbitt PC. Can sequence determine function? Genome Biol. 2000;1(5) PMID: Similar sequences produce similar proteins

Homology Sequences diverge over time Common ancestor – homologous sequences The variation between sequences – changes occurred during evolution in the form of substitutions (mutations) and/or indels. Traces of evolution may still remain in certain portions of the sequences to allow identification of the common ancestry. Residues performing key roles are conserved (preserved) by natural selection. Orthology vs paralogy

New stuff

Identity matrix Scoring systems I DNA and protein sequences can be aligned so that the number of identically matching pairs is maximized. Counting the number of matches gives us a score (3 in this case). Higher score means better alignment. This procedure can be formalized using substitution matrix. A T T G T A – - G A C A T ATCG A1 T01 C001 G0001

Scoring systems II identity matrix: NAs – OK, proteins – not enough AAs are not exchanged with the same probability as can be conceived theoretically. For example substitution of aspartic acids D by glutamic acid E is frequently observed. And change from aspartic acid to tryptophan W is very rare. DEW

Scoring systems II Why is that? 1. Triplet-based genetic code GAT (D) → GAA (E), GAT (D) → TGG (W) 2. Both D and E have similar properties, but D and W differ considerably. D is hydrophilic, W is hydrophobic, D → W mutation can greatly alter 3D structure and consequently function.

Genetic code

Gaps or no gaps

Scoring DNA sequence alignment (1) Match score:+1 Mismatch score:+0 Gap penalty:–1 ACGTCTGATACGCCGTATAGTCTATCT ||||| ||| || |||||||| ----CTGATTCGC---ATCGTCTATCT Matches: 18 × (+1) Mismatches: 2 × 0 Gaps: 7 × (– 1) Score = +11

Length penalties We want to find alignments that are evolutionarily likely. Which of the following alignments seems more likely to you? ACGTCTGATACGCCGTATAGTCTATCT ACGTCTGAT ATAGTCTATCT ACGTCTGATACGCCGTATAGTCTATCT AC-T-TGA--CG-CGT-TA-TCTATCT We can achieve this by penalizing more for a new gap, than for extending an existing gap  

Scoring DNA sequence alignment (2) Match/mismatch score:+1/+0 Origination/length penalty:–2/–1 ACGTCTGATACGCCGTATAGTCTATCT ||||| ||| || |||||||| ----CTGATTCGC---ATCGTCTATCT Matches: 18 × (+1) Mismatches: 2 × 0 Origination: 2 × (–2) Length: 7 × (–1) Score = +7

Substitution matrices Substitution (score) matrices show scores for amino acids substitution. Higher score means higher probability of mutation. Conservative substitutions – conserve the physical and chemical properties of the amino acids, limit structural/functional disruption Substitution matrices should reflect: Physicochemical properties of amino acids. Different frequencies of individual amino acids occuring in proteins. Interchangeability of the genetic code.

PAM matrices I How to assign scores? Let’s get nature – evolution – involved! If you choose set of proteins with very similar sequences, you can do alignment manually. Also, if sequences in your set are similar, then there is high probability that amino acid difference are due to single mutation. From the frequencies of mutations in the set of similar protein sequences probabilities of substitutions can be derived. This is exactly the approach take by Margaret Dayhoff in 1978 to construct PAM (Accepted Point Mutation) matrices. Dayhoff, M.O., Schwartz, R. and Orcutt, B.C. (1978). "A model of Evolutionary Change in Proteins". Atlas of protein sequence and structure (volume 5, supplement 3 ed.). Nat. Biomed. Res. Found.. pp. 345–358.

PAM matrices II Alignments of 71 groups of very similar (at least 85% identity) protein sequences substitutions were found. These mutations do not significantly alter the protein function. Hence they are called accepted mutations (accepted by natural selection). Probabilities that any one amino acid would mutate into any other were calculated. If I know probabilities of individual amino acids, what is the probability for the given sequence? Product But to calculate the score, we would like to sum probabilities, not multiply. How to achieve this? Logarithm Excellent discussion of the derivation and use of PAM matrices: George DG, Barker WC, Hunt LT. Mutation data matrix and its uses. Methods Enzymol. 1990,183: PMID:

PAM matrices III Dayhoff’s definition of accepted mutation was thus based on empirically observed amino acids substitutions. The used unit is a PAM. Two sequences are 1 PAM apart if they have 99% identical residues. PAM1 matrix is the result of computing the probability of one substitution per 100 amino acids. PAM1 matrix represents probabilities of point mutations over certain evolutionary time. in Drosophila 1 PAM corresponds to ~2.62 MYA in Human 1 PAM corresponds to ~4.58 MYA

PAM1 matrix numbers are multiplied by

Higher PAM matrices What to do if I want get probabilities over much longer evolutionary time? Dayhoff proposed a model of evolution that is a Markov process. A case of Markov process is a linear dynamical system.

Linear dynamical system I

Linear dynamical system II

Higher PAM matrices Biologically, the PAM120 matrix means that in 100 amino acids there have been 50 substitutions, while in PAM250 there have been 2.5 amino acid mutation at each side. This may sound unusual, but remember, that over evolutionary time, it is possible that an alanine was changed to glycine, then to valine, and then back to alanine. These are called silent substituions.

PAM 120 small, polar small, nonpolar polar or acidic basic large, hydrophobic aromatic Zvelebil, Baum, Understanding bioinformatics. Positive score – frequency of substitutions is greater than would have occurred by random chance. Zero score – frequency is equal to that expected by chance. Negative score – frequency is less than would have occurred by random chance.

PAM matrices assumptions Mutation of amino acid is independent of previous mutations at the same position (Markov process requirement). Only PAM1 was “measured”, all other are extrapolations (i.e. predictions based on some model). Each amino acid position is equally mutable. Mutations are assumed to be independent of surrounding residues. Forces responsible for sequence evolution over short time are the same as these over longer times. PAM matrices are based on protein sequences available in 1978 (bias towards small, globular proteins) New generation of Dayhoff-type – e.g. PET91

How to calculate score? Selzer, Applied bioinformatics. substitution matrix 2