Last lecture summary. Sequence alignment What is sequence alignment Three flavors of sequence alignment Point mutations, indels.

Slides:



Advertisements
Similar presentations
Sources Page & Holmes Vladimir Likic presentation: 20show.pdf
Advertisements

Measuring the degree of similarity: PAM and blosum Matrix
1 ALIGNMENT OF NUCLEOTIDE & AMINO-ACID SEQUENCES.
DNA sequences alignment measurement
Last lecture summary.
Last lecture summary.
Lecture 8 Alignment of pairs of sequence Local and global alignment
Introduction to Bioinformatics
Sequence Alignments and Database Searches Introduction to Bioinformatics.
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.
Introduction to Bioinformatics Burkhard Morgenstern Institute of Microbiology and Genetics Department of Bioinformatics Goldschmidtstr. 1 Göttingen, March.
Sequence Similarity Searching Class 4 March 2010.
Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al.
Heuristic alignment algorithms and cost matrices
Sequence analysis course
Introduction to Bioinformatics Algorithms Sequence Alignment.
Introduction to bioinformatics
Sequence similarity.
Sequence Alignment III CIS 667 February 10, 2004.
Introduction to Bioinformatics Algorithms Sequence Alignment.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Scoring matrices Identity PAM BLOSUM.
Sequence Alignments Introduction to Bioinformatics.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Sequence Alignments Revisited
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Pairwise alignment Computational Genomics and Proteomics.
Roadmap The topics:  basic concepts of molecular biology  more on Perl  overview of the field  biological databases and database searching  sequence.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Sequence Analysis Alignments dot-plots scoring scheme Substitution matrices Search algorithms (BLAST)
Bioinformatics in Biosophy
Chapter 3 The Biological Basis of Life. Chapter Outline  The Cell  DNA Structure  DNA Replication  Protein Synthesis  What is a Gene?  Cell Division:
Pairwise Alignments Part 1 Biology 224 Instructor: Tom Peavy Sept 8
An Introduction to Bioinformatics
Multiple Sequence Alignment May 12, 2009 Announcements Quiz #2 return (average 30) Hand in homework #7 Learning objectives-Understand ClustalW Homework#8-Due.
Protein Evolution and Sequence Analysis Protein Evolution and Sequence Analysis.
Last lecture summary. New generation sequencing (NGS) The completion of human genome was just a start of modern DNA sequencing era – “high-throughput.
Protein Sequence Alignment and Database Searching.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise Sequence Alignment BMI/CS 776 Mark Craven January 2002.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Construction of Substitution Matrices
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Last lecture summary. New generation sequencing (NGS) The completion of human genome was just a start of modern DNA sequencing era – “high-throughput.
Basic terms:  Similarity - measurable quantity. Similarity- applied to proteins using concept of conservative substitutions Similarity- applied to proteins.
Last lecture summary. Flavors of sequence alignment pair-wise alignment × multiple sequence alignment.
Pairwise Sequence Analysis-III
COT 6930 HPC and Bioinformatics Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Applied Bioinformatics Week 3. Theory I Similarity Dot plot.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Construction of Substitution matrices
Step 3: Tools Database Searching
DNA sequences alignment measurement Lecture 13. Introduction Measurement of “strength” alignment Nucleic acid and amino acid substitutions Measurement.
Last lecture summary.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Pairwise Sequence Alignment and Database Searching
Sequence similarity, BLAST alignments & multiple sequence alignments
Last lecture summary.
Protein Sequence Alignments
Last lecture summary.
Basic Local Alignment Search Tool
Introduction to bioinformatics Lecture 5 Pair-wise sequence alignment
Presentation transcript:

Last lecture summary

Sequence alignment What is sequence alignment Three flavors of sequence alignment Point mutations, indels

Sequences 'Central dogma of bioinformatics' Sequences diverge Conserved residues The variation between sequences – changes occurred during evolution in the form of substitutions (mutations) and/or indels.

New stuff

Homology During the time period, the molecular sequences undergo random changes, some of which are selected during the process of evolution. Selected sequences accumulate mutations, they diverge over time. Two sequences are homologous when they are descended from a common ancestor sequence. Traces of evolution may still remain in certain portions of the sequences to allow identification of the common ancestry. Residues performing key roles are preserved by natural selection, less crucial residues mutate more frequently.

Orhology, paralogy I Orthologs – homologous proteins from different species that possess the same function (e.g. corresponding kinases in signal transduction pathway in humans and mice) Paralogs – homologous proteins that have different function in the same species (e.g. two kinases in different signal transduction pathways of humans) However, these terms are controversially discussed: Jensen RA. Orthologs and paralogs - we need to get it right. Genome Biol. 2001;2(8), PMID: and references therein

Orthology, paralogy II Orthologs – genes separated by the event of speciation Sequences are direct descendants of a common ancestor. Most likely have similar domain structure, 3D structure and biological function. Paralogs – genes separated by the event of genetic duplication Gene duplication: An extra copy of a gene. Gene duplication is a key mechanism in evolution. Once a gene is duplicated, the identical genes can undergo changes and diverge to create two different genes.

Gene duplication 1. Unequal cross-over 2. Entire chromosome is replicated twice This error will result in one of the daughter cells having an extra copy of the chromosome. If this cell fuses with another cell during reproduction, it may or may not result in a viable zygote. 3. Retrotransposition Sequences of DNA are copied to RNA and then back to DNA instead of being translated into proteins resulting in extra copies of DNA being present within cell.

Unequal cross-over Homologous chromosomes are misaligned during meiosis. The probability of misalignment is a function of the degree of sharing the repetitive elements.

Comparing sequences through alignment – patterns of conservation and variation can be identified. The degree of sequence conservation in the alignment reveals evolutionary relatedness of different sequences The variation between sequences reflects the changes that have occurred during evolution in the form of substitutions and/or indels. Identifying the evolutionary relationships between sequences helps to characterize the function of unknown sequences. Protein sequence comparison can identify homologous sequences from common ancestor 1 billions year ago (BYA). DNA sequences typically only 600 MYA.

The outline of sequence alignment 1. How to recognize which sequence alignment is better. Scoring system Scoring DNA alignment Scoring protein alignment – substitution matrices (PAM, BLOSUM) 2. How to perform sequence alignment. Algorithm Dot plot, dynamic programming, heuristic algorithms (BLAST)

Scoring sequence alignment

Scoring DNA alignment

Identity matrix Substitution matrix DNA and protein sequences can be aligned so that the number of identically matching pairs is maximized. Counting the number of matches gives us a score (3 in this case). Higher score means better alignment. This procedure can be formalized using substitution matrix. A T T G T A – - G A C A T ATCG A1 T01 C001 G0001

Gaps or no gaps

Gapped DNA alignment (1) Match score:+1 Mismatch score:+0 Gap penalty:–1 ACGTCTGATACGCCGTATAGTCTATCT ||||| ||| || |||||||| ----CTGATTCGC---ATCGTCTATCT Matches: 18 × (+1) Mismatches: 2 × 0 Gaps: 7 × (– 1) Score = +11

Length penalties We want to find alignments that are evolutionarily likely. Which of the following alignments seems more likely to you? ACGTCTGATACGCCGTATAGTCTATCT ACGTCTGAT ATAGTCTATCT ACGTCTGATACGCCGTATAGTCTATCT AC-T-TGA--CG-CGT-TA-TCTATCT We can achieve this by penalizing more for a new gap, than for extending an existing gap  

Gapped DNA alignment (2) Match/mismatch score:+1/+0 Origination/length penalty:–2/–1 ACGTCTGATACGCCGTATAGTCTATCT ||||| ||| || |||||||| ----CTGATTCGC---ATCGTCTATCT Matches: 18 × (+1) Mismatches: 2 × 0 Origination: 2 × (–2) Length: 7 × (–1) Score = +7

Typical DNA Alignment Scoring Frequencies of mutations are equal for all bases: match score +5 mismatch score -4 gap penalty (usually a parameter) opening -10 extending -2

Scoring protein alignment

identity matrix: NAs – OK, proteins – not enough AAs are not exchanged with the same probability as can be conceived theoretically. For example, substitution of aspartic acids D by glutamic acid E is frequently observed. And change from aspartic acid to tryptophan W is very rare. DEW

Scoring protein alignment Why is that? 1. Triplet-based genetic code GAT (D) → GAA (E), GAT (D) → TGG (W) 2. Both D and E have similar properties, but D and W differ considerably. D is hydrophilic, W is hydrophobic, D → W mutation can greatly alter 3D structure and consequently function.

Genetic code

Substitution matrices for proteins Substitution (score) matrices show scores for amino acids substitution. Higher score means higher probability of mutation. Conservative substitutions – conserve the physical and chemical properties of the amino acids, limit structural/functional disruption Substitution matrices should reflect: Physicochemical properties of amino acids. Different frequencies of individual amino acids occuring in proteins. Interchangeability of the genetic code.

Protein substitution matrices – PAM

PAM matrices I How to assign scores? Let’s get nature – evolution – involved! If you choose set of proteins with very similar sequences, you can do alignment manually. Also, if sequences in your set are similar, then there is a high probability that amino acid difference are due to single mutation. From the frequencies of mutations in the set of similar protein sequences probabilities of substitutions can be derived. This is exactly the approach take by Margaret Dayhoff in 1978 to construct PAM (Accepted Point Mutation) matrices. Dayhoff, M.O., Schwartz, R. and Orcutt, B.C. (1978). "A model of Evolutionary Change in Proteins". Atlas of protein sequence and structure (volume 5, supplement 3 ed.). Nat. Biomed. Res. Found.. pp. 345–358.

PAM matrices II 71 gapless alignments of sequences with at least 85% identity substitutions were found. These mutations do not significantly alter the protein function. Hence they are called accepted mutations (accepted by natural selection). Probabilities that any one amino acid would mutate into any other were calculated. From these probabilities, scores were derived. Excellent discussion of the derivation and use of PAM matrices: George DG, Barker WC, Hunt LT. Mutation data matrix and its uses. Methods Enzymol. 1990,183: PMID:

PAM matrices III Dayhoff’s definition of accepted mutation was thus based on empirically observed amino acids substitutions. The used unit is a PAM. Two sequences are 1 PAM apart if they have 99% identical residues. i.e. from 100 residues, one is mutated. PAM1 matrix represents probabilities of point mutations over certain evolutionary time. in Drosophila 1 PAM corresponds to ~2.62 MYA in Human 1 PAM corresponds to ~4.58 MYA

Higher PAM matrices What to do if I want get probabilities over much longer evolutionary time? Dayhoff proposed a model of evolution that is a Markov process. A case of Markov process is a linear dynamical system.

Linear dynamical system I

Linear dynamical system II