Roadmap The topics:  basic concepts of molecular biology  more on Perl  overview of the field  biological databases and database searching  sequence.

Slides:

Advertisements

Similar presentations

Sequence Alignments.

Advertisements

1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.

Measuring the degree of similarity: PAM and blosum Matrix

DNA sequences alignment measurement

Last lecture summary.

Lecture 8 Alignment of pairs of sequence Local and global alignment

Sequence Alignments and Database Searches Introduction to Bioinformatics.

Sequence Similarity Searching Class 4 March 2010.

Heuristic alignment algorithms and cost matrices

Expect value Expect value (E-value) Expected number of hits, of equivalent or better score, found by random chance in a database of the size.

Introduction to Bioinformatics Algorithms Sequence Alignment.

Summer Bioinformatics Workshop 2008 Sequence Alignments Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University.

Scoring Matrices June 19, 2008 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.

Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.

|| || ||||| ||| || || ||||||||||||||||||| MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFE… ATGGTGAACCTGACCTCTGACGAGAAGACTGCCGTCCTTGCCCTGTGGAACAAGGTGGACG TGGAAGACTGTGGTGGTGAGGCCCTGGGCAGGTTTGTATGGAGGTTACAAGGCTGCTTAAG.

Scoring Matrices June 22, 2006 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.

Sequence Alignments Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center

Introduction to bioinformatics

Sequence Comparison Intragenic - self to self. -find internal repeating units. Intergenic -compare two different sequences. Dotplot - visual alignment.

Sequence similarity.

Similar Sequence Similar Function Charles Yan Spring 2006.

Introduction to Bioinformatics Algorithms Sequence Alignment.

1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.

Scoring matrices Identity PAM BLOSUM.

Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, All rights reserved.

Sequence Alignments Introduction to Bioinformatics.

Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.

Sequence Alignments Revisited

Alignment IV BLOSUM Matrices. 2 BLOSUM matrices Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks.

Alignment III PAM Matrices. 2 PAM250 scoring matrix.

Basics of Sequence Alignment and Weight Matrices and DOT Plot

Introduction to NCBI & Ensembl tools including BLAST and database searching Incorporating Bioinformatics into the High School Biology Curriculum Fran Lewitter,

Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.

Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.

An Introduction to Bioinformatics

Protein Evolution and Sequence Analysis Protein Evolution and Sequence Analysis.

CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)

Computational Biology, Part 3 Sequence Alignment Robert F. Murphy Copyright  1996, All rights reserved.

Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)

BLAST Workshop Maya Schushan June 2009.

Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.

Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.

Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.

Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.

Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.

Biology 4900 Biocomputing.

Comp. Genomics Recitation 3 The statistics of database searching.

Construction of Substitution Matrices

Sequence Alignment Csc 487/687 Computing for bioinformatics.

Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.

Basic terms:  Similarity - measurable quantity. Similarity- applied to proteins using concept of conservative substitutions Similarity- applied to proteins.

Tutorial 4 Substitution matrices and PSI-BLAST 1.

Introduction to NCBI & Ensembl tools including BLAST and database searching Incorporating Bioinformatics into the High School Biology Curriculum Fran Lewitter,

©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.

Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.

Sequence Alignment.

Construction of Substitution matrices

Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University

Step 3: Tools Database Searching

BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.

Last lecture summary.

Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,

Last lecture summary. Sequence alignment What is sequence alignment Three flavors of sequence alignment Point mutations, indels.

Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.

9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.

Pairwise Sequence Alignment and Database Searching

Sequence similarity, BLAST alignments & multiple sequence alignments

Basic Local Alignment Search Tool

Presentation transcript:

Roadmap The topics:  basic concepts of molecular biology  more on Perl  overview of the field  biological databases and database searching  sequence alignments  phylogenetics  structure prediction  microarray data analysis

Sequence alignments Introduction Introduction What is an alignment? What is an alignment? Why do alignments? Why do alignments? A bit of history A bit of history Dot matrix comparison Dot matrix comparison Scoring alignments Scoring alignments Alignment methods Alignment methods Significance of alignments Significance of alignments

What is Sequence alignment Sequence alignment is an arrangement of two or more sequences, highlighting their similarity.

Why do alignments? Sequence Alignment is useful for discovering structural, functional and evolutional information in biological sequences.

Over time, genes accumulate mutations Environmental factors Radiation Oxidation Mistakes in replication/repair Deletions, Duplications Insertions Inversions Point mutations

Comparing two sequences Point mutations, easy: ACGTCTGATACGCCGTATAGTCTATCT ACGTCTGATTCGCCCTATCGTCTATCT Point mutations, easy: ACGTCTGATACGCCGTATAGTCTATCT ACGTCTGATTCGCCCTATCGTCTATCT Insertions/deletions, must align: ACGTCTGATACGCCGTATAGTCTATCT CTGATTCGCATCGTCTATCT Insertions/deletions, must align: ACGTCTGATACGCCGTATAGTCTATCT CTGATTCGCATCGTCTATCT ACGTCTGATACGCCGTATAGTCTATCT ----CTGATTCGC---ATCGTCTATCT

Sequence Alignment Doolittle RF, Hunkapiller MW, Hood LE, Devare SG, Robbins KC, Aaronson SA, Antoniades HN. Science 221: , A sequence for platelet derived growth factor (PDGF) from mammalian cells was virtually identical to the sequence for the retrovirus encoded oncogene known as v-sis (gene causing cancer in animals). Retrovirus had acquired the gene from the host cell as some kind of genetic exchange event and then had produced a mutant that could alter the function of the normal protein when it infected another animal. Russell F. Doolittle

Dot Matrix Comparison A: T C A G A G G T C T G B: T C A G A G C T G X X C X X T XXXXG XXT XC XXXXG XXA XXXXG XXA XC XXT GTGGAGACT

Interpretation of dot matrix Regions of similarity appear as diagonal runs of dots Regions of similarity appear as diagonal runs of dots Reverse diagonals (perpendicular to diagonal) indicate inversions Reverse diagonals (perpendicular to diagonal) indicate inversions Can link or "join" separate diagonals to form alignment with "gaps" Can link or "join" separate diagonals to form alignment with "gaps"

More on Dot Matrix Improving detection of matching regions by filtering Improving detection of matching regions by filtering using sliding window to compare the two sequences. For example, print a dot at a matrix position only if using sliding window to compare the two sequences. For example, print a dot at a matrix position only if 7 out of the next 11 positions in the sequence are identical 7 out of the next 11 positions in the sequence are identical Similarity score of the next 11 positions in the sequence is greater than 5. Similarity score of the next 11 positions in the sequence is greater than 5.

Sequence repeats Many sequences contains repetitive regions. Many sequences contains repetitive regions. a retrovirus vector sequence against itself using a window size of 9 and mismatch limit of 2 (

More on Dot Matrix Dot matrix graphically presents regions of identity or similarity between two sequences Dot matrix graphically presents regions of identity or similarity between two sequences The use of windows and thresholds can reduce “noise” in dot matrix The use of windows and thresholds can reduce “noise” in dot matrix Inversions and duplications have unique “signatures” in dot matrix Inversions and duplications have unique “signatures” in dot matrix

Software Dotlet (java applet)– Dotlet (java applet)– Dnadot – Dnadot – arbl.cvmbs.colostate.edu/molkit/dnadot/ Dotter – Dotter – Dottup – Dottup –

How to measure the similarity Basically three kinds of changes can occur at any given position within a sequence: Mutation Mutation Insertion Insertion Deletion Deletion Insertion and deletion have been found to occur in nature at a significantly lower frequency than mutations. Insertion and deletion have been found to occur in nature at a significantly lower frequency than mutations.

Scoring Matrices for Aligning DNA Sequences Transition --- substitutions in which a purine (A/G) is replaced by another purine (A/G) or a pyrimadine (C/T) is replaced by another pyrimadine (C/T). Transversions --- (A/G)  (C/T) 1000G 0100C 0010T 0001AGCTA Identity matrix G C T AGCTA BLAST matrix 1-5-5G -51-5C -51-5T-5-51AGCTA Transition-Transversion matrix

Scoring a sequence alignment Match score:+1 Match score:+1 Mismatch score:+0 Mismatch score:+0 Gap penalty:–1 Gap penalty:–1 ACGTCTGATACGCCGTATAGTCTATCT ||||| ||| || |||||||| ----CTGATTCGC---ATCGTCTATCT ACGTCTGATACGCCGTATAGTCTATCT ||||| ||| || |||||||| ----CTGATTCGC---ATCGTCTATCT Matches: 18 × (+1) Matches: 18 × (+1) Mismatches: 2 × 0 Mismatches: 2 × 0 Gaps: 7 × (– 1) Gaps: 7 × (– 1) Score = +11

Gap opening and extension penalties We want to find alignments that are evolutionarily likely. We want to find alignments that are evolutionarily likely. Which of the following alignments seems more likely to you? Which of the following alignments seems more likely to you? ACGTCTGATACGCCGTATAGTCTATCT ACGTCTGAT ATAGTCTATCT ACGTCTGATACGCCGTATAGTCTATCT AC-T-TGA--CG-CGT-TA-TCTATCT We can achieve this by penalizing more for a new gap, than for extending an existing gap We can achieve this by penalizing more for a new gap, than for extending an existing gap  

Scoring a sequence alignment Match/mismatch score:+1/+0 Match/mismatch score:+1/+0 Open/extension penalty:–2/–1 ACGTCTGATACGCCGTATAGTCTATCT ||||| ||| || |||||||| ----CTGATTCGC---ATCGTCTATCT Open/extension penalty:–2/–1 ACGTCTGATACGCCGTATAGTCTATCT ||||| ||| || |||||||| ----CTGATTCGC---ATCGTCTATCT Matches: 18 × (+1) Matches: 18 × (+1) Mismatches: 2 × 0 Mismatches: 2 × 0 Open: 2 × (–2) Open: 2 × (–2) Extension: 5 × (–1) Extension: 5 × (–1) Score = +9

Amino Acid Substitution Matrices PAM - point accepted mutation based on global alignment [evolutionary model] BLOSUM - block substitutions based on local alignments [similarity among conserved sequences]

Part of PAM 250 Matrix CSTPAG C12 S02 T-213 P-3106 A G Log-odds = log ( ) chance to see the pair in homologous proteins chance to see the pair in unrelated proteins by chance

PAM matrices PAM 1 Matrix reflects an amount of evolution producing on average one mutation per hundred amino acids (1 unit evolution). PAM unit evolution 0.01 … Probability PAM Phe to Cys … Phe to Asp Phe to Asn Phe to Arg Phe to Ala PAM 1 Amino acid change

Limitations of PAM Matrices Constructed based on the phylogenetic relationships prior to scoring mutations; Constructed based on the phylogenetic relationships prior to scoring mutations; Difficulty of determining ancestral relationships among sequences; Difficulty of determining ancestral relationships among sequences; Based on a small set of closely related proteins; Based on a small set of closely related proteins; …

BLOSUM Matrices Based on the observed amino acid substitutions in a large set of ~2000 conserved amino acid patterns (blocks). The blocks are found in a database of protein sequences representing more than 500 families of related proteins and act as signatures of these protein families. Based on the observed amino acid substitutions in a large set of ~2000 conserved amino acid patterns (blocks). The blocks are found in a database of protein sequences representing more than 500 families of related proteins and act as signatures of these protein families. The matrices are measured on the multiple alignment of the blocks. The matrices are measured on the multiple alignment of the blocks. The entries of the matrices are computed based on the same principle used in PAM -- log(odds’ ratio). The entries of the matrices are computed based on the same principle used in PAM -- log(odds’ ratio).

Part of BLOSUM 62 Matrix CSTPAG C9 S4 T15 P-37 A0104 G BLOSUM62 was measured on pairs of sequences with an average of 62 % identical amino acids. BLOSUM62 was measured on pairs of sequences with an average of 62 % identical amino acids. Log-odds = log ( ) chance to see the pair in homologous proteins chance to see the pair in unrelated proteins by chance

PAM vs. BLOSUM PAM PAM Based on mutational model of evolution ( Markov process ) Based on mutational model of evolution ( Markov process ) PAM1 is based on sequences of 85% similarity PAM1 is based on sequences of 85% similarity Designed to track the evolutionary origins Designed to track the evolutionary origins BLOSUM BLOSUM Based on the multiple alignment of blocks Based on the multiple alignment of blocks Good to be used to compare distant sequences Good to be used to compare distant sequences Designed to find proteins’ conserved domains Designed to find proteins’ conserved domains

Gap Penalty Optimal penalties vary from sequence to sequence, and finding the most adequate value is a matter of empirical trial and error. Optimal penalties vary from sequence to sequence, and finding the most adequate value is a matter of empirical trial and error. When compare distantly related sequences, a high gap- opening penalty and a very low gap-extension penalty often give better results When compare distantly related sequences, a high gap- opening penalty and a very low gap-extension penalty often give better results When compare closely related sequences, gaps should be penalized on both a gap-opening and gap-extension When compare closely related sequences, gaps should be penalized on both a gap-opening and gap-extension