Sequencing and Sequence Alignment

Slides:



Advertisements
Similar presentations
Sequence Alignments.
Advertisements

Mutations.
Global Sequence Alignment by Dynamic Programming.
Analysis of your 16s RNA. DNA sequencing Most current sequencing projects use the chain termination method –Also known as Sanger sequencing, after its.
1 Explain What is a frameshift mutation and give an example Infer The effects of a mutation are not always visible. Choose a species and explain how a.
1 ALIGNMENT OF NUCLEOTIDE & AMINO-ACID SEQUENCES.
Sequence Alignments and Database Searches Introduction to Bioinformatics.
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.
Introduction to Bioinformatics Burkhard Morgenstern Institute of Microbiology and Genetics Department of Bioinformatics Goldschmidtstr. 1 Göttingen, March.
Sequence Alignment Tutorial #2
Introduction to Bioinformatics Algorithms Sequence Alignment.
Summer Bioinformatics Workshop 2008 Sequence Alignments Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University.
Sequence Alignment Bioinformatics. Sequence Comparison Problem: Given two sequences S & T, are S and T similar? Need to establish some notion of similarity.
Sequence Alignments Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2004.
Introduction to bioinformatics
Dynamic Programming and Biological Sequence Comparison Part I.
Sequence Alignment II CIS 667 Spring Optimal Alignments So we know how to compute the similarity between two sequences  How do we construct an.
Sequence Alignment III CIS 667 February 10, 2004.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Sequence Alignments Introduction to Bioinformatics.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Physical Mapping II + Perl CIS 667 March 2, 2004.
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
Incorporating Bioinformatics in an Algorithms Course Lawrence D’Antonio Ramapo College of New Jersey.
Pairwise alignment Computational Genomics and Proteomics.
SC.L.16.3 Describe the basic process of DNA replication and how it relates to the transmission and conservation of the genetic information.
1 Introduction to Bioinformatics 2 Introduction to Bioinformatics. LECTURE 3: SEQUENCE ALIGNMENT * Chapter 3: All in the family.
Sequencing a genome and Basic Sequence Alignment
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Developing Pairwise Sequence Alignment Algorithms
Sequence Alignments and Dynamic Programming BIO/CS 471 – Algorithms for Bioinformatics.
Sequence Alignment.
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
1 Generalized Tree Alignment: The Deferred Path Heuristic Stinus Lindgreen
Introduction to Bioinformatics Algorithms Sequence Alignment.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.
. Sequence Alignment. Sequences Much of bioinformatics involves sequences u DNA sequences u RNA sequences u Protein sequences We can think of these sequences.
Pairwise Sequence Alignment BMI/CS 776 Mark Craven January 2002.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
RNA and Protein Synthesis
Chapter 12-3: RNA and Protein Synthesis
DNA alphabet DNA is the principal constituent of the genome. It may be regarded as a complex set of instructions for creating an organism. Four different.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Chapter 3 Computational Molecular Biology Michael Smith
Gene Regulations and Mutations
Applied Bioinformatics Week 3. Theory I Similarity Dot plot.
Human Genomics. Writing in RED indicates the SQA outcomes. Writing in BLACK explains these outcomes in depth.
. Sequence Alignment. Sequences Much of bioinformatics involves sequences u DNA sequences u RNA sequences u Protein sequences We can think of these sequences.
Locating and sequencing genes
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
MUTATIONS. Mutations  errors/changes in the DNA sequence that are inherited.  May have a negative effect, a positive effect, or no effect.
Molecular Genetics Jeopardy DNATranscriptionTranslationEpigeneticsPotpourri Final Jeopardy.
Sequence comparison and database search.
Mutation. What you need to know How alteration of chromosome number or structurally altered chromosomes can cause genetic disorders How point mutations.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Sequence similarity, BLAST alignments & multiple sequence alignments
Unit 7 Review Test tomorrow.
Gene Mutations.
Bioinformatics: The pair-wise alignment problem
Sequence Alignment 11/24/2018.
CSE 589 Applied Algorithms Spring 1999
Mutations changes in genetic material (_____).
Presentation transcript:

Sequencing and Sequence Alignment CIS 667 Bioinformatics Spring 2004

Protein Sequencing Before DNA sequencing, protein sequencing was common Sanger won a Nobel prize for determining amino acid sequence of insulin Protein sequences much shorter than today’s DNA fragments One amino acid at a time can be removed from the protein The aa can then be determined

Protein Sequencing Unfortunately, this works only for a few aa’s from the end So insulin broken up into fragments Gly Ile Val Glu Ile Val Glu Gln Gln Cys Cys Ala

Protein Sequencing Then the fragments are sequenced After they are assembled by finding the overlapping regions Gly Ile Val Glu Ile Val Glu Gln Gln Cys Cys Ala Gly Ile Val Glu Gln Cys Cys Ala

Protein Sequencing By the late 1960s protein sequencing machines on market RNA sequencing following the same basic methodology by 1965

DNA Sequencing DNA was first sequenced by transcribing DNA to RNA Slow - years to sequence tens of base pairs By mid 70s Maxam and Gilbert learned how to cleave DNA selectively at A, C, G, or T This led to the development of Maxam-Gilbert sequencing method

Maxam-Gilbert Sequencing Single-stranded DNA labeled with radioactive tag at 5’ end Sample quartered and digested in four base-specific reactions Reaction concentrations are such that each strand of DNA in each sample cut once at random location Use gel electrophoresis to find lengths of tagged fragments

Sanger Sequencing Today, an alternative method called Sanger sequencing is generally used A primer bonds to a single-stranded DNA near the 3’ end of the target to be sequenced DNA polymerase extends the primer along the target DNA For each of the 4 bases this extension is done

Sanger Sequencing A small amount of extension ending nucleotides are introduced This causes the extension to end randomly at a specific base Now use gel electrophoresis and read the sequence as the complement of the bases

Sanger Sequencing

Sequence Alignment Given two string, find the optimal alignment of the strings Strings may be of different lengths, optimal alignment may include gaps An alignment score is produced Example: SHALL WEAR ALL WE SHALL WEAR --ALL WE--

Sequence Alignment Alignment score produced by looking at each column in alignment Match gives column a +1 score Mismatch: -1 Space: -2 HELLO THERE JELLO TEAR- Score: 7*(+1)+3*(-1)+1*(-2)=2

Sequence Alignment In biology, the sequences to be aligned consist of nucleotides or amino acids Sufficiently similar sequences can allow us to infer homology Common evolutionary history We can also infer the function of a protein or gene given similarity to one with known functionality

Sequence Alignment Since homologous sequences share a common evolutionary history the alignment score should reflect evolutionary processes DNA changes over time due to mutations Most mutations are harmful May be due to environmental factors, e.g. radiation

Mutation May also be due to problems in the transcription process One nucleotide may be substituted for another Deletion of a nucleotide Duplication Insertions Inversions

Mutation

Mutation Deletions have different effects depending on the number of nucleotides deleted Deletions of 3 in an ORF result in the deletion of a codon, so an amino acid is not produced Usually damaging, sometimes lethal Deletion of 1 causes a frame shift - changes all downstream amino acids Almost always lethal

Codon Deletion ATGATACCGACGTACGGCATTTAA START IPTYGI STOP START IPTYI STOP

Frame Shift ATGATACCGACGTACGGCATTTAA START IPTYGI STOP START IPT STOP

Mutations Some notes… A single base substitution may even produce the same amino acid (especially if it is the last in a codon) May also produce a similar amino acid It is impossible to tell whether the gap in an alignment results from insertion in one sequence or deletion from another After mutation, an organism may be more or less likely to survive natural selection

Alignment Scores Based on what we have said about mutations - how should we modify the alignment scores? Note that a single long gap is more likely than several shorter ones… Therefore it should have a smaller penalty Say… Match: +1 Mismatch: 0 Gap origination: -2 Gap extension: -1

Alignment We can have sequences with different sizes An alignment is defined to be the insertion of spaces in arbitrary locations along the sequences so that they end up being the same size No space in the sequence can be aligned with a space in the other GA-CGGATTAG GATCGGAATAG

Alignment Let’s use the following scores for similarity - match: +1; mismatch: -1; space: -2 Let sim(s, t) denote the similarity score for two sequences s and t We want to develop an algorithm to compute the maximum sim(s, t) given s and t

Dynamic Programming We will use a technique known as dynamic programming Solve an instance of a problem by using an already solved smaller instance of the same problem In our case, we build up the solution by determining the similarities between arbitrary prefixes of the two sequences Start with shorter prefixes, work towards longer ones

Dynamic Programming Let m be the size of s and n the size of t Then there are m + 1 prefixes of s and n + 1 prefixes of t, including the empty string We store the similarities of the prefixes in an (m + 1)  (n + 1) array Entry (I, j) contains the similarity between s[1..I] and t[1..j]

Dynamic Programming Let s = AAAC and t = AGC We need to initialize part of the array to get started If one of the sequences is empty, we just add as many spaces as characters in the other sequence Correspondingly, we fill in the first row and column with multiples of the space penalty (-2)

Dynamic Programming We can compute the value of entry (i, j) by looking at just three previous entries: (i - 1, j), (i - 1, j - 1), (i, j - 1) Corresponds to these choices Align s[1..i] with t[1..j - 1] and match a space with t[j] Align s[1..i - 1] with t[1..j - 1] and match s[i] with t[j] Align s[1..i - 1] with t[1..j] and match s[i] with a space

Dynamic Programming If we compute entries in an smart way, scores for best alignments between smaller prefixes have already been stored in the array, so sim(s[1..i], t[1..j] = max {sim (s[1..i], t[1..j - 1]) - 2, sim (s[1..i - 1], t[1..j - 1]) + p(i, j), sim (s[1..i - 1], t[1..j]) - 2} Where p(i, j) = + 1 if s[i] = t[j], -1 otherwise

Dynamic Programming We should fill in the array row by row, left to right If we denote the array by a then we have a[i, j] = max {a[i, j - 1] - 2, a[i - 1, j - 1] + p(i, j), a[i - 1, j] - 2} Where p(i, j) = + 1 if s[i] = t[j], -1 otherwise

Dynamic Programming Algorithm Similarity input: sequences s and t output: similarity of s and t m  |s| n  |t| for i  0 to m do a[i, 0]  i  g for j  0 to n do a[0, j]  j  g for i  1 to m do for j  1 to n do a[i, j]  max(a[i - 1, j] + g, a[i - 1, j - 1] + p(i, j), a[i, j - 1] + g) return a[m, n]

Optimal Alignments So now we know the maximum similarity, but we still need to compute the optimal alignment We will use the array a of similarities previously computed To be continued …