Sequence Alignments Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center clin@winona.edu
Sequence Alignments Cornerstone of bioinformatics What is a sequence? Nucleotide sequence Amino acid sequence Pairwise and multiple sequence alignments What alignments can help Determine function of a newly discovered gene sequence Determine evolutionary relationships among genes, proteins, and species Predicting structure and function of protein Intro to Bioinformatics – Sequence Alignment Acknowledgement: This notes is adapted from lecture notes of both Wright State University’s Bioinformatics Program.
DNA Replication Prior to cell division, all the genetic instructions must be “copied” so that each new cell will have a complete set Intro to Bioinformatics – Sequence Alignment
Over time, genes accumulate mutations Environmental factors Radiation Oxidation Mistakes in replication or repair Deletions, Duplications Insertions, Inversions Translocations Point mutations Intro to Bioinformatics – Sequence Alignment
Deletions Codon deletion: ACG ATA GCG TAT GTA TAG CCG… Effect depends on the protein, position, etc. Almost always deleterious Sometimes lethal Frame shift mutation: ACG ATA GCG TAT GTA TAG CCG… ACG ATA GCG ATG TAT AGC CG?… Almost always lethal Intro to Bioinformatics – Sequence Alignment
Indels Comparing two genes it is generally impossible to tell if an indel is an insertion in one gene, or a deletion in another, unless ancestry is known: ACGTCTGATACGCCGTATCGTCTATCT ACGTCTGAT---CCGTATCGTCTATCT Intro to Bioinformatics – Sequence Alignment
The Genetic Code Substitutions are mutations accepted by natural selection. Synonymous: CGC CGA Non-synonymous: GAU GAA Intro to Bioinformatics – Sequence Alignment
Point Mutation Example: Sickle-cell Disease Wild-type hemoglobin DNA 3’----CTT----5’ mRNA 5’----GAA----3’ Normal hemoglobin ------[Glu]------ Mutant hemoglobin DNA 3’----CAT----5’ mRNA 5’----GUA----3’ ------[Val]------ Intro to Bioinformatics – Sequence Alignment
Intro to Bioinformatics – Sequence Alignment image credit: U.S. Department of Energy Human Genome Program, http://www.ornl.gov/hgmis.
Comparing Two Sequences Point mutations, easy: ACGTCTGATACGCCGTATAGTCTATCT ACGTCTGATTCGCCCTATCGTCTATCT Indels are difficult, must align sequences: ACGTCTGATACGCCGTATAGTCTATCT CTGATTCGCATCGTCTATCT ACGTCTGATACGCCGTATAGTCTATCT ----CTGATTCGC---ATCGTCTATCT Intro to Bioinformatics – Sequence Alignment
Why Align Sequences? The draft human genome is available Automated gene finding is possible Gene: AGTACGTATCGTATAGCGTAA What does it do? One approach: Is there a similar gene in another species? Align sequences with known genes Find the gene with the “best” match Intro to Bioinformatics – Sequence Alignment
Scoring a Sequence Alignment Match score: +1 Mismatch score: +0 Gap penalty: –1 ACGTCTGATACGCCGTATAGTCTATCT ||||| ||| || |||||||| ----CTGATTCGC---ATCGTCTATCT Matches: 18 × (+1) Mismatches: 2 × 0 Gaps: 7 × (– 1) Score = +11 Intro to Bioinformatics – Sequence Alignment
How can we find an optimal alignment? Finding the alignment is computationally hard: ACGTCTGATACGCCGTATAGTCTATCT CTGAT---TCG-CATCGTC--T-ATCT There are ~888,000 possibilities to align the two sequences given above. Algorithms using a technique called “dynamic programming” are used – out of the scope of this workshop. Intro to Bioinformatics – Sequence Alignment
Global and Local alignments Global alignments – score the entire alignment Local alignment – find the best matching subsequence Why local sequence alignment? Subsequence comparison between a DNA sequence and a genome Protein function domains Exons matching Intro to Bioinformatics – Sequence Alignment
Example Compare the two sequences: TTGACACCCTCCCAATT ACCCCAGGCTTTACACAG Global alignment (does it look good?) TTGACACCCTCC-CAATT || || || Local alignment (does it look good?) ---------TTGACACCCTCCCAATT || |||| ACCCCAGGCTTTACACAG-------- Intro to Bioinformatics – Sequence Alignment
Dot Plots One of the simplest and oldest methods for sequence alignment Visualization of regions of similarity Assign one sequence on the horizontal axis Assign the other on the vertical axis Place dots on the space of matches Diagonal lines means adjacent regions of identity Intro to Bioinformatics – Sequence Alignment
A Simple Example Construct a simple dot plot for TAGTCGATG TGGTCATC The alignment is TAGTCGATG TGGTC-ATC T A G C * Intro to Bioinformatics – Sequence Alignment
What else can it do (and how)? Gaps Inverse substring Repeat Palindrome Gene conservation and order study Intro to Bioinformatics – Sequence Alignment