Summer Bioinformatics Workshop 2008 Sequence Alignments Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center
Summer Bioinformatics Workshop Sequence Alignments Cornerstone of bioinformatics What is a sequence? Nucleotide sequence Amino acid sequence Pairwise and multiple sequence alignments What alignments can help Determine function of a newly discovered gene sequence Determine evolutionary relationships among genes, proteins, and species Predict structure and function of protein
Summer Bioinformatics Workshop Why Align Sequences? The draft human genome is available Automated gene finding is possible Gene: AGTACGTATCGTATAGCGTAA What does it do? One approach: Is there a similar gene in another species? Align sequences with known genes Find the gene with the “best” match
Summer Bioinformatics Workshop Visualization of Sequence Alignment Dot Plot One of the simplest and oldest methods for sequence alignment Visualization of regions of similarity Assign one sequence on the horizontal axis Assign the other on the vertical axis Place dots on the space of matches Diagonal lines means adjacent regions of identity
Summer Bioinformatics Workshop A Simple Example Construct a simple dot plot for TAGTCGATG TGGTCATC The alignment is TAGTCGATG TGGTC-ATC TAGTCGATG T*** G*** G*** T*** C* A** T*** C*
Summer Bioinformatics Workshop Genes Accumulate Mutations over Time Mistakes in gene replication or repair Deletions, duplications Insertions, inversions Translocations Point mutations Environmental factors Radiation Oxidation
Summer Bioinformatics Workshop Codon deletion: ACG ATA GCG TAT GTA TAG CCG… Effect depends on the protein, position, etc. Almost always deleterious Sometimes lethal Frame shift mutation: ACG ATA GCG TAT GTA TAG CCG… ACG ATA GCG ATG TAT AGC CG?… Almost always lethal Deletions
Summer Bioinformatics Workshop Indels Comparing two genes it is generally impossible to tell if an indel is an insertion in one gene, or a deletion in another, unless ancestry is known: ACGTCTGATACGCCGTATCGTCTATCT ACGTCTGAT---CCGTATCGTCTATCT
Summer Bioinformatics Workshop The Genetic Code Substitutions Substitutions are mutations accepted by natural selection. Synonymous: CGC CGA Non-synonymous: GAU GAA
Summer Bioinformatics Workshop Point Mutation Example: Sickle-cell Disease Wild-type hemoglobin DNA 3’----CTT----5’ mRNA 5’----GAA----3’ Normal hemoglobin [Glu] Mutant hemoglobin DNA 3’----CAT----5’ mRNA 5’----GUA----3’ Mutant hemoglobin [Val]------
Summer Bioinformatics Workshop image credit: U.S. Department of Energy Human Genome Program,
Summer Bioinformatics Workshop Comparing Two Sequences Point mutations, easy: ACGTCTGATACGCCGTATAGTCTATCT ACGTCTGATTCGCCCTATCGTCTATCT Indels are difficult, must align sequences: ACGTCTGATACGCCGTATAGTCTATCT CTGATTCGCATCGTCTATCT ACGTCTGATACGCCGTATAGTCTATCT ----CTGATTCGC---ATCGTCTATCT
Summer Bioinformatics Workshop Scoring a Sequence Alignment Example Match score:+1 Mismatch score:+0 Gap penalty:–1 ACGTCTGATACGCCGTATAGTCTATCT ||||| ||| || |||||||| ----CTGATTCGC---ATCGTCTATCT Matches: 18 × (+1) Mismatches: 2 × 0 Gaps: 7 × (– 1) Various scoring scheme exist. Score = (-7) = +11
Summer Bioinformatics Workshop How can we find an optimal alignment? Finding the alignment is computationally hard: ACGTCTGATACGCCGTATAGTCTATCT CTGAT---TCG-CATCGTC--T-ATCT There are ~888,000 possibilities to align the two sequences given above. Algorithms using a technique called “dynamic programming” are used – out of the scope of this workshop.
Summer Bioinformatics Workshop Global and Local Alignments Global alignments – score the entire alignment Local alignment – find the best matching subsequence Why local sequence alignment? Global alignment is useful only if the sequences to be aligned are very similar Subsequence comparison between a DNA sequence and a genome Identify Conserved regions Protein function domains
Summer Bioinformatics Workshop Example Compare the two sequences: TTGACACCCTCCCAATT ACCCCAGGCTTTACACAG Global alignment (does it look good?) TTGACACCCTCC-CAATT || || || ACCCCAGGCTTTACACAG Local alignment (does it look good?) TTGACACCCTCCCAATT || |||| ACCCCAGGCTTTACACAG
Summer Bioinformatics Workshop Where do we get sequences to work with? Biological databases NCBI Entrez ( i?term=) i?term Wet labs Simulations Other people’s results On-line education resources BEDROCK ( BLAST results