Download presentation
Published byNaomi Walters Modified over 9 years ago
1
BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment
8/27/07 Finish: Lecture 2- Biological Databases Lecture 4 Sequence Alignment #4_Aug27 BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment BCB 444/544 Fall 07 Dobbs
2
Required Reading (before lecture)
#4 - Sequence Alignment Required Reading (before lecture) 8/27/07 Mon Aug 27 - for Lecture #4 Pairwise Sequence Alignment Chp 3 - pp Xiong Textbook Wed Aug 29 - for Lecture #5 Dynamic Programming Eddy: What is Dynamic Programming? Thurs Aug 30 - Lab #2: Databases, ISU Resources,& Pairwise Sequence Alignment Fri Aug 31 - for Lecture #6 Scoring Matrices and Alignment Statistics Chp 3 - pp 41-49 BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment BCB 444/544 Fall 07 Dobbs
3
BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment
8/27/07 HW#2: BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment BCB 444/544 Fall 07 Dobbs
4
Back to: Chp 2- Biological Databases
#4 - Sequence Alignment 8/27/07 Back to: Chp 2- Biological Databases Xiong: Chp 2 Introduction to Biological Databases What is a Database? Types of Databases Biological Databases Pitfalls of Biological Databases Information Retrieval from Biological Databases BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment BCB 444/544 Fall 07 Dobbs
5
BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment
What is a Database? 8/27/07 Duh!! OK: skip we'll skip that! BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment BCB 444/544 Fall 07 Dobbs
6
BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment
Types of Databases 8/27/07 3 Major types of electronic databases: Flat files - simple text files no organization to facilitate retrieval Relational - data organized as tables ("relations") shared features among tables allows rapid search Object-oriented - data organized as "objects" objects associated hierarchically BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment BCB 444/544 Fall 07 Dobbs
7
BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment
Biological Databases 8/27/07 Currently - all 3 types, but MANY flat files What are goals of biological databases? Information retrieval Knowledge discovery Important issue: Interconnectivity BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment BCB 444/544 Fall 07 Dobbs
8
Types of Biological Databases
#4 - Sequence Alignment Types of Biological Databases 8/27/07 1- Primary "simple" archives of sequences, structures, images, etc. raw data, minimal annotations, not always well curated! 2- Secondary enhanced with more complete annotation of sequences, structures, images, etc. usually curated! 3- Specialized focused on a particular research interest or organism usually - not always - highly curated BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment BCB 444/544 Fall 07 Dobbs
9
Examples of Biological Databases
#4 - Sequence Alignment Examples of Biological Databases 8/27/07 1- Primary DNA sequences GenBank - US European Molecular Biology Lab - EMBL DNA Data Bank of Japan - DDBJ Structures (Protein, DNA, RNA) PDB - Protein Data Bank NDB - Nucleic Acid Data Bank BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment BCB 444/544 Fall 07 Dobbs
10
Examples of Biological Databases
#4 - Sequence Alignment Examples of Biological Databases 8/27/07 2- Secondary Protein sequences Swiss-Prot, TreEMBL, PIR these recently combined into UniProt 3- Specialized Species-specific (or "taxonomic" specific) Flybase, WormBase, AceDB, PlantDB Molecule-specific,disease-specific BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment BCB 444/544 Fall 07 Dobbs
11
Pitfalls of Biological Databases
#4 - Sequence Alignment Pitfalls of Biological Databases 8/27/07 Errors! & Lack of documentation re: quality or reliability of data Limited mechanisms for "data checking" or preventing propagation of errors (esp. annotation errors!!) Redundancy Inconsistency Incompatibility (format, terminology, data types, etc.) BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment BCB 444/544 Fall 07 Dobbs
12
Information Retrieval from Biological Databases
#4 - Sequence Alignment Information Retrieval from Biological Databases 8/27/07 2 most popular retrieval systems: ENTREZ - NCBI will use a LOT - was introduced in Lab 1 SRS - Sequence Retrieval Systems - EBI will use less, similar to ENTREZ Both: Provide access to multiple databases Allow complex queries BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment BCB 444/544 Fall 07 Dobbs
13
Web Resources: Bioinformatics & Computational Biology
#4 - Sequence Alignment 8/27/07 Web Resources: Bioinformatics & Computational Biology NCBI - National Center for Biotechnology Information ISCB - International Society for Computational Biology JCB - Jena Center for Bioinformatics Pitt - OBRC Online Bioinformatics Resources Collection UBC - Bioinformatics Links Directory UWash - BioMolecules ISU - Bioinformatics Resources - Andrea Dinkelman ISU - YABI = "Yet Another Bioinformatics Index" (from BCB Lab at ISU) Wikipedia: Bioinformatics BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment BCB 444/544 Fall 07 Dobbs
14
ISU Resources & Experts
#4 - Sequence Alignment 8/27/07 ISU Resources & Experts ISU Research Centers & Graduate Training Programs: LH Baker Center - Bioinformatics & Biological Statistics BCB - Bioinformatics & Computational Biology BCB Lab - (Student-Led Consulting & Resources) CIAG - Center for Integrated Animal Genomics CCILD - Computational Intelligence, Learning & Discovery IGERT Training Grant - Computational Molecular Biology ISU Facilities: Biotechnology - Instrumentation Facilities PSI - Plant Sciences Institute PSI Centers BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment BCB 444/544 Fall 07 Dobbs
15
SUMMARY: #2- Biological Databases
#4 - Sequence Alignment 8/27/07 SUMMARY: #2- Biological Databases BEWARE! BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment BCB 444/544 Fall 07 Dobbs
16
Chp 3- Sequence Alignment
8/27/07 Chp 3- Sequence Alignment SECTION II SEQUENCE ALIGNMENT Xiong: Chp 3 Pairwise Sequence Alignment Evolutionary Basis Sequence Homology versus Sequence Similarity Sequence Similarity versus Sequence Identity Methods Scoring Matrices Statistical Significance of Sequence Alignment Adapted from Brown and Caragea, 2007, with some slides from: Altman, Fernandez-Baca, Batzoglou, Craven, Hunter, Page. BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment BCB 444/544 Fall 07 Dobbs
17
Motivation for Sequence Alignment
8/27/07 "Sequence comparison lies at the heart of bioinformatics analysis." Jin Xiong Sequence comparison is important for drawing functional & evolutionary inferences re: new genes/proteins Pairwise sequence alignment is fundamental; it used to: Search for common patterns of characters Establish pair-wise correspondence between related sequences Pairwise sequence alignment is basis for: Database searching (e.g., BLAST) Multiple sequence alignment (MSA) BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment BCB 444/544 Fall 07 Dobbs
18
BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment
8/27/07 Why Align Sequences? Databases contain many sequences with known functions & many sequences with unknown functions. Genes (or proteins) with similar sequences may have similar structures and/or functions. Sequence alignment can provide important clues to the function of a novel gene or protein BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment BCB 444/544 Fall 07 Dobbs
19
Examples of Bioinformatics Tasks that Rely on Sequence Alignment
8/27/07 Genomic sequencing (> 500 complete genomes sequenced!) Assembling multiple sequence reads into contigs, scaffolds Aligning sequences with chromosomes Finding genes and regulatory regions Identifying gene products Identifying function of gene products Studying the structural organization of genomes Comparative genomics BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment BCB 444/544 Fall 07 Dobbs
20
BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment
8/27/07 Evolutionary Basis DNA, RNA and proteins are "molecular fossils" they encode the history of millions of years of evolution During evolution, molecular sequences accumulate random changes (mutations/variants) some of which provide a selective advantage or disadvantage, and some of which are neutral Sequences that are structurally and/or functionally important tend to be conserved (e.g., chromosomal telomeric sequences; enzyme active sites) Significant sequence conservation allows inference of evolutionary relatedness BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment BCB 444/544 Fall 07 Dobbs
21
BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment
8/27/07 Homology Homology has a very specific meaning in evolutionary & computational biology - & the term is often used incorrectly For us: Homology = similarity due to descent from a common evolutionary ancestor But, HOMOLOGY ≠ SIMILARITY When 2 sequences share a sufficiently high degree of sequence similarity (or identity), we may infer that they are homologous We can infer homology from similarity (can't prove it!) BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment BCB 444/544 Fall 07 Dobbs
22
BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment
8/27/07 Orthologs vs Paralogs 2 types of homologous sequences: Orthologs - "same genes" in different species; result of common ancestry; corresponding proteins have "same" functions (e.g., human -globin & mouse -globin) Paralogs - "similar genes" within a species; result of gene duplication events; corresponding proteins may (or may not) have similar functions (e.g., human -globin & human -globin) Speciation Duplication B A C C' A is the parent gene Speciation leads to B & C Duplication leads to C’ B and C are Orthologous C and C’ are Paralogous BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment BCB 444/544 Fall 07 Dobbs
23
Sequence Homology vs Similarity
#4 - Sequence Alignment Sequence Homology vs Similarity 8/27/07 Homologous sequences - sequences that share a common evolutionary ancestry Similar sequences - sequences that have a high percentage of aligned residues with similar physicochemical properties (e.g., size, hydrophobicity, charge) IMPORTANT: Sequence homology: An inference about a common ancestral relationship, drawn when two sequences share a high enough degree of sequence similarity Homology is qualitative Sequence similarity: The direct result of observation from a sequence alignment Similarity is quantitative; can be described using percentages BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment BCB 444/544 Fall 07 Dobbs
24
Sequence Similarity vs Identity
#4 - Sequence Alignment Sequence Similarity vs Identity 8/27/07 For nucleotide sequences (DNA & RNA), sequence similarity and identity have the "same" meaning: Two DNA sequences can share a high degree of sequence identity (or similarity) -- means the same thing Drena's opinion: Always use "identity" when making quantitative comparisons re: DNA or RNA sequences (to avoid confusion!) For protein sequences, sequence similarity and identity have different meanings: Identity = % of exact matches between two aligned sequences Similarity = % of aligned residues that share similar characteristics (e.g, physicochemical characteristics, structural propsensities, evolutionary profiles) Drena's opinion: Always use "identity" when making quantitative comparisons re: protein sequences (to avoid confusion!) BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment BCB 444/544 Fall 07 Dobbs
25
What is Sequence Alignment?
8/27/07 What is Sequence Alignment? Given 2 sequences of letters, and a scoring scheme for evaluating matching letters, find an optimal pairing of letters in one sequence to letters of other sequence. Align: 1: THIS IS A RATHER LONGER SENTENCE THAN THE NEXT. 2: THIS IS A SHORT SENTENCE. 1: THIS IS A RATHER LONGER SENTENCE THAN THE NEXT. 2: THIS IS A ######SHORT## SENTENCE##############. OR 2: THIS IS A ##SHORT###SENT#EN###CE##############. Is one of these alignments "optimal"? Which is better? BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment BCB 444/544 Fall 07 Dobbs
26
Goal of Sequence Alignment
8/27/07 Find the best pairing of 2 sequences, such that there is maximum correspondence between residues DNA letter alphabet (+ gap) TTGACAC TTTACAC Proteins 20 letter alphabet (+ gap) RKVA-GMA RKIAVAMA BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment BCB 444/544 Fall 07 Dobbs
27
BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment
Statement of Problem 8/27/07 Given: 2 sequences Scoring system for evaluating match (or mismatch) of two characters Penalty function for gaps in sequences Find: Optimal pairing of sequences that Retains the order of characters Introduces gaps where needed Maximizes total score BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment BCB 444/544 Fall 07 Dobbs
28
Types of Sequence Variation
#4 - Sequence Alignment Types of Sequence Variation 8/27/07 Sequences can diverge from a common ancestor through various types of mutations: Substitutions ACGA AGGA Insertions ACGA ACCGA Deletions ACGA AGA Insertions or deletions ("indels") result in gaps in alignments Substitotions result in mismatches No change? match BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment BCB 444/544 Fall 07 Dobbs
29
BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment
Gaps 8/27/07 Indels of various sizes can occur in one sequence relative to the other e.g., corresponding to a shortening of the polypeptide chain in a protein BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment BCB 444/544 Fall 07 Dobbs
30
Avoiding Random Alignments with a Scoring Function
#4 - Sequence Alignment 8/27/07 Avoiding Random Alignments with a Scoring Function Introducing too many gaps generates nonsense alignments: s--e-----qu---en--ce sometimesquipsentice Need to distinguish between alignments that occur due to homology and those that occur by chance Define a scoring function that accounts for mismatches and gaps Scoring Function (F): e.g. Match: + m +1 Mismatch: - s -1 Gap: - d -2 F = m(#matches) + s(#mismatches) + d(#gaps) BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment BCB 444/544 Fall 07 Dobbs
31
Not All Mismatches are the Same
#4 - Sequence Alignment Not All Mismatches are the Same 8/27/07 Some amino acids are more "exchangeable" than others; e.g., Ser and Thr are more similar than Trp and Ala A substitution matrix can be used to introduce "mismatch costs" for handling different types of substitutions Mismatch costs are not usually used in aligning DNA or RNA sequences, because no substitution is "better" than any other (in general) BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment BCB 444/544 Fall 07 Dobbs
32
BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment
Substitution Matrix 8/27/07 s(a,b) corresponds to score of aligning character a with character b Match scores are often calculated based on frequency of mutations in very similar sequences (more details later) BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment BCB 444/544 Fall 07 Dobbs
33
BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment
Methods 8/27/07 Global and Local Alignment Alignment Algorithms Dot Matrix Method Dynamic Programming Method Gap penalities DP for Global Alignment DP for Local Alignment Scoring Matrices Amino acid scoring matrices PAM BLOSUM Comparisons between PAM & BLOSUM Statistical Significance of Sequence Alignment BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment BCB 444/544 Fall 07 Dobbs
34
Global vs Local Alignment
#4 - Sequence Alignment 8/27/07 Global vs Local Alignment Global alignment Finds best possible alignment across entire length of 2 sequences Aligned sequences assumed to be generally similar over entire length Local alignment Finds local regions with highest similarity between 2 sequences Aligns these without regard for rest of sequence Sequences are not assumed to be similar over entire length BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment BCB 444/544 Fall 07 Dobbs
35
Global vs Local Alignment - example
#4 - Sequence Alignment 8/27/07 Global vs Local Alignment - example S = CTGTCGCTGCACG T = TGCCGTG CTGTCG-CTGCACG -TGCCG--TG---- Global alignment -TGC-CG-TG---- CTGTCGCTGCACG-- TGC-CGTG Local alignment Which is better? BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment BCB 444/544 Fall 07 Dobbs
36
Global vs Local Alignment When use which?
#4 - Sequence Alignment 8/27/07 Global vs Local Alignment When use which? Both are important but it is critical to use right method for a given task! Global alignment: Good for: aligning closely related sequences of approx. same length Not good for: divergent sequences or sequences with different lengths Local Alignment: Good for: searching for conserved patterns (domains or motifs) in DNA or protein sequences Not good for: generating alignment of closely related sequences Global and local alignments are fundamentally similar and differ only in optimization strategy used in aligning similar residues BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment BCB 444/544 Fall 07 Dobbs
37
BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment
8/27/07 Alignment Algorithms 3 major methods for alignment: Dot matrix analysis Dynamic Programming Word or k-tuple methods (later, in Chp 4) BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment BCB 444/544 Fall 07 Dobbs
38
Dot Matrix Method (Dot Plots)
#4 - Sequence Alignment 8/27/07 Dot Matrix Method (Dot Plots) Place 1 sequence along top row of matrix Place 2nd sequence along left column of matrix Plot a dot each time there is a match between an element of row sequence and an element of column sequence For proteins, usually use more sophisticated scoring schemes than "identical match" Diagonal lines indicate areas of match Reverse diagonals (perpendicular to diagonal) indicate inversions A C G A C G Exploring Dot Plots BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment BCB 444/544 Fall 07 Dobbs
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.