Speaker: Sean D. Mooney Date: August 27, 2015 Bioinformatics I Date: August 27, 2015.

Slides:



Advertisements
Similar presentations
1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
Advertisements

Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
1 Genome information GenBank (Entrez nucleotide) Species-specific databases Protein sequence GenBank (Entrez protein) UniProtKB (SwissProt) Protein structure.
Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
1 ALIGNMENT OF NUCLEOTIDE & AMINO-ACID SEQUENCES.
Lecture 8 Alignment of pairs of sequence Local and global alignment
How to use the web for bioinformatics Molecular Technologies Ethan Strauss X 1171
Structural bioinformatics
1 Exercise: BIOINFORMATIC DATABASES and BLAST. 2 Outline  NCBI and Entrez  Pubmed  Google scholar  RefSeq  Swissprot  Fasta format  PDB: Protein.
Introduction to Bioinformatics Burkhard Morgenstern Institute of Microbiology and Genetics Department of Bioinformatics Goldschmidtstr. 1 Göttingen, March.
Sequence Similarity Searching Class 4 March 2010.
Heuristic alignment algorithms and cost matrices
Bioinformatics and Phylogenetic Analysis
Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Biological Databases Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Pairwise Alignment Global & local alignment Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis.
Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, All rights reserved.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Signaling Pathways and Summary June 30, 2005 Signaling lecture Course summary Tomorrow Next Week Friday, 7/8/05 Morning presentation of writing assignments.
Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.
Incorporating Bioinformatics in an Algorithms Course Lawrence D’Antonio Ramapo College of New Jersey.
Sequence alignment, E-value & Extreme value distribution
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Sequencing a genome and Basic Sequence Alignment
Information theoretic interpretation of PAM matrices Sorin Istrail and Derek Aguiar.
BIOMETRICS Module Code: CA641 Week 11- Pairwise Sequence Alignment.
Pairwise & Multiple sequence alignments
Bioinformatics.
An Introduction to Bioinformatics
Computational Biology, Part 3 Sequence Alignment Robert F. Murphy Copyright  1996, All rights reserved.
Essential Bioinformatics and Biocomputing Module (Tutorial) Biological Databases Lecturer: Chen Yuzong Jan 2003 TAs: Cao Zhiwei Lee Teckkwong, Bernett.
Introduction to Bioinformatics CPSC 265. Interface of biology and computer science Analysis of proteins, genes and genomes using computer algorithms and.
BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment
Content of the previous class Introduction The evolutionary basis of sequence alignment The Modular Nature of proteins.
BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
BIOINFORMATICS IN BIOCHEMISTRY Bioinformatics– a field at the interface of molecular biology, computer science, and mathematics Bioinformatics focuses.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Copyright OpenHelix. No use or reproduction without express written consent1.
Scoring Matrices April 23, 2009 Learning objectives- 1) Last word on Global Alignment 2) Understand how the Smith-Waterman algorithm can be applied to.
Part I: Identifying sequences with … Speaker : S. Gaj Date
Sequencing a genome and Basic Sequence Alignment
Organizing information in the post-genomic era The rise of bioinformatics.
Construction of Substitution Matrices
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
REMINDERS 2 nd Exam on Nov.17 Coverage: Central Dogma of DNA Replication Transcription Translation Cell structure and function Recombinant DNA technology.
Bioinformatics Chapter 4: Sequence comparison ____________________________________________________________________________________________________________________.
Overview of Bioinformatics 1 Module Denis Manley..
Basic Local Alignment Search Tool BLAST Why Use BLAST?
COT 6930 HPC and Bioinformatics Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Construction of Substitution matrices
Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University
Copyright OpenHelix. No use or reproduction without express written consent1.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
GENBANK FILE FORMAT LOCUS –LOCUS NAME Is usually the first letter of the genus and species name, followed by the accession number –SEQUENCE LENGTH Number.
Summer Bioinformatics Workshop 2008 BLAST Chi-Cheng Lin, Ph.D., Professor Department of Computer Science Winona State University – Rochester Center
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.
Computer Applications and Bioinformatics
What is Bioinformatics?
Basic Local Alignment Search Tool
Source Page Understanding for Heterogeneous Molecular Biological Data
Presentation transcript:

Speaker: Sean D. Mooney Date: August 27, 2015 Bioinformatics I Date: August 27, 2015

Speaker: Sean D. Mooney Date: August 27, 2015 Learning Objectives To have an understanding of: – Sequence analysis – Genome assembly and annotation – Publicly available molecular biology and genetic databases – Identification of sequence similarity – Defining function of biological sequences – Using sequence information to hypothesize function

Speaker: Sean D. Mooney Date: August 27, 2015 Bioinformatics Introduction Bioinformatics is the intersection of computer science, statistics and one of the following life science disciplines: biology, biochemistry or medicine Computer Science Statistics Bioinformatics

Speaker: Sean D. Mooney Date: August 27, 2015 What is bioinformatics? Clinical Informatics – Systems used to deal with patient data – Clinical trial management systems, electronic health records, etc. Laboratory Informatics – Systems to deal with scientific instruments and data management – Connecting instruments together, managing laboratory flow, etc. Bioinformatics – Systems to deal with basic research data – DNA, proteins, ‘molecular’ things

Speaker: Sean D. Mooney Date: August 27, 2015 Sequence Analysis DNA sequencing is a common activity, so analysis of biological sequences has become critical to modern molecular biology and genetics. What kind of sequences? – DNA – Protein – RNA Today we will learn about how these sequences are managed and used by biologists

Speaker: Sean D. Mooney Date: August 27, 2015 Storing A Sequence On A Computer FASTA Format: >TITLE SEQUENCE 1 SEQUENCE1VSLLSNNLNKFDEGLALAHFVWIAPLQVALLMGLIWELLQA SAFCGLGFLIVLALFQAGL >TITLE SEQUENCE 2 SEQUENCE2ASKKNPKLINALRRCFFWRFMFYGIFLYLGEVTKAVQPLLL GRIIASYDPDNKEERSIAIYLGIGLCLLFIVRTLLLHPAIFGLHHIGMQM RIAMFSLIYKKTLKLSSRVLDKISIGQLVSLLSNNLNKFDEGLALAHFVW IAPLQVALLMGLIWELLQASAFCGLGFLIVLALFQAG >TITLE SEQUENCE 3 AINKIQDFLQKQEYKTLEYNLTTTEVVMENVTAFWEEGFGELFEKAKQNN NNRKTSNGDDSLFFSNFSLLGTPVLKDINFKIERGQL

Speaker: Sean D. Mooney Date: August 27, 2015 Actual FASTA File >gi| |ref|NM_ | Homo sapiens cystic fibrosis transmembrane conductance regulator (ATP-binding cassette sub-family C, member 7) (CFTR), mRNA AATTGGAAGCAAATGACATCACAGCAGGTCAGAGAAAAAGGGTTGAGCGGCAGGCACCCAGAGTAGTAGG TCTTTGGCATTAGGAGCTTGAGCCCAGACGGCCCTAGCAGGGACCCCAGCGCCCGAGAGACCATGCAGAG GTCGCCTCTGGAAAAGGCCAGCGTTGTCTCCAAACTTTTTTTCAGCTGGACCAGACCAATTTTGAGGAAA GGATACAGACAGCGCCTGGAATTGTCAGACATATACCAAATCCCTTCTGTTGATTCTGCTGACAATCTAT CTGAAAAATTGGAAAGAGAATGGGATAGAGAGCTGGCTTCAAAGAAAAATCCTAAACTCATTAATGCCCT TCGGCGATGTTTTTTCTGGAGATTTATGTTCTATGGAATCTTTTTATATTTAGGGGAAGTCACCAAAGCA GTACAGCCTCTCTTACTGGGAAGAATCATAGCTTCCTATGACCCGGATAACAAGGAGGAACGCTCTATCG CGATTTATCTAGGCATAGGCTTATGCCTTCTCTTTATTGTGAGGACACTGCTCCTACACCCAGCCATTTT TGGCCTTCATCACATTGGAATGCAGATGAGAATAGCTATGTTTAGTTTGATTTATAAGAAGACTTTAAAG CTGTCAAGCCGTGTTCTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACA AATTTGATGAAGGACTTGCATTGGCACATTTCGTGTGGATCGCTCCTTTGCAAGTGGCACTCCTCATGGG GCTAATCTGGGAGTTGTTACAGGCGTCTGCCTTCTGTGGACTTGGTTTCCTGATAGTCCTTGCCCTTTTT

Speaker: Sean D. Mooney Date: August 27, 2015 GenBank GenBank is the public domain database of biological (RNA, DNA, protein) sequences. If you sequence a novel sequence, you can submit it! Started at Los Alamos National Laboratory – Moved to the National Center for Biotechnology Information in 1993 – Stores almost every type of sequence possible

Speaker: Sean D. Mooney Date: August 27, 2015 Using GenBank You can access GenBank at

Speaker: Sean D. Mooney Date: August 27, 2015 GenBank Now bigger than 100,000,000,000 bases 240,000 named organisms More than 60 million records

Speaker: Sean D. Mooney Date: August 27, 2015 The need for quality: RefSeq GenBank is an uncurated mess… RefSeq – the reference sequence project. This is a subset of GenBank, and a curated set of sequences and annotations RefSeq IDs have the form of “Letter-Letter- Underscore-Number.” The letter-letter-underscore prefixes are: – “NM_” - mRNA – “NP_” - protein – “NT_” – genomic (automated) – “XP_” – genomic protein (automated), etc..

Speaker: Sean D. Mooney Date: August 27, 2015 When you look for a sequence… It will come back to you in something like the GenBank file format.

Speaker: Sean D. Mooney Date: August 27, 2015 What are the important elements of GenBank format? LOCUS NM_ bp mRNA linear PRI 07-OCT-2007 DEFINITION Homo sapiens cystic fibrosis transmembrane conductance regulator (ATP-binding cassette sub-family C, member 7) (CFTR), mRNA. ACCESSION NM_ VERSION NM_ GI: KEYWORDS. SOURCE Homo sapiens (human) ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; Catarrhini; Hominidae; Homo. REFERENCE 1 (bases 1 to 6132) AUTHORS Pall,H., Zielenski,J., Jonas,M.M., DaSilva,D.A., Potvin,K.M., Yuan,X.W., Huang,Q. and Freedman,S.D. TITLE Primary sclerosing cholangitis in childhood is associated with abnormalities in cystic fibrosis-mediated chloride channel function JOURNAL J. Pediatr. 151 (3), (2007) PUBMED REMARK GeneRIF: There is a high prevalence of CFTR-mediated ion transport dysfunction in subjects with childhood primary sclerosing cholangitis

Speaker: Sean D. Mooney Date: August 27, 2015 LOCUS NM_ bp mRNA linear PRI 07-OCT-2007 DEFINITION Homo sapiens cystic fibrosis transmembrane conductance regulator (ATP-binding cassette sub-family C, member 7) (CFTR), mRNA. ACCESSION NM_ VERSION NM_ GI: KEYWORDS. SOURCE Homo sapiens (human) ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; Catarrhini; Hominidae; Homo. REFERENCE 1 (bases 1 to 6132) AUTHORS Pall,H., Zielenski,J., Jonas,M.M., DaSilva,D.A., Potvin,K.M., Yuan,X.W., Huang,Q. and Freedman,S.D. TITLE Primary sclerosing cholangitis in childhood is associated with abnormalities in cystic fibrosis-mediated chloride channel function JOURNAL J. Pediatr. 151 (3), (2007) PUBMED REMARK GeneRIF: There is a high prevalence of CFTR-mediated ion transport dysfunction in subjects with childhood primary sclerosing cholangitis Accession: NM_ Version: 3 GenBank ID: Symbol: CFTR What are the important elements of GenBank format?

Speaker: Sean D. Mooney Date: August 27, 2015 What do you need to know? The gene ID or symbol is not a cite-able sequence! Without the version, the RefSeq ID is not unique! The GI is always unique, if a change occurs to the entry, a new GI is issued. When in doubt, use the GI!

Speaker: Sean D. Mooney Date: August 27, 2015 FEATURES Location/Qualifiers CDS /gene="CFTR" /protein_id="NP_ " /db_xref="GI: " /db_xref="CCDS:CCDS5773.1" /db_xref="GeneID:1080" /db_xref="HGNC:1884" /db_xref="HPRD:03883" /db_xref="MIM:602421" /translation="MQRSPLEKASVVSKLFFSWTRPILRKGYRQRLELSDIYQIPSVD SADNLSEKLEREWDRELASKKNPKLINALRRCFFWRFMFYGIFLYLGEVTKAVQPLLL GRIIASYDPDNKEERSIAIYLGIGLCLLFIVRTLLLHPAIFGLHHIGMQMRIAMFSLI YKKTLKLSSRVLDKISIGQLVSLLSNNLNKFDEGLALAHFVWIAPLQVALLMGLIWEL LQASAFCGLGFLIVLALFQAGLGRMMMKYRDQRAGKISERLVITSEMIENIQSVKAYC WEEAMEKMIENLRQTELKL GenBank files contain annotation

Speaker: Sean D. Mooney Date: August 27, 2015 FEATURES Location/Qualifiers CDS /gene="CFTR" /protein_id="NP_ " /db_xref="GI: " /db_xref="CCDS:CCDS5773.1" /db_xref="GeneID:1080" /db_xref="HGNC:1884" /db_xref="HPRD:03883" /db_xref="MIM:602421" /translation="MQRSPLEKASVVSKLFFSWTRPILRKGYRQRLELSDIYQIPSVD SADNLSEKLEREWDRELASKKNPKLINALRRCFFWRFMFYGIFLYLGEVTKAVQPLLL GRIIASYDPDNKEERSIAIYLGIGLCLLFIVRTLLLHPAIFGLHHIGMQMRIAMFSLI YKKTLKLSSRVLDKISIGQLVSLLSNNLNKFDEGLALAHFVWIAPLQVALLMGLIWEL LQASAFCGLGFLIVLALFQAGLGRMMMKYRDQRAGKISERLVITSEMIENIQSVKAYC WEEAMEKMIENLRQTELKL Note protein translation is present! There is also another protein record for this entry (NP_ )

Speaker: Sean D. Mooney Date: August 27, 2015 Why so many IDs? Biology! Gene Symbol (CFTR) GeneID – Unique to organism (1080) Transcript ID -1 Transcript 1 version1 Transcript 1 version2 Transcript 1 version3 Transcript ID -2 etc Transcript 2 version1 Transcript 2 version2

Speaker: Sean D. Mooney Date: August 27, 2015 The heirarchy of names Gene symbol/name is the official alphanumeric name of a gene – It is defined by Human Genome Organization (HUGO) – It generally does not reliably tell you the species or describe a transcript – Example: CFTR, ER, AR, BRCA1 Gene Identifier – A gene identifier, generally a number, allows you to uniquely link to the gene and species – Common identifiers: Locus Link or Gene – NIH identifiers Mouse Genome Informatics IDs (MGI) – Mouse identification Ensembl IDs – European gene identifiers Transcript (the mRNA product of gene) – RefSeq – GenBank Protein – Next topic!

Speaker: Sean D. Mooney Date: August 27, 2015 Protein databases The most popular protein database is called Uniprot Uniprot is a superset of Swiss-Prot, at Name is CFTR_HUMAN, Accession is Letter+Number (P13569)

Speaker: Sean D. Mooney Date: August 27, 2015 Some simple rules for publication When publishing genes or proteins: – When publishing information about a gene, use the HUGO approved symbol. If publishing information specifically about the gene, put the gene ID (preferably the locus link or entrez gene id). – When publishing information about a specific transcript of a gene, use the RefSeq ID. If including specifics about the sequence, include the version – Uniprot is a great way to reference proteins.

Speaker: Sean D. Mooney Date: August 27, 2015 What next? We now know how to find and cite a sequence from a database We know that genomes, genes, transcripts and proteins are treated differently How do we use sequences to do things like predict function? Sequence analysis!

Speaker: Sean D. Mooney Date: August 27, 2015 Sequence analysis is based on the sequence alignment! Given two sequences of letters, and a scoring scheme for evaluating matching letters, find the optimal pairing of letters from one sequence to letters of the other sequence. (The sequence analysis slides were adapted from Russ Altman’s bioinformatics course, Stanford University) Align: THIS IS A RATHER LONGER SENTENCE THAN THE NEXT. THIS IS A SHORT SENTENCE. THIS IS A RATHER LONGER SENTENCE THAN THE NEXT. THIS IS A SHORT-- SENTENCE OR THIS IS A SHORT SENTENCE

Speaker: Sean D. Mooney Date: August 27, 2015 Aligning biological sequences DNA (4 letter alphabet + gap) TTGACAC TTTACAC Proteins (20 letter alphabet + gap) RKVA--GMAKPNM RKIAVAAASKPAV

Speaker: Sean D. Mooney Date: August 27, 2015 Statement of Problem Given: – 2 sequences – scoring system for evaluating match(or mismatch) of two characters – penalty function for gaps in sequences Produce: – Optimal pairing of sequences that retains the order of characters introduces gaps maximizes total score

Speaker: Sean D. Mooney Date: August 27, 2015 Why align sequences? Lots of sequences with unknown structure and function. A few sequences with known structure and function. If they align, they are similar, maybe due to common descent. If they are similar, then they might have same structure or function. If one of them has known structure/function, then alignment to the other yields insight about how the structure or function works.

Speaker: Sean D. Mooney Date: August 27, 2015 Alignments have parameters Exact Matches OK, Inexact Costly, Gaps cheap. This is a rather longer sentence than the next. This is a sentence OR This is a *rather longer sentence than the next. This is a s---h----o---rtsentence Exact Matches OK, Inexact Moderate, Gaps cheap. This is a rather longer sentence than the next. This is a --short sentence Exact Matches cheap, Inexact cheap, Gaps expensive. This is a rather longer sentence than the next. This is a short sentence

Speaker: Sean D. Mooney Date: August 27, 2015 Multiple alignment Pairwise alignment (two at a time) is much easier than multiple alignment (N at a time). To be discussed later. This is a rather longer sentence than the next. This is a short sentence. This is the next sentence. Rather long is the next concept. Rather longer than what is the next concept.

Speaker: Sean D. Mooney Date: August 27, 2015 Multiple Alignment This is a rather longer sentence than the next This is a short sentence This is the next sentence Rather long is the next concept Rather longer than what is the next concept-.

Speaker: Sean D. Mooney Date: August 27, 2015 Dot Plots To Visualize Sequence Similarity Put one sequence along the top row of a matrix. Put the other sequence along the left column of the matrix. Plot a dot everytime there is a match between an element of row sequence and an element of the column sequence. Diagonal lines indicate areas of match.

Speaker: Sean D. Mooney Date: August 27, 2015

Speaker: Sean D. Mooney Date: August 27, 2015 Problems with dot matrices Rely on visual analysis Difficult to find optimal alignments Need scoring schemes more sophisticated that “identical match” Difficult to estimate significance of alignments

Speaker: Sean D. Mooney Date: August 27, 2015 The Dynamic Programming Algorithm Sequence alignment generally uses an algorithm called dynamic programming Dynamic programming allows for fast, optimal alignment of sequences Very informally, dynamic programming finds the best path through a dot plot.

Speaker: Sean D. Mooney Date: August 27, 2015 Substitution Matrices The degree of match between two letters can be represented in a matrix. Area of active research Changing matrix changes alignments – context-specific matching – information theoretic interpretation of scores – modeling evolution with different matrices

Speaker: Sean D. Mooney Date: August 27, 2015 A Sample Match Matrix for the amino acids (PAM-250).

Speaker: Sean D. Mooney Date: August 27, 2015 Where do matrices come from? 1. Manually align protein structures (or, more risky, sequences) 2. Look for frequency of amino acid substitutions at structurally nearly constant sites. 3. Entry ~ log ( freq(observed)/freq(expected) ) + —> More likely than random 0 —> At random base rate - —> Less likely than random

Speaker: Sean D. Mooney Date: August 27, 2015 Global vs. Local Alignment Global alignment: find alignment in which total score is highest, perhaps at expense of areas of great local similarity. Local alignment: find alignment in which the highest scoring subsequences are identified, at the expense of the overall score.

Speaker: Sean D. Mooney Date: August 27, 2015 Sequence alignment Two types of alignment approaches for DNA and Protein – Global alignments Originally described with the Needleman-Wunsch algorithm [1970] using dynamic programming Optimally finds an alignment between two sequences ClustalW is software for global alignment – Local alignments Originally described with the Smith-Waterman algorithm [1981], again using dynamic programming Optimally finds subsequence alignments

Speaker: Sean D. Mooney Date: August 27, 2015 Uses of sequence alignment Searching databases Assembling genome sequences Annotation transfer: similar sequences implies similar function Constructing multiple sequence alignments of groups of sequences Understanding evolutionary relationships between sequences DNA/RNA hybridization probe design Peptide and protein identification in proteomic experiments For more detail on sequence alignment algorithms and their statistics see: Biological Sequence Analysis: Probabilistic models of proteins and nucleic acids by Durbin, Eddy, Krogh and Mitchison (Cambridge Press)

Speaker: Sean D. Mooney Date: August 27, 2015 Global vs. Local Alignment TTGACACCCTCCCAATTGTA ACCCCAGGCTTTACACAT TTGACACCCTCCCAATTGTA--- |||| || | -----ACCCCAGGCTTTACACAT TTGACACCCTCCCAATTGTA || |||| ACCCCAGGCTTTACACAT (from Gribskov/Devereaux page 133)

Speaker: Sean D. Mooney Date: August 27, 2015 Local alignment asks different question than Global alignment. NOT: Are these two sequences generally the same? (Global question) BUT: Do these two sequences contain high scoring subsequences? (Local question) Local similarities may occur in sequences with different structure or function that share common substructure/subfunction. MOTIFS…

Speaker: Sean D. Mooney Date: August 27, 2015 Obtaining a Global Alignment ClustalW, Enter sequences in FASTA format! Parameters: penalties for gaps, substitution matrix, etc.

Speaker: Sean D. Mooney Date: August 27, 2015 Local Alignment? We will learn about local alignment tomorrow.

Speaker: Sean D. Mooney Date: August 27, 2015 Summary Biological sequences are stored in databases with annotations relevant to their function The primary sequence database is GenBank, hosted at the National Institutes of Health The providers of these databases have come up with database IDs that we should use when writing papers for genes, transcripts and proteins Sequence analysis allows us to compare two sequences through sequence alignments which come in two flavors, local and global