Bioinformatics The application of computer science to biological data Tony C Smith Department of Computer Science University of Waikato

Slides:



Advertisements
Similar presentations
Expressing Genetic Information- a.k.a. Protein Synthesis
Advertisements

Application of Unstructured Learning in Computational Biology Tony C Smith Department of Computer Science University of Waikato
Basic Molecular Biology for CS374 Scientific Method: The widely held philosophy that a theory can never be proved, only disproved, and that all attempts.
RNA and Protein Synthesis
. Class 1: Introduction. The Tree of Life Source: Alberts et al.
Integration of Bioinformatics into Inquiry Based Learning by Kathleen Gabric.
M.W. Mak and S.Y. Kung, ICASSP’09 1 Conditional Random Fields for the Prediction of Signal Peptide Cleavage Sites M.W. Mak The Hong Kong Polytechnic University.
Computational Biology, Part 2 Representing and Finding Sequence Features using Consensus Sequences Robert F. Murphy Copyright  All rights reserved.
10-2: RNA and 10-3: Protein Synthesis
RNA & Protein Synthesis Uracil Hydrogen bonds Adenine Ribose RNA Mrs. Stewart Biology I.
Proteins are made in the ribosomes outside the nucleus.
Biomolecules: Nucleic Acids and Proteins
CSE 6406: Bioinformatics Algorithms. Course Outline
Molecular Biology Primer for CS and engineering students Alan Qi January, 2010.
Molecular Biology Primer for CS and engineering students Alan Qi Jan. 10, 2008.
Intelligent Systems for Bioinformatics Michael J. Watts
COT 6930 HPC and Bioinformatics Introduction to Molecular Biology Xingquan Zhu Dept. of Computer Science and Engineering.
Transcription and Translation
PROTEINS Nicky Mulder Acknowledgements: Anna Kramvis for lecture material (adapted here)
RNA & Protein Synthesis
Year 12 Biology 2012 Ms Hodgins.  We’ve all heard that DNA is important because it holds the instructions for life, but what does it actually do?  DNA.
Bioinformatics Why Can’t It Tell Us Everything?. Bioinformatics What are our Data Sets? Interested in information flow with cells Currently, the key information.
Now playing: Frank Sinatra “My Way” A large part of modern biology is understanding large molecules like Proteins A large part of modern biology is understanding.
PROTEIN SYNTHESIS THE FORMATION OF PROTEINS USING THE INFORMATION CODED IN DNA WITHIN THE NUCLEUS AND CARRIED OUT BY RNA IN THE CYTOPLASM.
Protein Evolution: Introduction to Protein Structure and Function protEvolEllsEmblSept2009 Please open the.
1 PROTEIN SYNTHESIS. DNA and Genes 2 Genes & Proteins DNA contains genes, sequences of nucleotide bases These genes code for polypeptides (proteins)
THE STRUCTURE AND FUNCTION OF MACROMOLECULES Proteins - Many Structures, Many Functions 1.A polypeptide is a polymer of amino acids connected to a specific.
Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.
Bioinformatics The Prediction of Life Tony C Smith Department of Computer Science University of Waikato
AdvancedBioinformatics Biostatistics & Medical Informatics 776 Computer Sciences 776 Spring 2002 Mark Craven Dept. of Biostatistics & Medical Informatics.
Bioinformatics The Prediction of Life Tony C Smith Department of Computer Science University of Waikato
RNA & Protein Synthesis
3.A.1 DNA and RNA Part IV: Translation DNA, and in some cases RNA, is the primary source of heritable information. DNA, and in some cases RNA, is the primary.
November 18, 2000ICTCM 2000 Introductory Biological Sequence Analysis Through Spreadsheets Stephen J. Merrill Sandra E. Merrill Marquette University Milwaukee,
Bioinformatics and Computational Biology
DNA, RNA, and Protein Replication Transcription Translation.
Integration of Bioinformatics into Inquiry Based Learning by Kathleen Gabric.
Motif Search and RNA Structure Prediction Lesson 9.
Teaching Bioinformatics Nevena Ackovska Ana Madevska - Bogdanova.
Machine Learning Methods of Protein Secondary Structure Prediction Presented by Chao Wang.
Starter What do you know about DNA and gene expression?
Do Now: Can you figure out the coded message? EOB JT B DPEF! DNA IS A CODE!
Welcome to class 1/19/16 – 1/20/16  Turn in Check for understanding (3 of them)  Warm up  Notes on RNA and Transcription process  Complete check.
Introduction to Bioinformatics Summary Thomas Nordahl Petersen.
RNA and Protein Synthesis. RNA Structure n Like DNA- Nucleic acid- composed of a long chain of nucleotides (5-carbon sugar + phosphate group + 4 different.
Bioinformatics bits of Life Dr. Tony C Smith Department of Computer Science University of Waikato
Lesson 4- Gene Expression PART 2 - TRANSLATION. Warm-Up Name 10 differences between DNA replication and transcription.
Prepared By: Syed Khaleelulla Hussaini. Outline Proteins DNA RNA Genetics and evolution The Sequence Matching Problem RNA Sequence Matching Complexity.
Ch. 11: DNA Replication, Transcription, & Translation Mrs. Geist Biology, Fall Swansboro High School.
Chapter – 10 Part II Molecular Biology of the Gene - Genetic Transcription and Translation.
Bioinformatics Overview
Biology DNA Unit.
Basics of RNA structure and modeling
CHEMISTRY 2 BIOCHEMISTRY.
From DNA to Proteins Lesson 1.
PROTEIN SYNTHESIS.
DNA Test Review.
Translation Genetic code converted from the “language” of mRNA to the “language” of protein. - a protein is a string of amino acids.
Transcription and Translation The How to…
There are four levels of structure in proteins
By Dr. Friday Nwalo Dept. Biology/Microbiology/Biotechnology
Introduction to Bioinformatics II
copyright cmassengale
12-3 RNA and Protein Synthesis
Draw the structure of an amino acid
Applying principles of computer science in a biological context
An Overview of Gene Expression
Genes and Protein Synthesis Review
Deep Learning in Bioinformatics
Presentation transcript:

Bioinformatics The application of computer science to biological data Tony C Smith Department of Computer Science University of Waikato

Bioinformatics Tony C Smith The essence is prediction … My dog is very littl_ My dog is very littl_ ?   We know that letters do not occur in English at random (e.g. ‘t’ is more common than ‘x’)   We know that context changes the probability of a letter (e.g. ‘x’ is more common than ‘t’ after the sequence “I eat Weet-Bi_”) Predicting symbols is fundamental to a wide range of important applications (e.g. encryption, compression)

Bioinformatics Tony C Smith Prediction in bioinformatics Predicting the location of genes in DNA Predicting gene roles in an organism Predicting errors in a genetic transcription Predicting the function of proteins Predicting diseases from molecular samples Anything that involves “making a judgment”; a yes/no decision about whether some sample datum ‘does’ or ‘does not’ have some property.

Bioinformatics Tony C Smith Representation W e e t – B i x … … to the computer, everything is binary!

Bioinformatics Tony C Smith A A C G T C A T T C G A T G A T T C G A Just as we can teach a computer to predict things about a sequence of letters in English prose, we can also teach it to predict things about a other sequences—like a genetic sequence

Bioinformatics Tony C Smith A genetic prediction problem ttgcaatcggcgctacgcttcaaaatttattatattcccggc gcggctacgttcatcccagcagcagcgattttaaaattaa cgcatcagactctcgtcgcgttcgtcgcctttattcacgcta atggacgacatcttttactacgacggcgcctacgcatcg cagcatacgacgcccagcatagtattttagaggcgagg acatcatcatatcgcagctacagcgcatcagacgcata cgacgacgactacgacgacactaacgacgatgttgcg cacccacaccagttatatagagacgaactcgcatcagc ttgcaatcggcgctacgcttcaaaatttattatattcccggc gcggctacgttcatcccagcagcagcgattttaaaattaa cgcatcagactctcgtcgcgttcgtcgcctttattcacgcta atggacgacatcttttactacgacggcgcctacgcatcg cagcatacgacgcccagcatagtattttagaggcgagg acatcatcatatcgcagctacagcgcatcagacgcata cgacgacgactacgacgacactaacgacgatgttgcg cacccacaccagttatatagagacgaactcgcatcagc

Bioinformatics Tony C Smith A genetic prediction problem ttgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcg cctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgc agctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagctgc aatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgccttt attcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagct acagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcg gcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattca cgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacag cgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgct acgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgcta atggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcat cagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgct tcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatgga cgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcaga cgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaa aatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgac atcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcat acgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttat tatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatctttt actacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacga cgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatatt cccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactac gacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacg actacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgttgcgcacccacaccagttatatagagacgaactc ttgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcg cctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgc agctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagctgc aatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgccttt attcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagct acagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcg gcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattca cgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacag cgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgct acgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgcta atggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcat cagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgct tcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatgga cgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcaga cgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaa aatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgac atcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcat acgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttat tatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatctttt actacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacga cgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatatt cccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactac gacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacg actacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgttgcgcacccacaccagttatatagagacgaactc

Bioinformatics Tony C Smith A genetic prediction problem ttgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcg cagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagct gcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgc agcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgc aatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcag catacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaat cggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcat acgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcg gcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatac gacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggc gctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacga cgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgct acgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacg cccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctac gcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcc cagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgttgcgcacccacacc agttatatagagacgaactcttgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttacta cgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatag agacgaactcgcatcagctgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactac gacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatataga gacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacga cggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagaga cgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacg gcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacg aactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggc gcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaa ctcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgc ctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaact cgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcct acgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcg catcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctac gcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgca tcagtgttgcgcacccacaccagttatatagagacgaactc ttgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcg cagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagct gcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgc agcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgc aatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcag catacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaat cggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcat acgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcg gcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatac gacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggc gctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacga cgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgct acgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacg cccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctac gcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcc cagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgttgcgcacccacacc agttatatagagacgaactcttgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttacta cgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatag agacgaactcgcatcagctgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactac gacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatataga gacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacga cggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagaga cgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacg gcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacg aactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggc gcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaa ctcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgc ctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaact cgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcct acgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcg catcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctac gcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgca tcagtgttgcgcacccacaccagttatatagagacgaactc

Bioinformatics Tony C Smith A genetic prediction problem  A gene encodes a protein  It is a blueprint that provides biochemical instructions on how to construct a sequence of amino acids so as to make a working protein that will perform some function in the organism

Bioinformatics Tony C Smith A genetic prediction problem encoding region untranslated region transcription factor RNA

Bioinformatics Tony C Smith A genetic prediction problem untranslated region

Bioinformatics Tony C Smith A genetic prediction problem untranslated region ttgcaatcggcgctacgcttcaaaatttattatattcccggc

Bioinformatics Tony C Smith A genetic prediction problem ttgcaatcggcgctacgcttcaaaatttattatattcccggc What transcription factors bind to this gene? Where is the transcription factor binding site?

Bioinformatics Tony C Smith A genetic prediction problem ttgcaatcggcgctacgcttcaaaatttattatattcccggc Clues:A binding site is often a short general pattern E.g. CCGATNATCGG

Bioinformatics Tony C Smith A genetic prediction problem ttgcaatcggcgctacgcttcaaaatttattatattcccggc Clues:The patterns are often reverse complements E.g.CCGATNATCGG GGCTANTAGCC

Bioinformatics Tony C Smith A genetic prediction problem ttgcaatcggcgctacgcttcaaaatttattatattcccggc Clues:Where there is one binding site, often there is another nearby.

Bioinformatics Tony C Smith A genetic prediction problem All of these properties are the kinds of things for which computer science has developed algorithms and data structures to identify quickly and efficiently, and therefore it is exactly the kind of problem computer scientists should be able to solve.

Bioinformatics Tony C Smith proteomics Three consecutive nucleotides in the coding region form a ‘codon’ … i.e. encode an amino acid. A string of amino acids makes a protein. 3 nucleotides, 4 possibilities each: 4 3 = 64 possible codons But there are only 20 amino acids!

Bioinformatics Tony C Smith proteomics Glycine:GGA, GGC, GGG, GGT Tyrosine:TAT, TAC Methionine:ATG There is quite a bit of redundancy in codons.

Bioinformatics Tony C Smith Amide group Carboxyl group R group Amino Acid

Bioinformatics Tony C Smith Amino Acid glycine tyrosine

Bioinformatics Tony C Smith

Signal peptide A relatively short sequence of amino residues at the N-terminus of the nascent protein typically residues typically residues MAGPRPSPWARLLLAALISVSLSGTLARCKKAPVSKKCETCVGQAALTGL … Cleaved off as protein passes through membrane (operates like a pass key) Knowing signal peptide helps determine protein function in the organism

Bioinformatics Tony C Smith Local biases in residues around the cleavage site Sequence regularities can be exploited by statistical and pattern-based models

Bioinformatics Tony C Smith Existing solutions Partial alignments (Altschul & Gish, 1996) Neural networks (Nielsen at al., 1997) Hidden Markov models (Nielsen et al., 1999) Polypeptide probabilities (Chou, 2001) Maximum entropy (Clote, 2002)

Bioinformatics Tony C Smith SignalP (Nielsen et al., ) HMMs (or NNs) used to predict cleavage point

Bioinformatics Tony C Smith Existing methods all perform reasonably well and with about the same accuracy (90% eukaryotes, 87% gram-, 85% gram+) Do not offer a transparent explanatory framework as to the underlying biology Many other learning algorithms do! (WEKA data mining tools, Waikato University)

Bioinformatics Tony C Smith From sequences to text Primary sequence data has many similarities with text –Amino residues (letters) –Polypeptides (words) –Secondary structures (phrases/sentences)

Bioinformatics Tony C Smith

From sequences to text Primary sequence data looks like text –Amino residues (letters) –Polypeptides (words) –Secondary structures (phrases/sentences) –Tertiary structure (whole documents) Approach: transform a sequence into a set of pseudo-text documents

Bioinformatics Tony C Smith Approach Problem is stated as two-class: an amino acid is either the first residue of the mature protein or it is not Each residue is described by a single document, which includes as many electrochemical, structural or contextual facts as are available (desirable)

Bioinformatics Tony C Smith Properties of amino acids

Bioinformatics Tony C Smith Free facts about amino acids

Bioinformatics Tony C Smith Residue as a document E.g.CysteineCysC aliphatic [yes], aromatic [no], hydrophobic [yes], charge [-], polarized [yes], small [no], number of nitrogen atoms [1], contains sulphur [yes], has a carbon ring [no], ionized [yes], valence [2], cbeta [no], covalent [yes], h-bond [yes], etc. (whatever else experimenter wants to include)

Bioinformatics Tony C Smith Sample document PRNUM:1. AANUM:21. PRNUM:1. AANUM:21. AMINO[-8]:L. ALIPH[-8]:-. AROMA[-8]:-. CBETA[-8]:-. CHARG[-8]:-. COVAL[-8]:-. HBOND[-8]:-. HPHOB[-8]:+. IONIZ[-8]:-. NITRO[-8]:1. POLAR[-8]:-. POSNG[-8]:0. SMALL[-8]:-. SULPH[-8]:-. TEENY[-8]:-. CRING[-8]:-. VALEN[-8]:2. AMINO[-7]:L. ALIPH[-7]:-. AROMA[-7]:-. CBETA[-7]:-. CHARG[-7]:-. COVAL[-7]:-. HBOND[-7]:-. HPHOB[-7]:+. IONIZ[-7]:-. NITRO[-7]:1. POLAR[-7]:-. POSNG[-7]:0. SMALL[-7]:-. SULPH[-7]:-. TEENY[-7]:-. CRING[-7]:-. VALEN[-7]:2. AMINO[-6]:F. ALIPH[-6]:+. AROMA[-6]:+. CBETA[-6]:-. CHARG[-6]:-. COVAL[-6]:-. HBOND[-6]:-. HPHOB[-6]:+. IONIZ[-6]:-. NITRO[-6]:1. POLAR[-6]:-. POSNG[-6]:0. SMALL[-6]:-. SULPH[-6]:-. TEENY[-6]:-. CRING[-6]:+. VALEN[-6]:2. AMINO[-5]:A. ALIPH[-5]:-. AROMA[-5]:-. CBETA[-5]:-. CHARG[-5]:-. COVAL[-5]:-. HBOND[-5]:-. HPHOB[-5]:-. IONIZ[-5]:-. NITRO[-5]:1. POLAR[-5]:-. POSNG[-5]:0. SMALL[-5]:+. SULPH[-5]:-. TEENY[-5]:+. CRING[-5]:-. VALEN[-5]:2. AMINO[-4]:T. ALIPH[-4]:+. AROMA[-4]:-. CBETA[-4]:+. CHARG[-4]:-. COVAL[-4]:-. HBOND[-4]:+. HPHOB[-4]:-. IONIZ[-4]:-. NITRO[-4]:1. POLAR[-4]:+. POSNG[- 4]:0. SMALL[-4]:+. SULPH[-4]:-. TEENY[-4]:-. CRING[-4]:-. VALEN[-4]:2. AMINO[-3]:C. ALIPH[-3]:+. AROMA[-3]:-. CBETA[-3]:-. CHARG[- 3]:-. COVAL[-3]:+. HBOND[-3]:+. HPHOB[-3]:+. IONIZ[-3]:+. NITRO[-3]:1. POLAR[-3]:+. POSNG[-3]:-. SMALL[-3]:-. SULPH[-3]:+. TEENY[-3]:-. CRING[-3]:-. VALEN[-3]:2. AMINO[-2]:I. ALIPH[-2]:-. AROMA[-2]:-. CBETA[-2]:+. CHARG[-2]:-. COVAL[-2]:-. HBOND[-2]:-. HPHOB[-2]:+. IONIZ[-2]:-. NITRO[-2]:1. POLAR[-2]:-. POSNG[-2]:0. SMALL[-2]:-. SULPH[-2]:-. TEENY[-2]:-. CRING[-2]:-. VALEN[-2]:2. AMINO[-1]:A. ALIPH[-1]:-. AROMA[-1]:-. CBETA[-1]:-. CHARG[-1]:-. COVAL[-1]:-. HBOND[-1]:-. HPHOB[-1]:-. IONIZ[-1]:-. NITRO[-1]:1. POLAR[-1]:-. POSNG[-1]:0. SMALL[-1]:+. SULPH[-1]:-. TEENY[-1]:+. CRING[-1]:-. VALEN[-1]:2. AMINO[0]:R. ALIPH[0]:+. AROMA[0]:-. CBETA[0]:-. CHARG[0]:+. COVAL[0]:-. HBOND[0]:+. HPHOB[0]:-. IONIZ[0]:+. NITRO[0]:4. POLAR[0]:+. POSNG[0]:+. SMALL[0]:-. SULPH[0]:-. TEENY[0]:-. CRING[0]:-. VALEN[0]:3. AMINO[1]:H. ALIPH[1]:+. AROMA[1]:+. CBETA[1]:-. CHARG[1]:+. COVAL[1]:-. HBOND[1]:+. HPHOB[1]:-. IONIZ[1]:+. NITRO[1]:3. POLAR[1]:+. POSNG[1]:+. SMALL[1]:-. SULPH[1]:-. TEENY[1]:-. CRING[1]:+. VALEN[1]:3. AMINO[2]:Q. ALIPH[2]:+. AROMA[2]:-. CBETA[2]:-. CHARG[2]:-. COVAL[2]:-. HBOND[2]:+. HPHOB[2]:-. IONIZ[2]:-. NITRO[2]:2. POLAR[2]:+. POSNG[2]:0. SMALL[2]:-. SULPH[2]:-. TEENY[2]:-. CRING[2]:-. VALEN[2]:2. AMINO[3]:Q. ALIPH[3]:+. AROMA[3]:-. CBETA[3]:-. CHARG[3]:-. COVAL[3]:-. HBOND[3]:+. HPHOB[3]:-. IONIZ[3]:-. NITRO[3]:2. POLAR[3]:+. POSNG[3]:0. SMALL[3]:-. SULPH[3]:-. TEENY[3]:-. CRING[3]:-. VALEN[3]:2. AMINO[4]:R. ALIPH[4]:+. AROMA[4]:-. CBETA[4]:-. CHARG[4]:+. COVAL[4]:-. HBOND[4]:+. HPHOB[4]:-. IONIZ[4]:+. NITRO[4]:4. POLAR[4]:+. POSNG[4]:+. SMALL[4]:-. SULPH[4]:-. TEENY[4]:-. CRING[4]:-. VALEN[4]:3. AMINO[5]:Q. ALIPH[5]:+. AROMA[5]:-. CBETA[5]:-. CHARG[5]:-. COVAL[5]:-. HBOND[5]:+. HPHOB[5]:-. IONIZ[5]:-. NITRO[5]:2. POLAR[5]:+. POSNG[5]:0. SMALL[5]:-. SULPH[5]:-. TEENY[5]:-. CRING[5]:-. VALEN[5]:2. AMINO[6]:Q. ALIPH[6]:+. AROMA[6]:-. CBETA[6]:-. CHARG[6]:-. COVAL[6]:-. HBOND[6]:+. HPHOB[6]:-. IONIZ[6]:-. NITRO[6]:2. POLAR[6]:+. POSNG[6]:0. SMALL[6]:-. SULPH[6]:-. TEENY[6]:-. CRING[6]:-. VALEN[6]:2. AMINO[7]:Q. ALIPH[7]:+. AROMA[7]:-. CBETA[7]:-. CHARG[7]:-. COVAL[7]:-. HBOND[7]:+. HPHOB[7]:-. IONIZ[7]:-. NITRO[7]:2. POLAR[7]:+. POSNG[7]:0. SMALL[7]:-. SULPH[7]:-. TEENY[7]:-. CRING[7]:-. VALEN[7]:2. AMINO[8]:Q. ALIPH[8]:+. AROMA[8]:-. CBETA[8]:-. CHARG[8]:-. COVAL[8]:-. HBOND[8]:+. HPHOB[8]:-. IONIZ[8]:-. NITRO[8]:2. POLAR[8]:+. POSNG[8]:0. SMALL[8]:-. SULPH[8]:-. TEENY[8]:-. CRING[8]:-. VALEN[8]:2. MULT3:7. MULT5:4. MULT7:3. MULT9:2. 2GRAM:IA. GRAM2:HQ. 3GRAM:CIA. GRAM3:HQQ.

Bioinformatics Tony C Smith demo

Concluding remarks A [pseudo] text classification approach to sequence prediction problems can perform as well as the state-of- the-art stochastic methods Allows miscellaneous facts (i.e. any textual description of relevant information) to be included A ranked list of features from the text classifier provides insights into the underlying biology Features could be used for text generation

Bioinformatics Tony C Smith Biotechnology Biologists know proteins, computer scientists know machine learning Together, they can find out a lot of hidden information about genes and proteins Biotechnology is a multi-billion dollar industry Biotechnology is one of the best funded areas of scientific research

Bioinformatics Tony C Smith The University of Waikato Waikato University is the centre of the universe for machine learning The Machine Learning Group is a large, globally active, well-funded research group The WEKA workbench of ML tools is used around the world Professors at Waikato University literally wrote the book on sequence modeling

Bioinformatics Tony C Smith The University of Waikato If you’re seriously interested in machine learning, in getting involved in bioinformatics research, or indeed any other area along the leading edge of computer science, then university is the only place to be, and Waikato wants You!