What is bioinformatics?
What are bioinformaticians up to, actually? Manage molecular biological data –Store in databases, organise, formalise, describe... Compare molecular biological data Find patterns in molecular biological data –phylogenies –correlations (sequence / structure / expression / function / disease) Goals: characterise biological patterns & processes predict biological properties –low level data ⇒ high level properties (eg., sequence ⇒ function)
Bioinformatics: neighbour disciplines Computational biology –Broader concept: includes computational ecology, physiology, neurology etc... -omics: –Genomics –Transcriptomics –Proteomics Systems biology –Putting it all together... –Building models, identify control & regulation
Bioinformatics: prerequisites Bio- side: –Molecular biology –Cell biology –Genetics –Evolutionary theory -informatics side: –Computer science –Statistics –Theoretical physics
Molecular biology data... >alpha-D ATGCTGACCGACTCTGACAAGAAGCTGGTCCTGCAGGTGTGGGAGAAGGTGATCCGCCAC CCAGACTGTGGAGCCGAGGCCCTGGAGAGGTGCGGGCTGAGCTTGGGGAAACCATGGGCA AGGGGGGCGACTGGGTGGGAGCCCTACAGGGCTGCTGGGGGTTGTTCGGCTGGGGGTCAG CACTGACCATCCCGCTCCCGCAGCTGTTCACCACCTACCCCCAGACCAAGACCTACTTCC CCCACTTCGACTTGCACCATGGCTCCGACCAGGTCCGCAACCACGGCAAGAAGGTGTTGG CCGCCTTGGGCAACGCTGTCAAGAGCCTGGGCAACCTCAGCCAAGCCCTGTCTGACCTCA GCGACCTGCATGCCTACAACCTGCGTGTCGACCCTGTCAACTTCAAGGCAGGCGGGGGAC GGGGGTCAGGGGCCGGGGAGTTGGGGGCCAGGGACCTGGTTGGGGATCCGGGGCCATGCC GGCGGTACTGAGCCCTGTTTTGCCTTGCAGCTGCTGGCGCAGTGCTTCCACGTGGTGCTG GCCACACACCTGGGCAACGACTACACCCCGGAGGCACATGCTGCCTTCGACAAGTTCCTG TCGGCTGTGTGCACCGTGCTGGCCGAGAAGTACAGATAA >alpha-A ATGGTGCTGTCTGCCAACGACAAGAGCAACGTGAAGGCCGTCTTCGGCAAAATCGGCGGC CAGGCCGGTGACTTGGGTGGTGAAGCCCTGGAGAGGTATGTGGTCATCCGTCATTACCCC ATCTCTTGTCTGTCTGTGACTCCATCCCATCTGCCCCCATACTCTCCCCATCCATAACTG TCCCTGTTCTATGTGGCCCTGGCTCTGTCTCATCTGTCCCCAACTGTCCCTGATTGCCTC TGTCCCCCAGGTTGTTCATCACCTACCCCCAGACCAAGACCTACTTCCCCCACTTCGACC TGTCACATGGCTCCGCTCAGATCAAGGGGCACGGCAAGAAGGTGGCGGAGGCACTGGTTG AGGCTGCCAACCACATCGATGACATCGCTGGTGCCCTCTCCAAGCTGAGCGACCTCCACG CCCAAAAGCTCCGTGTGGACCCCGTCAACTTCAAAGTGAGCATCTGGGAAGGGGTGACCA GTCTGGCTCCCCTCCTGCACACACCTCTGGCTACCCCCTCACCTCACCCCCTTGCTCACC ATCTCCTTTTGCCTTTCAGCTGCTGGGTCACTGCTTCCTGGTGGTCGTGGCCGTCCACTT CCCCTCTCTCCTGACCCCGGAGGTCCATGCTTCCCTGGACAAGTTCGTGTGTGCCGTGGG CACCGTCCTTACTGCCAAGTACCGTTAA DNA sequences
Molecular biology data... Amino acid sequences Protein structure: –X-ray crystallography –NMR
Cell biology & proteomics data... Subcellular localization
Cell biology & proteomics data... protein-protein interactions
DNA microarray technology Transcriptomics: DNA microarray technology
Proteomics & transcriptomics data Proteins encoded by periodically expressed genes: a functionally diverse protein category
Cell cycle regulation Multiprotein complexes in yeast (Saccharomyces cerevisiae). Coloured components are periodically expressed (dynamic). White components are always expressed (static). Complexes contain both kinds of components (just-in-time assembly). Dynamic complex formation during the yeast cell cycle. Science Feb 4;307(5710): de Lichtenberg U, Jensen LJ, Brunak S, Bork P.
Cell cycle regulation 2 Transcriptional regulation is poorly conserved! Lars Juhl Jensen, Thomas Skøt Jensen, Ulrik de Lichtenberg, Søren Brunak and Peer Bork, Nature, Oct 5, 2006 Dynamic components overlap:
Phenotype data: human diseases
Prediction methods Homology / Alignment Simple pattern (“word”) recognition Statistical methods –Weight matrices: calculate amino acid probabilities –Other examples: Regression, variance analysis, clustering Machine learning –Like statistical methods, but parameters are estimated by iterative training rather than direct calculation –Examples: Neural Networks (NN), Hidden Markov Models (HMM), Support Vector Machines (SVM) Combinations
Simple pattern (“word”) recognition Example: PROSITE entry PS00014, ER_TARGET: Endoplasmic reticulum targeting sequence (”KDEL-signal”). Pattern: [KRHQSA]-[DENQ]-E-L> NB: only yes/no answers!
Statistical methods ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV Estimate probabilities for nucleotides / amino acids Information content in sequences; logos; Position- Weight Matrices. Quantitative answers.
ADDGSLAFVPSEF--SISPGEKIVFKNNAGFPHNIVFDEDSIPSGVDASKISMSEEDLLN TVNGAI--PGPLIAERLKEGQNVRVTNTLDEDTSIHWHGLLVPFGMDGVPGVSFPG---I -TSMAPAFGVQEFYRTVKQGDEVTVTIT-----NIDQIED-VSHGFVVVNHGVSME---I IE--KMKYLTPEVFYTIKAGETVYWVNGEVMPHNVAFKKGIV--GEDAFRGEMMTKD--- -TSVAPSFSQPSF-LTVKEGDEVTVIVTNLDE------IDDLTHGFTMGNHGVAME---V ASAETMVFEPDFLVLEIGPGDRVRFVPTHK-SHNAATIDGMVPEGVEGFKSRINDE---- TVNGQ--FPGPRLAGVAREGDQVLVKVVNHVAENITIHWHGVQLGTGWADGPAYVTQCPI Sequence profiles Conserved Non-conserved Matching anything but G => large negative score Anything can match TKAVVLTFNTSVEICLVMQGTSIV----AAESHPLHLHGFNFPSNFNLVDPMERNTAGVP TVNGQ--FPGPRLAGVAREGDQVLVKVVNHVAENITIHWHGVQLGTGWADGPAYVTQCPI Unlike Position-Weight Matrices, sequence profiles are able to handle gaps
HMM ( Hidden Markov Models ) Which parts of the sequence are –in the cytoplasm? –embedded in the membrane (TM-helices)? –outside? Rule-of-thumb says amino acids in the TM helices are hydrophobic – but there are exceptions TMHMM predicts topology from sequence using a statistical model TMHMM: predicting the topology of α-helix transmembrane proteins
I K E E H V I I Q A E H E C IKEEHVIIQAE FYLNPDQSGEF….. Window Input Layer Hidden Layer Output Layer Weights Neural Networks: ”Architecture”
Neural networks Neural networks can learn higher order correlations! –What does this mean? A A => 0 A C => 1 C A => 1 C C => 0 No linear function can learn this pattern Neural Networks: advantages
How to estimate parameters for prediction?
Model selection Linear Regression Quadratic RegressionJoin-the-dots
The test set method
Problem: sequences are related If the sequences in the test set are closely related to those in the training set, we can not measure true generalization performance ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV
Protein sorting in eukaryotes Proteins belong in different organelles of the cell – and some even have their function outside the cell Günter Blobel was in 1999 awarded The Nobel Prize in Physiology or Medicine for the discovery that "proteins have intrinsic signals that govern their transport and localization in the cell"
Secretory proteins have a signal peptide Initially, they are transported across the ER membrane Protein sorting: secretory pathway / ER
Signal peptides A signal peptide is an N- terminal part of the amino acid chain, containing a hydrophobic region. Signal peptides differ between proteins, and can be hard to recognize.
SignalP Method for prediction of signal peptides from the amino acid sequence Based on Neural Networks The first SignalP paper (Nielsen H, et al., Protein Eng. 10[1]: 1-6, January 1997) was cited 3144 times in a 10 year period