Biological sequence analysis and information processing by artificial neural networks Søren Brunak Center for Biological Sequence Analysis Technical University of Denmark
Pairwise alignment >carp Cyprinus carpio growth hormone 210 aa vs. >chicken Gallus gallus growth hormone 216 aa scoring matrix: BLOSUM50, gap penalties: -12/ % identity; Global alignment score: carp MA--RVLVLLSVVLVSLLVNQGRASDN-----QRLFNNAVIRVQHLHQLAAKMINDFEDSLLPEERRQLSKIFPLSFCNSD ::. :...:.:. : :.. :: :::.:.:::: :::...::..::..:.:.:: :. chicken MAPGSWFSPLLIAVVTLGLPQEAAATFPAMPLSNLFANAVLRAQHLHLLAAETYKEFERTYIPEDQRYTNKNSQAAFCYSE carp YIEAPAGKDETQKSSMLKLLRISFHLIESWEFPSQSLSGTVSNSLTVGNPNQLTEKLADLKMGISVLIQACLDGQPNMDDN : ::.:::..:..:..:::.:. ::.:: : : ::..:.:. :.... ::: ::. ::..:.. :.:. chicken TIPAPTGKDDAQQKSDMELLRFSLVLIQSWLTPVQYLSKVFTNNLVFGTSDRVFEKLKDLEEGIQALMRELEDRSPR---G carp DSLPLP-FEDFYLTM-GENNLRESFRLLACFKKDMHKVETYLRVANCRRSLDSNCTL.: :.. :...:. :... ::.:::::.:::::::.:.:::.::::. chicken PQLLRPTYDKFDIHLRNEDALLKNYGLLSCFKKDLHKVETYLKVMKCRRFGESNCTI
Biological neuron
Diversity of interactions in a network enables complex calculations Similar in biological and artificial systems Excitatory (+) and inhibitory (-) relations between compute units
Transfer of biological principles to neural network algorithms Non-linear relation between input and output Massively parallel information processing Data-driven construction of algorithms Ability to generalize to new data items
Simplest non-trivial classification problem CNHSYYP, HIETRRA, NWQSADY, NQYSEPR, WHITRCA, DYHSANY,... Two categories: positives and negatives Data described by two features, e.g. charge, sidechain volume, molecular weight, number of atoms,...
Features of phosphorylations sites PKG cGMP- dep.kinase PKC CaM-II Ca++/cal- modulin-dep. kinase cdc2 Cyclin- dep.kinase 2 CK-II Casein kinase 2
Homotypical cerebral cortex – (from primate) - 6 layers
DEMO
negative positive Training and error reduction
Transfer of biological principles to neural network algorithms Non-linear relation between input and output Massively parallel information processing Data-driven construction of algorithms
Sparse encoding of amino acid sequence windows
Sparse encoding of nucleotide sequence windows Nucleotides 4 letter alphabet Normally no need for a fifth letter ACGTAGGCAATCTCAGACGTTTATC