Genetic Algorithms Select Protein Features Most Predictive of Enzyme Function Andrew Kernytsky, Burkhard Rost Columbia University.

1 Genetic Algorithms Select Protein Features Most Predictive of Enzyme Function Andrew Kernytsky, Burkhard Rost Columbia University

2 Enzyme function prediction Given protein sequence predict Enzyme Commission (EC) number NC-IUBMB (1992) Recommendations of the International Union of Biochemistry on the Nomenclature and Classification of Enzymes. In, Enzyme Nomenclature. Academic Press, New York. EC Wheel Figure: Porter CT, Bartlett GJ, Thornton JM. Nucleic Acids Res. 2004 January 1; 32: D129–D133. Oxidoreductases Transferases Hydrolases Lyases Isomerases Ligases

3 TAGHCVNYDYGAGCQSGSPV bbbbbieeeiibbieeeeee..|....|......||.... AA Acc Cons Intersection properties capture local information 20% 10% 5% HHHEEEEELLEEEEELLLLL iiibbbbbbboooobbbbbb 36788842100000000123 Feat 4 Feat 5 Feat 6 1% 0.1% 0.01% All Global All Interse ction Limited local information Significant risk of overfitting during training 10 3+ features > 10 2 positive samples

4 Algorithm overview Protein sequence MSNLLKDFEVAQCMSNLLKDFEVAQC AA AA×sec sec AA×sec Inner Learning Algorithm SVM Neural Network OR 0.635 0.688 0.677 Fitness Assesed Selection Crossover Mutation AA×sec AA AA×sec sec AA×sec 2 nd Generation Genome Pop. 3 rd Generation Genome Pop. GA Evolution Genetic Algorithm 1 st 2 nd 3 rd 4 th Generation Populations AA sec AA × sec AA sec AA AA×sec sec AA × sec AA sec AA × sec All possible combinations of feature classes [genomes] AA sec AA × sec All intersection and global feature classes

5 GA improves performance EC Level

6 Balance between intersection and global features gives best performance AA, acc, sec, htm, cons-95 AA, acc, sec, cons-95 AA, acc, acc×sec, htm, cons-95 AA, sec, cons- 97 AA, acc×sec, sec, cons- 95 AA, acc, acc×sec×cons-94, sec AA, AA×acc×sec×cons-95, sec, cons-95 AA, sec AA, acc, sec×cons-94, cons-83×cons-94 AA, acc×sec×cons-89, cons-95 AA, acc×sec×cons- 84×cons-94, sec AA×acc×htm×cons- 84×cons-95, acc, cons-94 AA AA, acc×cons-96, sec×cons-91 AA, acc×sec×cons-94, acc×cons-94 AA, acc×sec×cons- 84×cons-94 AA, acc×sec×cons- 88×cons-91×cons-95 AA×cons-94, acc×cons- 94 AA×cons-82, acc×sec×cons-94 AA×cons-82, acc×sec×cons-94×cons-96 AA×sec×htm×cons- 95×cons-96

