Predict Protein Sequence by Fuzzy-Association Rules Student Name: Shasha Luo Instructor: Dr. Zhang Course Name: CI (csc8710) Georgia State university Fall 2003
Introduction Bioinformatics Fuzzy-association rules system How to predict the sequence? (Algorithm) Prediction results Conclusion and future work
Bioinformatics Create and maintain databases of biological information Analyze and interpret of various types of data including amino acid sequences, protein domains, and protein structures… Develop and implement tools to efficient access and manage different types of information.
Protein Sequence Matching Develop methods to predict the structure and discover proteins and structural RNA sequences. Cluster protein sequences into families of related sequences and the development of protein models. Align similar proteins and generate phylogenetic trees to examine evolutionary relationships. Find the genes in DNA sequences of various organisms.
Fuzzy-Association Rules System Fuzzy logic Mamdani fuzzy control model If …, then… Define membership function, rulebases, linguistic values. Defuzzification Association rules Min-support value Min-confidence value Apriori algorithm
How to Predict the Type of Protein Sequences? Give a large protein sequence file with 7 types of protein, divide it into two databases in which each sequence has five amino elements and a type element. Protein sequence: ABCCEFG Sequence type: BCEGHST A new fuzzy-association system Use one of databases applying association rule (Apriori algorithm) to generate the rule for prediction Rule example: 3 <- A B D (10.0%, 5.0%) Use the other database as an input file passing through fuzzy system to predict which type of sequence is.
System Structure 1 2 3 output 4 5 6 7 Compare Outputs From fuzzy Association rules To be predicted Protein sequence Three inputs Which type Protein sequence? Fuzzy system
Algorithm (Con.) After generate the rules, input the sequences needed to predict and compare with rules, see if there are matches. How to compare? Compare with rules of each type at one time, as t. Repeat seven times, total time T. During t, there are various rules. We divide them into at most five categories: 1, 2, 3, 4, 5 elements rules, donate type1, type2, type3, type4, type5.
Algorithm (Con.) First, compare with type1, donate as C1. For other comparisons donate as C2, C3…C7. If there are several matches, then choose the highest support and confidence value, later save as an input in sub-fuzzy system. Also repeat this step three times (for our case). ABCDE => 3 <- ABD (10.0%, 5.0%) Repeat last step seven time, then we will get all the inputs for the fuzzy system. Now we have seven different outputs through seven fuzzy system. Compare them, choose the largest output and we get the predict result of which type it is because of the largest probability.
How to Define Fuzzy Logic System? For each subsystem, three inputs and one output. Each input [0,40] = 25% * Supp + 75% * Conf Output [0, 45] Rulebases: all subsystem use same rulebases. The input of C1 has the 15% priority. The input of C2 has 25% priority. The input of C3 has 60% priority. 27 rulebases for each fuzzy system. Output and inputs have low, Mid, high.
Prediction Results Use 11259 protein sequence to generate association rules Use 11259 protein sequence as input to be predict. For type 2,3,5,6,7 of sequence, Min-support = 5% and Min-conf = 5% For type 1,4 of sequence, Min-support = 1% and Min-conf = 1%
Prediction Results (Con.) Association rules Support and confidence values 2 <- C S 2 <- C G 2 <- P V G 2 <- P L G 2 <- N L G (4.5%, 25.5%) (5.7%, 26.0%) (4.1%, 33.3%) (4.1%, 31.4%) (4.3%, 22.8%)
Prediction Results (Con.)
Conclusion and Future Work Conclusion: result is not very accurate. Future work: Because Apriori algorithm does not consider the data order and repetition, our association rules is not very accurate. Design a compatible Apriori algorithm for rules search.