Protein Secondary Structure Prediction Using BLAST and Relaxed Threshold Rule Induction from Coverings Leong Lee Missouri University of Science and Technology,

Slides:



Advertisements
Similar presentations
Introductory Mathematics & Statistics for Business
Advertisements

Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
SADC Course in Statistics Estimating population characteristics with simple random sampling (Session 06)
Algebra Problems… Solutions
Chapter 18 Methodology – Monitoring and Tuning the Operational System Transparencies © Pearson Education Limited 1995, 2005.
Part 1 Module 1 Sets, elements, subsets In this course we cover a variety of topics, some of which may seem to be completely unrelated to others. Set mathematics.
Three Special Functions
Sequential Three-way Decision with Probabilistic Rough Sets Supervisor: Dr. Yiyu Yao Speaker: Xiaofei Deng 18th Aug, 2011.
Problems and Their Classes
5.1 Real Vector Spaces.
Sorting It All Out Mathematical Topics
Equivalence Relations
On Complexity, Sampling, and -Nets and -Samples. Range Spaces A range space is a pair, where is a ground set, it’s elements called points and is a family.
1 ECE 776 Project Information-theoretic Approaches for Sensor Selection and Placement in Sensor Networks for Target Localization and Tracking Renita Machado.
1 Decomposing Hypergraphs with Hypertrees Raphael Yuster University of Haifa - Oranim.
1 Machine Learning: Lecture 3 Decision Tree Learning (Based on Chapter 3 of Mitchell T.., Machine Learning, 1997)
JAYASRI JETTI CHINMAYA KRISHNA SURYADEVARA
_ Rough Sets. Basic Concepts of Rough Sets _ Information/Decision Systems (Tables) _ Indiscernibility _ Set Approximation _ Reducts and Core _ Rough Membership.
Recursive Definitions and Structural Induction
Bounds on Code Length Theorem: Let l ∗ 1, l ∗ 2,..., l ∗ m be optimal codeword lengths for a source distribution p and a D-ary alphabet, and let L ∗ be.
Parallel Scheduling of Complex DAGs under Uncertainty Grzegorz Malewicz.
Functional Dependencies - Example
Applied Discrete Mathematics Week 11: Graphs
Rough Set Strategies to Data with Missing Attribute Values Jerzy W. Grzymala-Busse Department of Electrical Engineering and Computer Science University.
Mining Association Rules. Association rules Association rules… –… can predict any attribute and combinations of attributes … are not intended to be used.
Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
Lecture 21 Rule discovery strategies LERS & ERID.
Northwestern University Winter 2007 Machine Learning EECS Machine Learning Lecture 13: Computational Learning Theory.
NP-Complete Problems Reading Material: Chapter 10 Sections 1, 2, 3, and 4 only.
August 2005RSFDGrC 2005, Regina, Canada 1 Feature Selection Based on Relative Attribute Dependency: An Experimental Study Jianchao Han 1, Ricardo Sanchez.
LEARNING DECISION TREES
1 On statistical models of cluster stability Z. Volkovich a, b, Z. Barzily a, L. Morozensky a a. Software Engineering Department, ORT Braude College of.
Copyright © Cengage Learning. All rights reserved.
Induction and recursion
1.1 Chapter 1: Introduction What is the course all about? Problems, instances and algorithms Running time v.s. computational complexity General description.
Basic Data Mining Techniques
Copyright © Cengage Learning. All rights reserved. CHAPTER 11 ANALYSIS OF ALGORITHM EFFICIENCY ANALYSIS OF ALGORITHM EFFICIENCY.
Induction and recursion
Database Systems Normal Forms. Decomposition Suppose we have a relation R[U] with a schema U={A 1,…,A n } – A decomposition of U is a set of schemas.
On Applications of Rough Sets theory to Knowledge Discovery Frida Coaquira UNIVERSITY OF PUERTO RICO MAYAGÜEZ CAMPUS
Chapter 3 (Part 3): Mathematical Reasoning, Induction & Recursion  Recursive Algorithms (3.5)  Program Correctness (3.6)
© by Kenneth H. Rosen, Discrete Mathematics & its Applications, Sixth Edition, Mc Graw-Hill, 2007 Chapter 4 (Part 3): Mathematical Reasoning, Induction.
Hardness of Learning Halfspaces with Noise Prasad Raghavendra Advisor Venkatesan Guruswami.
Extending the Definition of Exponents © Math As A Second Language All Rights Reserved next #10 Taking the Fear out of Math 2 -8.
11 -1 Chapter 11 Randomized Algorithms Randomized Algorithms In a randomized algorithm (probabilistic algorithm), we make some random choices.
3. Rough set extensions  In the rough set literature, several extensions have been developed that attempt to handle better the uncertainty present in.
Advanced Topics in Propositional Logic Chapter 17 Language, Proof and Logic.
1 Combinatorial Algorithms Local Search. A local search algorithm starts with an arbitrary feasible solution to the problem, and then check if some small,
Overview Concept Learning Representation Inductive Learning Hypothesis
Copyright © Cengage Learning. All rights reserved.
Relations, Functions, and Matrices Mathematical Structures for Computer Science Chapter 4 Copyright © 2006 W.H. Freeman & Co.MSCS Slides Relations, Functions.
NP-Complete Problems. Running Time v.s. Input Size Concern with problems whose complexity may be described by exponential functions. Tractable problems.
A Logic of Partially Satisfied Constraints Nic Wilson Cork Constraint Computation Centre Computer Science, UCC.
Chapter SETS DEFINITION OF SET METHODS FOR SPECIFYING SET SUBSETS VENN DIAGRAM SET IDENTITIES SET OPERATIONS.
ICS 253: Discrete Structures I Induction and Recursion King Fahd University of Petroleum & Minerals Information & Computer Science Department.
CS 103 Discrete Structures Lecture 13 Induction and Recursion (1)
The Evolution of a Hard Graph Theory Problem – Secure Sets Ron Dutton Computer Science University of Central Florida 1.
Top-K Generation of Integrated Schemas Based on Directed and Weighted Correspondences by Ahmed Radwan, Lucian Popa, Ioana R. Stanoi, Akmal Younis Presented.
CS 8751 ML & KDDComputational Learning Theory1 Notions of interest: efficiency, accuracy, complexity Probably, Approximately Correct (PAC) Learning Agnostic.
Sets.
Discrete Mathematics Lecture # 22 Recursion.  First of all instead of giving the definition of Recursion we give you an example, you already know the.
CompSci 102 Discrete Math for Computer Science March 13, 2012 Prof. Rodger Slides modified from Rosen.
SECTION 2 BINARY OPERATIONS Definition: A binary operation  on a set S is a function mapping S X S into S. For each (a, b)  S X S, we will denote the.
D EPENDENCIES IN K NOWLEDGE B ASE By: Akhil Kapoor Manandeep Singh Bedi.
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
The NP class. NP-completeness Lecture2. The NP-class The NP class is a class that contains all the problems that can be decided by a Non-Deterministic.
Section Recursion 2  Recursion – defining an object (or function, algorithm, etc.) in terms of itself.  Recursion can be used to define sequences.
The NP class. NP-completeness
CHAPTER 3 SETS, BOOLEAN ALGEBRA & LOGIC CIRCUITS
The Language of Sets If S is a set, then
Presentation transcript:

Protein Secondary Structure Prediction Using BLAST and Relaxed Threshold Rule Induction from Coverings Leong Lee Missouri University of Science and Technology, USA 1

RT-RICO - Rule Induction From Coverings RT-RICO is based on a previously implemented method called RICO (Rule Induction From Coverings) (Maglia, Leopold and Ghatti, 2004). RICO uses some of the concepts introduced by Pawlak (1984) for rough sets, a classification scheme based on partitions of entities in a data set (Grzymala-Busse, 1991). In this approach, if S is a set of attributes and R is a set of decision attributes, then a covering P of R in S can be found if the following three conditions are satisfied: i.P is a subset of S. ii.R depends on P (i.e., P determines R). That is, if a pair of entities x and y cannot be distinguished by means of attributes from P, then x and y also cannot be distinguished by means of attributes from R. If this is true, then entities x and y are said to be indiscernible by P (and, hence, R), denoted x ~ P y. An indiscernibility relation ~ P is such a partition over all entities in the data set. iii.P is minimal. 2

RT-RICO - Rule Induction From Coverings Condition (ii) is true if and only if an equivalent condition, known as the attribute dependency inequality, holds for P* and R*, the partitions of all attributes and decisions generated by P and R, respectively, where, for a set of attributes A: A* = a є A ~ [a]* The inequality P* R* holds if and only if for each block B of P*, there exists a block B of R* such that B is a subset of B. Once a covering is found, it is a straightforward process to induce rules from it. For example, if a set of attributes P = {a 1, a 2 } is found to determine a set of attributes R = {a 3 } (i.e., P is a covering for R), then rules of the form (a 1, v 1 ) (a 2, v 2 ) (a 3, v 3 ) can be generated where v 1, v 2, and v 3 are actual values of attributes a 1, a 2, and a 3, respectively, 3

RT-RICO - Relaxed Attribute Dependency Inequality All rules generated from coverings in this manner are perfect in the sense that there is no instance in the data set for which the rule is not true. In order to relax this restriction… Definition 1: Relaxed Attribute Dependency Inequality The inequality P* r R* holds if and only if there exists a block B of P*, and there exists a block B of R* such that B is a subset of B. As an example for the data set of Table II, let P = {2} and R = {3}. Then {2}* = {{x 1, x 2 }, {x 3, x 4 }, {x 5, x 6 }} {3}* = {{x 1, x 2 }, {x 3, x 5, x 6 }, {x 4 }} There exists a block B = {x 1, x 2 } in {2}* and a block B = {x 1, x 2 } in {3}* such that B B. Thus, {2}* r {3}* which means that {3} depends on {2} (i.e., {2} r {3}) for at least some values of {2}. More specific rules can then be deduced from this relationship, such as (2, D) (3, H). 4

RT-RICO - Relaxed Coverings Similarly, we can relax the definition of a covering in order to be able to induce rules depending on as small a number of attributes as possible. Definition 2: Relaxed Coverings A subset P of the set S is called a relaxed covering of R in S if and only if P r R and P is minimal in S (no proper subset P of P exists such that P r R). For the data set of Table II, suppose we want to induce rules for R = {3}. The covering {1, 2} can be used; that is, for any assignment of values for the covering {1, 2}, each entity in Table II will induce a rule for {3}. But, instead of inducing a rule from looking at combinations of values for {1, 2}, such as (1, L) (2, D) (3, H), we will induce rules based on values for only {1} or {2}. Thus, (2, D) (3, H) will be generated as a rule since {2} r {3} and {2} is minimal in {1, 2}. In this manner, {2} is a relaxed covering of {3}. 5

RT-RICO - Checking Attribute Dependency The concept of checking attribute dependency, was introduced by Grzymala-Busse (1991). In order for P to be a relaxed covering of R in S, the following conditions must be true: i.P must be a subset of S, ii.R must depend on set P (for some values of P), and iii.P must be minimal. For our specific application, to generate rules for protein secondary structure prediction, rules involving more attributes are preferred over rules involving fewer attributes, because they normally generate higher confidence values. In addition, we need all the possible attribute position combinations. As a result, condition (iii) is not enforced for rule generation in our implementation. 6

RT-RICO - Checking Attribute Dependency Condition (ii) is true if and only if the relaxed attribute dependency inequality, P* r R*, is satisfied. For each set P, a new partition, generated by P, must be determined. Partition U should be generated by P. For partitions and of U, is a partition of U such that two entities, x and y, are in the same block of if and only if x and y are in the same block for both partitions and of U. For example, referring to Table III, {1}* = {{x 1, x 2, x 5, x 6 }, {x 3, x 4 }} {2}* = {{x 1, x 2, x 4, x 5 }, {x 3, x 6 }} {1}* {2}* = {{x 1, x 2, x 5 }, {x 3 }, {x 4 }, {x 6 }} That is, for {1}* and {2}*, two entities x 1 and x 2 are in the same block of {1}* {2}* if and only if x 1 and x 2 are in the same block of {1}* and in the same block of {2}*. Further, the relaxed covering of {3} is {1, 2}, because {1}* {2}* r {3}*, and {1, 2} is minimal since {1}* r {3}* and {2}* r {3}* are both not true. 7

RT-RICO - Finding the set of All Relaxed Coverings The algorithm R-RICO (Relaxed Rule Induction from Coverings) can be used to find the set C of all relaxed coverings of R in S (and the rules). Let S be the set of all attributes, and let R be the set of all decision attributes. Let k be a positive integer. The set of all subsets of the same cardinality k of the set S is denoted P k = {{x i1, x i2, …, x ik } | x i1, x i2, …, x ik S}. The condition (iii) for a relaxed covering is not enforced. Time complexity = |S|, the number of attributes. Algorithm 1: R-RICO begin for each attribute x in S do compute [x]*; compute partition R* k:=1 while k |S| do for each set P in P k do if ( x P [x]* r R*) then begin find the attribute values from the first block B of P and from the first block B of R; add rule to output file; end k := k+1; end-while end-algorithm. 8

RT-RICO - RT-RICO Algorithm The R-RICO algorithm produces rules that are 100% correct. However, unlike decision tree induction, R- RICO produces a more comprehensive rule set. The algorithm can be further modified to satisfy some particular level of uncertainty in the rules (e.g., the rule is 50% true). To accommodate this information in the rules, the definition of attribute dependency inequality must be further modified as in Definition 3. 9

RT-RICO - Relaxed Attribute Dependency Inequality with Threshold Definition 3: Relaxed Attribute Dependency Inequality with Threshold Set R depends on a set P with threshold probability t (0 < t 1), and is denoted by P r,t R if and only if P* r,t R* and there exists a block B of P*, and there exists a block B of R* such that (|B B| / |B|) t. It can be observed that, when t=1, Definitions 1 and 3 represent the same mathematical relation. As an example, for the data set of Table IV, let P = {1, 2}, R = {3}, and t = 0.6. Then we have the following partitions: {1}* = {{x 1, x 6 }, {x 2, x 3, x 4, x 5 }} {2}* = {{x 1, x 2, x 3, x 4, x 5, x 6 }} P* = {1,2}* = {1}* {2}*={{x 1, x 6 }, {x 2, x 3, x 4, x 5 }} R* = {3}* = {{x 1, x 5 }, {x 2, x 3, x 4, x 6 }} There exists a block B = {x 2, x 3, x 4, x 5 } in {1, 2}*, and there exists a block B = {x 2, x 3, x 4, x 6 } in {3}* such that (|B B| / |B|) = |{x 2, x 3, x 4 }| / |{x 2, x 3, x 4, x 6 }| = Thus, P* = {1, 2}* r,t R* = {3}*, and {3} depends on {1, 2} (i.e., {1, 2} r,t {3}), with threshold probability 0.6. B B = {x 2, x 3, x 4 } Rule: (1, C) (2, A) (3, H) with a probability (confidence) of 75%. 10

RT-RICO - Relaxed Coverings with Threshold Probability The definition of relaxed coverings must also be modified to incorporate the notion of the threshold probability as in Definition 4. Definition 4: Relaxed Coverings with Threshold Probability Let S be a nonempty subset of a set of all attributes, and let R be a nonempty subset of decision attributes, where S and R are disjoint. A subset P of the set S is called a relaxed covering of R in S with threshold probability t (0 < t 1) if and only if P r,t R and P is minimal in S. 11

RT-RICO - Algorithm RT-RICO Algorithm RT-RICO (Relaxed Threshold Rule Induction From Coverings) finds the set C of all relaxed coverings of R in S (and the related rules), with threshold probability t (0 < t 1), where S is the set of all attributes, and R is the set of all decisions. The set of all subsets of the same cardinality k of the set S is denoted P k = {{x i1, x i2, …, x ik } | x i1, x i2, …, x ik S} Algorithm 2: RT-RICO begin for each attribute x in S do compute [x]*; compute partition R* k:=1 while k |S| do for each set P in P k do if ( x P [x]* r,t R*) then begin find values of attributes from the entities that are in the region (B B) such that (|B B| / |B|) t; add rule to output file; end k := k+1 end-while; end-algorithm. 12

RT-RICO Note that the condition P is minimal in S of a relaxed covering with threshold probability is not enforced in the RT-RICO algorithm. The reason for not implementing this condition is the same as the reason mentioned in R-RICO algorithm. For our application, to generate rules for protein secondary structure prediction, rules involving more attributes are preferred over rules involving fewer attributes, because they normally generate higher confidence values. Also, we need all the possible attribute position combinations. 13

RT-RICO The time complexity of the RT-RICO algorithm is again exponential to |S|, the number of attributes in the data set (for our training data sets, |S| = 5). Due to the fact that 2 |S| is smaller than number of 5 residue segments in our training data set, the time complexity of the RT- RICO algorithm for our training data set is polynomial to the number of 5 residue segments. The rules generated by the RT-RICO algorithm are then compared with the proteins in the test data set to predict the secondary structure elements. 14