1 Discovery Robust Knowledge from Databases that Change Chun-Nan HsuGraig A. Knoblock Arizona State UniversityUniversity of Southern California Journal.

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

Data Mining Classification: Alternative Techniques
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 16 Relational Database Design Algorithms and Further Dependencies.
1 Finite Constraint Domains. 2 u Constraint satisfaction problems (CSP) u A backtracking solver u Node and arc consistency u Bounds consistency u Generalized.
Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Decision Tree Approach in Data Mining
Iterative Optimization and Simplification of Hierarchical Clusterings Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial.
2001/12/181/50 Discovering Robust Knowledge from Databases that Change Author: Chun-Nan Hsu, Craig A. Knoblock Advisor: Dr. Hsu Graduate: Yu-Wei Su.
Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.
Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland.
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 52 Database Systems I Relational Algebra.
The Relational Model System Development Life Cycle Normalisation
1 Discovering Robust Knowledge from Databases that Change Chun-Nan HsuCraig A. Knoblock Arizona State UniversityUniversity of Southern California Journal.
Constraint Logic Programming Ryan Kinworthy. Overview Introduction Logic Programming LP as a constraint programming language Constraint Logic Programming.
Iterative Optimization of Hierarchical Clusterings Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial Intelligence.
1 Lecture 5: Automatic cluster detection Lecture 6: Artificial neural networks Lecture 7: Evaluation of discovered knowledge Brief introduction to lectures.
Decision Tree Algorithm
Iterative Optimization and Simplification of Hierarchical Clusterings Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial.
16.5 Introduction to Cost- based plan selection Amith KC Student Id: 109.
Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee David W. Cheung Ben Kao The University of Hong Kong.
Chapter 5 Normalization Transparencies © Pearson Education Limited 1995, 2005.
Mining Association Rules
Discovering Robust Knowledge from Databases that Change Chun-Nan Hsu Craig A. Knoblock Arizona State University University of Southern California March.
Mining Association Rules of Simple Conjunctive Queries Bart Goethals Wim Le Page Heikki Mannila SIAM /8/261.
Abrar Fawaz AlAbed-AlHaq Kent State University October 28, 2011
Chapter 10 Functional Dependencies and Normalization for Relational Databases.
CS 405G: Introduction to Database Systems 16. Functional Dependency.
On Bridging Simulation and Formal Verification Eugene Goldberg Cadence Research Labs (USA) VMCAI-2008, San Francisco, USA.
1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.
Machine Learning CSE 681 CH2 - Supervised Learning.
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
1 A Theoretical Framework for Association Mining based on the Boolean Retrieval Model on the Boolean Retrieval Model Peter Bollmann-Sdorra.
Chapter 13 Query Processing Melissa Jamili CS 157B November 11, 2004.
Normalization Transparencies
OLAP : Blitzkreig Introduction 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema :
RRXS Redundancy reducing XML storage in relations O. MERT ERKUŞ A. ONUR DOĞUÇ
1 Discovering Robust Knowledge from Databases that Change Chun-Nan HsuCraig A. Knoblock Arizona State UniversityUniversity of Southern California Journal.
Today’s Agenda  Reminder: HW #1 Due next class  Quick Review  Input Space Partitioning Software Testing and Maintenance 1.
Chapter 10 Normalization Pearson Education © 2009.
Efficient RDF Storage and Retrieval in Jena2 Written by: Kevin Wilkinson, Craig Sayers, Harumi Kuno, Dave Reynolds Presented by: Umer Fareed 파리드.
Machine Learning Chapter 5. Artificial IntelligenceChapter 52 Learning 1. Rote learning rote( โรท ) n. วิถีทาง, ทางเดิน, วิธีการตามปกติ, (by rote จากความทรงจำ.
OLAP Recap 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema : Hierarchical Dimensions.
1 Test Selection for Result Inspection via Mining Predicate Rules Wujie Zheng
Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee, David W. Cheung, Ben Kao The University of Hong.
Simultaneously Learning and Filtering Juan F. Mancilla-Caceres CS498EA - Fall 2011 Some slides from Connecting Learning and Logic, Eyal Amir 2006.
CSE314 Database Systems Lecture 3 The Relational Data Model and Relational Database Constraints Doç. Dr. Mehmet Göktürk src: Elmasri & Navanthe 6E Pearson.
Conditional Probability Mass Function. Introduction P[A|B] is the probability of an event A, giving that we know that some other event B has occurred.
LDK R Logics for Data and Knowledge Representation ClassL (Propositional Description Logic with Individuals) 1.
Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.
Lab name TBA1NTUST talk Data Mining for Information Retrieval Chun-Nan Hsu Institute of Information Science Academia Sinica, Taipei, TAIWAN Copyright ©
Data Mining and Decision Support
CS 338Database Design and Normal Forms9-1 Database Design and Normal Forms Lecture Topics Measuring the quality of a schema Schema design with normalization.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Concept Learning and The General-To Specific Ordering
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
Chapter 8 Relational Database Design. 2 Relational Database Design: Goals n Reduce data redundancy (undesirable replication of data values) n Minimize.
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
DATA MINING TECHNIQUES (DECISION TREES ) Presented by: Shweta Ghate MIT College OF Engineering.
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss Pedro Domingos, Michael Pazzani Presented by Lu Ren Oct. 1, 2007.
Chapter 7. Classification and Prediction
Chapter 8: Concurrency Control on Relational Databases
Chapter 6: Integrity (and Security)
Discriminative Frequent Pattern Analysis for Effective Classification
Appendix D: Network Model
Implementation of Learning Systems
Presentation transcript:

1 Discovery Robust Knowledge from Databases that Change Chun-Nan HsuGraig A. Knoblock Arizona State UniversityUniversity of Southern California Journal of Data Mining and Knowledge Discovery, Volume 2, 1998 Presenter: Tri Tran CS331 – Spring 2006

2 Outline Introduction Preliminaries Robustness of Knowledge Applying Robustness in Knowledge Discovery Experimental Results Conclusions

3 Introduction: Motivation Many applications of knowledge discovery and data mining require the knowledge to be consistent with data. Database usually changes over time and make machine- discovered knowledge inconsistent. Useful knowledge should be robust against database changes so that it is unlikely to become inconsistent after database changes.  Defines a notion of robustness in the context of relational databases  Describes how robustness can be estimated and applied in knowledge discovery.

4 Introduction: Definition of Robustness Why do we need robustness? It is important to know whether a discovered knowledge is robust against database changes. Robustness: A measure of the quality of discovered knowledge. General definition of robustness: Robustness of discovered knowledge can be defined as the probability that the knowledge is consistent with a database state.

5 Introduction: Applications of Robustness Estimation Minimize the maintenance cost of inconsistent rules in the presence of database changes. When an inconsistent rule is detected, a system may: remove the rule: repeatedly invoke discovery system repair the rule: repaired frequently after data changes Apply to rule discovery for semantic query optimization (SQO). Rule: all Maltese seaports have railroad access. Query: find all Maltese seaports with railroad access and 2,000 sq.ft of storage.

6 Applications of Robustness Estimation (cont.) Apply to guide the data mining and evaluation of semantic rules. Two stages: Data mining stage: uses a rule pruning approach to prune antecedents of a discovered rule to increase its robustness. Evaluation stage: eliminates rules when their estimated robustness values are below a given threshold. Apply to rule maintenance. Similar to rule pruning approach. Use the estimated robustness of the resulting partially repaired rules to search for the best sequence of repair operators so that the repaired rule is more robust than the original one.

7 Introduction: Robustness Comparison Comparison with Support count: Support count for an association rule expresses the probability that a data instance satisfies a rule. While the robustness expresses the probability that an entire database state is consistent with a rule. Comparison with predictive accuracy: Predictive accuracy for classification rule measures the probability that knowledge is consistent with randomly selected unseen data instead of with database states. The difference is significant in databases that are interpreted using the closed-world assumption (CWA).

8 Introduction: Closed-world Assumption Information that is not explicitly present in the database is taken to be false. Closed-world databases are widely used because of The characteristics of application domains. The limitation of the representation systems. Databases change by deletions, insertions and updates, and in a close- world database, they may affect the validity of a rule. Closed-world data tends to be dynamic, important for knowledge discovery systems to handle dynamic and closed world data. An instance of closed-world data usually represents a dynamic state in the world. E.g: an instance of employee information in a personnel database.

9 Example of a relational database Schema: ship_class(class_name, ship_type,max_draft, length,container_cap), ship(ship_name, ship_class, status, fleet, year, built), geoloc(name, glc_cd, country, latitude, longitude), seaport( name, glc_code, storage, rail, road, anch_offshore), wharf(wharf_id, glc_code, depth, length, crane_qty). Rules: R1:The latitude of a Maltese geographic location is greater than or equal to geoloc(_,_,?country,?latitude,_) ^ ?country = ”Malta”  ?latitude >= R2: All Maltese geographic locations are seaports. geoloc(_,?glc_cd, ?country,_,_) ^ ?country = ”Malta”  seaport(_, ?glc_cd, _, _,_,_)

10 Outline Introduction Preliminaries Robustness of Knowledge Applying Robustness in Knowledge Discovery Experimental Results Conclusions

11 Preliminaries Relational database: a set of relations A relation contains a set of instances (tuples) of attribute- value vectors Two types of literals: Database literal: literal defined on database relations (e.g.: seaport(_,?glc_cd,?storage,_,_,_) Built-in literal: literal on built-in relations (e.g.: latitude > 35.89) Two types of rules: Range rule: rules with a positive built-in literal (e.g.: R1) Relational rule: rules with a database literal as their consequent (e.g.: R2)

12 Preliminaries (cont.) Database state at time t: the collection of the instances presents in the database at time t. Consistent rule: given a database state, if all variable instantiations that satisfy the antecedents of the rule also satisfy the consequent of the rule.

13 Outline Introduction Preliminaries Robustness of Knowledge Applying Robustness in Knowledge Discovery Experimental Results Conclusions

14 Definitions of Robustness Robust rule: does not become inconsistent (invalid) after database changes Definition 1: (Robustness for all states) Given a rule r, let D be the event that a database is in a state that is consistent with r.

15 Definitions of Robustness (cont.) Intuitively, a rule is robust if it is unlikely that the transactions will invalidate the rule. Definition 2: (Robustness for accessible states) Given a rule r, and a database in a state d, in which r is consistent. Let t denote the event of performing a transaction on d that results in new database states inconsistent with r. The robustness of r in accessible states from the current state d is defined as

16 Definitions of Robustness (cont.) Observation: If all transactions are equally probable, then The robustness of a rule could be different in different database states. E.g.: Suppose there are two db states d1 and d2 of a given db. To reach a state inconsistent with r, we need to delete 10 tuples in d1 and only 1 tuple in d2. Robust(r|d1) > Robust(r|d2)

17 Estimate Robustness Key idea: estimates the probabilities of data changes, rather than the number of possible database states. Estimate the robustness of a rule based on the probability of transactions that may invalidate the rule. Decompose data changing transactions and estimate their probabilities using the Laplace law of succession.

18 Estimate Robustness (cont.) Laplace law of Succession: Given a repeatable experiment with an outcome of one of any of k classes. Suppose we have conducted this experiment n times, r of which have resulted in some outcome C, in which we are interested. The probability that the outcome of the next experiment will be C can be estimated as

19 Examples to Estimate Robustness R1:The latitude of a Maltese geographic location is greater than or equal to geoloc(_,_,?country,?latitude,_) ^ ?country = “Malta”  ?latitude >= Find the robustness of R1 ? T1 (Update a satisfied tuple): One of the existing tuples of geoloc with its ?country = “Malta” is updated such that its ?latitude < T2 (Insert a new tuple): A new tuple of geoloc with its ?country = “Malta” and ?latitude < is inserted to the database. T3 (Update an unsatisfied tuple): One of the existing tuples of geoloc with its ?latitude < and ?country != “Malta” is updated such that its ?country = “Malta”.

20 Examples to Estimate Robustness (cont.) T1, T2, and T3: transactions that invalidate R1. Since T1, T2, and T3 are mutually exclusive, we have The robustness of R1 can be estimated from the probabilities of T1, T2, and T3. Requirements for Transactions Classes Transaction classes must be mutually exclusive. Minimal, no redundant conditions are specified. ?country = ‘Malta’ ?latitude

21 Decomposition of Transactions How to estimate Pr(T1), Pr(T2), Pr(T3)? Decompose the transaction into more primitive statements and estimate their local probabilities first Use Bayesian network model to decompose the transaction (next slide).

22 Bayesian network model of transaction Nodes in the network represent the random variables involved in the transaction. The arc from node x_i to x_j indicates that x_j is dependent on x_i. X1: type of transaction? X3: on which tuples? X4: on which attributes? X5: what new attribute value? X2: on which relation?

23 Decomposition of Transaction The probability of a transaction can be estimated as the joint probability of all variables Example for the transaction T1, their semantics are as follows: x1: a tuple is updated x2: a tuple of geoloc is updated x3: a tuple of geoloc, whose ?country = ‘Malta’ is updated x4: a tuple of geoloc whose ?latitude is updated x5: a tuple of geoloc whose ?latitude is updated to a new value less than 35.89

24 Estimation of Probability of T1 Apply the Laplace law to estimate each local conditional probability. x1: a tuple is updated: x2|x1: a tuple of geoloc is updated, given that a tuple is updated: x3|x2^x1: a tuple of geoloc whose ?country=“Malta” is updated, given that a tuple of geoloc is updated:

25 Estimation of Probability of T1 Assume the size of the relation geoloc is 616, ten of them with ?country=“Malta” Pr(T1) = (1/3). (1/5). (10/616). (1/5). (1/2) = Similarly, we can estimate Pr(T2) and Pr(T3). Suppose, that Pr(T2) = and Pr(T3) = Then the robustness of the rule can be estimated as 1- ( ) =

26 Estimate Robustness Estimation accuracy depends on available information. However, even given only database schemas, the method can still come up with some estimates. How to derive transactions that invalidate an arbitrary logic statement? It is not a trivial problem!!! Most knowledge discovery systems have strong restrictions on the syntax of discovered knowledge. Hence, the invalidating transactions can be manually generalized into a small sets of transaction templates, as well as templates of probability estimates for robustness estimation.

27 Outline Introduction Preliminaries Robustness of Knowledge Applying Robustness in Knowledge Discovery Experimental Results Conclusions

28 Applying Robustness in Knowledge Discovery A rule pruning algorithm which can increase the robustness and applicability of machine discovered rules by pruning their antecedent literals. Specification of rule pruning problem: Take a machine-generated rule as input, which is consistent with a database but potentially overly-specific, and remove antecedent literals of the rule so that it remains consistent but is short and robust. Basic idea of pruning algorithm: To search for a subset of antecedent literals to remove until any further removal will yield an inconsistent rule. However the search space can be exponentially large with respect to the number of literals in rule.  A beam-search algorithm was presented to trim the search space.

29 Pruning Algorithm Beam-search algorithm Apply the robustness estimation approach to estimate the robustness of a partially pruned rule and guide the pruning search. Two properties to optimize: robustness and length. For each set of equally short rules, the algorithm searches for the rule that is as robust as possible while still being consistent.

30 Antecedent pruning algorithm Pruning rule antecedents 1. INPUT R = rules (initially the rule to be pruned), B = beam size; 2. LET O = results;( initially empty); 3. WHILE (R is not empty) DO 4. Move the first rule r in R to O; 5. Prune r, LET R’ = resulting rules; 6. Remove visited dangling or inconsistent rules in R’; 7. Estimate and sort on the robustness of rules in R’; 8. Retain top B rules in R’ and remove the rest; 9. Merge sorted R’ into R in sorted order of the robustness; 10. RETURN O ;

31 Empirical Demonstration of Rule Pruning A detailed empirical study has been conducted on rule R3. wharf(_,?code,?depth,?length,?crane) ^ seaport(?name,?code,_,_,_,_) ^ geoloc(?name,_,?country,_,_) ^ ?country = “Malta” ^ ?depth <= 50 ^ ?crane > 0  ?length >= 1200 The relationship between length and robustness of the rules is shown on the next slide. Pruned rules R7: wharf(_,?code,?depth,?length,?crane) ^ seaport(?name,?code,_,_,_,_) ^ geoloc(?name,_,?country,_,_) ^ ?crane > 0  ?length >= 1200

32 Pruned rules and their estimated robustness

33 Experimental Results Experiment setups: Use the rule discovery system BASIL to derive rules from two large ORACLE relations databases Synthesize 123 sample transactions to represent possible transactions of the experimental databases (27 updates, 29 deletions and 67 insertions) Steps of the experiments: Train BASIL to discover a set of rules and estimate their robustness, generates 355 rules Use another set of 202 sample transactions to assist the robustness estimation. Apply the set of 123 transactions to the two relational database and check consistency of all 355 rules

34 Experimental results

35 Conclusions Formalize the notion of robustness of a rule r in a given database state d Estimate probabilities of rule invalidating transactions Decomposes the probability of a transactions into local probabilities that can be estimated using Laplace law. No need to provide additional information for the estimation. Apply in a rule discovery system Pruning antecedents of a discovered rule so that the rule will be highly robust and widely applicable. Beam-search algorithm Conduct empirical experiments to demonstrate the algorithm.

36 Questions for final exam prep. Define robustness. (Slides 4, 14, 15) What is the closed world assumption and the significance of CWA? Where is it used in data mining? (Slide 8) Compare and contrast robustness, support count and predicative accuracy (Slide 7)

37 Thanks and questions