1 Discovering Robust Knowledge from Databases that Change Chun-Nan HsuCraig A. Knoblock Arizona State UniversityUniversity of Southern California Journal.

Slides:



Advertisements
Similar presentations
Pat Langley Computational Learning Laboratory Center for the Study of Language and Information Stanford University, Stanford, California
Advertisements

Data Mining Classification: Alternative Techniques
1 CS 391L: Machine Learning: Rule Learning Raymond J. Mooney University of Texas at Austin.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 16 Relational Database Design Algorithms and Further Dependencies.
From Decision Trees To Rules
Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.
Weakening the Causal Faithfulness Assumption
Decision Tree Approach in Data Mining
Iterative Optimization and Simplification of Hierarchical Clusterings Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial.
2001/12/181/50 Discovering Robust Knowledge from Databases that Change Author: Chun-Nan Hsu, Craig A. Knoblock Advisor: Dr. Hsu Graduate: Yu-Wei Su.
Case Study: BibFinder BibFinder: A popular CS bibliographic mediator –Integrating 8 online sources: DBLP, ACM DL, ACM Guide, IEEE Xplore, ScienceDirect,
Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements Raju Balakrishnan (Arizona State University)
Sampling Large Databases for Association Rules ( Toivenon’s Approach, 1996) Farzaneh Mirzazadeh Fall 2007.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.
Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland.
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 52 Database Systems I Relational Algebra.
1 Discovering Robust Knowledge from Databases that Change Chun-Nan HsuCraig A. Knoblock Arizona State UniversityUniversity of Southern California Journal.
1 Lecture 5: Automatic cluster detection Lecture 6: Artificial neural networks Lecture 7: Evaluation of discovered knowledge Brief introduction to lectures.
Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee, David W. Cheung, Ben Kao The University of Hong.
Relational Data Mining in Finance Haonan Zhang CFWin /04/2003.
1 Discovery Robust Knowledge from Databases that Change Chun-Nan HsuGraig A. Knoblock Arizona State UniversityUniversity of Southern California Journal.
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee, D. W. Cheung, B. Kao Department of Computer Science.
2-1 Sample Spaces and Events Conducting an experiment, in day-to-day repetitions of the measurement the results can differ slightly because of small.
1 Functional Dependency and Normalization Informal design guidelines for relation schemas. Functional dependencies. Normal forms. Normalization.
1 Validation and Verification of Simulation Models.
Fast Algorithms for Association Rule Mining
Parameterizing Random Test Data According to Equivalence Classes Chris Murphy, Gail Kaiser, Marta Arias Columbia University.
Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee David W. Cheung Ben Kao The University of Hong Kong.
Mining Association Rules
Discovering Robust Knowledge from Databases that Change Chun-Nan Hsu Craig A. Knoblock Arizona State University University of Southern California March.
Part I: Classification and Bayesian Learning
Mining Association Rules of Simple Conjunctive Queries Bart Goethals Wim Le Page Heikki Mannila SIAM /8/261.
Abrar Fawaz AlAbed-AlHaq Kent State University October 28, 2011
Chapter 10 Functional Dependencies and Normalization for Relational Databases.
Discovering Interesting Subsets Using Statistical Analysis Maitreya Natu and Girish K. Palshikar Tata Research Development and Design Centre (TRDDC) Pune,
Verification & Validation
1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
TM 1 Dr. Chen, Business Database Systems Data Modeling Professor Chen School of Business Administration Gonzaga University Spokane, WA
A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung
OLAP : Blitzkreig Introduction 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema :
Mining Reference Tables for Automatic Text Segmentation Eugene Agichtein Columbia University Venkatesh Ganti Microsoft Research.
On information theory and association rule interestingness Loo Kin Kong 5 th July, 2002.
Bab 5 Classification: Alternative Techniques Part 1 Rule-Based Classifer.
Logical Database Design (1 of 3) John Ortiz Lecture 6Logical Database Design (1)2 Introduction  The logical design is a process of refining DB schema.
Multi-Relational Data Mining: An Introduction Joe Paulowskey.
Today’s Agenda  Reminder: HW #1 Due next class  Quick Review  Input Space Partitioning Software Testing and Maintenance 1.
Efficient RDF Storage and Retrieval in Jena2 Written by: Kevin Wilkinson, Craig Sayers, Harumi Kuno, Dave Reynolds Presented by: Umer Fareed 파리드.
Machine Learning Chapter 5. Artificial IntelligenceChapter 52 Learning 1. Rote learning rote( โรท ) n. วิถีทาง, ทางเดิน, วิธีการตามปกติ, (by rote จากความทรงจำ.
1 Test Selection for Result Inspection via Mining Predicate Rules Wujie Zheng
Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee, David W. Cheung, Ben Kao The University of Hong.
Simultaneously Learning and Filtering Juan F. Mancilla-Caceres CS498EA - Fall 2011 Some slides from Connecting Learning and Logic, Eyal Amir 2006.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
For Monday Finish chapter 19 Take-home exam due. Program 4 Any questions?
Copyright, Harris Corporation & Ophir Frieder, The Process of Normalization.
CSE314 Database Systems Lecture 3 The Relational Data Model and Relational Database Constraints Doç. Dr. Mehmet Göktürk src: Elmasri & Navanthe 6E Pearson.
Chapter 7 Functional Dependencies Copyright © 2004 Pearson Education, Inc.
Lab name TBA1NTUST talk Data Mining for Information Retrieval Chun-Nan Hsu Institute of Information Science Academia Sinica, Taipei, TAIWAN Copyright ©
Data Mining and Decision Support
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
1 Discriminative Frequent Pattern Analysis for Effective Classification Presenter: Han Liang COURSE PRESENTATION:
Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.
1 The Relational Data Model David J. Stucki. Relational Model Concepts 2 Fundamental concept: the relation  The Relational Model represents an entire.
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss Pedro Domingos, Michael Pazzani Presented by Lu Ren Oct. 1, 2007.
Gspan: Graph-based Substructure Pattern Mining
3.1 Functional Dependencies
Farzaneh Mirzazadeh Fall 2007
Discriminative Frequent Pattern Analysis for Effective Classification
Presentation transcript:

1 Discovering Robust Knowledge from Databases that Change Chun-Nan HsuCraig A. Knoblock Arizona State UniversityUniversity of Southern California Journal of Data Mining and Knowledge Discovery, Volume 2, 1998 Presented & Edited by: Danielle Steimke CS332/295 Data Mining – Spring 2014

2 Outline Introduction Preliminaries Robustness of Knowledge Applying Robustness in Knowledge Discovery Conclusions

3 Introduction: Importance Real world databases are dynamic. Changing information (updates, deletions) may make current rules incorrect How can we detect, check & update these inconsistent rules without high maintenance costs? Use robustness as a measure of how likely knowledge is to be consistent after database changes

4 Introduction: BASIL BASIL rule discovery system – discovers semantic rules that optimize performance Many rules may be invalid after DB changes Use robustness estimation to guide data mining using a rule pruning approach Search for pruning pattern that yields highly robust rules

5 Introduction: Applications of Robustness Estimation Minimize the maintenance cost of inconsistent rules in the presence of database changes. When an inconsistent rule is detected, a system may: remove the rule: repeatedly invoke discovery system repair the rule: repaired frequently after data changes Apply to guide the data mining and evaluation of semantic rules: In data mining stage prune antecedents of a discovered rule to increase its robustness. In evaluation stage, eliminate rules with robustness values below a threshold.

6 Introduction: Robustness Comparison Robustness vs. Support: Support - the probability that a data instance satisfies a rule. Robustness - probability that an entire database state is consistent with a rule. Robustness vs. Predictive Accuracy: Predictive - the probability that knowledge is consistent with randomly selected unseen data. The difference is significant in databases that are interpreted using the closed-world assumption (CWA).

7 Introduction: Closed-world Assumption Information that is not explicitly present in the database is taken to be false. Closed-world databases are widely used in relational db’s, deductive db’s and rule-based information systems because of The characteristics of application domains. The limitation of the representation systems. Databases change by deletions, insertions and updates, and in a close- world database, they may affect the validity of a rule. Closed-world data tends to be dynamic, important for knowledge discovery systems to handle dynamic and closed world data.

8 Important Definitions Database state at time t: the collection of the instances presents in the database at time t. Database Literals vs. Built-In Literals Range Rule vs. Relational Rule Consistent rule: given a database state, if all variable instantiations that satisfy the antecedents of the rule also satisfy the consequent of the rule.

9 Outline Introduction Preliminaries Robustness of Knowledge Applying Robustness in Knowledge Discovery Conclusions

10 First Definition of Robustness Robust rule: does not become inconsistent (invalid) after database changes Definition 1: (Robustness for all states) Given a rule r, let D be the event that a database is in a state that is consistent with r.

11 First Definition Continued Two problems with estimate Treats all database states as if they are equally probable Number of database states is large, even in small databases

12 Second Definition of Robustness Robustness for accessible states: Given a rule r, and a database in a state d, in which r is consistent. Let t denote the event of performing a transaction on d that results in new database states inconsistent with r. The robustness of r in accessible states from the current state d is defined as

13 Definitions of Robustness (cont.) If all transactions are equally probable, then The robustness of a rule could be different in different database states. E.g.: Suppose there are two db states d1 and d2 of a given db. To reach a state inconsistent with r, we need to delete 10 tuples in d1 and only 1 tuple in d2. Robust(r|d1) > Robust(r|d2)

14 Estimate Robustness Estimate the probabilities of data changes, rather than the number of possible database states. Estimate the robustness of a rule based on the probability of transactions that may invalidate the rule. Decompose data changing transactions and estimate their probabilities using the Laplace law of succession.

15 Estimate Robustness (cont.) Laplace law of Succession: Given a repeatable experiment with an outcome of one of any of k classes. Suppose we have conducted this experiment n times, r of which have resulted in some outcome C, in which we are interested. The probability that the outcome of the next experiment will be C can be estimated as

16 Decomposition of Transactions To estimate Pr(t|d) Use Bayesian network model to decompose the transaction and estimate local probabilities first X1: type of transaction? X2: on which relation? X3: on which tuples? X4: on which attributes? X5: what new attribute value?

17 Estimating Robustness Estimation accuracy depends on available information. However, even given only database schemas, the method can still come up with some estimates. How to derive transactions that invalidate an arbitrary logic statement? Most knowledge discovery systems have strong restrictions on the syntax of discovered knowledge. Therefore the invalidating transactions can be manually generalized into a small sets of transaction templates, as well as templates of probability estimates for robustness estimation.

18 Outline Introduction Preliminaries Robustness of Knowledge Applying Robustness in Knowledge Discovery Conclusions

19 Applying Robustness in Knowledge Discovery A rule pruning algorithm which can increase the robustness and applicability of machine discovered rules by pruning their antecedent literals. Specification of rule pruning problem: Take a machine-generated rule as input, which is consistent with a database but potentially overly-specific, and remove antecedent literals of the rule so that it remains consistent but is short and robust. Basic idea of pruning algorithm: To search for a subset of antecedent literals to remove until any further removal will yield an inconsistent rule. However the search space can be exponentially large with respect to the number of literals in rule.  A beam-search algorithm was presented to trim the search space.

20 Antecedent pruning algorithm Pruning rule antecedents 1. INPUT R = rules (initially the rule to be pruned), B = beam size; 2. LET O = results;( initially empty); 3. WHILE (R is not empty) DO 4. Move the first rule r in R to O; 5. Prune r, LET R’ = resulting rules; 6. Remove visited dangling or inconsistent rules in R’; 7. Estimate and sort on the robustness of rules in R’; 8. Retain top B rules in R’ and remove the rest; 9. Merge sorted R’ into R in sorted order of the robustness; 10. RETURN O ;

21 Pruning Algorithm Beam-search algorithm Apply the robustness estimation approach to estimate the robustness of a partially pruned rule and guide the pruning search. Two properties to optimize: robustness and length. For each set of equally short rules, the algorithm searches for the rule that is as robust as possible while still being consistent.

22 Experimental Results Experiment setups: Use the rule discovery system BASIL to derive rules from two large ORACLE relations databases BASIL part of SIMS information mediator developed by authors Synthesize 123 sample transactions to represent possible transactions of the experimental databases (27 updates, 29 deletions and 67 insertions) Steps of the experiments: Train BASIL to discover a set of rules and estimate their robustness, generates 355 rules Use another set of 202 sample transactions to assist the robustness estimation. Apply the set of 123 transactions to the two relational database and check consistency of all 355 rules

23 Conclusions Formalize the notion of robustness of a rule r in a given database state d Estimate probabilities of rule invalidating transactions Decomposes the probability of a transactions into local probabilities that can be estimated using Laplace law. No need to provide additional information for the estimation. Apply in a rule discovery system Pruning antecedents of a discovered rule so that the rule will be highly robust and widely applicable. Beam-search algorithm Conduct empirical experiments to demonstrate the algorithm.

Real-world Applications Face recognition software Envirionment is the “data set” – constantly changing Rules must resist variability in lighting and pose Systems prognostics Data set continuously updated with diagnostic or calibration information Predictive failure rules used to dictate system resource planning 24

First Exam Question Why is finding robust rules significant? 25

First Question - Answer Real world databases tend to be dynamic Changing information – updates and deletions, rather than just additions – could invalidate current rules Continually checking and updating rules may incur high maintenance costs, especially for large databases Robustness measures how likely the knowledge found will be consistent after changes to the database 26

Second Exam Question Compare and Contrast Robustness estimations with support and predictive accuracy 27

Second Question - Answer Robustness is the probability that an entire database state is consistent with a rule, while support is the probability that a specific data instance satisfies a rule. Predictive Accuracy is the probability that knowledge is consistent with randomly selected unseen data. This is significant in closed world databases, where data tends to be dynamic. 28

Third Exam Question Why are transaction classes mutually exclusive when calculating robustness? 29

Third Question – Answer Transaction classes are mutually exclusive so that no transaction class covers another because for any two classes of transactions t_a and t_b, if t_a covers t_b, then Pr(t_a ⋀ t_b)=Pr(t_a) and it is redundant to consider t_b. 30