Download presentation
Presentation is loading. Please wait.
Published byAshley Holmes Modified over 9 years ago
1
Lab name TBA1NTUST talk Data Mining for Information Retrieval Chun-Nan Hsu Institute of Information Science Academia Sinica, Taipei, TAIWAN Copyright © 1998 Chun-Nan Hsu, All right reserved
2
Lab name TBA2NTUST talk The formation of the field “data mining” Statistics ~1800? Pattern Recognition ~1970 Rule induction Machine learning ~1980 Expert Systems ~1970 Relational Databases, Triggers ~1980 Knowledge Discovery for Databases (KDD) ~1990 MIS decision support ~1990 Data Mining ~1995
3
Lab name TBA3NTUST talk Taxonomies of data mining l Based on underlying technologies »decision trees, rule-based, example-based, nonlinear regression, neural networks, bayesian networks, rough sets... l Based on tasks at hand (due to Fayyad et al. 1997) »classification, regression, clustering, summarization, dependency modeling, change and deviation detection l Based on data???? Formalize these ideas »collection of similarities »time series »image --- snapshot of a state »collection of images
4
Lab name TBA4NTUST talk Collection of similarities l Characterize classes by generating classifiers (supervised learning)?????? l Cluster objects into classes (clustering, unsupervised learning)?????? l Many techniques available, most well understood
5
Lab name TBA5NTUST talk Time series l Forecasting, predicting the next (few) states l Characterizing the “trend” to detect changes and deviations l Usually can be reformulated as a supervised learning problem
6
Lab name TBA6NTUST talk Collection of images l Extracting dependency, co-relations l Example: a collection of shopping lists of supermarket customers l Example: a collection of symptom lists of patients taking a new medicine??????? l Techniques »Association rules »Bayesian networks and other probabilistic graphical models
7
Lab name TBA7NTUST talk Image l Summarization l Key feature extraction l Not much is known l Example: a snapshot of an inventory database
8
Lab name TBA8NTUST talk Issue: Consistency of Machine-generated Rules database state (t) database state (t+1) transactions : insert/ delete/ update Rules Consistent? Learning Data Mining Discovery
9
Lab name TBA9NTUST talk Dealing with Inconsistent Rules l Delete them? »Simple, but the system might have no rule to use l Modify them? »Smart, but the system might be busy modifying rules l Learn rules that are unlikely to become inconsistent »Yes, but how does it know which rule to learn? l Need a way to measure “likelihood of not becoming inconsistent” --- Robustness of knowledge
10
Lab name TBA10NTUST talk Robustness vs. Predictive Accuracy Given a rule A C l Closed-world assumption on databases: BOTH insertions and deletions affect inconsistency l Robustness of a rule is measured with regard to entire database states D: Pr(A C|D) l Predictive accuracy of a rule is measured with regard to data tuples d: Pr(C| A,d)
11
Lab name TBA11NTUST talk Definition of Robustness of knowledge (1) l A rule is robust if it is unlikely that the rule becomes inconsistent with a database state l Intuitively, this probability can be estimated as # of database states consistent with the rule # of possible database states l However: »database states are not equally probable »# of database states are intractably large
12
Lab name TBA12NTUST talk Definition of Robustness of knowledge (2) l A rule is robust given a current database state if transactions that invalidate the rule is unlikely to be performed. l Likelihood of database states depends on »Current database state »Probability of transactions performed on that state l New definition of robustness is 1 - Pr(t|d) »t: transactions that invalidate the rule is performed »d: current database state
13
Lab name TBA13NTUST talk Robustness Estimation l Step 1: Find transactions that invalidate the input rule l Step 2: Decompose the probabilities of invalidating transactions into local probabilities l Step 3: Estimate local probabilities
14
Lab name TBA14NTUST talk Step 1: Find Transactions that Invalidate the Input Rule l R1: The latitude of a Maltese Geographic location is greater than or equal to 35.89. geoloc(_,_,?country,?latitude,_) & (?country = “Malta”) ?latitude > or = 35.89 l Transactions that invalidate R1: »T1: One of the existing tuples of geoloc with its country = “Malta” is updated such that its latitude < 35.89 »T2: Insert an inconsistent tuple... »T3:Update a tuple whose latitude < 35.89 into “Malta” l Robust(R1) = 1 - Pr(t|d) = 1 - (Pr(T1|d) + Pr(T2|d) + Pr(T3|d))
15
Lab name TBA15NTUST talk Step 2: Decompose the Probabilities of Invalidating Transactions x1: type of transaction? x4: on which attribute? x3: on which tuple? x2: on which relation? x5: what new attribute value? Pr(t|d) = Pr(x1,x2,x3,x4,x5|d) = Pr(x1|d) Pr(x2| x1,d) Pr(x3|x2,x1,d) Pr(x4| x2,x1,d) Pr(x5| x4,x2,x1,d) = p1 * p2 * p3 * p4 * p5
16
Lab name TBA16NTUST talk Step 3: Estimate Local Probabilities l Estimate local probabilities using Laplace Law of Succession (Laplace 1820) r + 1 n + k l Useful information for robustness estimation: »transaction log »expected size of tables »information about attribute ranges, value distributions l When no information is available, use database schema information
17
Lab name TBA17NTUST talk Example of Robustness Estimation R1: geoloc(_,_,?country,?latitude,_) & (?country = “Malta”) ?latitude > or = 35.89 l T1: One of the existing tuples of geoloc with its country = “Malta” is updated such that its latitude < 35.89 »p1: update? 1/3 = 0.33 »p2: geoloc? 1/2 = 0.50 »p3: geoloc, country = “Malta”? 4/80 = 0.05 »p4: geoloc, latitude to be updated? 1/5 = 0.20 »p5: latitude updated to < 35.89? 1/2 = 0.5 l Pr(T1|d) = p1 * p2 * p3 * p4 * p5 = 0.008 l Pr(T2|d) and Pr(T3|d) can be estimated similarly
18
Lab name TBA18NTUST talk Example (cont.): When additional information is available l Naive »p1: update?1/3 = 0.33 l Laplace »p1: update?# of previous updates + 1 # of previous transactions + 3 l m-Probability (Cestnik & Bratko 1991) »p1: update? # of previous updates + m * Pr(U) # of previous transactions + m »m is an expected number of future transactions »Pr(U) is a prior probability of updates
19
Lab name TBA19NTUST talk Applying Robustness Estimation l Robustness may not be the only desirable property of target rules l Need to combine robustness and other utility measures to guide learning »Tautologies are the most robust l Using many measures to guide rule generation could be difficult
20
Lab name TBA20NTUST talk Pruning Rule Literals with Robustness Estimation l Use existing algorithms to generate rules l Prune literals of an output rule based on its applicability and estimated robustness l Example: if wharf in Malta, depth < 50ft, with one or more crane its length > 1200ft »shortest rule consistent with the database if wharf in Malta its length > 1200ft »the most robust if wharf in Malta with one or more crane its length > 1200ft
21
Lab name TBA21NTUST talk Applications l Learning rules for Semantic Query Optimization (Hsu & Knoblock ML94, Siegel Boston U. thesis 89, Shekha et al. IEEE TKDE 94) l Learning functional dependency (Mannila & Raiha KDD94) l Discovering models to reconcile/integrate heterogeneous databases (Dao, Son et al. KDD95) l Learning to answer intentional queries (Chu et al. 91) l Discovering knowledge for decision suppor
22
Lab name TBA22NTUST talk Summary l Data Mining from “Image” need to estimate the robustness of extracted knowledge l Robustness can be defined based on the probability of invalidating transactions l Robustness can be estimated efficiently l Rule pruning guided by robustness and other utility measures may yield robust and interesting rules l Discovering robust knowledge to enhance database functionalities
23
Lab name TBA23NTUST talk Data Mining for IR? l Different tasks need different ways to collect and prepare data l Data preparation and cleaning are important
24
Lab name TBA24NTUST talk Data Mining for IR? Issues l Potential Applications »text categorization (a.k.a. classification, routing, filtering) »fact extraction (a.k.a. template filling) »clustering »text summarization (a.k.a. abstracting, gisting) »user profiling and modeling »interactive query formulation l Issues »scaling up to large volume of data »feature selection (a.k.a. dimensionality reduction)
25
Lab name TBA25NTUST talk Projects l Recent projects »Template filling --- inducing information extractors from labeled semi-structured documents (J of Info Systems, 1999) »Feature Selection --- feature selection for backprop neural network (IEEE Tools with AI, 1998) l (to-be-proposed) projects »Alias-mining for digital library (NSC) »Classifying NL diagnosis records to ICD-9-CM coding (NHI) l More projects…plans of collaboration much welcome!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.