Download presentation
Presentation is loading. Please wait.
Published byJoseph Ray Modified over 9 years ago
1
1 A Theoretical Framework for Association Mining based on the Boolean Retrieval Model on the Boolean Retrieval Model Peter Bollmann-Sdorra
2
2 Contents Introduction Background Boolean Association Mining Expressing item-sets as queries Conclusions Future Work
3
3 Introduction Researchers focus on discovering rules in the form of implications between itemsets which have adequate supports. Having frequent itemsets as both antecedent and precedent parts of rules represent only the simplest form of predicates. This simplicity is due in part to the lack of a theoretical framework that includes more expressive predicates.
4
4 Motivation In Information retrieval systems, a strong theoretical background gives the user the power to ask more sophisticated and pertinent questions. Information retrieval and association mining are two complementary processes on the same data records or transactions. In information retrieval, given a query, we need to find the subset of records that matches the query. In contrast, in data mining, we need to find the queries (rules) having adequate number of records that support them.
5
5 Proposed Solution we introduce the theory of association mining that is based on a model of retrieval known as the Boolean Retrieval Model, where a Boolean query that uses only the AND operator is analogous to an itemset, a general Boolean query (AND, OR or NOT) has interpretation as a generalized itemset, notions of support of itemsets and confidence of rules can be dealt with uniformly, and an event algebra can be defined, involving all possible transaction subsets, to formally obtain a probability space.
6
6 Background Deriving association rules from data: Given a set of items I={i 1,i 2,..., i n }, and a set of transactions T = {t 1, t 2,..., t m }, each transaction t i T, such that t i I, an association rule is defined as X Y, where X I, Y I, and X Y = , describes the existence of a relationship between the two itemsets X and Y.
7
7 The percentage of transactions in the database that contain both X and Y. Measure for Significance
8
8 The percentage of transactions that contain Y among those transactions containing X. Measure for Importance
9
9 Represents a test of statistical independence. Measure for Importance
10
10 Boolean Association Mining Given a set of items I = {i 1, i 2, …, i n }, a transaction t is defined as a subset of items such that t 2 I, where 2 I = { , {i 1 }, {i 2 }, …, {i n }, {i 1, i 2 }, …, { i 1, i 2, …, i n }}. Let T 2 I be a given set of transactions {t 1, t 2, …, t m }. Every transaction t T has an assigned weight w’(t).
11
11 Possible Weights
12
12 weights w’s are normalized to and
13
13 Let I = {beer, milk, bread} be the set of all items, where price(beer) = 5, price(milk) = 3, and price(bread) = 2. The set of transactions T is f(t) is the frequency of transaction t Example
14
14 Case 1: W’(t) = 1,
15
15 Case 2: W’(t) = f(t),
16
16 Case 3: W’(t) = |t| * g(t), Let g(t)=f(t),
17
17 Case 4: W’(t) = v(t) * g(t), Let g(t)=f(t) and v(t)=Price(t)
18
18 Expressing item-sets as queries (logical expressions) Definition 1: For a given set of items I, the set Q of all possible queries associated with item-sets created from I is defined as follows. i I i Q, q, q’ Q q q’ Q These are all.
19
19 Definition 2: For any query q Q, the response set of q, RS(q), is defined as follows: For all atomic i Q, RS(i) = {t T | i t} RS (q q’) = RS(q) RS(q’)
20
20 Definition 3: Let q = (i 1 i 2 … i k ) and A q denote the item-set associated with q; that is, A q = {i 1, i 2, …, i k }, the support of A q is defined as where q = (i 1 i 2 … i k ).
21
21 Lemma 1: The support set of A q ; SS(A q ), equals to RS(q). Lemma 2: For queries q, q 1, q 2 and q 3, the following axioms hold: RS(q q) = RS(q) RS((q 1 q 2 ) q 3 ) = RS(q 1 (q 2 q 3 )) RS(q 1 q 2 ) = RS(q 2 q 1 )
22
22 Example: RS((x 1 x 2 ) (x 3 x 2 )) = RS(x 1 x 2 x 3 )
23
23 Definition 4: For a given set of items I, the set Q* of all possible queries is defined as follows. i I i Q*, q, q’ Q* q q’ Q* q, q’ Q* q q’ Q* q Q* q Q*
24
24 Definition 5: For any query q Q*, the response set of transactions, R (q) is defined as For all i Q*, RS (i) = {t T | i t} RS (q q’) = RS (q) RS (q’) RS (q q’) = RS (q) RS (q’) RS ( q) = T - RS (q)
25
25 Theorem: If q is a transformation of q’ that is obtained by applying the rules of Boolean algebra, then RS(q)= RS(q’) Each q Q* can be considered as a generalized itemset. The itemsets investigated in earlier works only consider q Q.
26
26 Lemma 3: {RS(q) | q Q*}=2 T Theorem: (T, 2T, P) is a probability space.
27
27 Rules and Their Response Strengths Definition 6: The confidence of a rule A q A q’ is defined as Definition 7: The interest of a rule A q A q’ is defined as Definition 8: The support of a rule A q A q’ is defined as
28
28 Lemma 4 : For a rule A q A q’, Lemma 5: For a rule A q A q’,
29
29 Conclusions The theory of association mining that is based on a model of retrieval known as the Boolean Retrieval Model has been introduced. The framework we develop derives from the observation that information retrieval and association mining are two complementary processes on the same data records or transactions. Based on the theory of Boolean retrieval, we generalize the itemset structure by using all Boolean operators.
30
30 Conclusions (cont.) By introducing the notion of support of generalized itemsets, a uniform measure for both itemsets and rules (generalized itemsets) has been developed. Support of a generalized itemset is extended to allow transactions to be weighted so that they can contribute to support unequally.
31
31 Future Work In order to only generate understandable queries, new restrictions or measures, such as, compactness and simplicity, should be introduced. (These restrictions or measures could eliminate a large number of frequent generalized itemsets, many of which could have complex structures.)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.