Association Rule Mining in Peer-to-Peer Systems Ran Wolff Assaf Shcuster Department of Computer Science Technion I.I.T. Haifa 32000,Isreal
Difficulties of Distributed DB Impracticality of global communications and global synchronization Impracticality of global communications and global synchronization Dynamic topology changes of the network Dynamic topology changes of the network On-the-fly data updates On-the-fly data updates Resource sharing with other applications Resource sharing with other applications Frequent failure and recovery of resources. Frequent failure and recovery of resources.
The Algorithm Requirements Entirely asynchronous Entirely asynchronous Imposes very little communication overhead Imposes very little communication overhead Transparently tolerates network topology changes and node failures Transparently tolerates network topology changes and node failures Quickly adjusts to changes in the data as they occur Quickly adjusts to changes in the data as they occur
Problems in LSD-ARM There can be no global synchronization There can be no global synchronization Nodes must act independently Nodes must act independently No point in time in which the algorithm is known to have finished No point in time in which the algorithm is known to have finished No way of knowing that the information they possess is final and accurate. No way of knowing that the information they possess is final and accurate.
Solution For each node to maintain an assumption of the correct result For each node to maintain an assumption of the correct result Update the result whenever new data arrives Update the result whenever new data arrives Nodes compute the result through local negotiation with their immediate neighbor Nodes compute the result through local negotiation with their immediate neighbor
Dynamic nature of LSD system If the mean time between failures of a single node is 20,000 hours If the mean time between failures of a single node is 20,000 hours A system consisting of 100,000 nodes could easily fail five times per hour A system consisting of 100,000 nodes could easily fail five times per hour Whenever a node departs, the global DB and result of computation will be changed Whenever a node departs, the global DB and result of computation will be changed Similar problem occurs when new nodes join Similar problem occurs when new nodes join
The majority voting protocol Requires no synchronization between the computing nodes Requires no synchronization between the computing nodes Each node communicates only with its immediate neighbors Each node communicates only with its immediate neighbors Locality implies that the algorithm is scalable to very large network Locality implies that the algorithm is scalable to very large network
8 Notation definition database at time t database at time t partition of node u at time t partition of node u at time t the group of machines reachable from u at time t the group of machines reachable from u at time t solution of LSD-ARM problem, for node u at time t, which is a set of rules solution of LSD-ARM problem, for node u at time t, which is a set of rules
LSD-Majority LSD-Majority :an entirely different majority voting protocol LSD-Majority :an entirely different majority voting protocol The purpose is to ensure that each node converges toward the correct majority The purpose is to ensure that each node converges toward the correct majority Ad-hoc solution of node u is : Ad-hoc solution of node u is : 1 :when the majority in is of set bits 1 :when the majority in is of set bits 0 :when the majority in is of unset bits 0 :when the majority in is of unset bits
The nodes communicate by sending messages containing two integers The nodes communicate by sending messages containing two integers Count :stands for the number of bits this message reports Count :stands for the number of bits this message reports Sum :which is the number of those bits which are equal to one Sum :which is the number of those bits which are equal to one
Cu is for now one Cu is for now one △ u measures the number of access set bits u has been informed of △ u measures the number of access set bits u has been informed of △ uv measures the number of access set bits u and v have last reported to one another △ uv measures the number of access set bits u and v have last reported to one another
△ u recalculation: each time Su changes, a message is received, or a node connects to v or disconnects from v △ u recalculation: each time Su changes, a message is received, or a node connects to v or disconnects from v △ uv recalculation: each time a message is sent to or received from v △ uv recalculation: each time a message is sent to or received from v As long as △u ≥ △uv ≥ 0 and As long as △u ≥ △uv ≥ 0 and △v ≥ △vu ≥ 0,there is no need to exchange data △v ≥ △vu ≥ 0,there is no need to exchange data
Algorithm 1: LSD-Majority
Generalize LSD-Majority for frequency counts Cu: size of the local database Cu: size of the local database Su: local support of an itemset Su: local support of an itemset λ: MinFreq λ: MinFreq Thus the resulting protocol will decide whether an itemset is frequent or not in Thus the resulting protocol will decide whether an itemset is frequent or not in
Cu: the number of transactions that include X in the local database Cu: the number of transactions that include X in the local database Su: the number of these transactions include both X and Y Su: the number of these transactions include both X and Y λ: MinConf λ: MinConf Thus the result will decide whether a rule X → Y is confident or not. Thus the result will decide whether a rule X → Y is confident or not.
Deciding whether a rule is correct or false requires that each node run two instances of the protocol. Deciding whether a rule is correct or false requires that each node run two instances of the protocol. This way LSD-Majority efficiently decides whether a candidate rule is correct or false. This way LSD-Majority efficiently decides whether a candidate rule is correct or false.
Majority-Rule Each node must take into account not only the local data, but also data brought to it by LSD-Majority. Each node must take into account not only the local data, but also data brought to it by LSD-Majority. An algorithm which never really finishes discovering all itemsets must generate rules on the fly. An algorithm which never really finishes discovering all itemsets must generate rules on the fly.
Majority-Rule
Conclusion A distributed majority vote protocol- LSD- Majority as part of the algorithm A distributed majority vote protocol- LSD- Majority as part of the algorithm An algorithm – Majority-Rule that mines association rules on distributed systems of unlimited size. An algorithm – Majority-Rule that mines association rules on distributed systems of unlimited size. Key quality is its locality. Key quality is its locality. Also fast convergence of the result and low communication demands Also fast convergence of the result and low communication demands