Presentation is loading. Please wait.

Presentation is loading. Please wait.

1  S. Matwin, 2002 Data Mining What is data mining? Motivating example Why now? Technological foundations Tasks Architectures and processes data warehouse,

Similar presentations


Presentation on theme: "1  S. Matwin, 2002 Data Mining What is data mining? Motivating example Why now? Technological foundations Tasks Architectures and processes data warehouse,"— Presentation transcript:

1 1  S. Matwin, 2002 Data Mining What is data mining? Motivating example Why now? Technological foundations Tasks Architectures and processes data warehouse, data mart middleware OLAP Conclusion http://www.site.uottawa.ca/~stan/csi5387/dm3517~1.pdf

2 2  S. Matwin, 2002 Definition Technology that fins implicit, unexpected relationships in the data the K-mart example

3 3  S. Matwin, 2002 Why now? Bar codes networks/connectivity IT-maturity of management

4 4  S. Matwin, 2002 Technological foundations Databases machine learning visualization statistics

5 5  S. Matwin, 2002 Tasks Associations/MBA estimation classification clustering...

6 6  S. Matwin, 2002 Associations Given: I = {i1,…, im} set of items D set of transactions (a database), each transaction is a set of items T  2 I Association rule: X  Y, X  I, Y  I, X  Y=0 confidence c: ratio of # transactions that contain Y to # of all transaction that contain X support s: ratio of # of transactions that contain both X and Y to # of transactions in D

7 7  S. Matwin, 2002 An association rule A  B is a conditional implication among itemsets A and B, where A  I, B  I and A  B = . The confidence of an association rule r: A  B is the conditional probability that a transaction contains B, given that it contains A. The support of rule r is defined as: sup(r) = sup(A  B). The confidence of rule r can be expressed as conf(r) = sup(A  B)/sup(A).

8 8  S. Matwin, 2002 Associations - mining Given D, generate all assoc rules with c, s > thresholds min c, min s (items are ordered, e.g. by barcode) Idea: find all itemsets that have transaction support > min s : large itemsets

9 9  S. Matwin, 2002 Associations - mining to do that: start with indiv. items with large support in ea next step, k, use itemsets from step k-1, generate new itemset C k, count support of C k (by counting the candidates which are contained in any t), prune the ones that are not large

10 10  S. Matwin, 2002 Associations - mining Only keep those that are contained in some transaction

11 11  S. Matwin, 2002 Candidate generation C k = apriori-gen(L k-1 )

12 12  S. Matwin, 2002 Subset function Subset(C k, t) checks if an itemset Ck is in a transaction t It is done via a tree structure through a series of hashing: Hash C on every item in t: itemsets not containing anything from t are ignored If you got here by hashing item i of t, hash on all following items of t set of itemsets Check if itemset contained in this leaf

13 13  S. Matwin, 2002 Example L 3 ={{1 2 3}, {1 2 4},{1 3 4},{1 3 5},{2 3 4}} C 4 ={{1 2 3 4} {1 3 4 5}} pruning deletes {1 3 4 5} because {1 4 5} is not in L 3. See http://www.almaden.ibm.com/u/ragrawal/pubs.html#associations for details

14 14  S. Matwin, 2002 Lift chart population 100 5% response rate contacting 10 best chances, we obtain 20% of the 5% who respond, so 1 person. Without a model, 0.5 pers. The lift is 2. Oftentimes, cost has to be taken into account for samples of small and large size

15 15  S. Matwin, 2002 Architectures data warehouse metadata middleware data mart data cube

16 16  S. Matwin, 2002 Architecture - defs data warehouse: several heterogeneous databases that contain data relevant to a given problem (e.g. transactions, customer info, …) metadata = data about the data. Describes the hierarchy of attributes and the logical organization of the data (e.g. customer data consists of the number, name, accounts, … accounts is …) the database scheme is an example of metadata metadata describes the data in the data warehouse from the business perspective

17 17  S. Matwin, 2002 Architecture - defs middleware: software protocol for a single interface to a distributed DW. E.g. standards such as Open DataBase Connectivity (ODBC) and Java DBC (JDBC) APIs problems (efficiency) when querying multitiered approach: datamarts: data warehouse needed for a given dept.

18 18  S. Matwin, 2002

19 19  S. Matwin, 2002 Processes: OLAP: On Line Analytical Processing from de-normalized data (source system, e.g. transactions) to a star topology analyzing the reports that are likely to be needed the star and the report define the dimension the dimensions define the cube

20 20  S. Matwin, 2002 Example moviegoers database (de-normalized): namesexagesourcemovie name Amyf27OberlinIndependence day Andym34OberlinThe Birdcage Bobm51PinewoodsSchindler’s list Cathyf39124 Mt. AuburnThe Birdcage Curtm30MRJJudgement day Davidm40MRJIndependence day Ericaf23124 Mt. AuburnTrainspotting

21 21  S. Matwin, 2002 central fact table dimension tables

22 22  S. Matwin, 2002 Typical reports # of times ea. movie was seen for movies seen > 5 times for what movies is the avg age of viewers > 30? the # of people and their ages by source the # of people from ea. source by gender

23 23  S. Matwin, 2002 Cube is formed by representing the whole database (denormalized) by the dimensions the size of the cube does not depend on the number of people the cube has subcubes, ea. containing the “key” info that identifies it plus the summary aggregate info cube = MDD (multi-dimensional Database) real cubes have more than three dimensions ea. record belongs to exactly one subcube

24 24  S. Matwin, 2002 Tasks drilling: looking inside a subcube at the records(of the original database) that are represented in that subcube churning/attrition: loosing customers. can be cast as a classification problem on historical data (two classes: customers who have churned and those who have not) then a classification system (e.g. decision tree induction) can induce the classifiers fraud detection: learning regular patterns, watching for discrepancies

25 25  S. Matwin, 2002

26 26  S. Matwin, 2002

27 27  S. Matwin, 2002 Data mining - tools IBM Intelligent Miner SAS Enterprise Miner SGI MineSet RuleQuest (you can download it for a trial!)

28 28  S. Matwin, 2002 Data mining - conclusion Treats historical data as an organizational asset, rather than burden tries to find out the unknown predict the unknown applies to marketing internet mining E-commerce...


Download ppt "1  S. Matwin, 2002 Data Mining What is data mining? Motivating example Why now? Technological foundations Tasks Architectures and processes data warehouse,"

Similar presentations


Ads by Google