Information Retrieval from Data Bases for Decisions Dr. Gábor SZŰCS, Ph.D. Assistant professor BUTE, Department Information and Knowledge Management.

Information Retrieval from Data Bases for Decisions Dr. Gábor SZŰCS, Ph.D. Assistant professor BUTE, Department Information and Knowledge Management

Contents Aims General steps in the procedure Market basket analysis Frequent itemsets Conclusion

Aims search hidden coherences in the existing data bases (DB) help to take a well grounded decision Data mining techniques are able to find such relationships. they provide the ability to optimize decision- making they are the most powerful tools for retrieval important information

Steps of the data mining 1. Declaration of the key and the predictor variables in order to analyse (Sampling from a large amount of data) 2. Modification of variables, where we should examine whether some variables should be integrated (in large DBs always occur some mistakes) (some transformations should be executed)

Additional steps of the data mining 3. Modelling, data mining techniques: neural network, decision tree, regression procedures, cluster analysis, factor analysis, discriminant analysis, etc. 4. Comparison the data mining models built on the same DB (the best model can be selected). The procedure can be cyclically repeated. After the whole procedure the hidden relationships between different aspects can be shown.

Market Basket Analysis is used for finding groups of items that tend to occur together. The models give the likelihood of different products being purchased together. Market basket analysis is useful for: 1. items occur together 2. items occur in a particular sequence

Table of Co-Occurrence of Products Product 1Product 2Product 3Product 4Product 5 Product 123412012554 Product 212175652375 Product 30652296762 Product 4125236731555 Product 554756255292

Procedure of the market basket analysis 1. Choose the right level of the product hierarchy for the items. 2. Probabilities and joint probabilities of the items are calculated. 3. Determine the association rules.

Example Bicycle (A)140 Hand tools for bicycle (B)100 Tool rack (C)61 Bicycle and hand tool (A & B)50 Bicycle and tool rack (A & C)7 Hand tool and tool rack (B & C)45 Bicycle and hand tool and tool rack (A & B & C) 5

Table of probabilities and joint probabilities of items A 14 % B 10 % C 6,1 % A & B 5 % A & C 0,7 % B & C 4,5 % A & B & C 0,5 %

Association rules The rules ( A  B) consist of two parts: 1. condition and 2. consequence A confidence can be defined for the rules:

Example P(A  B) = 5 / 14 = 0.357 P((A&B)  C) = 0.05 / 0.5 = 0.1 P((A&C)  B) = 0.05 / 0.07 = 0.714 P((B&C)  A) = 0.05 / 0.45 = 0.111 Is this association rule can help us? If we offer product A for everybody, then 14 % of the persons will purchase. If A for only B and C, then 11 % of the people will purchase.

Improvement This will help us to decide that the association rule is useful or not.

In our example Improvement ((B&C)  A) = 0.111 / 0.14 = 0.794 Improvement ((A&B)  C) = 0.1 / 0.061 = 1.639 The value of improvement shows the usefulness of the analysis: a) improvement > 1 b) improvement < 1

Dissociation rules similar to association rules count the inverse of the original item,  modify each transaction: A transaction includes an inverse item if, and only if, it does not contain the original item.

Time series the transactions must have two additional features: time information (e.g. time sequence or time stamp) identifying information (e.g. customer id, account number in a bank)

Frequent itemsets appear in at least fixed ratio problem a-priori trick: If a set of items S is frequent, then every subset of S is also frequent. procedure built from lower level to upper level (frequent items, frequent pairs, etc.)

A-Priori Algorithm 1. Define a threshold for relative frequency. All items are examined. The set of the frequent items: L 1. 2. Pairs of items in L 1 become the candidate (C 2 ). This is compared with the threshold limit. L 2 contains the frequent pairs.

A-Priori Algorithm (cont.) 3. The candidate triples (C 3 ) are those sets {A,B,C} such that all of subset are in L 2. L 3 will contain the frequent triples. 4. L i is the frequent sets of size i, C i+1 is the candidate set of size i+1 until the sets become empty

Criticism of A-Priori Algorithm good if we would like to know only the frequent pairs at searhing maximal frequent itemsets too many steps may be needed physical capacity of computers

Market Basket Mining with High Correlation Analysis The data are organised in a matrix. The cells contain Boolean. 1: yes 0: no This matrix is very sparse. We want to find the highly correlated pairs.

Applications of High Correlation Mining 1. Rows are the document, columns are the words. The highly correlated pairs of columns will give the words that appear almost together. 2. Rows and columns are Web pages. The cell contains 1, if the page of row links to the page of column. Result: pages about the same topic. 3. Page of columns links to the page of row. Result: the mirror pages.

Conclusion Planning store layout Bundling products Offering coupons

Future Further development: hierarchical association rules association rules maintenance sequential pattern mining functional dependency mining

Thank you! The flow is open for the discussion. E-mail: szucs@itm.bme.hu

References Fajszi Bulcsú, Cser László: Üzleti tudás az adatok mélyén – Adatbányászat alkalmazói szemmel, Budapest, 2004, Budapesti Műszaki és Gazdaságtudományi Egyetem, Információ- és Tudásmenedzsment Tanszék. Michael J. A. Berry, Gordon Linoff: Data Mining Techniques – For Marketing, Sales, and Customor Support, Canada, 1997, John Wiley & Sons, Inc. Sam Kash Kachigan: Multivariate Statistical Analysis, New York, 1991, Radius Press. Ferenc Bodon: A fast APRIORI implementation. Agrawal, R., Srikant, R: Fast algorithms for mining association rules, The International Conference on Very Large Databases, 1994, pages 487-499.

Information Retrieval from Data Bases for Decisions Dr. Gábor SZŰCS, Ph.D. Assistant professor BUTE, Department Information and Knowledge Management.

Similar presentations

Presentation on theme: "Information Retrieval from Data Bases for Decisions Dr. Gábor SZŰCS, Ph.D. Assistant professor BUTE, Department Information and Knowledge Management."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Information Retrieval from Data Bases for Decisions Dr. Gábor SZŰCS, Ph.D. Assistant professor BUTE, Department Information and Knowledge Management.

Similar presentations

Presentation on theme: "Information Retrieval from Data Bases for Decisions Dr. Gábor SZŰCS, Ph.D. Assistant professor BUTE, Department Information and Knowledge Management."— Presentation transcript:

Similar presentations

About project

Feedback