1 Data Mining
2 Agenda Examples What is data mining? The Industry comments Techniques
3 Examples “On Friday evenings, shoppers who buy diapers also buy beer”. –Supermarket transaction database “People with good credit ratings have fewer accidents” –Insurance database, “A one-dollar gas station credit-card transaction followed by a large transaction is likely to be indicative of fraud”. –Credit card transactions database
4 More Examples Marketing –Targeted marketing using decision trees Stock selection / Fraud detection –Using neural networks Telecommunications –Churn modeling, identifying valuable customers
5 Even More Examples Healthcare –Fish oil and Reynaud’s disease Finding communities on the Web –Abortion example Personalization –Recommender systems
6 Even More More Examples Games (e.g. Hollywood Stock Exchange) – Viral Marketing –Social networks and network mining Sports – NBA Scout
7 Agenda Examples What is data mining? The Industry comments Techniques
8 What is Data Mining?
9 Querying large databases? Learning patterns from data? Building models from data?
10 What is Data Mining? Learning “structure” from large data –“reverse engineering” –“structure” could be patterns or models How is this different from statistics?
11 Data mining techniques Lots of them exist! How to categorize these? –Two approaches Description vs prediction RES framework
12 Classification of the main engines/techniques
13 Representation, Evaluation & Search: Linear Model Example Representation –Risk = 0.93*prior_default *num_cards – 1.3* employed –0.734 Evaluation –R-squared/degree of fit Search –How did the technique find the coefficients?
14 Representation, evaluation and search Different techniques represent, evaluate and search for patterns differently. –Methods can be characterized based on how they do these things. Data mining methods use very different representation schemes, use predictive accuracies as the main evaluation measure and use heuristic search procedures Strengths: Can build very accurate models and learn interesting patterns in a bottom-up manner Weaknesses: Can find false patterns and may “overfit” the learning data –How to mitigate these? This is one way to think about the difference between DM methods and traditional statistical methods
15 Agenda Examples What is data mining? The Industry comments Techniques
16 The Industry Space Data gathering and management –External data sources –Integrating databases to design unified views For realtime support For historical warehouse driven apps Firms –Data vendors, consulting services
17 web phone golfcourse channels Action Database Other Data Sources Customer Centric Architecture
18 The Industry Space Broad Data Analytics –Traditional statistical tools –Data mining tools Firms – –SPSS, SAS, Trajecta, IBM, SGI, Gainsmarts, HNC Software Other common sources –In-house analytics development and academia
19 The Industry Space Niche Market Analytics and Services –Fraud detection –Customer Segmentation –Direct Marketing –Bioinformatics –Internet Advertising –Personalization Firms –Examples: Doubletwist, Celera, HNC Software, Knowledge Stream Partners, Adknowledge (acquired by Engage), Epiphany.
20 The Industry Space Broad CRM Technologies and Services –General features Some data collection and integration tools Some analytics and profitability analyses Some features to streamline operations Often customizable based on client needs Boils down to client needs Firms –E.g. Siebel.
21 Data Mining Revisited Smart techniques –Data mining Not a problem. Engineering –Integrating this into an overall data management architecture The more difficult problem When and how to use –The hard part is figuring out which problem to solve, what data to use etc –The importance of thinking “bottom up” for solving problems
22 The Chief Data Officer
23 The Chief Data Officer
24 Agenda Examples What is data mining? The Industry comments Techniques
25 Example DM Models: Neural Networks Attempts to mimic the way neurons work in translating input data into an output (dependent variable)
26 Structure of a Neural Network
27 Surface-fitters or Function Approximators
28 Example DM Models: OLAP (On Line Analytical Processing) Provides visual tools to slice and dice the data
29 Browsing a Data Cube
30 Example: Clustering Identify homogeneous and separable groups (“clusters”) so that: –maximum similarity between points within a group –maximum difference between groups Applications –group customers into categories useful for targeted marketing. –Identify clusters in image data
31 What clusters can look like
32 Example: Classification
33 Example: Nearest neighbor methods Read “Amazon.com recommendations” paper
34 Online Recommender Systems Opportunities –Customized stores and all the associated benefits –Easy measurement –Permits experimentation Challenges –Scale (tens of millions of users, and millions of items) –Need for real-time results –Amount of info on customers varies, but often sparse data
35 Simple collaborative filtering C 1 C 2 C 3.. C n I 1 I 2 I 3 ….. I m Let C 1 be the vector of zeros and ones corresponding to customer 1. 2.Define similarity between customers A and B as cos(A, B) = A. B ||A||. ||B|| 3. In traditional collaborative filtering, for a given customer find the closest customer and then recommend the other products purchased by this closest cust. Advantages and Disadvantages?
36 Content based recommendations Treat recommendations as search for related items. E.g. if you liked “Men In Black” you may get recommendations for comedy films. Advantages and disadvantages?
37 Item-to-Item Collaborative Filtering I 1 I 2 I 3.. I n C 1 C 2 C 3 ….. C m For each item, find all similar items in an offline computation 2.Create a similar items table where for each items the set of all related items Are stored.
38 Example : Rule discovery methods Read: “On the discovery of statistical quantitative rules”
39 On Evaluation Apparently I would like watching movies on gang violence in New York theaters. –Why? Because… Hamburger grills product recommendation On evaluation –absolutely critical in a world in which more interactions are being structured automatically –‘evaluation’ has multiple aspects, not just how “accurate” a model may seem to be.
40 Agenda Examples What is data mining? The Industry comments Techniques