Presentation is loading. Please wait.

Presentation is loading. Please wait.

DAT204 Introduction to Data Mining with SQL Server 2000 ZhaoHui Tang Program Manager SQL Server Analysis Services Microsoft Corporation.

Similar presentations


Presentation on theme: "DAT204 Introduction to Data Mining with SQL Server 2000 ZhaoHui Tang Program Manager SQL Server Analysis Services Microsoft Corporation."— Presentation transcript:

1 DAT204 Introduction to Data Mining with SQL Server 2000 ZhaoHui Tang Program Manager SQL Server Analysis Services Microsoft Corporation

2 Agenda What is Data Mining What is Data Mining The Data Mining Market The Data Mining Market OLE DB for Data Mining OLE DB for Data Mining Overview of the Data Mining Features in SQL Server 2000 Overview of the Data Mining Features in SQL Server 2000 Demo Demo Q&A Q&A

3 What Is Data Mining?

4

5 What is DM? A process of data exploration and analysis using automatic or semi-automatic means A process of data exploration and analysis using automatic or semi-automatic means – Techniques origin from Machine Learning, statistics and database – “Exploring data” – scanning samples of known facts about “cases”. – “knowledge”: Clusters, Rules, Decision trees, Equations, Association rules… Once the “knowledge” is extracted it: Once the “knowledge” is extracted it: – Can be browsed Provides a very useful insight on the cases behavior Provides a very useful insight on the cases behavior – Can be used to predict values of other cases Can serve as a key element in closed loop analysis Can serve as a key element in closed loop analysis

6 What drives high school students to attend college?

7 The deciding factors for high school students to attend college are… Attend College: 55% Yes 45% No All Students Attend College: 79% Yes 21% No IQ=High Attend College: 45% Yes 55% No IQ=Low IQ ? Wealth Attend College: 94% Yes 6% No Wealth = True Attend College: 69% Yes 21% No Wealth = False Parents Encourage? Attend College: 70% Yes 30% No Attend College: 31% Yes 69% No Parents Encourage = No Parents Encourage = Yes

8 Business Oriented DM Problems Targeted ads Targeted ads – “What banner should I display to this visitor?” Cross sells Cross sells – “What other products is this customer likely to buy? Fraud detection Fraud detection – “Is this insurance claim a fraud?” Churn analysis Churn analysis – “Who are those customers likely to churn?” Risk Management Risk Management – “Should I approve the loan to this customer?” …

9

10 Mining Model Mining Process - Illustrated DM Engine Data To Predict DM Engine Predicted Data Training Data Mining Model

11 The Data Mining Market

12 The $$$: Market Size DM Tools Market: DM Tools Market: – 1999: $341.3M – 2000: $455.1M – 2001: $449.5M * IDC

13 The Players Leading vendors Leading vendors – SAS – SPSS – IBM – Angoss – Hundreds of smaller vendors offering DM algorithms… Oracle –Thinking Machines acquisition Oracle –Thinking Machines acquisition

14 The Products End-to-end horizontal DM tools End-to-end horizontal DM tools – Extraction, Cleansing, Loading, Modeling, Algorithms (dozens), Analysts workbench, Reporting, Charting…. The customer is the power-analyst The customer is the power-analyst – PhD in statistics is usually required… Closed tools – no standard API Closed tools – no standard API – Total vendor lock-in – Limited integration with applications DM an “outsider” in the Data Warehouse DM an “outsider” in the Data Warehouse Extensive consulting required Extensive consulting required Sky rocketing prices Sky rocketing prices – $60K+ for a single user license

15 What the analysts say… “Stand-alone Data Mining Is Dead” - Forrester “Stand-alone Data Mining Is Dead” - Forrester “The demise of [stand alone] data mining” – Gartner “The demise of [stand alone] data mining” – Gartner

16 The Microsoft Approach

17 DataPro Users Survey 1999-2001 “Data mining will be the fastest- growing BI technology…”

18 Market Size of BI * IDC

19 SQL Server 2000 - The Analysis Platform SQL 2000 provides a complete Analysis Platform SQL 2000 provides a complete Analysis Platform – Not an isolated, stand alone DM product Platform means: Platform means: – Standard based DM API’s (OLE DB for DM) for applications development – Integrated vision for all technologies, tools – Extensible – Scaleable

20 Data Flow DWOLTP OLAP DM Apps Reports & Analysis DM

21 Analysis Services 2000 – Components Manager UI DSO Analysis Server Client OLE DBOLAP Engine (local) OLAP Engine DM Engine DM Engine (local) DM DMM DM Wizards DM DTS Task Tree View Control Cluster View Control Lift Chart Control Sample Query Tool

22 OLE DB for Data Mining…

23 Why OLE DB for DM? Make DM a mass market technology by: Leverage existing technologies and knowledge Leverage existing technologies and knowledge – SQL and OLE DB Common industry wide concepts and data presentation Common industry wide concepts and data presentation Changing DM market perception from “proprietary” to “open” Changing DM market perception from “proprietary” to “open” Increasing the number of players: Increasing the number of players: – Reduce the cost and risk of becoming a consumer – one tool works with multiple providers – Reduce the cost and risk of becoming a provider – focus on expertise and find many partners to complement offering

24 Integration With RDBMS Customers would like to Customers would like to – Build DM models from within their RDBMS – Train the models directly off their relational tables – Perform predictions as relational queries (tables in, tables out) – Feel that DM is a native part of their database. Therefore… Therefore… – Data mining models are relational objects – All operations on the models are relational – The language used is SQL (w/Extensions) The effect: every DBA and VB developer can become a DM developer The effect: every DBA and VB developer can become a DM developer

25 Creating a Data Mining Model (DMM)

26 Identifying the “Cases” DM algorithms analyze “cases” DM algorithms analyze “cases” The “case” is the entity being categorized and classified The “case” is the entity being categorized and classified Examples Examples – Customer credit risk analysis: Case = Customer – Product profitability analysis: Case = Product – Promotion success analysis: Case = Promotion Each case encapsulate all we know about the entity Each case encapsulate all we know about the entity

27 A Simple Set of Cases StudentI D Gende r ParentIncomeIQEncouragementCollegePlans 1Male23400120 Not Encouraged No 2Female7920090EncouragedYes 3Male42000105 Yes

28 More Complicated Cases Cust ID Age Marit al Statu s IQ Favorite Movies TitleScore 135M2 Star Wars 8 Toy Story 9 Terminator7 220S3 Star Wars 7 Braveheart7 The Matrix 10 357M2 Sixth Sense 9 Casablanca10

29 A DMM is a Table! A DMM structure is defined as a table A DMM structure is defined as a table – Training a DMM means inserting data (pattern) into the table – Predicting from a DMM means querying the table All information describing the case are contained in columns All information describing the case are contained in columns

30 Creating a Mining Model CREATE MINING MODEL [Plans Prediction] ( StudentID LONG KEY, Gender TEXT DISCRETE, ParentIncome LONG CONTINUOUS, IQ DOUBLE CONTINUOUS, Encouragement TEXT DISCRETE, CollegePlans TEXT DISCRETE PREDICT ) USING Microsoft_Decision_Trees

31 Creating a mining model with nested table Create Mining Model MoviePrediction ( CutomerId long key, Age long continuous, Gender discrete, Education discrete, MovieList table predict ( MovieName text key )) using microsoft_decision_trees

32 Training a DMM

33 Training a DMM means passing it data for which the attributes to be predicted are known Training a DMM means passing it data for which the attributes to be predicted are known – Multiple passes are handled internally by the provider! Use an INSERT INTO statement Use an INSERT INTO statement The DMM will not persist the inserted data The DMM will not persist the inserted data Instead it will analyze the given cases and build the DMM content (decision tree, segmentation model, association rules) Instead it will analyze the given cases and build the DMM content (decision tree, segmentation model, association rules) INSERT [INTO] INSERT [INTO] [(columns list)]

34 INSERT INTO Plans Prediction INSERT INTO [ Plans Prediction ] ( StudentID, Gender, ParentIncome, IQ, Encouragement, CollegePlans ) SELECT [StudentID], [Gender], [ParentIncome], [IQ], [Encouragement], [CollegePlans] FROM [Students]

35 When Insert Into Is Done… The DMM is trained The DMM is trained – The model can be retrained – Content (rules, trees, formulas) can be explored – OLE DB Schema rowset – SELECT * FROM.CONTENT – XML string (PMML) Prediction queries can be executed Prediction queries can be executed

36 Predictions

37 What are Predictions? Predictions apply the rules of a trained model to a new set of data in order to estimate missing attributes or values Predictions apply the rules of a trained model to a new set of data in order to estimate missing attributes or values Predictions = queries Predictions = queries – The syntax is SQL - like – The output is a rowset In order to predict you need: In order to predict you need: – Input data set – A trained DMM – Binding (mapping) information between the input data and the DMM

38 The Truth Table Concept Gende r ParentIncomeIQEncouragement Colleg e Plans Probabilit y Male2000085 Not Encouraged No85% Male2000085 Yes15% Male2000085EncouragedNo60% Male2000085EncouragedYes40% Male2000090 No80% Male2000090 Yes20% Male2000090EncouragedNo58% …

39 Prediction GenderParentIncomeIQEncouragement College Plans Probability Male2000085 Not Encouraged No85% Male2000085 Yes15% Male2000085EncouragedNo60% Male2000085EncouragedYes40% Male2000090 No80% Male2000090 Yes20% Male2000090EncouragedNo58% Male2000090EncouragedYes42% Male2000095 No78% Male2000095 Yes22% Male2000095EncouragedNo45% It’s a JOIN! StudentI D GenderParentIncomeIQEncouragement1Male4300085 Not Encouraged 2Male20000135 3Female25000105Encouraged 4Male96000100Encouraged 5Female56000125 6Female4600090

40 The Prediction Query Syntax SELECT SELECT FROM PREDICTION JOIN PREDICTION JOIN ON = …

41 Example SELECT [New Students].[StudentID], [Plans Prediction].[CollegePlans], PredictProbability([CollegePlans])FROM [Plans Prediction] PREDICTION JOIN [Plans Prediction] PREDICTION JOIN [New Students] [New Students] ON [Plans Prediction].[Gender] = [New Students].[Gender] AND [New Students].[Gender] AND [Plans Prediction].[IQ] = [Plans Prediction].[IQ] = [New Students].[IQ] AND... [New Students].[IQ] AND...

42 Demo

43 OLE DB for Data Mining Defines API OLE DB for DM (API) RDBMS Consumer Provider Cube Misc. Data Source Provider Consumer … … OLE DB

44 OLEDB for DM Configuration Options Demo Consumers OLEDB for DM Providers MS Analysis Manager MS DM Provider ANGOSS DM Provider ANGOSS Controls 1 23 4

45 Demo on OLE DB for DM API using Angoss Controls and Provider

46 For more info… DM URL DM URL – www.microsoft.com/data/oledb www.microsoft.com/data/oledb – www.microsoft.com/data/oledb/DMResKit.htm www.microsoft.com/data/oledb/DMResKit.htm News Group: News Group: – Microsoft.public.SQLserver.datamining – Communities.msn.com/AnalysisServicesDataMining White papers: White papers: – Performance paper: www.unisys.com/windows2000/default-07.aspwww.microsoft.com/SQL/evaluation/compare/analysisdmwp.asp

47 Questions ?


Download ppt "DAT204 Introduction to Data Mining with SQL Server 2000 ZhaoHui Tang Program Manager SQL Server Analysis Services Microsoft Corporation."

Similar presentations


Ads by Google