Presentation is loading. Please wait.

Presentation is loading. Please wait.

欢迎光临 微软 SQL 数据挖掘 / 数据仓库 技术研讨会. 今日安排 微软 SQL 数据挖掘技术概述 − 左洪 微软公司 数据仓库在电信的应用 − 贝志城 明天高科 数据挖掘在 CRM 中的应用 − 王立军 中圣公司 灵通 IT Service 维护管理服务系统 – 邹雄文 广州灵通.

Similar presentations


Presentation on theme: "欢迎光临 微软 SQL 数据挖掘 / 数据仓库 技术研讨会. 今日安排 微软 SQL 数据挖掘技术概述 − 左洪 微软公司 数据仓库在电信的应用 − 贝志城 明天高科 数据挖掘在 CRM 中的应用 − 王立军 中圣公司 灵通 IT Service 维护管理服务系统 – 邹雄文 广州灵通."— Presentation transcript:

1 欢迎光临 微软 SQL 数据挖掘 / 数据仓库 技术研讨会

2 今日安排 微软 SQL 数据挖掘技术概述 − 左洪 微软公司 数据仓库在电信的应用 − 贝志城 明天高科 数据挖掘在 CRM 中的应用 − 王立军 中圣公司 灵通 IT Service 维护管理服务系统 – 邹雄文 广州灵通

3 Introduction to Data Mining with SQL Server 2000 左洪 高级产品市场经理 微软(中国)有限公司

4 Agenda What is Data Mining What is Data Mining The Data Mining Market The Data Mining Market OLE DB for Data Mining OLE DB for Data Mining Overview of the Data Mining Features in SQL Server 2000 Overview of the Data Mining Features in SQL Server 2000 Q&A Q&A

5 What Is Data Mining?

6 What is DM? A process of data exploration and analysis using automatic or semi- automatic means A process of data exploration and analysis using automatic or semi- automatic means  “Exploring data” – scanning samples of known facts about “cases”.  “knowledge”: Clusters, Rules, Decision trees, Equations, Association rules… Once the “knowledge” is extracted it: Once the “knowledge” is extracted it:  Can be browsed  Provides a very useful insight on the cases behavior  Can be used to predict values of other cases  Can serve as a key element in closed loop analysis

7 What drive high school students to attend college?

8 The deciding factors for high school students to attend college are… Attend College: 55% Yes 45% No All Students Attend College: 79% Yes 11% No IQ=High Attend College: 45% Yes 55% No IQ=Low IQ ? Wealth Attend College: 94% Yes 6% No Wealth = True Attend College: 69% Yes 21% No Wealth = False Parents Encourage? Attend College: 70% Yes 30% No Attend College: 31% Yes 69% No Parents Encourage = No Parents Encourage = Yes

9 Business Oriented DM Problems Targeted ads Targeted ads  “What banner should I display to this visitor?” Cross sells Cross sells  “What other products is this customer likely to buy? Fraud detection Fraud detection  “Is this insurance claim a fraud?” Churn analysis Churn analysis  “Who are those customers likely to churn?” Risk Management Risk Management  “Should I approve the loan to this customer?” …

10

11 Http://www.tunes.com

12 Mining Model Mining Process - Illustrated DM Engine Data To Predict DM Engine Predicted Data Training Data Mining Model

13 The Data Mining Market

14 The $$$: Y2000 Market Size DM Tools Market: $250M DM Tools Market: $250M  40% - license fees  60% consulting * Gartner

15 The Players Leading vendors Leading vendors  SAS  SPSS  IBM  Hundreds of smaller vendors offering DM algorithms… Oracle –Thinking Machines acquisition Oracle –Thinking Machines acquisition

16 The Products End-to-end Data Mining tools End-to-end Data Mining tools  Extraction, Cleansing, Loading, Modeling, Algorithms (dozens), Analysts workbench, Reporting, Charting…. The customer is the power-analyst The customer is the power-analyst  PhD in statistics is usually required… Closed tools – no standard API Closed tools – no standard API  Total vendor lock-in  Limited integration with applications DM an “outsider” in the Data Warehouse DM an “outsider” in the Data Warehouse Extensive consulting required Extensive consulting required Sky rocketing prices Sky rocketing prices  $60K+ for a single user license

17 What the analysts say… “Stand-alone Data Mining Is Dead” - Forrester “Stand-alone Data Mining Is Dead” - Forrester “The demise of [stand alone] data mining” – Gartner “The demise of [stand alone] data mining” – Gartner

18 The Microsoft Approach

19 DataPro Users Survey 1999-2001 “Data mining will be the fastest- growing BI technology…”

20 The $$$: 2000 Market Size DM Applications Market Size: $1.5B DM Applications Market Size: $1.5B * IDC

21 SQL Server 2000 - The Analysis Platform SQL 2000 provides a complete Analysis Platform SQL 2000 provides a complete Analysis Platform  Not an isolated, stand alone DM product Platform means: Platform means:  The infrastructure for applications  Not an application by itself  Integrated vision for all technologies, tools  Standard based API’s (OLE DB for DM)  Extensible  Scaleable

22 Data Flow DWOLTP OLAP DM Apps Reports & Analysis DM

23 Analysis Services 2000 - Architecture Manager UI DSO Analysis Server Client OLE DBOLAP Engine (local) OLAP Engine DM Engine DM Engine (local) DM DMM DM Wizards DM DTS Task Ext.

24 OLE DB for Data Mining…

25 Why OLE DB for DM? Make DM a mass market technology by: Leverage existing technologies and knowledge Leverage existing technologies and knowledge  SQL and OLE DB Common industry wide concepts and data presentation Common industry wide concepts and data presentation Changing DM market perception from “proprietary” to “open” Changing DM market perception from “proprietary” to “open” Increasing the number of players: Increasing the number of players:  Reduce the cost and risk of becoming a consumer – one tool works with multiple providers  Reduce the cost and risk of becoming a provider – focus on expertise and find many partners to complement offering  Dramatically increase the number of DM developers

26 Integration With RDBMS Customers would like to Customers would like to  Build DM models from within their RDBMS  Train the models directly off their relational tables  Perform predictions as relational queries (tables in, tables out)  Feel that DM is a native part of their database. Therefore… Therefore…  Data mining models are relational objects  All operations on the models are relational  The language used is SQL (w/Extensions) The effect: every DBA and VB developer can become a DM developer The effect: every DBA and VB developer can become a DM developer

27 Creating a Data Mining Model (DMM)

28 Identifying the “Cases” DM algorithms analyze “cases” DM algorithms analyze “cases” The “case” is the entity being categorized and classified The “case” is the entity being categorized and classified Examples Examples  Customer credit risk analysis: Case = Customer  Product profitability analysis: Case = Product  Promotion success analysis: Case = Promotion Each case encapsulate all we know about the entity Each case encapsulate all we know about the entity

29 A Simple Set of Cases StudentI D GenderParentIncomeIQEncouragementCollegePlans 1Male23400120 Not Encouraged No 2Female7920090EncouragedYes 3Male42000105 Yes

30 More Complicated Cases Cust ID Age Marit al Statu s IQ Favorite Movies TitleScore 135M2 Star Wars 8 Toy Story 9 Terminator7 220S3 Star Wars 7 Braveheart7 The Matrix 10 357M2 Sixth Sense 9 Casablanca10

31 A DMM is a Table! A DMM structure is defined as a table A DMM structure is defined as a table  Training a DMM means inserting data into the table  Predicting from a DMM means querying the table All information describing the case are contained in columns All information describing the case are contained in columns

32 Creating a Mining Model CREATE MINING MODEL [Plans Prediction] ( StudentID LONG KEY, Gender TEXT DISCRETE, ParentIncome LONG CONTINUOUS, IQ DOUBLE CONTINUOUS, Encouragement TEXT DISCRETE, CollegePlans TEXT DISCRETE PREDICT ) USING Microsoft_Decision_Trees

33 Creating a mining model with nested table Create Mining Model MoviePrediction ( CutomerId long key, Age long continuous, Gender discrete, Education discrete, MovieList table predict ( MovieName text key )) using microsoft_decision_trees

34 Training a DMM

35 Training a DMM means passing it data for which the attributes to be predicted are known Training a DMM means passing it data for which the attributes to be predicted are known  Multiple passes are handled internally by the provider! Use an INSERT INTO statement Use an INSERT INTO statement The DMM will not persist the inserted data The DMM will not persist the inserted data Instead it will analyze the given cases and build the DMM content (decision tree, segmentation model, association rules) Instead it will analyze the given cases and build the DMM content (decision tree, segmentation model, association rules) INSERT [INTO] INSERT [INTO] [(columns list)]

36 INSERT INTO Plans Prediction INSERT INTO [ Plans Prediction ] ( StudentID, Gender, ParentIncome, IQ, Encouragement, CollegePlans ) SELECT [StudentID], [Gender], [ParentIncome], [IQ], [Encouragement], [CollegePlans] FROM [CollegePlans]

37 When Insert Into Is Done… The DMM is trained The DMM is trained  The model can be retrained  Content (rules, trees, formulas) can be explored  OLE DB Schema rowset  SELECT * FROM.CONTENT  XML string (PMML) Prediction queries can be executed Prediction queries can be executed

38 Predictions

39 What are Predictions? Predictions apply the rules of a trained model to a new set of data in order to estimate missing attributes or values Predictions apply the rules of a trained model to a new set of data in order to estimate missing attributes or values Predictions = queries Predictions = queries  The syntax is SQL - like  The output is a rowset In order to predict you need: In order to predict you need:  Input data set  A trained DMM  Binding (mapping) information between the input data and the DMM  Specification of what to predict

40 The Truth Table Concept Gende r ParentIncomeIQEncouragement Colleg e PlansProbability Male2000085 Not Encouraged No85% Male2000085 Yes15% Male2000085EncouragedNo60% Male2000085EncouragedYes40% Male2000090 No80% Male2000090 Yes20% Male2000090EncouragedNo58% …

41 Prediction GenderParentIncomeIQEncouragement College Plans Probability Male2000085 Not Encouraged No85% Male2000085 Yes15% Male2000085EncouragedNo60% Male2000085EncouragedYes40% Male2000090 No80% Male2000090 Yes20% Male2000090EncouragedNo58% Male2000090EncouragedYes42% Male2000095 No78% Male2000095 Yes22% Male2000095EncouragedNo45% It’s a JOIN! Student ID GenderParentIncomeIQEncouragement1Male4300085 Not Encouraged 2Male20000135 3Female25000105Encouraged 4Male96000100Encouraged 5Female56000125 6Female4600090

42 The Prediction Query Syntax SELECT SELECT FROM PREDICTION JOIN PREDICTION JOIN ON = …

43 Example SELECT [New Students].[StudentID], [Plans Prediction].[CollegePlans], PredictProbability([CollegePlans])FROM [Plans Prediction] PREDICTION JOIN [Plans Prediction] PREDICTION JOIN [New Students] [New Students] ON [Plans Prediction].[Gender] = [New Students].[Gender] AND [New Students].[Gender] AND [Plans Prediction].[IQ] = [Plans Prediction].[IQ] = [New Students].[IQ] AND... [New Students].[IQ] AND...

44 OLE DB DM Sample Provider with Source All required OLE DB objects, such as session, command, and rowset All required OLE DB objects, such as session, command, and rowset The OLE DB for Data Mining syntax parser The OLE DB for Data Mining syntax parser Tokenization of input data Tokenization of input data Query processing engine Query processing engine A sample Naïve Bayes algorithm A sample Naïve Bayes algorithm Model persistence in XML and binary formats Model persistence in XML and binary formats Available at www.microsoft.com/data/oledb/DMResKit.htm Available at www.microsoft.com/data/oledb/DMResKit.htm

45 Integrated OLAP and DM Analysis

46 Why Use DM with OLAP Relational DM is designed for: Relational DM is designed for:  Reports of patterns  Batch predictions fed into an OLTP system  Real-time singleton prediction in an operational environment OLAP is designed for OLAP is designed for  interactive analysis by a knowledge worker  Consistent and convenient navigational model  Pre-aggregations of OLAP allow faster performance

47 Understanding DM Content – Decision Trees Credit Risk: 65% Good 35% Bad All Customers Credit Risk: 89% Good 11% Bad Debt=Low Credit Risk: 94% Good 6% Bad ET = Salaried Credit Risk: 70% Good 30% Bad Education? Credit Risk: 31% Good 69% Bad Education= High School Credit Risk: 79% Good 21% Bad Credit Risk: 45% Good 55% Bad Debt=High Debt ? Employ- -ment Type? ET = Self Employed Education= College Customers having high debt and college education: Filter([Individual Customers].Members, Customers.CurrentMember.Properties(“Debt”) = “High” And Customers.CurrentMember.Properties(“Education”) = “College”) Customers having low debt and are self employed: Filter([Individual Customers].Members, Customers.CurrentMember.Properties(“Debt”) = Low And Customers.CurrentMember.Properties(“Employment Type”) = “Self Employed”)

48 …Equivalent DM Dimension Customers with high debt and college education All Customers Customers with high debt Customers with high debt and high school education Customers with low debt and self employed Customers with low debt Customers with low debt and salaried CustomRoll-up Credit Risk - Good = 65%, Bad = 35% Aggregate(Filter( … Good = 89%, Bad = 11% Aggregate(Filter( … Good = 79%, Bad = 21% Aggregate(Filter( … Good = 94%, Bad = 6% Aggregate(Filter( … Good = 45%, Bad = 55% Aggregate(Filter( … Good = 70%, Bad = 30% Aggregate(Filter( … Good = 31%, Bad = 69%

49 Tree = Dimension Every node on the tree is a dimension member Every node on the tree is a dimension member The node statistics are the member properties The node statistics are the member properties All members are calculated All members are calculated  Formula aggregates the case dimension members that apply to this node  The MDX is generated by the DM algorithm Analysis Service will automatically generate the calculated dimension based on the DM content and also a virtual cube Analysis Service will automatically generate the calculated dimension based on the DM content and also a virtual cube Applies to Applies to  Classification (decision trees)  Segmentation (clusters)

50 Browsing the Virtual Cube Pivot the DM dimension: Pivot the DM dimension: WAORCA All Customers 320025008000 Customers with low debt Customers with low debt232015034300 Customers with high debt Customers with high debt8809974700 Customers … college Customers … college3204502310 Customers … high school Customers … high school5605472390 Credit Risk: 70% Good, 30% Bad

51 Predictions You might want to view predictions for each case You might want to view predictions for each case For example: For example:  What is the expected profitability of a product?  What is the credit risk of a specific customer?  What are the products this customer is likely to buy? All of those predictions are available through MDX calculated members All of those predictions are available through MDX calculated members Singleton query is created automatically Singleton query is created automatically

52 Prediction Calculated Member Measures.[Probability of High Credit Risk]: PREDICT(Customers.CurrentMember, “Credit Risk Model”, “PredictionProbability( PredictionHistogram(“Credit Risk”), PredictionHistogram(“Credit Risk”), ‘High’)“ )

53 Predictions Example Probability of High Credit Risk Probability of Low Credit Risk Joe Smith 73%27% John Dow 68%32% William Clington 45%55% Robert Maxwell 98%2% Denis Rodman 81%19%

54 Questions ? E-Mail: billzuo@microsoft.com http://www.microsoft.com/china/sql


Download ppt "欢迎光临 微软 SQL 数据挖掘 / 数据仓库 技术研讨会. 今日安排 微软 SQL 数据挖掘技术概述 − 左洪 微软公司 数据仓库在电信的应用 − 贝志城 明天高科 数据挖掘在 CRM 中的应用 − 王立军 中圣公司 灵通 IT Service 维护管理服务系统 – 邹雄文 广州灵通."

Similar presentations


Ads by Google