Download presentation
Presentation is loading. Please wait.
Published byJoy Higgins Modified over 9 years ago
1
欢迎光临 微软 SQL 数据挖掘 / 数据仓库 技术研讨会
2
今日安排 微软 SQL 数据挖掘技术概述 − 左洪 微软公司 数据仓库在电信的应用 − 贝志城 明天高科 数据挖掘在 CRM 中的应用 − 王立军 中圣公司 灵通 IT Service 维护管理服务系统 – 邹雄文 广州灵通
3
Introduction to Data Mining with SQL Server 2000 左洪 高级产品市场经理 微软(中国)有限公司
4
Agenda What is Data Mining What is Data Mining The Data Mining Market The Data Mining Market OLE DB for Data Mining OLE DB for Data Mining Overview of the Data Mining Features in SQL Server 2000 Overview of the Data Mining Features in SQL Server 2000 Q&A Q&A
5
What Is Data Mining?
6
What is DM? A process of data exploration and analysis using automatic or semi- automatic means A process of data exploration and analysis using automatic or semi- automatic means “Exploring data” – scanning samples of known facts about “cases”. “knowledge”: Clusters, Rules, Decision trees, Equations, Association rules… Once the “knowledge” is extracted it: Once the “knowledge” is extracted it: Can be browsed Provides a very useful insight on the cases behavior Can be used to predict values of other cases Can serve as a key element in closed loop analysis
7
What drive high school students to attend college?
8
The deciding factors for high school students to attend college are… Attend College: 55% Yes 45% No All Students Attend College: 79% Yes 11% No IQ=High Attend College: 45% Yes 55% No IQ=Low IQ ? Wealth Attend College: 94% Yes 6% No Wealth = True Attend College: 69% Yes 21% No Wealth = False Parents Encourage? Attend College: 70% Yes 30% No Attend College: 31% Yes 69% No Parents Encourage = No Parents Encourage = Yes
9
Business Oriented DM Problems Targeted ads Targeted ads “What banner should I display to this visitor?” Cross sells Cross sells “What other products is this customer likely to buy? Fraud detection Fraud detection “Is this insurance claim a fraud?” Churn analysis Churn analysis “Who are those customers likely to churn?” Risk Management Risk Management “Should I approve the loan to this customer?” …
11
Http://www.tunes.com
12
Mining Model Mining Process - Illustrated DM Engine Data To Predict DM Engine Predicted Data Training Data Mining Model
13
The Data Mining Market
14
The $$$: Y2000 Market Size DM Tools Market: $250M DM Tools Market: $250M 40% - license fees 60% consulting * Gartner
15
The Players Leading vendors Leading vendors SAS SPSS IBM Hundreds of smaller vendors offering DM algorithms… Oracle –Thinking Machines acquisition Oracle –Thinking Machines acquisition
16
The Products End-to-end Data Mining tools End-to-end Data Mining tools Extraction, Cleansing, Loading, Modeling, Algorithms (dozens), Analysts workbench, Reporting, Charting…. The customer is the power-analyst The customer is the power-analyst PhD in statistics is usually required… Closed tools – no standard API Closed tools – no standard API Total vendor lock-in Limited integration with applications DM an “outsider” in the Data Warehouse DM an “outsider” in the Data Warehouse Extensive consulting required Extensive consulting required Sky rocketing prices Sky rocketing prices $60K+ for a single user license
17
What the analysts say… “Stand-alone Data Mining Is Dead” - Forrester “Stand-alone Data Mining Is Dead” - Forrester “The demise of [stand alone] data mining” – Gartner “The demise of [stand alone] data mining” – Gartner
18
The Microsoft Approach
19
DataPro Users Survey 1999-2001 “Data mining will be the fastest- growing BI technology…”
20
The $$$: 2000 Market Size DM Applications Market Size: $1.5B DM Applications Market Size: $1.5B * IDC
21
SQL Server 2000 - The Analysis Platform SQL 2000 provides a complete Analysis Platform SQL 2000 provides a complete Analysis Platform Not an isolated, stand alone DM product Platform means: Platform means: The infrastructure for applications Not an application by itself Integrated vision for all technologies, tools Standard based API’s (OLE DB for DM) Extensible Scaleable
22
Data Flow DWOLTP OLAP DM Apps Reports & Analysis DM
23
Analysis Services 2000 - Architecture Manager UI DSO Analysis Server Client OLE DBOLAP Engine (local) OLAP Engine DM Engine DM Engine (local) DM DMM DM Wizards DM DTS Task Ext.
24
OLE DB for Data Mining…
25
Why OLE DB for DM? Make DM a mass market technology by: Leverage existing technologies and knowledge Leverage existing technologies and knowledge SQL and OLE DB Common industry wide concepts and data presentation Common industry wide concepts and data presentation Changing DM market perception from “proprietary” to “open” Changing DM market perception from “proprietary” to “open” Increasing the number of players: Increasing the number of players: Reduce the cost and risk of becoming a consumer – one tool works with multiple providers Reduce the cost and risk of becoming a provider – focus on expertise and find many partners to complement offering Dramatically increase the number of DM developers
26
Integration With RDBMS Customers would like to Customers would like to Build DM models from within their RDBMS Train the models directly off their relational tables Perform predictions as relational queries (tables in, tables out) Feel that DM is a native part of their database. Therefore… Therefore… Data mining models are relational objects All operations on the models are relational The language used is SQL (w/Extensions) The effect: every DBA and VB developer can become a DM developer The effect: every DBA and VB developer can become a DM developer
27
Creating a Data Mining Model (DMM)
28
Identifying the “Cases” DM algorithms analyze “cases” DM algorithms analyze “cases” The “case” is the entity being categorized and classified The “case” is the entity being categorized and classified Examples Examples Customer credit risk analysis: Case = Customer Product profitability analysis: Case = Product Promotion success analysis: Case = Promotion Each case encapsulate all we know about the entity Each case encapsulate all we know about the entity
29
A Simple Set of Cases StudentI D GenderParentIncomeIQEncouragementCollegePlans 1Male23400120 Not Encouraged No 2Female7920090EncouragedYes 3Male42000105 Yes
30
More Complicated Cases Cust ID Age Marit al Statu s IQ Favorite Movies TitleScore 135M2 Star Wars 8 Toy Story 9 Terminator7 220S3 Star Wars 7 Braveheart7 The Matrix 10 357M2 Sixth Sense 9 Casablanca10
31
A DMM is a Table! A DMM structure is defined as a table A DMM structure is defined as a table Training a DMM means inserting data into the table Predicting from a DMM means querying the table All information describing the case are contained in columns All information describing the case are contained in columns
32
Creating a Mining Model CREATE MINING MODEL [Plans Prediction] ( StudentID LONG KEY, Gender TEXT DISCRETE, ParentIncome LONG CONTINUOUS, IQ DOUBLE CONTINUOUS, Encouragement TEXT DISCRETE, CollegePlans TEXT DISCRETE PREDICT ) USING Microsoft_Decision_Trees
33
Creating a mining model with nested table Create Mining Model MoviePrediction ( CutomerId long key, Age long continuous, Gender discrete, Education discrete, MovieList table predict ( MovieName text key )) using microsoft_decision_trees
34
Training a DMM
35
Training a DMM means passing it data for which the attributes to be predicted are known Training a DMM means passing it data for which the attributes to be predicted are known Multiple passes are handled internally by the provider! Use an INSERT INTO statement Use an INSERT INTO statement The DMM will not persist the inserted data The DMM will not persist the inserted data Instead it will analyze the given cases and build the DMM content (decision tree, segmentation model, association rules) Instead it will analyze the given cases and build the DMM content (decision tree, segmentation model, association rules) INSERT [INTO] INSERT [INTO] [(columns list)]
36
INSERT INTO Plans Prediction INSERT INTO [ Plans Prediction ] ( StudentID, Gender, ParentIncome, IQ, Encouragement, CollegePlans ) SELECT [StudentID], [Gender], [ParentIncome], [IQ], [Encouragement], [CollegePlans] FROM [CollegePlans]
37
When Insert Into Is Done… The DMM is trained The DMM is trained The model can be retrained Content (rules, trees, formulas) can be explored OLE DB Schema rowset SELECT * FROM.CONTENT XML string (PMML) Prediction queries can be executed Prediction queries can be executed
38
Predictions
39
What are Predictions? Predictions apply the rules of a trained model to a new set of data in order to estimate missing attributes or values Predictions apply the rules of a trained model to a new set of data in order to estimate missing attributes or values Predictions = queries Predictions = queries The syntax is SQL - like The output is a rowset In order to predict you need: In order to predict you need: Input data set A trained DMM Binding (mapping) information between the input data and the DMM Specification of what to predict
40
The Truth Table Concept Gende r ParentIncomeIQEncouragement Colleg e PlansProbability Male2000085 Not Encouraged No85% Male2000085 Yes15% Male2000085EncouragedNo60% Male2000085EncouragedYes40% Male2000090 No80% Male2000090 Yes20% Male2000090EncouragedNo58% …
41
Prediction GenderParentIncomeIQEncouragement College Plans Probability Male2000085 Not Encouraged No85% Male2000085 Yes15% Male2000085EncouragedNo60% Male2000085EncouragedYes40% Male2000090 No80% Male2000090 Yes20% Male2000090EncouragedNo58% Male2000090EncouragedYes42% Male2000095 No78% Male2000095 Yes22% Male2000095EncouragedNo45% It’s a JOIN! Student ID GenderParentIncomeIQEncouragement1Male4300085 Not Encouraged 2Male20000135 3Female25000105Encouraged 4Male96000100Encouraged 5Female56000125 6Female4600090
42
The Prediction Query Syntax SELECT SELECT FROM PREDICTION JOIN PREDICTION JOIN ON = …
43
Example SELECT [New Students].[StudentID], [Plans Prediction].[CollegePlans], PredictProbability([CollegePlans])FROM [Plans Prediction] PREDICTION JOIN [Plans Prediction] PREDICTION JOIN [New Students] [New Students] ON [Plans Prediction].[Gender] = [New Students].[Gender] AND [New Students].[Gender] AND [Plans Prediction].[IQ] = [Plans Prediction].[IQ] = [New Students].[IQ] AND... [New Students].[IQ] AND...
44
OLE DB DM Sample Provider with Source All required OLE DB objects, such as session, command, and rowset All required OLE DB objects, such as session, command, and rowset The OLE DB for Data Mining syntax parser The OLE DB for Data Mining syntax parser Tokenization of input data Tokenization of input data Query processing engine Query processing engine A sample Naïve Bayes algorithm A sample Naïve Bayes algorithm Model persistence in XML and binary formats Model persistence in XML and binary formats Available at www.microsoft.com/data/oledb/DMResKit.htm Available at www.microsoft.com/data/oledb/DMResKit.htm
45
Integrated OLAP and DM Analysis
46
Why Use DM with OLAP Relational DM is designed for: Relational DM is designed for: Reports of patterns Batch predictions fed into an OLTP system Real-time singleton prediction in an operational environment OLAP is designed for OLAP is designed for interactive analysis by a knowledge worker Consistent and convenient navigational model Pre-aggregations of OLAP allow faster performance
47
Understanding DM Content – Decision Trees Credit Risk: 65% Good 35% Bad All Customers Credit Risk: 89% Good 11% Bad Debt=Low Credit Risk: 94% Good 6% Bad ET = Salaried Credit Risk: 70% Good 30% Bad Education? Credit Risk: 31% Good 69% Bad Education= High School Credit Risk: 79% Good 21% Bad Credit Risk: 45% Good 55% Bad Debt=High Debt ? Employ- -ment Type? ET = Self Employed Education= College Customers having high debt and college education: Filter([Individual Customers].Members, Customers.CurrentMember.Properties(“Debt”) = “High” And Customers.CurrentMember.Properties(“Education”) = “College”) Customers having low debt and are self employed: Filter([Individual Customers].Members, Customers.CurrentMember.Properties(“Debt”) = Low And Customers.CurrentMember.Properties(“Employment Type”) = “Self Employed”)
48
…Equivalent DM Dimension Customers with high debt and college education All Customers Customers with high debt Customers with high debt and high school education Customers with low debt and self employed Customers with low debt Customers with low debt and salaried CustomRoll-up Credit Risk - Good = 65%, Bad = 35% Aggregate(Filter( … Good = 89%, Bad = 11% Aggregate(Filter( … Good = 79%, Bad = 21% Aggregate(Filter( … Good = 94%, Bad = 6% Aggregate(Filter( … Good = 45%, Bad = 55% Aggregate(Filter( … Good = 70%, Bad = 30% Aggregate(Filter( … Good = 31%, Bad = 69%
49
Tree = Dimension Every node on the tree is a dimension member Every node on the tree is a dimension member The node statistics are the member properties The node statistics are the member properties All members are calculated All members are calculated Formula aggregates the case dimension members that apply to this node The MDX is generated by the DM algorithm Analysis Service will automatically generate the calculated dimension based on the DM content and also a virtual cube Analysis Service will automatically generate the calculated dimension based on the DM content and also a virtual cube Applies to Applies to Classification (decision trees) Segmentation (clusters)
50
Browsing the Virtual Cube Pivot the DM dimension: Pivot the DM dimension: WAORCA All Customers 320025008000 Customers with low debt Customers with low debt232015034300 Customers with high debt Customers with high debt8809974700 Customers … college Customers … college3204502310 Customers … high school Customers … high school5605472390 Credit Risk: 70% Good, 30% Bad
51
Predictions You might want to view predictions for each case You might want to view predictions for each case For example: For example: What is the expected profitability of a product? What is the credit risk of a specific customer? What are the products this customer is likely to buy? All of those predictions are available through MDX calculated members All of those predictions are available through MDX calculated members Singleton query is created automatically Singleton query is created automatically
52
Prediction Calculated Member Measures.[Probability of High Credit Risk]: PREDICT(Customers.CurrentMember, “Credit Risk Model”, “PredictionProbability( PredictionHistogram(“Credit Risk”), PredictionHistogram(“Credit Risk”), ‘High’)“ )
53
Predictions Example Probability of High Credit Risk Probability of Low Credit Risk Joe Smith 73%27% John Dow 68%32% William Clington 45%55% Robert Maxwell 98%2% Denis Rodman 81%19%
54
Questions ? E-Mail: billzuo@microsoft.com http://www.microsoft.com/china/sql
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.