Download presentation
Presentation is loading. Please wait.
Published byBeverley Norton Modified over 9 years ago
1
ZhaoHui Tang Program Manager SQL Server Analysis Services Microsoft Corporation DAT205 Advanced Data Mining Using SQL Server 2000
2
Agenda Microsoft Data Mining Algorithms Microsoft Data Mining Algorithms OLE DB for DM Data mining query OLE DB for DM Data mining query Data Mining Case Study: Click Stream Analysis Data Mining Case Study: Click Stream Analysis – Customer Segmentation – Site affiliation – Target ads in banner Performance of Microsoft Data Mining Algorithm Performance of Microsoft Data Mining Algorithm Q&A Q&A
3
Data Mining Algorithms in SQL Server 2000
4
Decision Tree Popular technique for classification, Prediction task Popular technique for classification, Prediction task – Churn analysis – Credit risk analysis –…–…–…–… Easy to understand Easy to understand – any path from node to leaf forms a rule Fast to build Fast to build Prediction based on leaf node stats Prediction based on leaf node stats Variation: C4.5, C5, CART, Chaid Variation: C4.5, C5, CART, Chaid Attend College: 55% Yes 45% No All Students Attend College: 79% Yes 21% No IQ=High Attend College: 35% Yes 65% No IQ High Attend College: 94% Yes 6% No Parent Income = High Attend College: 69% Yes 31% No Parent Income = Low
5
How tree works IQ Parent Encouragement Parent Income Gender HighMediumLowTrueFalseHighFalseMaleFemale CollegeP lan Yes300500200700300400600500 No1001000900400160040016001100900 0 100 200 300 400 500 600 700 800 900 1000 IQ=HighIQ=MediumIQ=Low 0 200 400 600 800 1000 1200 1400 1600 1800 PI=HighPI=FALSE 0 200 400 600 800 1000 1200 1400 1600 1800 PE=TRUEPE=FALSE 0 200 400 600 800 1000 1200 MaleFemale Yes No
6
Split recursively College Plan 33% Yes 67% No All Students College Plan 63% Yes 37% No Parent Encouragement = True College Plan 16% Yes 84% No Parent Encouragement = False IQ Parent Encouragement Parent Income Gender HighMediumLowTrueFalseHighFalseMaleFemale CollegeP lan Yes2004001007000300400 250 No502501004000100300250150
7
Microsoft Decision Trees Probabilistic Classification Tree Probabilistic Classification Tree Splitting methods: Bayesian score and Entropy Splitting methods: Bayesian score and Entropy Forward pruning Forward pruning Tree shape: Binary and Nary tree Tree shape: Binary and Nary tree Scalable framework Scalable framework
8
Clustering Algorithm (EM) A popular method for customer segmentation, mailing list, profiling… A popular method for customer segmentation, mailing list, profiling… Algorithm process Algorithm process – Assign a set of Initial Points – Assign initial cluster to each points – Assign data points to each cluster with a probability – Computer new central point based on weighted computation – Cycle until convergence
9
EM Illustration XX X
10
Microsoft Clustering Algorithm (Scalable EM) Data Fill Buffer Build/Update Model Build/Update Model Compressed date Sufficient stats Compressed date Sufficient stats Identify Data to be Compressed Identify Data to be Compressed Stop? Final Model
11
OLE DB for Data Mining
12
OLE DB for DM Industry standard for data mining Industry standard for data mining Based on existing technologies Based on existing technologies – SQL – OLE DB Define common concepts for DM Define common concepts for DM – Case, Nested Case – Mining Model – Model Creation – Model Training – Prediction Language based API Language based API
13
Customer Table Customer IDProfessionIncomeGenderRisk 1Engineer85MaleNo 2Worker40MaleYes 3Doctor90FemaleNo 4Teacher50FemaleNo 5Worker45MaleNo ……………
14
DM Query Language Create Mining Model CreditRisk (CustomerID long key, Gender text discrete, Income long continuous, Profession text discrete, Risktext discrete predict) Using Microsoft_Decision_Trees Create Mining Model CreditRisk (CustomerID long key, Gender text discrete, Income long continuous, Profession text discrete, Risktext discrete predict) Using Microsoft_Decision_Trees Insert into CreditRisk (CustomerId, Gender, Income, Profession, Risk) Select CustomerID, Gender, Income, Profession,Risk From Customers Insert into CreditRisk (CustomerId, Gender, Income, Profession, Risk) Select CustomerID, Gender, Income, Profession,Risk From Customers Select NewCustomers.CustomerID, CreditRisk.Risk, PredictProbability(CreditRisk) From CreditRisk Prediction Join NewCustomers On CreditRisk.Gender=NewCustomer.Gender And CreditRisk.Income=NewCustomer.Income AndCreditRisk.Profession=NewCustomer.Profession Select NewCustomers.CustomerID, CreditRisk.Risk, PredictProbability(CreditRisk) From CreditRisk Prediction Join NewCustomers On CreditRisk.Gender=NewCustomer.Gender And CreditRisk.Income=NewCustomer.Income AndCreditRisk.Profession=NewCustomer.Profession
15
Schema Rowsets Tabular data to provide meta data information Tabular data to provide meta data information List of Schema Rowsets in OLE DB for DM List of Schema Rowsets in OLE DB for DM – Mining_Services – Mining_Service_Parameters – Mining_Models – Mining_Columns – Mining_Model_Contents – Model_Content_PMML
16
Mining Model Contents Schema Rowsets
17
Schema Rowsets & Thin Client Browser
19
Case Study: Click Stream Analysis
20
Schema Customer CustomerGuid DayTimeOnLine NightTimeOnLin e BrowserType EmailTime ChatTime GeoLocation WebClickCustomerGuid URLCategory Time Duration ReferPage
21
Web Customer Segmentation
22
Web Visitors Segmentation
23
Segmentation based on Customer table Create Mining Model CustomerClustering (CustomerID text key, DayTimeOnline long continuous NightTimeOnline long continuous, BrowserType text discrete, ChatTime long continuous, EmailTimelong continuous, GeoLocationtext discrete ) Using Microsoft_Clustering Create Mining Model CustomerClustering (CustomerID text key, DayTimeOnline long continuous NightTimeOnline long continuous, BrowserType text discrete, ChatTime long continuous, EmailTimelong continuous, GeoLocationtext discrete ) Using Microsoft_Clustering
24
Segmentation based on Customer and WebClick Create Mining Model CustomerClustering (CustomerID text key, DayTimeOnline long continuous, NightTimeOnline long continuous, BrowserType text discrete, ChatTime long continuous, EmailTimelong continuous, GeoLocationtext discrete WebClicktable ( UrlCategory text key ) )Using Microsoft_Clustering Create Mining Model CustomerClustering (CustomerID text key, DayTimeOnline long continuous, NightTimeOnline long continuous, BrowserType text discrete, ChatTime long continuous, EmailTimelong continuous, GeoLocationtext discrete WebClicktable ( UrlCategory text key ) )Using Microsoft_Clustering
25
MSFTies Segmentation
26
Web Site Affiliation
27
Association analysis using Microsoft Decision Trees Insurance No Insurance Loan No Loan Business Loan No Loan Stock No Stock Insurance Business No Business Shopping No Shopping Stock Insurance No Insurance Loan No Stock
28
Association analysis using Microsoft Decision Trees Insurance No Insurance Loan No Loan Business Loan No Loan Stock No Stock Insurance Business No Business Shopping No Shopping Stock Insurance No Insurance Loan No Stock
29
Site Affiliation
30
Create Mining Model SiteAffiliation (CustomerID text key, WebClick table predict ( UrlCategory text key ) )Using Microsoft_Decision_Trees Create Mining Model SiteAffiliation (CustomerID text key, WebClick table predict ( UrlCategory text key ) )Using Microsoft_Decision_Trees Insert into SiteAffiliation (CustomerID,WebClick (skip, UrlCategory) OpenRowset(‘MSDataShape’, 'data provider=SQLOLEDB;Server=myserver;UID=me; PWD=mypass', 'Shape{Select CustomerID from Customer} Append ( {Select customerid, URLCategory from WebClick } relate CustomerID to CustomerID) as WebClick’ ) )
32
Path Prediction
34
Singleton Prediction Select Flattened Topcount((select URLCategory, $adjustedProbability as prob From Predict([Web Click], INCLUDE_STATISTICS, EXCLUSIVE)), prob, 5) From WebLog PREDICTION JOIN (select (select 'Business' as URLCategory) union (select ‘Telecom’ as URLCategory) as WebClick) as input On WebLog.[Web Click].URLCategory = input.WebClick.URLCategory Select Flattened Topcount((select URLCategory, $adjustedProbability as prob From Predict([Web Click], INCLUDE_STATISTICS, EXCLUSIVE)), prob, 5) From WebLog PREDICTION JOIN (select (select 'Business' as URLCategory) union (select ‘Telecom’ as URLCategory) as WebClick) as input On WebLog.[Web Click].URLCategory = input.WebClick.URLCategory
35
Architecture WebCustomerWebCustomer IISIIS ASPASP DM Provider DMMDMM Internet Real Time Predictio n ADO/DSO
36
Performance of DM Algorithms
37
DM Performance Study Joint effort between Unisys & Microsoft Joint effort between Unisys & Microsoft Two parts of the white paper: Two parts of the white paper: First part: Use AS2k to build DM Models for a banking business scenario Second Part: Performance results of DM algorithms study Some results in this session… Some results in this session… Details in the paper and SQL Server magazine articles… Details in the paper and SQL Server magazine articles…
38
Data Source for DMMs
39
Training Performance Results…
40
Sample Business Question for Non Nested MDT 1 Identify those customers that are most likely to churn (leave) based on customer demographical information.
41
Non Nested: Training Times for varying Number of Input attributes Assumptions: 1 mm cases 25 states 1 predictable attribute I/P Attributes Training Time 104.08 207.27 5031.54 10040.55 200129.35 Observations:
42
Non Nested: Training Times for varying Number of Cases Assumptions: 20 attributes 25 states 1 predictable attribute Observations:Cases Training Time 10,0000.38 1,000,00011.32 5,000,00034.19 10,000,000100.53
43
Sample Business Question for Nested MDT 2 Find the list of other products that the customer may be interested in based on the products the customer has purchased.
44
Nested Cases: Training Times for varying Sample size of Case Table Assumptions: Avg. customer purchases=25 States in nested=200 Nested key predictable Observations: Master Cases Training Time 10,00015.09 50,00067.79 100,000120.88 200,000240.62
45
Nested Cases: Training Times for varying Number of Products purchased per customer Assumptions: 200000 cases 1000 products in nested Observations: Nested Cases Training Time 1085.26 25120.82 50172.96 100281.65
46
For more info… DM URL DM URL – www.microsoft.com/data/oledb www.microsoft.com/data/oledb – www.microsoft.com/data/oledb/DMResKit.htm www.microsoft.com/data/oledb/DMResKit.htm News Group: News Group: – Microsoft.public.SQLserver.datamining – Communities.msn.com/AnalysisServicesDataMining White papers: White papers: – Performance paper: www.unisys.com/windows2000/default-07.aspwww.microsoft.com/SQL/evaluation/compare/analysisdmwp.asp
47
Don’t forget to complete the on-line Session Feedback form on the Attendee Web site https://web.mseventseurope.com/teched/ https://web.mseventseurope.com/teched/
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.