Presentation is loading. Please wait.

Presentation is loading. Please wait.

ZhaoHui Tang Program Manager SQL Server Analysis Services Microsoft Corporation DAT205 Advanced Data Mining Using SQL Server 2000.

Similar presentations


Presentation on theme: "ZhaoHui Tang Program Manager SQL Server Analysis Services Microsoft Corporation DAT205 Advanced Data Mining Using SQL Server 2000."— Presentation transcript:

1 ZhaoHui Tang Program Manager SQL Server Analysis Services Microsoft Corporation DAT205 Advanced Data Mining Using SQL Server 2000

2 Agenda Microsoft Data Mining Algorithms Microsoft Data Mining Algorithms OLE DB for DM Data mining query OLE DB for DM Data mining query Data Mining Case Study: Click Stream Analysis Data Mining Case Study: Click Stream Analysis – Customer Segmentation – Site affiliation – Target ads in banner Performance of Microsoft Data Mining Algorithm Performance of Microsoft Data Mining Algorithm Q&A Q&A

3 Data Mining Algorithms in SQL Server 2000

4 Decision Tree Popular technique for classification, Prediction task Popular technique for classification, Prediction task – Churn analysis – Credit risk analysis –…–…–…–… Easy to understand Easy to understand – any path from node to leaf forms a rule Fast to build Fast to build Prediction based on leaf node stats Prediction based on leaf node stats Variation: C4.5, C5, CART, Chaid Variation: C4.5, C5, CART, Chaid Attend College: 55% Yes 45% No All Students Attend College: 79% Yes 21% No IQ=High Attend College: 35% Yes 65% No IQ High Attend College: 94% Yes 6% No Parent Income = High Attend College: 69% Yes 31% No Parent Income = Low

5 How tree works IQ Parent Encouragement Parent Income Gender HighMediumLowTrueFalseHighFalseMaleFemale CollegeP lan Yes300500200700300400600500 No1001000900400160040016001100900 0 100 200 300 400 500 600 700 800 900 1000 IQ=HighIQ=MediumIQ=Low 0 200 400 600 800 1000 1200 1400 1600 1800 PI=HighPI=FALSE 0 200 400 600 800 1000 1200 1400 1600 1800 PE=TRUEPE=FALSE 0 200 400 600 800 1000 1200 MaleFemale Yes No

6 Split recursively College Plan 33% Yes 67% No All Students College Plan 63% Yes 37% No Parent Encouragement = True College Plan 16% Yes 84% No Parent Encouragement = False IQ Parent Encouragement Parent Income Gender HighMediumLowTrueFalseHighFalseMaleFemale CollegeP lan Yes2004001007000300400 250 No502501004000100300250150

7 Microsoft Decision Trees Probabilistic Classification Tree Probabilistic Classification Tree Splitting methods: Bayesian score and Entropy Splitting methods: Bayesian score and Entropy Forward pruning Forward pruning Tree shape: Binary and Nary tree Tree shape: Binary and Nary tree Scalable framework Scalable framework

8 Clustering Algorithm (EM) A popular method for customer segmentation, mailing list, profiling… A popular method for customer segmentation, mailing list, profiling… Algorithm process Algorithm process – Assign a set of Initial Points – Assign initial cluster to each points – Assign data points to each cluster with a probability – Computer new central point based on weighted computation – Cycle until convergence

9 EM Illustration XX X

10 Microsoft Clustering Algorithm (Scalable EM) Data Fill Buffer Build/Update Model Build/Update Model Compressed date  Sufficient stats Compressed date  Sufficient stats Identify Data to be Compressed Identify Data to be Compressed Stop? Final Model

11 OLE DB for Data Mining

12 OLE DB for DM Industry standard for data mining Industry standard for data mining Based on existing technologies Based on existing technologies – SQL – OLE DB Define common concepts for DM Define common concepts for DM – Case, Nested Case – Mining Model – Model Creation – Model Training – Prediction Language based API Language based API

13 Customer Table Customer IDProfessionIncomeGenderRisk 1Engineer85MaleNo 2Worker40MaleYes 3Doctor90FemaleNo 4Teacher50FemaleNo 5Worker45MaleNo ……………

14 DM Query Language Create Mining Model CreditRisk (CustomerID long key, Gender text discrete, Income long continuous, Profession text discrete, Risktext discrete predict) Using Microsoft_Decision_Trees Create Mining Model CreditRisk (CustomerID long key, Gender text discrete, Income long continuous, Profession text discrete, Risktext discrete predict) Using Microsoft_Decision_Trees Insert into CreditRisk (CustomerId, Gender, Income, Profession, Risk) Select CustomerID, Gender, Income, Profession,Risk From Customers Insert into CreditRisk (CustomerId, Gender, Income, Profession, Risk) Select CustomerID, Gender, Income, Profession,Risk From Customers Select NewCustomers.CustomerID, CreditRisk.Risk, PredictProbability(CreditRisk) From CreditRisk Prediction Join NewCustomers On CreditRisk.Gender=NewCustomer.Gender And CreditRisk.Income=NewCustomer.Income AndCreditRisk.Profession=NewCustomer.Profession Select NewCustomers.CustomerID, CreditRisk.Risk, PredictProbability(CreditRisk) From CreditRisk Prediction Join NewCustomers On CreditRisk.Gender=NewCustomer.Gender And CreditRisk.Income=NewCustomer.Income AndCreditRisk.Profession=NewCustomer.Profession

15 Schema Rowsets Tabular data to provide meta data information Tabular data to provide meta data information List of Schema Rowsets in OLE DB for DM List of Schema Rowsets in OLE DB for DM – Mining_Services – Mining_Service_Parameters – Mining_Models – Mining_Columns – Mining_Model_Contents – Model_Content_PMML

16 Mining Model Contents Schema Rowsets

17 Schema Rowsets & Thin Client Browser

18

19 Case Study: Click Stream Analysis

20 Schema Customer CustomerGuid DayTimeOnLine NightTimeOnLin e BrowserType EmailTime ChatTime GeoLocation WebClickCustomerGuid URLCategory Time Duration ReferPage

21 Web Customer Segmentation

22 Web Visitors Segmentation

23 Segmentation based on Customer table Create Mining Model CustomerClustering (CustomerID text key, DayTimeOnline long continuous NightTimeOnline long continuous, BrowserType text discrete, ChatTime long continuous, EmailTimelong continuous, GeoLocationtext discrete ) Using Microsoft_Clustering Create Mining Model CustomerClustering (CustomerID text key, DayTimeOnline long continuous NightTimeOnline long continuous, BrowserType text discrete, ChatTime long continuous, EmailTimelong continuous, GeoLocationtext discrete ) Using Microsoft_Clustering

24 Segmentation based on Customer and WebClick Create Mining Model CustomerClustering (CustomerID text key, DayTimeOnline long continuous, NightTimeOnline long continuous, BrowserType text discrete, ChatTime long continuous, EmailTimelong continuous, GeoLocationtext discrete WebClicktable ( UrlCategory text key ) )Using Microsoft_Clustering Create Mining Model CustomerClustering (CustomerID text key, DayTimeOnline long continuous, NightTimeOnline long continuous, BrowserType text discrete, ChatTime long continuous, EmailTimelong continuous, GeoLocationtext discrete WebClicktable ( UrlCategory text key ) )Using Microsoft_Clustering

25 MSFTies Segmentation

26 Web Site Affiliation

27 Association analysis using Microsoft Decision Trees Insurance No Insurance Loan No Loan Business Loan No Loan Stock No Stock Insurance Business No Business Shopping No Shopping Stock Insurance No Insurance Loan No Stock

28 Association analysis using Microsoft Decision Trees Insurance No Insurance Loan No Loan Business Loan No Loan Stock No Stock Insurance Business No Business Shopping No Shopping Stock Insurance No Insurance Loan No Stock

29 Site Affiliation

30 Create Mining Model SiteAffiliation (CustomerID text key, WebClick table predict ( UrlCategory text key ) )Using Microsoft_Decision_Trees Create Mining Model SiteAffiliation (CustomerID text key, WebClick table predict ( UrlCategory text key ) )Using Microsoft_Decision_Trees Insert into SiteAffiliation (CustomerID,WebClick (skip, UrlCategory) OpenRowset(‘MSDataShape’, 'data provider=SQLOLEDB;Server=myserver;UID=me; PWD=mypass', 'Shape{Select CustomerID from Customer} Append ( {Select customerid, URLCategory from WebClick } relate CustomerID to CustomerID) as WebClick’ ) )

31

32 Path Prediction

33

34 Singleton Prediction Select Flattened Topcount((select URLCategory, $adjustedProbability as prob From Predict([Web Click], INCLUDE_STATISTICS, EXCLUSIVE)), prob, 5) From WebLog PREDICTION JOIN (select (select 'Business' as URLCategory) union (select ‘Telecom’ as URLCategory) as WebClick) as input On WebLog.[Web Click].URLCategory = input.WebClick.URLCategory Select Flattened Topcount((select URLCategory, $adjustedProbability as prob From Predict([Web Click], INCLUDE_STATISTICS, EXCLUSIVE)), prob, 5) From WebLog PREDICTION JOIN (select (select 'Business' as URLCategory) union (select ‘Telecom’ as URLCategory) as WebClick) as input On WebLog.[Web Click].URLCategory = input.WebClick.URLCategory

35 Architecture WebCustomerWebCustomer IISIIS ASPASP DM Provider DMMDMM Internet Real Time Predictio n ADO/DSO

36 Performance of DM Algorithms

37 DM Performance Study Joint effort between Unisys & Microsoft Joint effort between Unisys & Microsoft Two parts of the white paper: Two parts of the white paper:  First part: Use AS2k to build DM Models for a banking business scenario  Second Part: Performance results of DM algorithms study Some results in this session… Some results in this session… Details in the paper and SQL Server magazine articles… Details in the paper and SQL Server magazine articles…

38 Data Source for DMMs

39 Training Performance Results…

40 Sample Business Question for Non Nested MDT 1 Identify those customers that are most likely to churn (leave) based on customer demographical information.

41 Non Nested: Training Times for varying Number of Input attributes Assumptions: 1 mm cases 25 states 1 predictable attribute I/P Attributes Training Time 104.08 207.27 5031.54 10040.55 200129.35 Observations:

42 Non Nested: Training Times for varying Number of Cases Assumptions: 20 attributes 25 states 1 predictable attribute Observations:Cases Training Time 10,0000.38 1,000,00011.32 5,000,00034.19 10,000,000100.53

43 Sample Business Question for Nested MDT 2 Find the list of other products that the customer may be interested in based on the products the customer has purchased.

44 Nested Cases: Training Times for varying Sample size of Case Table Assumptions: Avg. customer purchases=25 States in nested=200 Nested key predictable Observations: Master Cases Training Time 10,00015.09 50,00067.79 100,000120.88 200,000240.62

45 Nested Cases: Training Times for varying Number of Products purchased per customer Assumptions: 200000 cases 1000 products in nested Observations: Nested Cases Training Time 1085.26 25120.82 50172.96 100281.65

46 For more info… DM URL DM URL – www.microsoft.com/data/oledb www.microsoft.com/data/oledb – www.microsoft.com/data/oledb/DMResKit.htm www.microsoft.com/data/oledb/DMResKit.htm News Group: News Group: – Microsoft.public.SQLserver.datamining – Communities.msn.com/AnalysisServicesDataMining White papers: White papers: – Performance paper: www.unisys.com/windows2000/default-07.aspwww.microsoft.com/SQL/evaluation/compare/analysisdmwp.asp

47 Don’t forget to complete the on-line Session Feedback form on the Attendee Web site https://web.mseventseurope.com/teched/ https://web.mseventseurope.com/teched/

48


Download ppt "ZhaoHui Tang Program Manager SQL Server Analysis Services Microsoft Corporation DAT205 Advanced Data Mining Using SQL Server 2000."

Similar presentations


Ads by Google