ZhaoHui Tang Program Manager SQL Server Analysis Services Microsoft Corporation DAT205 Advanced Data Mining Using SQL Server 2000.

ZhaoHui Tang Program Manager SQL Server Analysis Services Microsoft Corporation DAT205 Advanced Data Mining Using SQL Server 2000

Agenda Microsoft Data Mining Algorithms Microsoft Data Mining Algorithms OLE DB for DM Data mining query OLE DB for DM Data mining query Data Mining Case Study: Click Stream Analysis Data Mining Case Study: Click Stream Analysis – Customer Segmentation – Site affiliation – Target ads in banner Performance of Microsoft Data Mining Algorithm Performance of Microsoft Data Mining Algorithm Q&A Q&A

Data Mining Algorithms in SQL Server 2000

Decision Tree Popular technique for classification, Prediction task Popular technique for classification, Prediction task – Churn analysis – Credit risk analysis –…–…–…–… Easy to understand Easy to understand – any path from node to leaf forms a rule Fast to build Fast to build Prediction based on leaf node stats Prediction based on leaf node stats Variation: C4.5, C5, CART, Chaid Variation: C4.5, C5, CART, Chaid Attend College: 55% Yes 45% No All Students Attend College: 79% Yes 21% No IQ=High Attend College: 35% Yes 65% No IQ High Attend College: 94% Yes 6% No Parent Income = High Attend College: 69% Yes 31% No Parent Income = Low

How tree works IQ Parent Encouragement Parent Income Gender HighMediumLowTrueFalseHighFalseMaleFemale CollegeP lan Yes300500200700300400600500 No1001000900400160040016001100900 0 100 200 300 400 500 600 700 800 900 1000 IQ=HighIQ=MediumIQ=Low 0 200 400 600 800 1000 1200 1400 1600 1800 PI=HighPI=FALSE 0 200 400 600 800 1000 1200 1400 1600 1800 PE=TRUEPE=FALSE 0 200 400 600 800 1000 1200 MaleFemale Yes No

Split recursively College Plan 33% Yes 67% No All Students College Plan 63% Yes 37% No Parent Encouragement = True College Plan 16% Yes 84% No Parent Encouragement = False IQ Parent Encouragement Parent Income Gender HighMediumLowTrueFalseHighFalseMaleFemale CollegeP lan Yes2004001007000300400 250 No502501004000100300250150

Microsoft Decision Trees Probabilistic Classification Tree Probabilistic Classification Tree Splitting methods: Bayesian score and Entropy Splitting methods: Bayesian score and Entropy Forward pruning Forward pruning Tree shape: Binary and Nary tree Tree shape: Binary and Nary tree Scalable framework Scalable framework

Clustering Algorithm (EM) A popular method for customer segmentation, mailing list, profiling… A popular method for customer segmentation, mailing list, profiling… Algorithm process Algorithm process – Assign a set of Initial Points – Assign initial cluster to each points – Assign data points to each cluster with a probability – Computer new central point based on weighted computation – Cycle until convergence

EM Illustration XX X

Microsoft Clustering Algorithm (Scalable EM) Data Fill Buffer Build/Update Model Build/Update Model Compressed date  Sufficient stats Compressed date  Sufficient stats Identify Data to be Compressed Identify Data to be Compressed Stop? Final Model

OLE DB for Data Mining

OLE DB for DM Industry standard for data mining Industry standard for data mining Based on existing technologies Based on existing technologies – SQL – OLE DB Define common concepts for DM Define common concepts for DM – Case, Nested Case – Mining Model – Model Creation – Model Training – Prediction Language based API Language based API

Customer Table Customer IDProfessionIncomeGenderRisk 1Engineer85MaleNo 2Worker40MaleYes 3Doctor90FemaleNo 4Teacher50FemaleNo 5Worker45MaleNo ……………

DM Query Language Create Mining Model CreditRisk (CustomerID long key, Gender text discrete, Income long continuous, Profession text discrete, Risktext discrete predict) Using Microsoft_Decision_Trees Create Mining Model CreditRisk (CustomerID long key, Gender text discrete, Income long continuous, Profession text discrete, Risktext discrete predict) Using Microsoft_Decision_Trees Insert into CreditRisk (CustomerId, Gender, Income, Profession, Risk) Select CustomerID, Gender, Income, Profession,Risk From Customers Insert into CreditRisk (CustomerId, Gender, Income, Profession, Risk) Select CustomerID, Gender, Income, Profession,Risk From Customers Select NewCustomers.CustomerID, CreditRisk.Risk, PredictProbability(CreditRisk) From CreditRisk Prediction Join NewCustomers On CreditRisk.Gender=NewCustomer.Gender And CreditRisk.Income=NewCustomer.Income AndCreditRisk.Profession=NewCustomer.Profession Select NewCustomers.CustomerID, CreditRisk.Risk, PredictProbability(CreditRisk) From CreditRisk Prediction Join NewCustomers On CreditRisk.Gender=NewCustomer.Gender And CreditRisk.Income=NewCustomer.Income AndCreditRisk.Profession=NewCustomer.Profession

Schema Rowsets Tabular data to provide meta data information Tabular data to provide meta data information List of Schema Rowsets in OLE DB for DM List of Schema Rowsets in OLE DB for DM – Mining_Services – Mining_Service_Parameters – Mining_Models – Mining_Columns – Mining_Model_Contents – Model_Content_PMML

Mining Model Contents Schema Rowsets

Schema Rowsets & Thin Client Browser

Case Study: Click Stream Analysis

Schema Customer CustomerGuid DayTimeOnLine NightTimeOnLin e BrowserType EmailTime ChatTime GeoLocation WebClickCustomerGuid URLCategory Time Duration ReferPage

Web Customer Segmentation

Web Visitors Segmentation

Segmentation based on Customer table Create Mining Model CustomerClustering (CustomerID text key, DayTimeOnline long continuous NightTimeOnline long continuous, BrowserType text discrete, ChatTime long continuous, EmailTimelong continuous, GeoLocationtext discrete ) Using Microsoft_Clustering Create Mining Model CustomerClustering (CustomerID text key, DayTimeOnline long continuous NightTimeOnline long continuous, BrowserType text discrete, ChatTime long continuous, EmailTimelong continuous, GeoLocationtext discrete ) Using Microsoft_Clustering

Segmentation based on Customer and WebClick Create Mining Model CustomerClustering (CustomerID text key, DayTimeOnline long continuous, NightTimeOnline long continuous, BrowserType text discrete, ChatTime long continuous, EmailTimelong continuous, GeoLocationtext discrete WebClicktable ( UrlCategory text key ) )Using Microsoft_Clustering Create Mining Model CustomerClustering (CustomerID text key, DayTimeOnline long continuous, NightTimeOnline long continuous, BrowserType text discrete, ChatTime long continuous, EmailTimelong continuous, GeoLocationtext discrete WebClicktable ( UrlCategory text key ) )Using Microsoft_Clustering

MSFTies Segmentation

Web Site Affiliation

Association analysis using Microsoft Decision Trees Insurance No Insurance Loan No Loan Business Loan No Loan Stock No Stock Insurance Business No Business Shopping No Shopping Stock Insurance No Insurance Loan No Stock

Site Affiliation

Create Mining Model SiteAffiliation (CustomerID text key, WebClick table predict ( UrlCategory text key ) )Using Microsoft_Decision_Trees Create Mining Model SiteAffiliation (CustomerID text key, WebClick table predict ( UrlCategory text key ) )Using Microsoft_Decision_Trees Insert into SiteAffiliation (CustomerID,WebClick (skip, UrlCategory) OpenRowset(‘MSDataShape’, 'data provider=SQLOLEDB;Server=myserver;UID=me; PWD=mypass', 'Shape{Select CustomerID from Customer} Append ( {Select customerid, URLCategory from WebClick } relate CustomerID to CustomerID) as WebClick’ ) )

Path Prediction

Singleton Prediction Select Flattened Topcount((select URLCategory, $adjustedProbability as prob From Predict([Web Click], INCLUDE_STATISTICS, EXCLUSIVE)), prob, 5) From WebLog PREDICTION JOIN (select (select 'Business' as URLCategory) union (select ‘Telecom’ as URLCategory) as WebClick) as input On WebLog.[Web Click].URLCategory = input.WebClick.URLCategory Select Flattened Topcount((select URLCategory, $adjustedProbability as prob From Predict([Web Click], INCLUDE_STATISTICS, EXCLUSIVE)), prob, 5) From WebLog PREDICTION JOIN (select (select 'Business' as URLCategory) union (select ‘Telecom’ as URLCategory) as WebClick) as input On WebLog.[Web Click].URLCategory = input.WebClick.URLCategory

Architecture WebCustomerWebCustomer IISIIS ASPASP DM Provider DMMDMM Internet Real Time Predictio n ADO/DSO

Performance of DM Algorithms

DM Performance Study Joint effort between Unisys & Microsoft Joint effort between Unisys & Microsoft Two parts of the white paper: Two parts of the white paper:  First part: Use AS2k to build DM Models for a banking business scenario  Second Part: Performance results of DM algorithms study Some results in this session… Some results in this session… Details in the paper and SQL Server magazine articles… Details in the paper and SQL Server magazine articles…

Data Source for DMMs

Training Performance Results…

Sample Business Question for Non Nested MDT 1 Identify those customers that are most likely to churn (leave) based on customer demographical information.

Non Nested: Training Times for varying Number of Input attributes Assumptions: 1 mm cases 25 states 1 predictable attribute I/P Attributes Training Time 104.08 207.27 5031.54 10040.55 200129.35 Observations:

Non Nested: Training Times for varying Number of Cases Assumptions: 20 attributes 25 states 1 predictable attribute Observations:Cases Training Time 10,0000.38 1,000,00011.32 5,000,00034.19 10,000,000100.53

Sample Business Question for Nested MDT 2 Find the list of other products that the customer may be interested in based on the products the customer has purchased.

Nested Cases: Training Times for varying Sample size of Case Table Assumptions: Avg. customer purchases=25 States in nested=200 Nested key predictable Observations: Master Cases Training Time 10,00015.09 50,00067.79 100,000120.88 200,000240.62

Nested Cases: Training Times for varying Number of Products purchased per customer Assumptions: 200000 cases 1000 products in nested Observations: Nested Cases Training Time 1085.26 25120.82 50172.96 100281.65

For more info… DM URL DM URL – www.microsoft.com/data/oledb www.microsoft.com/data/oledb – www.microsoft.com/data/oledb/DMResKit.htm www.microsoft.com/data/oledb/DMResKit.htm News Group: News Group: – Microsoft.public.SQLserver.datamining – Communities.msn.com/AnalysisServicesDataMining White papers: White papers: – Performance paper: www.unisys.com/windows2000/default-07.aspwww.microsoft.com/SQL/evaluation/compare/analysisdmwp.asp

Don’t forget to complete the on-line Session Feedback form on the Attendee Web site https://web.mseventseurope.com/teched/ https://web.mseventseurope.com/teched/

ZhaoHui Tang Program Manager SQL Server Analysis Services Microsoft Corporation DAT205 Advanced Data Mining Using SQL Server 2000.

Similar presentations

Presentation on theme: "ZhaoHui Tang Program Manager SQL Server Analysis Services Microsoft Corporation DAT205 Advanced Data Mining Using SQL Server 2000."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

ZhaoHui Tang Program Manager SQL Server Analysis Services Microsoft Corporation DAT205 Advanced Data Mining Using SQL Server 2000.

Similar presentations

Presentation on theme: "ZhaoHui Tang Program Manager SQL Server Analysis Services Microsoft Corporation DAT205 Advanced Data Mining Using SQL Server 2000."— Presentation transcript:

Similar presentations

About project

Feedback