ZhaoHui Tang Program Manager SQL Server Analysis Services Microsoft Corporation DAT205 Advanced Data Mining Using SQL Server 2000
Agenda Microsoft Data Mining Algorithms Microsoft Data Mining Algorithms OLE DB for DM Data mining query OLE DB for DM Data mining query Data Mining Case Study: Click Stream Analysis Data Mining Case Study: Click Stream Analysis – Customer Segmentation – Site affiliation – Target ads in banner Performance of Microsoft Data Mining Algorithm Performance of Microsoft Data Mining Algorithm Q&A Q&A
Data Mining Algorithms in SQL Server 2000
Decision Tree Popular technique for classification, Prediction task Popular technique for classification, Prediction task – Churn analysis – Credit risk analysis –…–…–…–… Easy to understand Easy to understand – any path from node to leaf forms a rule Fast to build Fast to build Prediction based on leaf node stats Prediction based on leaf node stats Variation: C4.5, C5, CART, Chaid Variation: C4.5, C5, CART, Chaid Attend College: 55% Yes 45% No All Students Attend College: 79% Yes 21% No IQ=High Attend College: 35% Yes 65% No IQ High Attend College: 94% Yes 6% No Parent Income = High Attend College: 69% Yes 31% No Parent Income = Low
How tree works IQ Parent Encouragement Parent Income Gender HighMediumLowTrueFalseHighFalseMaleFemale CollegeP lan Yes No IQ=HighIQ=MediumIQ=Low PI=HighPI=FALSE PE=TRUEPE=FALSE MaleFemale Yes No
Split recursively College Plan 33% Yes 67% No All Students College Plan 63% Yes 37% No Parent Encouragement = True College Plan 16% Yes 84% No Parent Encouragement = False IQ Parent Encouragement Parent Income Gender HighMediumLowTrueFalseHighFalseMaleFemale CollegeP lan Yes No
Microsoft Decision Trees Probabilistic Classification Tree Probabilistic Classification Tree Splitting methods: Bayesian score and Entropy Splitting methods: Bayesian score and Entropy Forward pruning Forward pruning Tree shape: Binary and Nary tree Tree shape: Binary and Nary tree Scalable framework Scalable framework
Clustering Algorithm (EM) A popular method for customer segmentation, mailing list, profiling… A popular method for customer segmentation, mailing list, profiling… Algorithm process Algorithm process – Assign a set of Initial Points – Assign initial cluster to each points – Assign data points to each cluster with a probability – Computer new central point based on weighted computation – Cycle until convergence
EM Illustration XX X
Microsoft Clustering Algorithm (Scalable EM) Data Fill Buffer Build/Update Model Build/Update Model Compressed date Sufficient stats Compressed date Sufficient stats Identify Data to be Compressed Identify Data to be Compressed Stop? Final Model
OLE DB for Data Mining
OLE DB for DM Industry standard for data mining Industry standard for data mining Based on existing technologies Based on existing technologies – SQL – OLE DB Define common concepts for DM Define common concepts for DM – Case, Nested Case – Mining Model – Model Creation – Model Training – Prediction Language based API Language based API
Customer Table Customer IDProfessionIncomeGenderRisk 1Engineer85MaleNo 2Worker40MaleYes 3Doctor90FemaleNo 4Teacher50FemaleNo 5Worker45MaleNo ……………
DM Query Language Create Mining Model CreditRisk (CustomerID long key, Gender text discrete, Income long continuous, Profession text discrete, Risktext discrete predict) Using Microsoft_Decision_Trees Create Mining Model CreditRisk (CustomerID long key, Gender text discrete, Income long continuous, Profession text discrete, Risktext discrete predict) Using Microsoft_Decision_Trees Insert into CreditRisk (CustomerId, Gender, Income, Profession, Risk) Select CustomerID, Gender, Income, Profession,Risk From Customers Insert into CreditRisk (CustomerId, Gender, Income, Profession, Risk) Select CustomerID, Gender, Income, Profession,Risk From Customers Select NewCustomers.CustomerID, CreditRisk.Risk, PredictProbability(CreditRisk) From CreditRisk Prediction Join NewCustomers On CreditRisk.Gender=NewCustomer.Gender And CreditRisk.Income=NewCustomer.Income AndCreditRisk.Profession=NewCustomer.Profession Select NewCustomers.CustomerID, CreditRisk.Risk, PredictProbability(CreditRisk) From CreditRisk Prediction Join NewCustomers On CreditRisk.Gender=NewCustomer.Gender And CreditRisk.Income=NewCustomer.Income AndCreditRisk.Profession=NewCustomer.Profession
Schema Rowsets Tabular data to provide meta data information Tabular data to provide meta data information List of Schema Rowsets in OLE DB for DM List of Schema Rowsets in OLE DB for DM – Mining_Services – Mining_Service_Parameters – Mining_Models – Mining_Columns – Mining_Model_Contents – Model_Content_PMML
Mining Model Contents Schema Rowsets
Schema Rowsets & Thin Client Browser
Case Study: Click Stream Analysis
Schema Customer CustomerGuid DayTimeOnLine NightTimeOnLin e BrowserType Time ChatTime GeoLocation WebClickCustomerGuid URLCategory Time Duration ReferPage
Web Customer Segmentation
Web Visitors Segmentation
Segmentation based on Customer table Create Mining Model CustomerClustering (CustomerID text key, DayTimeOnline long continuous NightTimeOnline long continuous, BrowserType text discrete, ChatTime long continuous, Timelong continuous, GeoLocationtext discrete ) Using Microsoft_Clustering Create Mining Model CustomerClustering (CustomerID text key, DayTimeOnline long continuous NightTimeOnline long continuous, BrowserType text discrete, ChatTime long continuous, Timelong continuous, GeoLocationtext discrete ) Using Microsoft_Clustering
Segmentation based on Customer and WebClick Create Mining Model CustomerClustering (CustomerID text key, DayTimeOnline long continuous, NightTimeOnline long continuous, BrowserType text discrete, ChatTime long continuous, Timelong continuous, GeoLocationtext discrete WebClicktable ( UrlCategory text key ) )Using Microsoft_Clustering Create Mining Model CustomerClustering (CustomerID text key, DayTimeOnline long continuous, NightTimeOnline long continuous, BrowserType text discrete, ChatTime long continuous, Timelong continuous, GeoLocationtext discrete WebClicktable ( UrlCategory text key ) )Using Microsoft_Clustering
MSFTies Segmentation
Web Site Affiliation
Association analysis using Microsoft Decision Trees Insurance No Insurance Loan No Loan Business Loan No Loan Stock No Stock Insurance Business No Business Shopping No Shopping Stock Insurance No Insurance Loan No Stock
Association analysis using Microsoft Decision Trees Insurance No Insurance Loan No Loan Business Loan No Loan Stock No Stock Insurance Business No Business Shopping No Shopping Stock Insurance No Insurance Loan No Stock
Site Affiliation
Create Mining Model SiteAffiliation (CustomerID text key, WebClick table predict ( UrlCategory text key ) )Using Microsoft_Decision_Trees Create Mining Model SiteAffiliation (CustomerID text key, WebClick table predict ( UrlCategory text key ) )Using Microsoft_Decision_Trees Insert into SiteAffiliation (CustomerID,WebClick (skip, UrlCategory) OpenRowset(‘MSDataShape’, 'data provider=SQLOLEDB;Server=myserver;UID=me; PWD=mypass', 'Shape{Select CustomerID from Customer} Append ( {Select customerid, URLCategory from WebClick } relate CustomerID to CustomerID) as WebClick’ ) )
Path Prediction
Singleton Prediction Select Flattened Topcount((select URLCategory, $adjustedProbability as prob From Predict([Web Click], INCLUDE_STATISTICS, EXCLUSIVE)), prob, 5) From WebLog PREDICTION JOIN (select (select 'Business' as URLCategory) union (select ‘Telecom’ as URLCategory) as WebClick) as input On WebLog.[Web Click].URLCategory = input.WebClick.URLCategory Select Flattened Topcount((select URLCategory, $adjustedProbability as prob From Predict([Web Click], INCLUDE_STATISTICS, EXCLUSIVE)), prob, 5) From WebLog PREDICTION JOIN (select (select 'Business' as URLCategory) union (select ‘Telecom’ as URLCategory) as WebClick) as input On WebLog.[Web Click].URLCategory = input.WebClick.URLCategory
Architecture WebCustomerWebCustomer IISIIS ASPASP DM Provider DMMDMM Internet Real Time Predictio n ADO/DSO
Performance of DM Algorithms
DM Performance Study Joint effort between Unisys & Microsoft Joint effort between Unisys & Microsoft Two parts of the white paper: Two parts of the white paper: First part: Use AS2k to build DM Models for a banking business scenario Second Part: Performance results of DM algorithms study Some results in this session… Some results in this session… Details in the paper and SQL Server magazine articles… Details in the paper and SQL Server magazine articles…
Data Source for DMMs
Training Performance Results…
Sample Business Question for Non Nested MDT 1 Identify those customers that are most likely to churn (leave) based on customer demographical information.
Non Nested: Training Times for varying Number of Input attributes Assumptions: 1 mm cases 25 states 1 predictable attribute I/P Attributes Training Time Observations:
Non Nested: Training Times for varying Number of Cases Assumptions: 20 attributes 25 states 1 predictable attribute Observations:Cases Training Time 10, ,000, ,000, ,000,
Sample Business Question for Nested MDT 2 Find the list of other products that the customer may be interested in based on the products the customer has purchased.
Nested Cases: Training Times for varying Sample size of Case Table Assumptions: Avg. customer purchases=25 States in nested=200 Nested key predictable Observations: Master Cases Training Time 10, , , ,
Nested Cases: Training Times for varying Number of Products purchased per customer Assumptions: cases 1000 products in nested Observations: Nested Cases Training Time
For more info… DM URL DM URL – – News Group: News Group: – Microsoft.public.SQLserver.datamining – Communities.msn.com/AnalysisServicesDataMining White papers: White papers: – Performance paper:
Don’t forget to complete the on-line Session Feedback form on the Attendee Web site