ZhaoHui Tang Program Manager SQL Server Analysis Services Microsoft Corporation DAT205 Advanced Data Mining Using SQL Server 2000.

Slides:



Advertisements
Similar presentations
Systematic Data Selection to Mine Concept Drifting Data Streams Wei Fan IBM T.J.Watson.
Advertisements

COMP3740 CR32: Knowledge Management and Adaptive Systems
Data Mining Lecture 9.
CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 27, 2012.
Decision Tree Approach in Data Mining
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan,
Classification: Definition Given a collection of records (training set ) –Each record contains a set of attributes, one of the attributes is the class.
Data Mining (and Machine Learning) With Microsoft Tools Michael Lisin, Plaster Group May 8, 2014.
Basic Data Mining Techniques Chapter Decision Trees.
An Overview of Database Access on the Web An Overview of Database Access on the Web Using ASP and Microsoft Database Technology Sheffield Hallam University.
Basic Data Mining Techniques
Working with Data Mining Rafal Lukawiecki Strategic Consultant, Project Botticelli Ltd
© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.
CS157A Spring 05 Data Mining Professor Sin-Min Lee.
Chapter 6 Decision Trees
Gavin Russell-Rockliff BI Technical Specialist Microsoft BIN305.
Beyond Opportunity; Enterprise Miner Ronalda Koster, Data Analyst.
Basic Data Mining Techniques
Microsoft ® Site Server Commerce Edition Jay Sauls Microsoft Consulting Services.
Data Mining Dr. Chang Liu. What is Data Mining Data mining has been known by many different terms Data mining has been known by many different terms Knowledge.
Lecture Notes 4 Pruning Zhangxi Lin ISQS
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Forecast Anything! The Seven Data Mining Models Andy Cheung ISV Developer Evangelist Microsoft Hong Kong.
Data Mining CS157B Fall 04 Professor Lee By Yanhua Xue.
More value from data using Data Mining Allan Mitchell SQL Server MVP.
DAT204 Introduction to Data Mining with SQL Server 2000 ZhaoHui Tang Program Manager SQL Server Analysis Services Microsoft Corporation.
Chapter 10. Sampling Strategy for Building Decision Trees from Very Large Databases Comprising Many Continuous Attributes Jean-Hugues Chauchat and Ricco.
The DM Process – MS’s view (DMX). The Basics  You select an algorithm, show the algorithm some examples called training example and, from these examples,
Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:
DATA MINING Prof. Sin-Min Lee Surya Bhagvat CS 157B – Spring 2006.
Lecture 7. Outline 1. Overview of Classification and Decision Tree 2. Algorithm to build Decision Tree 3. Formula to measure information 4. Weka, data.
SOCIAL NETWORKS ANALYSIS SEMINAR INTRODUCTORY LECTURE #2 Danny Hendler and Yehonatan Cohen Advanced Topics in on-line Social Networks Analysis.
Decision Tree Learning Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 25, 2014.
DAT 360: DTS in SQL Server 2000 Best Practices Euan Garden Group Manager, SQL Server Microsoft Corporation.
Business Intelligence and Decision Modeling Week 9 Customer Profiling Decision Trees (Part 2) CHAID CRT.
Decision Trees. MS Algorithms Decision Trees The basic idea –creating a series of splits, also called nodes, in the tree. The algorithm adds a node to.
CS157B Fall 04 Introduction to Data Mining Chapter 22.3 Professor Lee Yu, Jianji (Joseph)
Consul- ting Services Outsour- cing Services Techno- logy Services Local Profes- sional Services Competence Centers Business Intelligence WebTech SAP.
CS690L Data Mining: Classification
Chapter 20 Data Analysis and Mining. 2 n Decision Support Systems  Obtain high-level information out of detailed information stored in (DB) transaction-processing.
CONFIDENTIAL1 Hidden Decision Trees to Design Predictive Scores – Application to Fraud Detection Vincent Granville, Ph.D. AnalyticBridge October 27, 2009.
Copyright © 2010 SAS Institute Inc. All rights reserved. Decision Trees Using SAS Sylvain Tremblay SAS Canada – Education SAS Halifax Regional User Group.
Finding Hidden Intelligence with Predictive Analysis of Data Mining Rafal Lukawiecki Strategic Consultant, Project Botticelli Ltd
Web-Mining …searching for the knowledge on the Internet… Marko Grobelnik Institut Jožef Stefan.
An Introduction Student Name: Riaz Ahmad Program: MSIT( ) Subject: Data warehouse & Data Mining.
Monday, February 22,  The term analytics is often used interchangeably with:  Data science  Data mining  Knowledge discovery  Extracting useful.
Basic Data Mining Techniques Chapter 3-A. 3.1 Decision Trees.
Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 3 Basic Data Mining Techniques Jason C. H. Chen, Ph.D. Professor of MIS School of Business.
Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
BY International School of Engineering {We Are Applied Engineering} Disclaimer: Some of the Images and content have been taken from multiple online sources.
DATA MINING TECHNIQUES (DECISION TREES ) Presented by: Shweta Ghate MIT College OF Engineering.
Data Mining With SQL Server Data Tools Mining Data Using Tools You Already Have.
Developing More Intelligent Applications Using Data Mining Rafal Lukawiecki Strategic Consultant, Project Botticelli Ltd
Show Me Potential Customers Data Mining Approach Leila Etaati.
Introduction.  Instructor: Cengiz Örencik   Course materials:  myweb.sabanciuniv.edu/cengizo/courses.
Ahmed K. Ezzat, SQL Server 2008 and Data Mining Overview 1 Data Mining and Big Data.
Data Resource Management – MGMT An overview of where we are right now SQL Developer OLAP CUBE 1 Sales Cube Data Warehouse Denormalized Historical.
9/24/2017 7:27 AM © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN.
C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.
C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.
Blazing-Fast Performance:
©Jiawei Han and Micheline Kamber Slides contributed by Jian Pei
Data Analysis.
©Jiawei Han and Micheline Kamber Slides contributed by Jian Pei
Clustering.
Data Mining for Business Analytics
©Jiawei Han and Micheline Kamber Slides contributed by Jian Pei
כריית נתונים.
Decision Trees.
Welcome! Knowledge Discovery and Data Mining
Presentation transcript:

ZhaoHui Tang Program Manager SQL Server Analysis Services Microsoft Corporation DAT205 Advanced Data Mining Using SQL Server 2000

Agenda Microsoft Data Mining Algorithms Microsoft Data Mining Algorithms OLE DB for DM Data mining query OLE DB for DM Data mining query Data Mining Case Study: Click Stream Analysis Data Mining Case Study: Click Stream Analysis – Customer Segmentation – Site affiliation – Target ads in banner Performance of Microsoft Data Mining Algorithm Performance of Microsoft Data Mining Algorithm Q&A Q&A

Data Mining Algorithms in SQL Server 2000

Decision Tree Popular technique for classification, Prediction task Popular technique for classification, Prediction task – Churn analysis – Credit risk analysis –…–…–…–… Easy to understand Easy to understand – any path from node to leaf forms a rule Fast to build Fast to build Prediction based on leaf node stats Prediction based on leaf node stats Variation: C4.5, C5, CART, Chaid Variation: C4.5, C5, CART, Chaid Attend College: 55% Yes 45% No All Students Attend College: 79% Yes 21% No IQ=High Attend College: 35% Yes 65% No IQ High Attend College: 94% Yes 6% No Parent Income = High Attend College: 69% Yes 31% No Parent Income = Low

How tree works IQ Parent Encouragement Parent Income Gender HighMediumLowTrueFalseHighFalseMaleFemale CollegeP lan Yes No IQ=HighIQ=MediumIQ=Low PI=HighPI=FALSE PE=TRUEPE=FALSE MaleFemale Yes No

Split recursively College Plan 33% Yes 67% No All Students College Plan 63% Yes 37% No Parent Encouragement = True College Plan 16% Yes 84% No Parent Encouragement = False IQ Parent Encouragement Parent Income Gender HighMediumLowTrueFalseHighFalseMaleFemale CollegeP lan Yes No

Microsoft Decision Trees Probabilistic Classification Tree Probabilistic Classification Tree Splitting methods: Bayesian score and Entropy Splitting methods: Bayesian score and Entropy Forward pruning Forward pruning Tree shape: Binary and Nary tree Tree shape: Binary and Nary tree Scalable framework Scalable framework

Clustering Algorithm (EM) A popular method for customer segmentation, mailing list, profiling… A popular method for customer segmentation, mailing list, profiling… Algorithm process Algorithm process – Assign a set of Initial Points – Assign initial cluster to each points – Assign data points to each cluster with a probability – Computer new central point based on weighted computation – Cycle until convergence

EM Illustration XX X

Microsoft Clustering Algorithm (Scalable EM) Data Fill Buffer Build/Update Model Build/Update Model Compressed date  Sufficient stats Compressed date  Sufficient stats Identify Data to be Compressed Identify Data to be Compressed Stop? Final Model

OLE DB for Data Mining

OLE DB for DM Industry standard for data mining Industry standard for data mining Based on existing technologies Based on existing technologies – SQL – OLE DB Define common concepts for DM Define common concepts for DM – Case, Nested Case – Mining Model – Model Creation – Model Training – Prediction Language based API Language based API

Customer Table Customer IDProfessionIncomeGenderRisk 1Engineer85MaleNo 2Worker40MaleYes 3Doctor90FemaleNo 4Teacher50FemaleNo 5Worker45MaleNo ……………

DM Query Language Create Mining Model CreditRisk (CustomerID long key, Gender text discrete, Income long continuous, Profession text discrete, Risktext discrete predict) Using Microsoft_Decision_Trees Create Mining Model CreditRisk (CustomerID long key, Gender text discrete, Income long continuous, Profession text discrete, Risktext discrete predict) Using Microsoft_Decision_Trees Insert into CreditRisk (CustomerId, Gender, Income, Profession, Risk) Select CustomerID, Gender, Income, Profession,Risk From Customers Insert into CreditRisk (CustomerId, Gender, Income, Profession, Risk) Select CustomerID, Gender, Income, Profession,Risk From Customers Select NewCustomers.CustomerID, CreditRisk.Risk, PredictProbability(CreditRisk) From CreditRisk Prediction Join NewCustomers On CreditRisk.Gender=NewCustomer.Gender And CreditRisk.Income=NewCustomer.Income AndCreditRisk.Profession=NewCustomer.Profession Select NewCustomers.CustomerID, CreditRisk.Risk, PredictProbability(CreditRisk) From CreditRisk Prediction Join NewCustomers On CreditRisk.Gender=NewCustomer.Gender And CreditRisk.Income=NewCustomer.Income AndCreditRisk.Profession=NewCustomer.Profession

Schema Rowsets Tabular data to provide meta data information Tabular data to provide meta data information List of Schema Rowsets in OLE DB for DM List of Schema Rowsets in OLE DB for DM – Mining_Services – Mining_Service_Parameters – Mining_Models – Mining_Columns – Mining_Model_Contents – Model_Content_PMML

Mining Model Contents Schema Rowsets

Schema Rowsets & Thin Client Browser

Case Study: Click Stream Analysis

Schema Customer CustomerGuid DayTimeOnLine NightTimeOnLin e BrowserType Time ChatTime GeoLocation WebClickCustomerGuid URLCategory Time Duration ReferPage

Web Customer Segmentation

Web Visitors Segmentation

Segmentation based on Customer table Create Mining Model CustomerClustering (CustomerID text key, DayTimeOnline long continuous NightTimeOnline long continuous, BrowserType text discrete, ChatTime long continuous, Timelong continuous, GeoLocationtext discrete ) Using Microsoft_Clustering Create Mining Model CustomerClustering (CustomerID text key, DayTimeOnline long continuous NightTimeOnline long continuous, BrowserType text discrete, ChatTime long continuous, Timelong continuous, GeoLocationtext discrete ) Using Microsoft_Clustering

Segmentation based on Customer and WebClick Create Mining Model CustomerClustering (CustomerID text key, DayTimeOnline long continuous, NightTimeOnline long continuous, BrowserType text discrete, ChatTime long continuous, Timelong continuous, GeoLocationtext discrete WebClicktable ( UrlCategory text key ) )Using Microsoft_Clustering Create Mining Model CustomerClustering (CustomerID text key, DayTimeOnline long continuous, NightTimeOnline long continuous, BrowserType text discrete, ChatTime long continuous, Timelong continuous, GeoLocationtext discrete WebClicktable ( UrlCategory text key ) )Using Microsoft_Clustering

MSFTies Segmentation

Web Site Affiliation

Association analysis using Microsoft Decision Trees Insurance No Insurance Loan No Loan Business Loan No Loan Stock No Stock Insurance Business No Business Shopping No Shopping Stock Insurance No Insurance Loan No Stock

Association analysis using Microsoft Decision Trees Insurance No Insurance Loan No Loan Business Loan No Loan Stock No Stock Insurance Business No Business Shopping No Shopping Stock Insurance No Insurance Loan No Stock

Site Affiliation

Create Mining Model SiteAffiliation (CustomerID text key, WebClick table predict ( UrlCategory text key ) )Using Microsoft_Decision_Trees Create Mining Model SiteAffiliation (CustomerID text key, WebClick table predict ( UrlCategory text key ) )Using Microsoft_Decision_Trees Insert into SiteAffiliation (CustomerID,WebClick (skip, UrlCategory) OpenRowset(‘MSDataShape’, 'data provider=SQLOLEDB;Server=myserver;UID=me; PWD=mypass', 'Shape{Select CustomerID from Customer} Append ( {Select customerid, URLCategory from WebClick } relate CustomerID to CustomerID) as WebClick’ ) )

Path Prediction

Singleton Prediction Select Flattened Topcount((select URLCategory, $adjustedProbability as prob From Predict([Web Click], INCLUDE_STATISTICS, EXCLUSIVE)), prob, 5) From WebLog PREDICTION JOIN (select (select 'Business' as URLCategory) union (select ‘Telecom’ as URLCategory) as WebClick) as input On WebLog.[Web Click].URLCategory = input.WebClick.URLCategory Select Flattened Topcount((select URLCategory, $adjustedProbability as prob From Predict([Web Click], INCLUDE_STATISTICS, EXCLUSIVE)), prob, 5) From WebLog PREDICTION JOIN (select (select 'Business' as URLCategory) union (select ‘Telecom’ as URLCategory) as WebClick) as input On WebLog.[Web Click].URLCategory = input.WebClick.URLCategory

Architecture WebCustomerWebCustomer IISIIS ASPASP DM Provider DMMDMM Internet Real Time Predictio n ADO/DSO

Performance of DM Algorithms

DM Performance Study Joint effort between Unisys & Microsoft Joint effort between Unisys & Microsoft Two parts of the white paper: Two parts of the white paper:  First part: Use AS2k to build DM Models for a banking business scenario  Second Part: Performance results of DM algorithms study Some results in this session… Some results in this session… Details in the paper and SQL Server magazine articles… Details in the paper and SQL Server magazine articles…

Data Source for DMMs

Training Performance Results…

Sample Business Question for Non Nested MDT 1 Identify those customers that are most likely to churn (leave) based on customer demographical information.

Non Nested: Training Times for varying Number of Input attributes Assumptions: 1 mm cases 25 states 1 predictable attribute I/P Attributes Training Time Observations:

Non Nested: Training Times for varying Number of Cases Assumptions: 20 attributes 25 states 1 predictable attribute Observations:Cases Training Time 10, ,000, ,000, ,000,

Sample Business Question for Nested MDT 2 Find the list of other products that the customer may be interested in based on the products the customer has purchased.

Nested Cases: Training Times for varying Sample size of Case Table Assumptions: Avg. customer purchases=25 States in nested=200 Nested key predictable Observations: Master Cases Training Time 10, , , ,

Nested Cases: Training Times for varying Number of Products purchased per customer Assumptions: cases 1000 products in nested Observations: Nested Cases Training Time

For more info… DM URL DM URL – – News Group: News Group: – Microsoft.public.SQLserver.datamining – Communities.msn.com/AnalysisServicesDataMining White papers: White papers: – Performance paper:

Don’t forget to complete the on-line Session Feedback form on the Attendee Web site