DBMS support of the Data Mining Advisor : S.-Y. Hwang Ph.D D954020005 Tsung-Hsien Yang D954020006 Shi-Hwao Wang 1/22/2008.

Slides:



Advertisements
Similar presentations
Advanced Data Mining: Introduction
Advertisements

By: Mr Hashem Alaidaros MIS 211 Lecture 4 Title: Data Base Management System.
Overview of Data Mining & The Knowledge Discovery Process Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Introduction to Data Mining Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential.
Data Mining By Archana Ketkar.
Data Mining – Intro.
CS157A Spring 05 Data Mining Professor Sin-Min Lee.
Advanced Database Applications Database Indexing and Data Mining CS591-G1 -- Fall 2001 George Kollios Boston University.
Data Mining Query Languages Kristen LeFevre April 19, 2004 With Thanks to Zheng Huang and Lei Chen.
Data Mining: A Closer Look
Data Mining.
Data Mining: Concepts & Techniques. Motivation: Necessity is the Mother of Invention Data explosion problem –Automated data collection tools and mature.
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010.
OLAM and Data Mining: Concepts and Techniques. Introduction Data explosion problem: –Automated data collection tools and mature database technology lead.
Data Mining : Introduction Chapter 1. 2 Index 1. What is Data Mining? 2. Data Mining Functionalities 1. Characterization and Discrimination 2. MIning.
Data Warehouse Fundamentals Rabie A. Ramadan, PhD 2.
Data Mining Techniques
Shilpa Seth.  What is Data Mining What is Data Mining  Applications of Data Mining Applications of Data Mining  KDD Process KDD Process  Architecture.
Data Mining Dr. Chang Liu. What is Data Mining Data mining has been known by many different terms Data mining has been known by many different terms Knowledge.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Data Mining Techniques As Tools for Analysis of Customer Behavior
5.1 © 2007 by Prentice Hall 5 Chapter Foundations of Business Intelligence: Databases and Information Management.
Intro to MIS – MGS351 Databases and Data Warehouses Chapter 3.
Chapter 1 Introduction to Data Mining
More value from data using Data Mining Allan Mitchell SQL Server MVP.
1 1 Slide Introduction to Data Mining and Business Intelligence.
DAT204 Introduction to Data Mining with SQL Server 2000 ZhaoHui Tang Program Manager SQL Server Analysis Services Microsoft Corporation.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
1 Improving quality of graduate students by data mining Asst. Prof. Kitsana Waiyamai, Ph.D. Dept. of Computer Engineering Faculty of Engineering, Kasetsart.
CS157B Fall 04 Introduction to Data Mining Chapter 22.3 Professor Lee Yu, Jianji (Joseph)
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
3-1 Data Mining Kelby Lee. 3-2 Overview ¨ Transaction Database ¨ What is Data Mining ¨ Data Mining Primitives ¨ Data Mining Objectives ¨ Predictive Modeling.
Chapter 5: Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization DECISION SUPPORT SYSTEMS AND BUSINESS.
Introduction to Data-Mining Marko Grobelnik Institut Jozef Stefan.
Foundations of Business Intelligence: Databases and Information Management.
1 Data Mining Systems and Languages CS240A Notes.
1 Knowledge Discovery from DataBases (KDD) A.K.A. Data Mining & by other names as well Carlo Zaniolo UCLA CS Dept.
MIS2502: Data Analytics Advanced Analytics - Introduction.
An Introduction Student Name: Riaz Ahmad Program: MSIT( ) Subject: Data warehouse & Data Mining.
January 17, 2016Data Mining: Concepts and Techniques 1 What Is Data Mining? Data mining (knowledge discovery from data) Extraction of interesting ( non-trivial,
Academic Year 2014 Spring Academic Year 2014 Spring.
February 13, 2016 Data Mining: Concepts and Techniques 1 1 Data Mining: Concepts and Techniques These slides have been adapted from Han, J., Kamber, M.,
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 28 Data Mining Concepts.
Introduction.  Instructor: Cengiz Örencik   Course materials:  myweb.sabanciuniv.edu/cengizo/courses.
Chapter 3 Building Business Intelligence Chapter 3 DATABASES AND DATA WAREHOUSES Building Business Intelligence 6/22/2016 1Management Information Systems.
Ahmed K. Ezzat, SQL Server 2008 and Data Mining Overview 1 Data Mining and Big Data.
Data Resource Management – MGMT An overview of where we are right now SQL Developer OLAP CUBE 1 Sales Cube Data Warehouse Denormalized Historical.
CS570: Data Mining Spring 2010, TT 1 – 2:15pm Li Xiong.
July 7, 2016 Data Mining: Concepts and Techniques 1 1.
There is an inherent meaning in everything. “Signs for people who can see.”
Intro to MIS – MGS351 Databases and Data Warehouses
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 1 —
Data Mining – Intro.
Data Mining Motivation: “Necessity is the Mother of Invention”
MIS2502: Data Analytics Advanced Analytics - Introduction
Data warehouse & Data Mining: Concepts and Techniques
Introduction C.Eng 714 Spring 2010.
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Databases and Data Warehouses Chapter 3
Data Mining: Concepts and Techniques Course Outline
Data Warehousing and Data Mining
Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques
Data Mining Concepts and Techniques
Data Mining: Concepts and Techniques
Data mining algorithms
Presentation transcript:

DBMS support of the Data Mining Advisor : S.-Y. Hwang Ph.D D Tsung-Hsien Yang D Shi-Hwao Wang 1/22/2008

Agenda  Introduction to Data Mining  The Promise of Data Mining  KDD Process  Data Mining Algorithms  Data Mining Modeling and Language  Conclusion

Introduction to Data Mining  The Explosive Growth of Data: from terabytes to petabytes Major sources of abundant data  Business: Web, e-commerce, transactions, stocks, …  Science: Remote sensing, bioinformatics, scientific simulation, …  Society and everyone: news, digital cameras, YouTube Data collection and data availability  Automated data collection tools, database systems, Web, computerized society

What Is Data Mining?  Data mining: Discovering interesting patterns from large amounts of data  Data mining (knowledge discovery from data) Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data  Alternative names Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.  Watch out: Is everything “data mining”? Simple search and query processing (Deductive) expert systems

The Promise of Data Mining  Database analysis and decision support Market analysis and management  target marketing, customer relation management, market basket analysis, cross selling, market segmentation Risk analysis and management  Forecasting, customer retention, improved underwriting, quality control, competitive analysis Fraud detection and management  Other Applications Text mining (news group, , documents) and Web analysis.

Knowledge Discovery (KDD) Process Data mining—core of knowledge discovery process Data Cleaning Data Integration Databases Data Warehouse Task-relevant Data Selection Data Mining Pattern Evaluation

Data preprocessing Data Mining Management System (DMMS) Mining Model Define a model Train the model Training Data Test the model Test Data Prediction using the model Prediction Input Data

Data Mining Algorithms  Decision Trees  Naïve Bayesian  Clustering  Sequence Clustering  Association Rules  Neural Network  Time Series  Support Vector Machines  ….

Data Mining Function  Classification (attribute)  Estimation (regression)  Prediction (time series)  Association (cross selling)  Clustering (segmentation)

Data Mining Algorithms √√√√√√ √√√√√ √√√ √√√√√√ √√√ √ √ Decision Trees Naïve Bayes Clustering Seq. Clustering Time Series Association rules Neural Network Classification Regression Segmentaion Assoc. Analysis Anomaly Detect. Seq. Analysis Time series √ - second choice √ - first choice

Data Mining Language  New challenges in data mining API Large spectrum of applications: embedded to interactive BI Interoperability between different DM providers (engine) and DM consumers (tools) Data independence between content representation (trees, attributes, networks, etc) and data mining task (prediction, scoring, etc)  Requirements: Algorithm-neutral Task-oriented (specification of what we need, rather than how to) Vendor-neutral Flexible, extensible, declarative/self-contained  Sound familiar?  Yes, SQL

DMX Approach  Data Mining Extensions (DMX) to SQL  Table vs. Mining Model TABLEMINING MODEL schemaColumn definitionAttribute (variable) definition containsRowsPatterns, knowledge, cases operatio ns DDL (create,drop,alter) Create/drop/alter a model DML (insert, delete)Train (populate) a model Query (select)Prediction/browsing a model

Typical DM Process Using DMX Data Mining Management System (DMMS) Mining Model Define a model : CREATE MINING MODEL …. Train a model : INSERT INTO dmm …. Training Data Prediction using a model : SELECT … FROM dmm PREDICTION JOIN … Prediction Input Data

Defining a DM Model  Defines Shape of “ training cases ” (top-level entity being modeled) Input/output attributes (variables): type, distribution Algorithms and parameters  Example CREATE MINING MODEL CollegePlanModel ( StudentID LONG KEY, Gender TEXT DISCRETE, ParentIncome LONG NORMAL CONTINUOUS, Encouragement TEXT DISCRETE, CollegePlans TEXT DISCRETE PREDICT ) USING Microsoft_Decision_Trees (complexity_penalty = 0.5)

Training a DM Model: Simple INSERT INTO CollegePlanModel (StudentID, Gender, ParentIncome, Encouragement, CollegePlans) OPENROWSET(‘ ’, ‘ ’, ‘SELECTStudentID, Gender, ParentIncome, Encouragement, CollegePlans FROM CollegePlansTrainData’)

Prediction Using a DM Model  PREDICTION JOIN SELECT t.ID, CPModel.Plan FROM CPModel PREDICTION JOIN OPENQUERY(…,‘SELECT * FROM NewStudents’) AS t ON CPModel.Gender = t.Gender AND CPModel.IQ = t.IQ IDGenderIQ IDGenderIQPlan CPModelNewStudents

Classification  Model Definition CREATE MINING MODEL CPClass ( StudentID LONG KEY, Gender TEXT DISCRETE, ParentIncome LONG CONTINUOUS, Encouragement TEXT DISCRETE, CollegePlans TEXT DISCRETE PREDICT ) USING Microsoft_Decision_Trees

Classification (cont)  Find the new students whose predicted class (CollegePlan) is ‘Yes’ with confidence > 0.8 SELECT StudentID, PredictProbability(CPClass.CollegePlan) FROM CPClass PREDICTION JOIN OPENROWSET (’ ’,’ ’, ’SELECT * FROM NewStudents’) AS t ON t. Gender = CPClass.Gender AND t. ParentIncome = CPClass. ParentIncome AND t. Encouragement = CPClass. Encouragement WHERE CPClass.CollegePlan = ‘Yes’ AND PredictProbability(CPClass.CollegePlan) > 0.8

Regression  Model Definition CREATE MINING MODEL CustCredit ( CustID LONG KEY, Gender TEXT DISCRETE, Age TEXT CONTINUOUS REGRESSOR, Income LONG CONTINUOUS REGRESSOR, Credit DOUBLE CONTINUOUS PREDICT ) USING Microsoft_Decision_Trees

Regression (cont)  Predict Credit score (and stdev) for the new customer data entered from the web form. SELECT CustCredit.Credit, PredictStdev(CustCredit.Credit) FROM CustCredit PREDICTION JOIN (SELECT ’Female’ AS Gender, 30 AS Age, AS Income) AS t ON t. Gender = CustCredit.Gender AND t. Age = CustCredit. Age AND t. Income = CustCredit. Income

Segmentation  Model Definition CREATE MINING MODEL CPCluster ( StudentID LONG KEY, Gender TEXT DISCRETE, ParentIncome LONG CONTINUOUS, Encouragement TEXT DISCRETE, CollegePlans TEXT DISCRETE ) USING Microsoft_Clustering

Segmentation (cont.)  Find cluster and its probability for each student SELECT StudentID, $Cluster, ClusterProbability() FROM CPCluster PREDICTION JOIN OPENROWSET (’ ’,’ ’, ’SELECT * FROM NewStudents’) AS t ON t. Gender = CPCluster.Gender AND t. ParentIncome = CPCluster. ParentIncome AND t. Encouragement = CPCluster. Encouragement AND t. CollegePlans = CPCluster. CollegePlans

Association Prediction  Model Definition CREATE MINING MODEL FavMovieModel ( IDLONG KEY, MaritalStatus TEXT DISCRETE, FavMoviesTABLE PREDICT ( TitleTEXT KEY ) ) USING Microsoft_Decision_Trees

Association Prediction (cont)  As a web application, find 5 best recommendations for a customer whose shopping cart contains ‘Star Wars’ and ‘Matrix’. SELECT FLATTENED PredictAssociation(FavMovieModel.FavMovies, INCLUDE_STATISTICS, 5) FROM FavMovieModel NATURAL PREDICTION JOIN (SELECT ’Single’ AS MaritalStatus, (SELECT ’Star Wars’ AS Title UNION SELECT ’Matrix’ AS Title) AS FavMovies) AS t

Sequence Prediction  Model Definition CREATE MINING MODEL WebSeqModel ( SessionLONG KEY, PageSeq TABLE PREDICT ( SeqIDLONG KEY SEQUENCE, PageTEXT DISCRETE ) ) USING Microsoft_Sequence_Clustering

Sequence Prediction (cont)  Show the next 2 steps that a web visitor who visited ‘home’  ‘news’ is going to take. For each step, it has to show top 5 candidate pages with the highest probability. SELECT FLATTENED ( SELECT $Sequence, TopCount(PredictHistogram(Page), $Probability, 5) FROM PredictSequence(WebSeqModel.PageSeq, 2) ) FROM WebSeqModel NATURAL PREDICTION JOIN (SELECT (SELECT 1 AS SeqID, ’home’ AS Page UNION SELECT 2 AS SeqID, ’news’ AS Page) AS PageSeq ) AS t

Time-Series Prediction  Model Definition CREATE MINING MODEL StockModel ( SymbolLONG KEY, DateRecordedDATE KEY TIME, OpeningQuote DOUBLE CONTINUOUS, ClosingQuoteDOUBLE CONTINUOUS ) USING Microsoft_Time_Series

Time-Series Prediction (cont)  Predict next five days of MSFT stock closing quotes. SELECT FLATTENED PredictTimeSeries(StockModel.ClosingQuote, 5) FROM FavMovieModel WHERE StockModel.Symbol = ’MSFT’

Major Issues in Data Mining  Mining methodology Mining different kinds of knowledge from diverse data types, e.g., bio, stream, Web Performance: efficiency, effectiveness, and scalability Pattern evaluation: the interestingness problem Incorporation of background knowledge Handling noise and incomplete data Parallel, distributed and incremental mining methods Integration of the discovered knowledge with existing one: knowledge fusion  User interaction Data mining query languages and ad-hoc mining Expression and visualization of data mining results Interactive mining of knowledge at multiple levels of abstraction  Applications and social impacts Domain-specific data mining & invisible data mining Protection of data security, integrity, and privacy

Data Mining Vendors  SAS (Enterprise Miner)  IBM (DB2 Intelligent Miner)  Oracle (ODM option to Oracle 10g)  SPSS (Clementine)  Insightsful (Insightful Miner)  KXEN (Analytic Framework)  Prudsys (Discoverer and its family)  Microsoft (SQL Server 2005)  Angoss (KnowledgeServer and its family)  DBMiner (DBMiner)  Many others

Data Mining and Business Intelligence Increasing potential to support business decisions End User Business Analyst Data Analyst DBA Making Decisions Data Presentation Visualization Techniques Data Mining Information Discovery Data Exploration OLAP, MDA Statistical Analysis, Querying and Reporting Data Warehouses / Data Marts Data Sources Paper, Files, Information Providers, Database Systems, OLTP

Data Mining Modeling and Language  Problem Description two powerful tools  Database management systems  Efficient and effective data mining algorithms and frameworks Generally, this work asks:  “How can we merge the two?”  “How can we integrate data mining more closely with traditional database systems, particularly querying?”

Three Different Answers  MSQL: A Query Language for Database Mining (Imielinski & Virmani, Rutgers University)  DMQL: A Data Mining Query Language for Relational Databases (Han et al, Simon Fraser University)  Integrating Data Mining with SQL Databases: OLE DB for Data Mining (Netz et al, Microsoft)

MSQL  Focus on Association Rules  Seeks to provide a language both to selectively generate rules, and separately to query the rule base  Expressive rule generation language, and techniques for optimizing some commands

MSQL  Get-Rules and Select-Rules Queries Get-Rules operator generates rules over elements of argument class C, which satisfy conditions described in the “where” clause [Project Body, Consequent, confidence, support] GetRules(C) [as R1] [into ] [where ] [sql-group-by clause] [using-clause]

MSQL  may contain a number of conditions, including: restrictions on the attributes in the body or consequent  “rule.body HAS {(Job = ‘Doctor’}”  “rule1.consequent IN rule2.body”  “rule.consequent IS {Age = *}” pruning conditions (restrict by support, confidence, or size) Stratified or correlated subqueries in, has, and is are rule subset, superset, and equality respectively

MSQL GetRules(Patients) where Body has {Age = *} and Support >.05 and Confidence >.7 and not exists ( GetRules(Patients) Support >.05 and Confidence >.7 and R2.Body HAS R1.Body) Retrieve all rules with descriptors of the form “Age = *” in the body, except when there is a rule with equal or greater support and confidence with a rule containing a superset of the descriptors in the body

MSQL GetRules(C) R1 where and not exists ( GetRules(C) R2 where and R2.Body HAS R1.Body) correlated stratified GetRules(C) R1 where and consequent is {(X=*)} and consequent in (SelectRules(R2) where consequent is {(X=*)}

MSQL  Nested Get-Rules Queries and their optimization Stratified (non-corrolated) queries are evaluated “bottom-up.” The subquery is evaluated first, and replaced with its results in the outer query. Correlated queries are evaluated either top-down or bottom- up (like “loop-unfolding”), and there are rules for choosing between the two options

MSQL GetRules(Patients) where Body has {Age = *} and Support >.05 and Confidence >.7 Top-Down Evaluation For each rule produced by the outer, evaluate the inner not exists ( GetRules(Patients) Support >.05 and Confidence >.7 and R2.Body HAS R1.Body)

MSQL not exists ( GetRules(Patients) Support >.05 and Confidence >.7 and R2.Body HAS R1.Body) Bottom-Up Evaluation For each rule produced by the inner, evaluate the outer GetRules(Patients) where Body has {Age = *} and Support >.05 and Confidence >.7

DMQL  Commands specify the following: The set of data relevant to the data mining task (the training set) The kinds of knowledge to be discovered  Generalized relation  Characteristic rules  Discriminant rules  Classification rules  Association rules

DMQL  Commands Specify the following: Background knowledge  Concept hierarchies based on attribute relationships, etc. Various thresholds  Minimum support, confidence, etc.

DMQL  Syntax use database {use hierarchy for } related to from [where ] [order by ] {with [ ] threshold = [for ]} Specify background knowledge Specify rules to be discovered Collect the set of relevant data to mine Specify threshold parameters Relevant attributes or aggregations

DMQL use database Hospital find association rules as Heart_Health related to Salary, Age, Smoker, Heart_Disease from Patient_Financial f, Patient_Medical m where f.ID = m.ID and m.age >= 18 with support threshold =.05 with confidence threshold =.7

DMQL  DMQL provides a display in command to view resulting rules, but no advanced way to query them  Suggests that a GUI interface might aid in the presentation of these results in different forms (charts, graphs, etc.)

OLE DB for DM  An extension to the OLE DB interface for Microsoft SQL Server  Seeks to support the following ideas: Define a model by specifying the set of attributes to be predicted, the attributes used for the prediction, and the algorithm Populate the model using the training data Predict attributes for new data using the populated model Browse the mining model (not fully addressed because it varies a lot by model type)

OLE DB for DM  Defining a Mining Model Identify the set of data attributes to be predicted, the set of attributes to be used for prediction, and the algorithm to be used for building the model  Populating the Model Pull the information into a single rowset using views, and train the model using the data and algorithm specified

OLE DB for DM  Using the mining model to predict Defines a new operator prediction join. A model may be used to make predictions on datasets by taking the prediction join of the mining model and the data set.

OLE DB for DM CREATE MINING MODEL Heart_Health Prediction ( ID Int Key, Age Int, Smoker Int, Salary Double discretized, HeartAttack Int PREDICT, %Prediction column ) USING Microsoft_Decision_Trees Identifies the source columns for the training data, the column to be predicted, and the data mining algorithm.

OLE DB for DM INSERT INTO Heart_Health Prediction (Age, Smoker, Salary, HeartAttack ) OPENROWSET (’ ’,’ ’, ’SELECT Age, Smoker, Salary, HeartAttack FROM Patient_Medical M, Patient_Financial F WHERE M.ID = F.ID’) The INSERT represents using a tuple for training the model (not actually inserting it into the rowset).

OLE DB for DM SELECT T.ID, H.HeartAttack FROM Heart_Health Prediction H PREDICTION JOIN ( OPENROWSET (’ ’,’ ’, ’SELECT ID, Age, Smoker, Salary FROM Patient_Medical M, Patient_Financial F WHERE M.ID = F.ID’) as T ON H.Age = T.Age AND H.Smoker = T.Smoker AND H.Salary = T.Salary Prediction join connects the model and an actual data table to make predictions

Key Ideas  Important to have an API for creating and manipulating data mining models  The data is already in the DBMS, so it makes sense to do the data mining where the data is  Applications already use SQL, so a SQL extension seems logical

Key Ideas  Need a method for defining data mining models, including algorithm specification, specification of various parameters, and training set specification (DMQL, MSQL, ODBDM)  Need a method of querying the models (MSQL)  Need a way of using the data mining model to interact with other data in the database, for purposes such as prediction (ODBDM)