Data mining algorithms

Slides:



Advertisements
Similar presentations
C6 Databases.
Advertisements

5.1 © 2007 by Prentice Hall 5 Chapter Foundations of Business Intelligence: Databases and Information Management.
By: Mr Hashem Alaidaros MIS 211 Lecture 4 Title: Data Base Management System.
Data Mining (and Machine Learning) With Microsoft Tools Michael Lisin, Plaster Group May 8, 2014.
Managing Data Resources
Data Mining By Archana Ketkar.
Data Mining – Intro.
CS157A Spring 05 Data Mining Professor Sin-Min Lee.
DASHBOARDS Dashboard provides the managers with exactly the information they need in the correct format at the correct time. BI systems are the foundation.
Data Mining: A Closer Look
Chapter 5 Data mining : A Closer Look.
1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010.
Data Mining Techniques
Chapter 4: Organizing and Manipulating the Data in Databases
SharePoint 2010 Business Intelligence Module 6: Analysis Services.
Data Mining Dr. Chang Liu. What is Data Mining Data mining has been known by many different terms Data mining has been known by many different terms Knowledge.
5.1 © 2007 by Prentice Hall 5 Chapter Foundations of Business Intelligence: Databases and Information Management.
Chapter 6: Foundations of Business Intelligence - Databases and Information Management Dr. Andrew P. Ciganek, Ph.D.
More value from data using Data Mining Allan Mitchell SQL Server MVP.
INTRODUCTION TO DATA MINING MIS2502 Data Analytics.
DAT204 Introduction to Data Mining with SQL Server 2000 ZhaoHui Tang Program Manager SQL Server Analysis Services Microsoft Corporation.
The DM Process – MS’s view (DMX). The Basics  You select an algorithm, show the algorithm some examples called training example and, from these examples,
Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Lecturer: Gareth Jones. How does a relational database organise data? What are the principles of a database management system? What are the principal.
Introduction to SQL Server Data Mining Nick Ward SQL Server & BI Product Specialist Microsoft Australia Nick Ward SQL Server & BI Product Specialist Microsoft.
Database Design and Management CPTG /23/2015Chapter 12 of 38 Functions of a Database Store data Store data School: student records, class schedules,
C6 Databases. 2 Traditional file environment Data Redundancy and Inconsistency: –Data redundancy: The presence of duplicate data in multiple data files.
5-1 McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved.
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
MANAGING DATA RESOURCES ~ pertemuan 7 ~ Oleh: Ir. Abdul Hayat, MTI.
Consul- ting Services Outsour- cing Services Techno- logy Services Local Profes- sional Services Competence Centers Business Intelligence WebTech SAP.
Chapter 5: Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization DECISION SUPPORT SYSTEMS AND BUSINESS.
MIS2502: Data Analytics Advanced Analytics - Introduction.
Data Mining and Decision Support
Copyright © 2016 Pearson Education, Inc. Modern Database Management 12 th Edition Jeff Hoffer, Ramesh Venkataraman, Heikki Topi CHAPTER 11: BIG DATA AND.
Data Mining With SQL Server Data Tools Mining Data Using Tools You Already Have.
Show Me Potential Customers Data Mining Approach Leila Etaati.
Chapter 3 Building Business Intelligence Chapter 3 DATABASES AND DATA WAREHOUSES Building Business Intelligence 6/22/2016 1Management Information Systems.
Ahmed K. Ezzat, SQL Server 2008 and Data Mining Overview 1 Data Mining and Big Data.
Data Resource Management – MGMT An overview of where we are right now SQL Developer OLAP CUBE 1 Sales Cube Data Warehouse Denormalized Historical.
Managing Data Resources File Organization and databases for business information systems.
Oracle Advanced Analytics
Pengantar Sistem Informasi
Data Mining Functionalities
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining – Intro.
Information Systems in Organizations
MIS2502: Data Analytics Advanced Analytics - Introduction
Defining Data Warehouse Concepts and Terminology
DATA MINING © Prentice Hall.
Fundamentals & Ethics of Information Systems IS 201
Fundamentals of Information Systems
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Defining Data Warehouse Concepts and Terminology
©Jiawei Han and Micheline Kamber Slides contributed by Jian Pei
Chapter 2 Database Environment Pearson Education © 2009.
Basic Concepts in Data Management
Chapter 2 Database Environment.
MANAGING DATA RESOURCES
Data Analysis.
©Jiawei Han and Micheline Kamber Slides contributed by Jian Pei
Tabulations and Statistics
©Jiawei Han and Micheline Kamber Slides contributed by Jian Pei
Supporting End-User Access
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Chapter 2 Database Environment Pearson Education © 2009.
MIS2502: Data Analytics Introduction to Advanced Analytics and R
Chapter 2 Database Environment Pearson Education © 2009.
Data Mining CSCI 307, Spring 2019 Lecture 8
Presentation transcript:

Data mining algorithms CSD305 Advanced Databases Extensions to SQL for data mining

Data Mining Algorithms A data mining algorithm is a set of heuristics and calculations that creates a data mining model from data. Analyzes the data you provide, looking for specific types of patterns. Uses the results of this analysis to define the best parameters for creating the mining model (training). These parameters are then applied across the entire data set to extract patterns and statistics. CSD305 Advanced Databases

The mining model could be any of these: A set of clusters that describe how the cases in a dataset are related. A decision tree that predicts an outcome. A mathematical model that forecasts sales. A set of rules that describe how products are grouped together in a transaction. CSD305 Advanced Databases Clustering identifies natural groupings based on a set of attributes. E.g. customer data set two attributes: age, income. Groups into 3 segments: cluster 1 – younger with low income, 2 – middle aged with higher income, 3 – senior with relatively low income.

How do you choose the right algorithm to use? By type of algorithm or By type of problem CSD305 Advanced Databases

Choosing an Algorithm by Type Classification algorithms predict one or more discrete variables, based on the other attributes in the dataset. Regression algorithms predict one or more continuous variables, such as profit or loss, based on other attributes in the dataset. Segmentation algorithms divide data into groups, or clusters, of items that have similar properties. Association algorithms find correlations between different attributes in a dataset. The most common application of this kind of algorithm is for creating association rules, which can be used in a market basket analysis. Sequence analysis algorithms summarize frequent sequences in data, such as a Web path flow. CSD305 Advanced Databases

Choosing an Algorithm by Task Predicting a discrete attribute examples Microsoft algorithms to use Flag the customers in a prospective buyers list as good or poor prospects. Calculate the probability that a server will fail within the next 6 months. Categorize patient outcomes and explore related factors. Microsoft Decision Trees Algorithm Microsoft Naive Bayes Algorithm Microsoft Clustering Algorithm Microsoft Neural Network Algorithm CSD305 Advanced Databases

Choosing an Algorithm by Task Predicting a continuous attribute examples Microsoft algorithms to use Forecast next year's sales. Predict site visitors given past historical and seasonal trends. Generate a risk score given demographics. Microsoft Decision Trees Algorithm Microsoft Time Series Algorithm Microsoft Linear Regression Algorithm CSD305 Advanced Databases

Choosing an Algorithm by Task Predicting a sequence examples Microsoft algorithms to use Perform clickstream analysis of a company's Web site. Analyze the factors leading to server failure. Capture and analyze sequences of activities during outpatient visits, to formulate best practices around common activities. Microsoft Sequence Clustering Algorithm CSD305 Advanced Databases

Choosing an Algorithm by Task Finding groups of common items in transactions examples Microsoft algorithms to use Use market basket analysis to determine product placement. Suggest additional products to a customer for purchase. Analyze survey data from visitors to an event, to find which activities or booths were correlated, to plan future activities. Microsoft Association Algorithm Microsoft Decision Trees Algorithm CSD305 Advanced Databases

Choosing an Algorithm by Task Finding groups of similar items examples Microsoft algorithms to use Create patient risk profiles groups based on attributes such as demographics and behaviours. Analyze users by browsing and buying patterns. Identify servers that have similar usage characteristics. Microsoft Clustering Algorithm Microsoft Sequence Clustering Algorithm CSD305 Advanced Databases

Accuracy of predictions We need to be able to consider the accuracy of predictions from a number of different algorithms to help choose which is best Example in Classification spreadsheet shows accuracy and error calculations for a binary classification An Association mining structure example is also illustrated CSD305 Advanced Databases

Descriptive Modelling – Training data Classification is the task of learning a target function f that maps each attribute set x to one of the predefined class labels y CSD305 Advanced Databases Attribute set contains properties of a vertebrate: body temp, skin cover, method of reproduction. Most attributes are discrete but it can contain continuous features. Class label however must be discrete attribute. Classification is the task of learning a target function f that maps each attribute set x to one of the predefined class labels y A key characteristic making classification different from regression, regression is a predicative modelling task in which attributes are continuous. Discrete and continuous data Discrete data can only take particular values. There may potentially be an infinite number of those values, but each is distinct and there's no grey area in between. Discrete data can be numeric -- like numbers of apples -- but it can also be categorical -- like red or blue, or male or female, or good or bad. Continuous data are not restricted to defined separate values, but can occupy any value over a continuous range. Between any two continuous data values there may be an infinite number of others. Continuous data are always essentially numeric. Used for Descriptive Modelling and predictive modelling Descriptive Modelling Summarise data and define which features define a vertebrate as a mammal, reptile, bird or fish

Predictive Modelling A classification model can be used to predict the class label of unknown records. Can be treated as a black box, it automatically assigns class label when presented with and attribute set of unknown record. CSD305 Advanced Databases

Confusion Matrix for a 2-class problem F11 correct Actual/predicted class=1 F10 wrong Actual class = 1 predicted class=0 F01 wrong Actual 0 predicted 1 F00 correct Actual 0 predicted 0 CSD305 Advanced Databases Most algorithms seek models which attain highest accuracy, or lowest error rate when applied to the test set. Used for machine learning, also known as a error matrix shows performance of an algorithm, each row is actual, column is predicted. The correct predictions can be seen diagonally in the table. https://en.wikipedia.org/wiki/Confusion_matrix Based on the counts of test records correctly and incorrectly modelled. Tabulated in confusion matrix. Total number of correct predictions is f11+f00 Total number of incorrect f10+f01 Although it provides information to determine how well a classification model performs, summarising this info will give a single number to help compare performance of different models. Accuracy Or error rate

Data Mining Modeling and Language CSD305 Advanced Databases

Data Mining Language New challenges in data mining API Requirements: Large spectrum of applications: embedded to interactive BI Interoperability between different DM providers (engine) and DM consumers (tools) Data independence between content representation (trees, attributes, networks, etc) and data mining task (prediction, scoring, etc) Requirements: Algorithm-neutral Task-oriented (specification of what we need, rather than how to) Vendor-neutral Flexible, extensible, declarative/self-contained Sound familiar? Yes, SQL CSD305 Advanced Databases Embedded integration of self-service BI tools into common business applications. E.g. CRM apps may have DM features to group customers into segments. ERP (enterprise resource planning) may have features to forecast production. An online bookstore can give customers real-time recommendations on books. Interactive BI software that uses OLAP and visualisation tools for BI Interoperability Majority of packages include few algorithms, a graphic interface for model building, some data extraction and transformation functions and a reporting tool. Some also include own storage engines with special formats. because there are so many components its hard to find a good product with satisfactory features across all areas. Most are strong in data mining algorithms but weak in other components. Biggest issue is products are proprietary systems. No dominant standard API, so hard to integrate results of DM with standard reporting tools or use model prediction functions in applications.

SQL Revolution (1970’s) Before After Architecture File system, Hierarchical/network DB Relational DB API Proprietary ISAM, X/OPEN CLI, etc SQL Data independence Physical model tied to logical model (appl logic)  Physical model change requires re-develop the apps. Clear separation between physical/logical model  No more app changes due to physical model update Appl dev tools Not many. Custom dev with consulting services Commodity. Product services than consulting services SQL (w/ RDB) is the biggest contributor to the maturity of DB industry. CSD305 Advanced Databases

DMX Approach Data Mining Extensions (DMX) to SQL Table vs. Mining Model TABLE MINING MODEL schema Column definition Attribute (variable) definition contains Rows Patterns, knowledge, cases operations DDL (create,drop,alter) Create/drop/alter a model DML (insert, delete) Train (populate) a model Query (select) Prediction/browsing a model CSD305 Advanced Databases

Typical DM Process Using DMX Define a model: CREATE MINING MODEL …. Data Mining Management System (DMMS) Train a model: INSERT INTO dmm …. Training Data CSD305 Advanced Databases Prediction using a model: SELECT … FROM dmm PREDICTION JOIN … Prediction Input Data Mining Model

Defining a DM Model Defines Example Shape of “training cases” (top-level entity being modeled) Input/output attributes (variables): type, distribution Algorithms and parameters Example CREATE MINING MODEL CollegePlanModel (StudentID LONG KEY, Gender TEXT DISCRETE, ParentIncome LONG NORMAL CONTINUOUS, Encouragement TEXT DISCRETE, CollegePlans TEXT DISCRETE PREDICT ) USING Microsoft_Decision_Trees (complexity_penalty = 0.5) CSD305 Advanced Databases Container similar to relational table and uses create command. Model uses gender, income, encouragement to predict college plans Each column, the statement specifies data type and continuous or discrete, content types. Tell the algorithm the right way to model the column. Algorithm applied is Microsoft Decision Trees. Complexity penalty, Inhibits the growth of the decision tree. Decreasing this value increases the likelihood of a split, while increasing this value decreases the likelihood.  This is available only Enterprise Edition.

Training (processing) a DM Model Simply issue INSERT with training data DMMS (data mining in Microsoft SQL Server) takes care of everything: Accessing the training data possibly outside the system Transformation (e.g., discretization, normalization) Tokenization, numeric conversion, feature selection, etc. Learn the algorithm Persistency of patterns discovered Multiple ways to specify training data SELECT, OPENROWSET, SHAPE, etc. CSD305 Advanced Databases Discretization – process of transferring continuous functions, models, variables into discrete counterparts. To make suitable for numerical evaluation. Tokenization, separation of sentences, words etc. Tokens separated by whitespace, punctuation marks or line breaks. Tokens become input.

Training a DM Model: Simple INSERT INTO CollegePlanModel (StudentID, Gender, ParentIncome, Encouragement, CollegePlans) OPENROWSET(‘<provider>’, ‘<connection>’, ‘SELECT StudentID, Gender, ParentIncome, Encouragement, CollegePlans FROM CollegePlansTrainData’) CSD305 Advanced Databases

Prediction Using a DM Model PREDICTION JOIN SELECT t.ID, CPModel.Plan FROM CPModel PREDICTION JOIN OPENQUERY(…,‘SELECT * FROM NewStudents’) AS t ON CPModel.Gender = t.Gender AND CPModel.IQ = t.IQ CSD305 Advanced Databases CPModel ID Gender IQ Plan ID Gender IQ NewStudents

Your data mining exercises In the tutorial you will explore the data mining that is possible in SQL Server 2017 Analytical Services We will be using AdventureworksDW CSD305 Advanced Databases

Adventure Works AdventureWorksDW Based on a fictional bicycle manufacturing company named Adventure Works Cycles. Adventure Works produces and distributes metal and composite bicycles to North American, European, and Asian commercial markets. The base of operations is located in Bothell, Washington with 500 employees, and several regional sales teams are located throughout their market base. CSD305 Advanced Databases