Data mining algorithms

Data mining algorithms
CSD305 Advanced Databases Extensions to SQL for data mining

Data Mining Algorithms
A data mining algorithm is a set of heuristics and calculations that creates a data mining model from data. Analyzes the data you provide, looking for specific types of patterns. Uses the results of this analysis to define the best parameters for creating the mining model (training). These parameters are then applied across the entire data set to extract patterns and statistics. CSD305 Advanced Databases

The mining model could be any of these:
A set of clusters that describe how the cases in a dataset are related. A decision tree that predicts an outcome. A mathematical model that forecasts sales. A set of rules that describe how products are grouped together in a transaction. CSD305 Advanced Databases Clustering identifies natural groupings based on a set of attributes. E.g. customer data set two attributes: age, income. Groups into 3 segments: cluster 1 – younger with low income, 2 – middle aged with higher income, 3 – senior with relatively low income.

How do you choose the right algorithm to use?
By type of algorithm or By type of problem CSD305 Advanced Databases

Choosing an Algorithm by Type
Classification algorithms predict one or more discrete variables, based on the other attributes in the dataset. Regression algorithms predict one or more continuous variables, such as profit or loss, based on other attributes in the dataset. Segmentation algorithms divide data into groups, or clusters, of items that have similar properties. Association algorithms find correlations between different attributes in a dataset. The most common application of this kind of algorithm is for creating association rules, which can be used in a market basket analysis. Sequence analysis algorithms summarize frequent sequences in data, such as a Web path flow. CSD305 Advanced Databases

Choosing an Algorithm by Task
Predicting a discrete attribute examples Microsoft algorithms to use Flag the customers in a prospective buyers list as good or poor prospects. Calculate the probability that a server will fail within the next 6 months. Categorize patient outcomes and explore related factors. Microsoft Decision Trees Algorithm Microsoft Naive Bayes Algorithm Microsoft Clustering Algorithm Microsoft Neural Network Algorithm CSD305 Advanced Databases

Predicting a continuous attribute examples Microsoft algorithms to use Forecast next year's sales. Predict site visitors given past historical and seasonal trends. Generate a risk score given demographics. Microsoft Decision Trees Algorithm Microsoft Time Series Algorithm Microsoft Linear Regression Algorithm CSD305 Advanced Databases

Predicting a sequence examples Microsoft algorithms to use Perform clickstream analysis of a company's Web site. Analyze the factors leading to server failure. Capture and analyze sequences of activities during outpatient visits, to formulate best practices around common activities. Microsoft Sequence Clustering Algorithm CSD305 Advanced Databases

Finding groups of common items in transactions examples Microsoft algorithms to use Use market basket analysis to determine product placement. Suggest additional products to a customer for purchase. Analyze survey data from visitors to an event, to find which activities or booths were correlated, to plan future activities. Microsoft Association Algorithm Microsoft Decision Trees Algorithm CSD305 Advanced Databases

Finding groups of similar items examples Microsoft algorithms to use Create patient risk profiles groups based on attributes such as demographics and behaviours. Analyze users by browsing and buying patterns. Identify servers that have similar usage characteristics. Microsoft Clustering Algorithm Microsoft Sequence Clustering Algorithm CSD305 Advanced Databases

Accuracy of predictions
We need to be able to consider the accuracy of predictions from a number of different algorithms to help choose which is best Example in Classification spreadsheet shows accuracy and error calculations for a binary classification An Association mining structure example is also illustrated CSD305 Advanced Databases

Descriptive Modelling – Training data
Classification is the task of learning a target function f that maps each attribute set x to one of the predefined class labels y CSD305 Advanced Databases Attribute set contains properties of a vertebrate: body temp, skin cover, method of reproduction. Most attributes are discrete but it can contain continuous features. Class label however must be discrete attribute. Classification is the task of learning a target function f that maps each attribute set x to one of the predefined class labels y A key characteristic making classification different from regression, regression is a predicative modelling task in which attributes are continuous. Discrete and continuous data Discrete data can only take particular values. There may potentially be an infinite number of those values, but each is distinct and there's no grey area in between. Discrete data can be numeric -- like numbers of apples -- but it can also be categorical -- like red or blue, or male or female, or good or bad. Continuous data are not restricted to defined separate values, but can occupy any value over a continuous range. Between any two continuous data values there may be an infinite number of others. Continuous data are always essentially numeric. Used for Descriptive Modelling and predictive modelling Descriptive Modelling Summarise data and define which features define a vertebrate as a mammal, reptile, bird or fish

Predictive Modelling A classification model can be used to predict the class label of unknown records. Can be treated as a black box, it automatically assigns class label when presented with and attribute set of unknown record. CSD305 Advanced Databases

Confusion Matrix for a 2-class problem
F11 correct Actual/predicted class=1 F10 wrong Actual class = 1 predicted class=0 F01 wrong Actual 0 predicted 1 F00 correct Actual 0 predicted 0 CSD305 Advanced Databases Most algorithms seek models which attain highest accuracy, or lowest error rate when applied to the test set. Used for machine learning, also known as a error matrix shows performance of an algorithm, each row is actual, column is predicted. The correct predictions can be seen diagonally in the table. Based on the counts of test records correctly and incorrectly modelled. Tabulated in confusion matrix. Total number of correct predictions is f11+f00 Total number of incorrect f10+f01 Although it provides information to determine how well a classification model performs, summarising this info will give a single number to help compare performance of different models. Accuracy Or error rate

Data Mining Modeling and Language
CSD305 Advanced Databases

Data Mining Language New challenges in data mining API Requirements:
Large spectrum of applications: embedded to interactive BI Interoperability between different DM providers (engine) and DM consumers (tools) Data independence between content representation (trees, attributes, networks, etc) and data mining task (prediction, scoring, etc) Requirements: Algorithm-neutral Task-oriented (specification of what we need, rather than how to) Vendor-neutral Flexible, extensible, declarative/self-contained Sound familiar? Yes, SQL CSD305 Advanced Databases Embedded integration of self-service BI tools into common business applications. E.g. CRM apps may have DM features to group customers into segments. ERP (enterprise resource planning) may have features to forecast production. An online bookstore can give customers real-time recommendations on books. Interactive BI software that uses OLAP and visualisation tools for BI Interoperability Majority of packages include few algorithms, a graphic interface for model building, some data extraction and transformation functions and a reporting tool. Some also include own storage engines with special formats. because there are so many components its hard to find a good product with satisfactory features across all areas. Most are strong in data mining algorithms but weak in other components. Biggest issue is products are proprietary systems. No dominant standard API, so hard to integrate results of DM with standard reporting tools or use model prediction functions in applications.

SQL Revolution (1970’s) Before After Architecture File system, Hierarchical/network DB Relational DB API Proprietary ISAM, X/OPEN CLI, etc SQL Data independence Physical model tied to logical model (appl logic)  Physical model change requires re-develop the apps. Clear separation between physical/logical model  No more app changes due to physical model update Appl dev tools Not many. Custom dev with consulting services Commodity. Product services than consulting services SQL (w/ RDB) is the biggest contributor to the maturity of DB industry. CSD305 Advanced Databases

DMX Approach Data Mining Extensions (DMX) to SQL
Table vs. Mining Model TABLE MINING MODEL schema Column definition Attribute (variable) definition contains Rows Patterns, knowledge, cases operations DDL (create,drop,alter) Create/drop/alter a model DML (insert, delete) Train (populate) a model Query (select) Prediction/browsing a model CSD305 Advanced Databases

Typical DM Process Using DMX
Define a model: CREATE MINING MODEL …. Data Mining Management System (DMMS) Train a model: INSERT INTO dmm …. Training Data CSD305 Advanced Databases Prediction using a model: SELECT … FROM dmm PREDICTION JOIN … Prediction Input Data Mining Model

Defining a DM Model Defines Example
Shape of “training cases” (top-level entity being modeled) Input/output attributes (variables): type, distribution Algorithms and parameters Example CREATE MINING MODEL CollegePlanModel (StudentID LONG KEY, Gender TEXT DISCRETE, ParentIncome LONG NORMAL CONTINUOUS, Encouragement TEXT DISCRETE, CollegePlans TEXT DISCRETE PREDICT ) USING Microsoft_Decision_Trees (complexity_penalty = 0.5) CSD305 Advanced Databases Container similar to relational table and uses create command. Model uses gender, income, encouragement to predict college plans Each column, the statement specifies data type and continuous or discrete, content types. Tell the algorithm the right way to model the column. Algorithm applied is Microsoft Decision Trees. Complexity penalty, Inhibits the growth of the decision tree. Decreasing this value increases the likelihood of a split, while increasing this value decreases the likelihood. This is available only Enterprise Edition.

Training (processing) a DM Model
Simply issue INSERT with training data DMMS (data mining in Microsoft SQL Server) takes care of everything: Accessing the training data possibly outside the system Transformation (e.g., discretization, normalization) Tokenization, numeric conversion, feature selection, etc. Learn the algorithm Persistency of patterns discovered Multiple ways to specify training data SELECT, OPENROWSET, SHAPE, etc. CSD305 Advanced Databases Discretization – process of transferring continuous functions, models, variables into discrete counterparts. To make suitable for numerical evaluation. Tokenization, separation of sentences, words etc. Tokens separated by whitespace, punctuation marks or line breaks. Tokens become input.

Training a DM Model: Simple
INSERT INTO CollegePlanModel (StudentID, Gender, ParentIncome, Encouragement, CollegePlans) OPENROWSET(‘<provider>’, ‘<connection>’, ‘SELECT StudentID, Gender, ParentIncome, Encouragement, CollegePlans FROM CollegePlansTrainData’) CSD305 Advanced Databases

Prediction Using a DM Model
PREDICTION JOIN SELECT t.ID, CPModel.Plan FROM CPModel PREDICTION JOIN OPENQUERY(…,‘SELECT * FROM NewStudents’) AS t ON CPModel.Gender = t.Gender AND CPModel.IQ = t.IQ CSD305 Advanced Databases CPModel ID Gender IQ Plan ID Gender IQ NewStudents

Your data mining exercises
In the tutorial you will explore the data mining that is possible in SQL Server 2017 Analytical Services We will be using AdventureworksDW CSD305 Advanced Databases

Adventure Works AdventureWorksDW
Based on a fictional bicycle manufacturing company named Adventure Works Cycles. Adventure Works produces and distributes metal and composite bicycles to North American, European, and Asian commercial markets. The base of operations is located in Bothell, Washington with 500 employees, and several regional sales teams are located throughout their market base. CSD305 Advanced Databases

Data mining algorithms

Similar presentations

Presentation on theme: "Data mining algorithms"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data mining algorithms

Similar presentations

Presentation on theme: "Data mining algorithms"— Presentation transcript:

Similar presentations

About project

Feedback