Data Mining Techniques

Slides:



Advertisements
Similar presentations
Week 9 Data Mining System (Knowledge Data Discovery)
Advertisements

© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,
Data Mining.
Classical Techniques: Statistics, Neighborhoods, and Clustering.
Data Mining By Archana Ketkar.
Data Mining – Intro.
CS157A Spring 05 Data Mining Professor Sin-Min Lee.
DASHBOARDS Dashboard provides the managers with exactly the information they need in the correct format at the correct time. BI systems are the foundation.
Data Mining: A Closer Look
Chapter 5 Data mining : A Closer Look.
Computer Science Universiteit Maastricht Institute for Knowledge and Agent Technology Data mining and the knowledge discovery process Summer Course 2005.
TURKISH STATISTICAL INSTITUTE INFORMATION TECHNOLOGIES DEPARTMENT (Muscat, Oman) DATA MINING.
Enterprise systems infrastructure and architecture DT211 4
Data Mining By Andrie Suherman. Agenda Introduction Major Elements Steps/ Processes Tools used for data mining Advantages and Disadvantages.
Data Mining: Concepts & Techniques. Motivation: Necessity is the Mother of Invention Data explosion problem –Automated data collection tools and mature.
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010.
OLAM and Data Mining: Concepts and Techniques. Introduction Data explosion problem: –Automated data collection tools and mature database technology lead.
Data Mining : Introduction Chapter 1. 2 Index 1. What is Data Mining? 2. Data Mining Functionalities 1. Characterization and Discrimination 2. MIning.
University of Toronto 8/30/20151 Data Mining The Art and Science of Obtaining Knowledge from Data Dr. Saed Sayad.
SharePoint 2010 Business Intelligence Module 6: Analysis Services.
Data Mining. 2 Models Created by Data Mining Linear Equations Rules Clusters Graphs Tree Structures Recurrent Patterns.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
DATA MINING Team #1 Kristen Durst Mark Gillespie Banan Mandura University of DaytonMBA APR 09.
Data Mining Chun-Hung Chou
Understanding Data Analytics and Data Mining Introduction.
1 Data Mining Books: 1.Data Mining, 1996 Pieter Adriaans and Dolf Zantinge Addison-Wesley 2.Discovering Data Mining, 1997 From Concept to Implementation.
3 Objects (Views Synonyms Sequences) 4 PL/SQL blocks 5 Procedures Triggers 6 Enhanced SQL programming 7 SQL &.NET applications 8 OEM DB structure 9 DB.
Data Mining and Application Part 1: Data Mining Fundamentals Part 2: Tools for Knowledge Discovery Part 3: Advanced Data Mining Techniques Part 4: Intelligent.
COMP3503 Intro to Inductive Modeling
Introduction to Data Mining Group Members: Karim C. El-Khazen Pascal Suria Lin Gui Philsou Lee Xiaoting Niu.
INTRODUCTION TO DATA MINING MIS2502 Data Analytics.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
Introduction, or what is data mining? Introduction, or what is data mining? Data warehouse and query tools Data warehouse and query tools Decision trees.
Some working definitions…. ‘Data Mining’ and ‘Knowledge Discovery in Databases’ (KDD) are used interchangeably Data mining = –the discovery of interesting,
 Fundamentally, data mining is about processing data and identifying patterns and trends in that information so that you can decide or judge.  Data.
Fox MIS Spring 2011 Data Mining Week 9 Introduction to Data Mining.
Data Warehousing.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
CS157B Fall 04 Introduction to Data Mining Chapter 22.3 Professor Lee Yu, Jianji (Joseph)
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
Descriptive Statistics vs. Factor Analysis Descriptive statistics will inform on the prevalence of a phenomenon, among a given population, captured by.
Chapter 5: Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization DECISION SUPPORT SYSTEMS AND BUSINESS.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
MIS2502: Data Analytics Advanced Analytics - Introduction.
Data Mining and Decision Support
Copyright © 2001, SAS Institute Inc. All rights reserved. Data Mining Methods: Applications, Problems and Opportunities in the Public Sector John Stultz,
Data Mining. Overview the extraction of hidden predictive information from large databases Data mining tools predict future trends and behaviors, allowing.
Applications and Trends in Data Mining Pertemuan 13 Matakuliah: M0614 / Data Mining & OLAP Tahun : Feb
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
1 Data Mining Chapter 34 in textbook + Chapter 4 in DATA MINING by P. Adriaans and D. Zantinge.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 28 Data Mining Concepts.
Nearest Neighbour and Clustering. Nearest Neighbour and clustering Clustering and nearest neighbour prediction technique was one of the oldest techniques.
Why Intelligent Data Analysis? Joost N. Kok Leiden Institute of Advanced Computer Science Universiteit Leiden.
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining – Intro.
What Is Cluster Analysis?
Data Transformation: Normalization
MIS2502: Data Analytics Advanced Analytics - Introduction
DATA MINING © Prentice Hall.
Data Warehouse.
Descriptive Statistics vs. Factor Analysis
Supporting End-User Access
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
CHAPTER 7: Information Visualization
Presentation transcript:

Data Mining Techniques DM in not so much a single technique, as the idea that there is more knowledge hidden in the data than shows itself on the surface. Any data that helps extract more out of data is useful. So DM techniques form quite a heterogeneous group Babu Ram Dawadi

DM Techniques.. Query tools Statistical Techniques Visualization OLAP Case-Based Learning (K- Nearest Neighbor) Decision Trees Association rules Neural Network Genetic Algorithms & many more… Babu Ram Dawadi

Query Tools The first step in data mining should always be a rough analysis of the data set using traditional query tools. Applying simple SQL can achieve wealth of information Almost 80% of the interesting information can be abstracted from a database using SQL, The remaining 20% requires more advanced techniques where 20% hidden Information might have vital importance. Babu Ram Dawadi

Statistical Methods (Statistical DM) There are many well-established statistical techniques for data analysis, particularly for numeric data applied extensively to data from scientific experiments and data from economics and the social sciences Regression predict the value of a response (dependent) variable from one or more predictor (independent) variables where the variables are numeric Babu Ram Dawadi

Statistical Methods (Statistical DM) Regression trees Binary trees are used for classification and prediction Similar to decision trees:Tests are performed at the internal nodes In a regression tree the mean of the objective attribute is computed and used as the predicted value Analysis of variance Analyze experimental data for two or more populations described by a numeric response variable and one or more categorical variables (factors) Mixed-effects models provide a powerful and flexible tool for the analysis of balanced and unbalanced grouped data. These data arise in several areas of investigation and are characterized by the presence of correlation between observations within the same group. Some examples are repeated measures data, longitudinal studies, and nested designs. Classical modeling techniques which assume independence of the observations are not appropriate for grouped data. Babu Ram Dawadi

Statistical Methods (Statistical DM) http://www.spss.com/datamine/factor.htm Factor analysis determine which variables are combined to generate a given factor e.g., for many psychiatric data, one can indirectly measure other quantities (such as test scores) that reflect the factor of interest Discriminant analysis predict a categorical response variable, commonly used in social science Attempts to determine several discriminant functions (linear combinations of the independent variables) that discriminate among the groups defined by the response variable Babu Ram Dawadi

Visualization Techniques (Visual DM) Visualization: use of computer graphics to create visual images which aid in the understanding of complex, often massive representations of data Visual Data Mining: the process of discovering implicit but useful knowledge from large data sets using visualization techniques Human Computer Interfaces Computer Graphics Multimedia Systems High Performance Computing Pattern Recognition Babu Ram Dawadi

Visualization Purpose of Visualization Gain insight into an information space by mapping data onto graphical primitives Provide qualitative overview of large data sets Search for patterns, trends, structure, irregularities, relationships among data. Help to find interesting regions and suitable parameters for further quantitative analysis. Provide a visual proof of computer representations derived Babu Ram Dawadi

Data Mining Result Visualization Presentation of the results or knowledge obtained from data mining in visual forms Examples Scatter plots Decision trees Association rules Clusters Outliers Babu Ram Dawadi

Visualization of Data Mining Results: Scatter Plots Babu Ram Dawadi

Visualization of a Decision Tree : MineSet 3.0 Babu Ram Dawadi

Visualization of Cluster Grouping in IBM Intelligent Miner Babu Ram Dawadi

Data Mining Process Visualization Presentation of the various processes of data mining in visual forms so that users can see Data extraction process Where the data is extracted How the data is cleaned, integrated, preprocessed, and mined Method selected for data mining Where the results are stored How they may be viewed Babu Ram Dawadi

Visualization of Data Mining Processes by Clementine E.g.: Knowledge Flow in WEKA See your solution discovery process clearly How the interactive Clementine knowledge discovery process works See your solution discovery process clearly The interactive stream approach to data mining is the key to Clementine's power. Using icons that represent steps in the data mining process, you mine your data by building a stream - a visual map of the process your data flows through. Start by simply dragging a source icon from the object palette onto the Clementine desktop to access your data flow. Then, explore your data visually with graphs. Apply several types of algorithms to build your model by simply placing the appropriate icons onto the desktop to form a stream. Discover knowledge interactively Data mining with Clementine is a "discovery-driven" process. Work toward a solution by applying your business expertise to select the next step in your stream, based on the discoveries made in the previous step. You can continually adapt or extend initial streams as you work through the solution to your business problem. Easily build and test models All of Clementine's advanced techniques work together to quickly give you the best answer to your business problems. You can build and test numerous models to immediately see which model produces the best result. Or you can even combine models by using the results of one model as input into another model. These "meta-models" consider the initial model's decisions and can improve results substantially. Understand variations in your business with visualized data Powerful data visualization techniques help you understand key relationships in your data and guide the way to the best results. Spot characteristics and patterns at a glance with Clementine's interactive graphs. Then "query by mouse" to explore these patterns by selecting subsets of data or deriving new variables on the fly from discoveries made within the graph. How Clementine scales to the size of the challenge The Clementine approach to scaling is unique in the way it aims to scale the complete data mining process to the size of large, challenging datasets. Clementine executes common operations used throughout the data mining process in the database through SQL queries. This process leverages the power of the database for faster processing, enabling you to get better results with large datasets. Understand variations with visualized data Babu Ram Dawadi

Likelihood & Distance Records that are close to each other are very alike & records that are vary far removed from each other represents individuals that have little in common. Examples: 3 field values (age, income & Credit): these three attributes form a three dimensional data space and we can analyze the distances between records in this space. Assume: Age range: 1 to 100 yrs, Income range: 0 to 100,000$ & Credit range: 0 to 50,000$ If it is used without correction, then income/credit is more distinctive than age. So for better analysis, we need coding. Coding: divide income & age separation by 1000. Babu Ram Dawadi

Likelihood & Distance Age Income Credit Customer 1 32 40,000 10,000 24 30,000 2,000 Differences 8 8,000 Coding 10 distance SquareRoot(8*8+10*10+15*8)=15 Records Become points in Multidimensional data space Customer 1 15 Customer 2 Babu Ram Dawadi

Likelihood & Distance Some times it is possible to identify a visual cluster of potential Customers that are very likely to buy certain product Babu Ram Dawadi

OLAP Tools Idea of dimensionality (data are managed into multidimensional cube) Eg: how much is sold?? (Zero Dimensional) What type of magazines are sold in a designated area per month and to what age group? (a 4-dimensional question: product, area, purchase date and age) OLAP Operations (Slicing,Dicing,Rolling,Drilling) Babu Ram Dawadi

K-Nearest Neighbor When we interpret records as points in a data space, we can define the concepts of neighborhood. “Records that are close to each other live in each others neighborhood.” “Records of the same type will be close to each other in the data space.” i.e. customers of same type will show same behavior. Babu Ram Dawadi

K-Nearest Neighbor Basic principle of K-nearest neighbor is “Do as neighbor Do” If we want to predict the behavior of certain individual, we first to look at the behavior of, for example: ten individuals that are close to him/her in the data space. We calculate the average of 10 neighbors and this average value will be the prediction for our individual. K=> No. of Neighbors. 5-Nearest neighbor => 5 neighbors to be taken for calculation Babu Ram Dawadi

K-Nearest Neighbor Draw Back If we want to make a prediction for each element in the data set containing n-records, then we have to compare each record with every other records which leads to complexity. For million records, we may need billion comparisons. So not desirable for large data sets. Doesn't work for high attributes or high dimensional data. Babu Ram Dawadi