Lecture 4 TIES445 Data mining Nov-Dec 2007 Sami Äyrämö.

Slides:



Advertisements
Similar presentations
UNIT-2 Data Preprocessing LectureTopic ********************************************** Lecture-13Why preprocess the data? Lecture-14Data cleaning Lecture-15Data.
Advertisements

Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Data preprocessing before classification In Kennedy et al.: “Solving data mining problems”
Civil and Environmental Engineering Carnegie Mellon University Sensors & Knowledge Discovery (a.k.a. Data Mining) H. Scott Matthews April 14, 2003.
Week 9 Data Mining System (Knowledge Data Discovery)
© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,
Data Preprocessing.
Data Mining By Archana Ketkar.
Data Mining – Intro.
Major Tasks in Data Preprocessing(Ref Chap 3) By Prof. Muhammad Amir Alam.
GUHA method in Data Mining Esko Turunen Tampere University of Technology Tampere, Finland.
CS Machine Learning. What is Machine Learning? Adapt to / learn from data  To optimize a performance function Can be used to:  Extract knowledge.
Data Mining: Concepts & Techniques. Motivation: Necessity is the Mother of Invention Data explosion problem –Automated data collection tools and mature.
OLAM and Data Mining: Concepts and Techniques. Introduction Data explosion problem: –Automated data collection tools and mature database technology lead.
Data Mining : Introduction Chapter 1. 2 Index 1. What is Data Mining? 2. Data Mining Functionalities 1. Characterization and Discrimination 2. MIning.
Data Mining Techniques
Overview DM for Business Intelligence.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Data Mining Techniques As Tools for Analysis of Customer Behavior
Data Mining Chun-Hung Chou
Anomaly detection with Bayesian networks Website: John Sandiford.
Knowledge Discovery and Data Mining Evgueni Smirnov.
Data Mining Process A manifestation of best practices A systematic way to conduct DM projects Different groups has different versions Most common standard.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Knowledge Discovery and Data Mining Evgueni Smirnov.
Data Warehousing Data Mining Privacy. Reading Bhavani Thuraisingham, Murat Kantarcioglu, and Srinivasan Iyer Extended RBAC-design and implementation.
Principles of Data Mining. Introduction: Topics 1. Introduction to Data Mining 2. Nature of Data Sets 3. Types of Structure Models and Patterns 4. Data.
Data Preprocessing Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
Chapter 5: Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization DECISION SUPPORT SYSTEMS AND BUSINESS.
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
BOĞAZİÇİ UNIVERSITY DEPARTMENT OF MANAGEMENT INFORMATION SYSTEMS MATLAB AS A DATA MINING ENVIRONMENT.
1 Introduction to Data Mining C hapter 1. 2 Chapter 1 Outline Chapter 1 Outline – Background –Information is Power –Knowledge is Power –Data Mining.
An Introduction Student Name: Riaz Ahmad Program: MSIT( ) Subject: Data warehouse & Data Mining.
January 17, 2016Data Mining: Concepts and Techniques 1 What Is Data Mining? Data mining (knowledge discovery from data) Extraction of interesting ( non-trivial,
Data Mining and Decision Support
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
3/13/2016Data Mining 1 Lecture 1-2 Data and Data Preparation Phayung Meesad, Ph.D. King Mongkut’s University of Technology North Bangkok (KMUTNB) Bangkok.
Waqas Haider Bangyal. Classification Vs Clustering In general, in classification you have a set of predefined classes and want to know which class a new.
Tallahassee, Florida, 2016 CIS4930 Introduction to Data Mining Midterm Review Peixiang Zhao.
Tallahassee, Florida, 2016 CIS4930 Introduction to Data Mining Final Review Peixiang Zhao.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 28 Data Mining Concepts.
DATA MINING and VISUALIZATION Instructor: Dr. Matthew Iklé, Adams State University Remote Instructor: Dr. Hong Liu, Embry-Riddle Aeronautical University.
The KDD Process for Extracting Useful Knowledge from Volumes of Data Fayyad, Piatetsky-Shapiro, and Smyth Ian Kim SWHIG Seminar.
Pattern Recognition Lecture 20: Data Mining 2 Dr. Richard Spillman Pacific Lutheran University.
Data Mining.
Course Outline 1. Pengantar Data Mining 2. Proses Data Mining
Data Mining – Intro.
Data Transformation: Normalization
DATA MINING © Prentice Hall.
Data Mining: Data Preparation
Noisy Data Noise: random error or variance in a measured variable.
Introduction to Data Mining
Introduction C.Eng 714 Spring 2010.
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Data Mining: Concepts and Techniques Course Outline
CSE591: Data Mining by H. Liu
Data Warehousing and Data Mining
Course Introduction CSC 576: Data Mining.
Data Transformations targeted at minimizing experimental variance
Data Warehousing Data Mining Privacy
By Sandeep Patil, Department of Computer Engineering, I²IT
Data Pre-processing Lecture Notes for Chapter 2
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Tel Hope Foundation’s International Institute of Information Technology, (I²IT). Tel
Presentation transcript:

Lecture 4 TIES445 Data mining Nov-Dec 2007 Sami Äyrämö

Definitions for data mining ”Data mining is a step in the KDD process consisting of particular data mining algorithms that, under some acceptable computational efficiency limitations, produces a particular enumeration of patterns Ej over database F.” ”Data mining is the analysis of (often large) observational data sets to find unsuspected relationships an to summarize the data in novel ways that are both understandable and useful to the data owner.” Enumeration of patterns involves some form of search in the (often infinte) space of patterns Note that also global models are searched The computational efficiency constraints place several limits on the subspace that can be explored by the algorithm

Definition of Knowledge Discovery in Databases ”KDD Process is the process of using data mining methods (algorithms) to extract (identify) what is deemed knowledge according to the specifications of measures and thresholds, using database F along with any required preprocessing, subsampling, and transformation of F.” ”The nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data” Goals (e.g., Fayyad et al. 1996): Verification of user’s hypothesis (this against the EDA principle…) Autonomous discovery of new patterns and models Prediction of future behavior of some entities Description of interesting patterns and models

KDD Process In a multistep process many decisions are made by the user (domain expert): Iterative and interactive – loops between any two steps are possible Usually the most focus is on the DM step, but other steps are of considerable importance for the successful application of KDD in practice

KDD versus DM DM is a component of the KDD process that is mainly concerned with means by which patterns and models are extracted and enumerated from the data DM is quite technical Knowledge discovery involves evaluation and interpretation of the patterns and models to make the decision of what constitutes knowledge and what does not KDD requires a lot of domain understanding It also includes, e.g., the choice of encoding schemes, preprocessing, sampling, and projections of the data prior to the data mining step The DM and KDD are often used interghangebly Perhaps DM is a more common term in business world, and KDD in academic world

The main steps of the KDD process

Refined steps of KDD Process Domain understanding and goal setting Creating a target data set Data cleaning and preprocessing Data reduction and projection Data mining Choosing the data mining task Choosing the data mining algorithm(s) Use of data mining algorithms Interpretation of mined patterns Utilization of discovered knowledge

1. Domain analysis Development of domain understanding Discovery of relevant prior knowledge Definition of the goal of the knowledge discovery In the applied research projects at JYU this step has been supported by so-called genre-based domain analysis Assists to recognize the most important information sources and their current owners Including related metadata such as data amounts, formats, and users Examines information communicated by capturing all information flows including Verbal communication IT systems Paper and eletronic documentation Maps different data sources As a result, perhaps the most interesting non-digital information can be digitized prior to the actual KDD activities Public defence of PhD thesis: Turo Kilpeläinen, December, 2007!!

2. Data selection Selection and integration of the target data from possibly many different and heterogeneous sources Interesting data may exist, e.g., in relational databases, document collections, e-mails, photographs, video clips, process database, customer transaction database, web logs etc. Focus on the correct subset of variables and data samples E.g., customer behavior in a certain country, relationship between items purchased and customer income and age Possibly interesting non-electronic sources (”indirectly- or non- mineable” data) should be concerned For example, faxes, letters, video tapes, can be of interest and their digitizing can be considered cf. the genre-based analysis of the application domain

3. Data cleaning and preprocessing Today’s datasets are incomplete (missing attribute values), noisy (errors and outliers), and inconsistent (discrepanciens in the collected data) Dirty data can confuse the mining procedures and lead to unreliable and invalid outputs Complex analysis and mining on a huge amount of data may take a very long time Preprocessing and cleaning should improve the quality of data and mining results by enhancing the actual mining process The actions to be taken includes Removal of noise or outliers Collecting necessary information to model or account for noise Using prior domain knowledge to remove the inconsistencies and duplicates from the data Choice or usage of strategies for handling missing data fields

4. Data reduction and projection Finding useful features to represent the data depending on the goal of the task Data becomes more appropriate for mining For example, in high-dimensional spaces (the large number of attributes) the distances between objects may become meaningless Dimensionality reduction and transformation methods reduce the effective number of variables under consideration or find invariant representations for the data Data transformation techniques Smoothing (binning, clustering, regression etc.) Aggregation (use of summary operations (e.g., averaging) on data) Generalization (primitive data objects can be replaced by higher-level concepts) Normalization (min-max-scaling, z-score) Feature construction from the existing attributes (PCA, MDS) Data reduction techniques are applied to produce reduced representation of the data (smaller volume that closely maintains the integrity of the original data) Aggregation Dimension reduction (Attribute subset selection, PCA, MDS,…) Compression (e.g., wavelets, PCA, clustering,…) Numerosity reduction parametric models: regression and log-linear models non-parametric models: histograms, clustering, sampling… Discretization (e.g., binning, histograms,cluster analysis,…) Concept hierarchy generation (numeric value of ”age” to a higher level concept ”young, middle-aged, senior”)

5. Choice of data mining task Define the task for data mining Exploration/summarization Summarizing statistics (mean, median, mode, std,..) Class/concept description Explorative data analysis Graphical techniques, low-dimensional plots,… Predictive Classification or regression Descriptive Cluster analysis, dependency modelling, change and outlier detection Mining of associations, rules and sequential patterns

6. Choosing the DM algorithm(s) Select the most appropriate methods to be used for the model and pattern search Includes also the decisions about the appropriate models, patterns, parameters, and score functions (aka evaluation criteria) A cluster model or probabilistic mixture model? Prototype or dendogram representation of the cluster patterns? K-means (fast) or K-medoid (robust) algorithm? Parameters of chosen algorithm (e.g., number of clusters)? Matching the chosen method with the overall goal of the KDD process (necessites communication between the end user and method specialists) Note that this step requires understanding in many fields, such as computer science, statistics, machine learning, optimization, etc.

7. Use of data mining algorithms Application of the chosen DM algorithms to the target data set Search for the patterns and models of interest in a particular representational form or a set of such representations Classification rules or trees, regression models, clusters, mixture models… Should be relatively automatic Generally DM involves: Establish the structural form (model/pattern) one is interested Estimate the parameters from the available data Interprete the fitted models

8. Interpretation/evaluation The mined patterns and models are interpreted Patterns are local structures that makes statements only about restricted regions of the space spanned by the variables, e.g., P(Y>y1|X>x1)=p1 Anomaly detection applications: fault detection in industrial process or fraud detection in banking Models are global structures that makes statements about any point in measurement space, e.g., Y = aX+b (linear model) Models can assign a point to a cluster or predict the value of some other variable The results should be presented in understandable form Visualization techniques are important for making the results useful – mathematical models or text type descriptions may be difficult for domain experts Possible return to any of the previous step

Knowledge Mining (KM) process