Download presentation
Presentation is loading. Please wait.
Published byJerome Gilbert Modified over 9 years ago
1
Computer Science Universiteit Maastricht Institute for Knowledge and Agent Technology Data mining and the knowledge discovery process Summer Course 2005 H.H.L.M. Donkers
2
Content l Opening / acquaintance l What is data mining l Data mining methodology l Course perspective l Course contents
3
Data - Information - Knowledge - l Data: symbols l Information: data that are processed to be useful; provides answers to "who", "what", "where", and "when" questions l Knowledge: application of data and information; answers "how" questions l Understanding: appreciation of "why" l Wisdom: evaluated understanding. ( http://www.outsights.com/systems/dikw/dikw.htm l Wisdom: evaluated understanding. ( Russell Ackoff - http://www.outsights.com/systems/dikw/dikw.htm )
4
Data - Information - Knowledge - http://www.outsights.com/systems/dikw/dikw.htm
5
What is Data Mining – Traditionally “Data mining is the extraction of implicit, previously unknown, and potentially useful information from data.” Witten & Frank (2000). Data Mining.
6
What is Data Mining – Traditionally “The application of specific algorithms for extracting patterns from data, it is a part of knowledge discovery from databases” Fayyad (1997). From data mining to knowledge discovery in databases.
7
What is Data Mining – Traditionally “Data mining is a process, not just a series of statistical analyses.” SAS Institute (2003). Finding the solution to data mining.
8
What is Data Mining – Traditionally l Computer Science (Semi-)automated application of algorithms for pattern discovery(Semi-)automated application of algorithms for pattern discovery Algorithms developed in the field of Artificial Intelligence (machine learning)Algorithms developed in the field of Artificial Intelligence (machine learning) Part of the process of knowledge discoveryPart of the process of knowledge discovery l Statistics Process of discovering patterns in data (Manual) application of a series of statistical techniques (among which machine learning) Incorporates –Exploration –Sampling –Modeling –Validation Data mining = Statistics + Marketing
9
What is Data Mining – A Fusion “An analytic process designed to explore data in search of consistent patterns and/or systematic relationships between variables, and then to validate the findings by applying the detected patterns to new subsets of data. The ultimate goal is prediction.” Statsoft (2003). Data Mining Techniques.
10
What is Data Mining – A Fusion “An information extraction activity whose goal is to discover hidden facts contained in databases. Using a combination of machine learning, statistical analysis, modeling techniques and database technology, data mining finds patterns and subtle relationships in data and infers rules that allow the prediction of future results.” Rudjer Boskovic Institute (2001). DMS Tutorial.
11
Data Mining In This Course l We use the book of Witten & Frank Computer science (machine learning) approachComputer science (machine learning) approach l Emphasis on algorithms for pattern discovery and rule extraction –What are the underlying models –What are the properties of the algorithms –When to use (for which tasks) –How to apply and to tune –How to interpret and assess the results
12
Data Mining Process l These algorithms are only part of a process that computer scientists call Knowledge Discovery and the statisticians call Data Mining l The process starts with the recognition of a problem and ends with the control of a deployed solution l The whole process needs to be supported for a successful application
13
Methodologies for Data Mining l As Data Mining is coming of age, several methodologies have been developed, each with their own perspective. We will discuss three of them: Fayyad et al. (Computer science)Fayyad et al. (Computer science) –E.g., WEKA SEMMA (SAS) (Statistics)SEMMA (SAS) (Statistics) –SAS Enterprise Miner CRISP-DM (SPSS, OHRA, a.o.) (Business)CRISP-DM (SPSS, OHRA, a.o.) (Business) –SPSS Clementine
14
Fayyad’s KDD Methodology data Target data Processed data Transformed data Patterns Knowledge Selection Preprocessing & cleaning Transformation & feature selection Data Mining Interpretation Evaluation
15
SEMMA Methodology Supported by SAS Enterprise Mining environment SAMPLE Input data, Sampling, Data partition EXPLORE Distribution explorer, Multiplot, Insight, Association, Variable selection MODEL Regression, Tree, Neural Network, Ensemble MODIFY Transform variable, Filter outliers, Clustering, SOM / Kohonen ASSESS Assessment, Score, Report
16
CRISP-DM Methodology l Developed by data-mining companies (SPSS, NCR, OHRA, ChryslerDaimler), funded by the European Commission l Tool-independent / industry-independent l Hierarchical process model 1 Generic phases 2 Generic tasks 3 Specific tasks 4 Task instances l Supported by SPSS Clementine environment
17
CRISP-DM Methodology Business understanding Data understanding Data Preparation Modeling Evaluation Deployment TASKS Business objective Assess situation Data mining goals Project plan
18
CRISP-DM Methodology Business understanding Data understanding Data Preparation Modeling Evaluation Deployment TASKS Collect data Describe data Explore data Verify data quality
19
CRISP-DM Methodology Business understanding Data understanding Data Preparation Modeling Evaluation Deployment TASKS Select data Clean data Construct data Integrate data Format data
20
CRISP-DM Methodology Business understanding Data understanding Data Preparation Modeling Evaluation Deployment TASKS Select modeling techniques Design the test Build model Assess model
21
CRISP-DM Methodology Business understanding Data understanding Data Preparation Modeling Evaluation Deployment TASKS Evaluate results Review process Determine next steps
22
CRISP-DM Methodology Business understanding Data understanding Data Preparation Modeling Evaluation Deployment TASKS Plan deployment Plan monitoring and maintenance Final report Review project
23
A Comparison data Target data Processed data Transformed data Patterns Knowledge Selection Preprocessing & cleaning Transformation & feature selection Data Mining Interpretation Evaluation SAMPLE Input data, Sampling, Data partition EXPLORE Distribution explorer, Multiplot, Insight, Association, Variable selection MODEL Regression, Tree, Neural Network, Ensemble MODIFY Transform variable, Filter outliers, Clustering, SOM / Kohonen ASSESS Assessment, Score, Report Business understanding Data understanding Data Preparation Modeling Evaluation Deployment
24
A Small Poll (July 2002) Source: http://www.kdnuggets.com/polls/2002/methodology.htm
25
Course perspective and goal l The perspective is from computer science (machine learning): Fayyad’s approach l The emphasis is on techniques for the automated discovery of patterns in data and the automated extraction of rules (the model phase of SEMMA and CRISP) l The goal is to get acquainted with these techniques, so you can use them in the methodology of your choice
26
Course contents l Data preparation (Wednesday) Selection, preprocessing, transformationSelection, preprocessing, transformation l Techniques, algorithms and models Decision trees (Monday)Decision trees (Monday) Instance based and Bayesian learning (Tuesday)Instance based and Bayesian learning (Tuesday) Neural networks (Tuesday)Neural networks (Tuesday) Association rules (Thursday)Association rules (Thursday) Clustering (Thursday)Clustering (Thursday) Support Vector Machines (Friday)Support Vector Machines (Friday) l Evaluation of learned models (Wednesday)
27
Course contents l For each technique you learn For which tasks it is suitableFor which tasks it is suitable –Classification, rules, prediction, … –Restrictions on input data (numerical, symbolic, etc.) What algorithms are availableWhat algorithms are available What parameters should be tunedWhat parameters should be tuned How to interpret the resultsHow to interpret the results How to evaluate the modelHow to evaluate the model
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.