Why Intelligent Data Analysis? Joost N. Kok Leiden Institute of Advanced Computer Science Universiteit Leiden
Overview Data Analysis Data Mining Applications Outlook
Data Analysis
Data Mining ``Data Mining is one of the five key note technologies that will have a major impact across a wide range of industries within the next three to five years’’ (Gartner) ``Data Mining is one of the top ten new technologies in which companies will invest during the next five years’’ (Gartner) ``Data Mining is an overhyped concept’’ (OTR)
Data Analysis Data analysis = Processing data Exploratory vs. Confirmatory –are there interesting structures? –can we predict the value? Descriptive vs. Inferential –statement about data set –draw more general conclusions Data analysis = process of computing various summaries and derived values from the given collection of data
Tools Cookbook fallacy: Data analysis = picking and applying the right tool. –Tools are not independent. –Matching is an iterative process (which needs intelligence).
Stat vs. ML Statistics –Mathematics Machine Learning –Experimental Computer Science ``Statistics is difficult’’ ``Algorithms are not exact’’
Models Models vs. Algorithms Empirical vs. Mechanistic Models Understanding vs. Prediction Models vs. Patterns Overfitting Constraints
Algorithms Enabling data analysis Too many: often no foundations, no applications In practice only a restricted set of algorithms is used
The nature of Data Different kinds of data –Numerical Data –Text –Images –Sound Raw data has –missing values –distortions –misrecording –inadequate sampling –etc.
The nature of data Data sets can be large –horizontal –vertical Curse of dimensionality Experiments Sampling
The nature of data Too little –Example: storm situations Too much –Example: image segmentation Static vs. dynamic Off-line vs. On-line Infoglut What is collected?
Overview Statistical methods and concepts Bayesian methods Time series Rule induction Neural networks Fuzzy logic Stochastic search methods Applications
Overview Why Intelligent Data Analysis Fundamental Concepts of Statistics Intelligent Data Analysis: Issues and Challenges Artificial Neural Networks Fuzzy Logic Industrial Applications of Neuro- Fuzzy Networks Statistical Methods for Data Analysis Time Series Analysis
Overview Chaos and Reality Bayesian Networks ANN Visualization Tools Rule Induction Evolutionary Systems Data Analysis in Real-World Applications
Enrichment Data Fusion –combine data sets Example: –customer database –survey information
Data Mining Database technology Data visualization Data warehouse vs Operational database –time-dependent –non-volatile –subject-oriented –integrated Target: decision making
Data Mining
Selection Cleaning Enrichment Coding Data Mining Reporting
Cleaning Remove duplicates Check domain consistency Remove data Project data Combine data in one table
Coding Adress - Region Date of birth - Age Scaling of numerical data Date - Number of months
Data Mining SQL queries Clustering Pattern Recognition ES ML Statistics Visual DB KDD
Nearest Neighbor Search k nearest points
Oil Search Shell research South-East Asia measurements kinds of stone coring
Applications
Outlook
Positive –Moore’s Law –New kinds of computers –Data collection –More data is more easy reachable Negative –Collective memory gets lost –Infoglut Data battle
Outlook Merge of Machine Learning and Statistics Algorithms –Adaptive parameters –Black Box data mining From suites to tailored tools
Intelligent Data Analysis –User Interaction –also uses tools from Machine Learning
NetTalk Sound generator Speech-synthesis expert system INTELLI Sound Generator Speech-synthesis expert system NetTalk Neural Network