Lecture 2 Themes in this session Knowledge discovery in databases

Data mining Multidimensional analysis and OLAP

3 What is Knowledge? Knowledge Understanding Wisdom Data Information
symbols representing properties of events and their environments Information is contained in descriptions, provides the answers to a number of basic questions Knowledge basic know-how facilitates allows action Understanding achieved through diagnosis and prescription Wisdom judgement of what is efficient and effective

4 Characteristics of discovered knowledge
non-trivial valid novel potential useful understandable An aggregated measure is “interestingness” validity novelty usefulness simplicity

5 A more formal definition of knowledge
Pattern A pattern is an expression E in a language L describing facts in a subset FE of F. E is called a pattern if it is simpler than the enumeration of all the facts in FE Knowledge A pattern E  L is called knowledge if for some user-specified threshold i  Mi , I(E,F,C,N,U,S) > i where C = validity, N = novelty, U = usefulness, S = simplicity

6 What is KDD? Knowledge Discovery in Databases involves the extraction of implicit, previously unknown and potentially useful information from data. KDD is a process involves the extraction, organisation and presentation of discovered information KDD is effected by a human-centred system is in itself a knowledge intensive task consisting of complex interactions between a human and a (large) database.

7 Overview of the analyst’s tasks
Goals Insight gains formulates enriches Queries generates Analyses DB Output Dataset

8 Characteristics of the KDD process
highly iterative protracted over time numerous sub-tasks highly complex numerous input systems

9 A description of the KDD process
Task discovery Goal formulation Data cleaning Model development Data analysis Output generation Data discovery

10 Goal formulation Based on a means-ends chain extending into the workings of the organisation Formulate a goal for improving the operations of the business Decide what one needs to know in order to fulfil this goal and perform the business activity in a better manner On the basis of what one needs to know formulate goals for how to discover this information by using the KDD process Revise all of the goals above if needs on the basis of iterative discovery

11 Data discovery Try and understand the domain in order to determine which entities are relevant to the discovery process Check the coverage and content of the data sift through the source data to see what is available sift through the source data to see what is not available Determine the quality of the data Determine the structure of the data

12 Task discovery Find means stipulated by the ends contained in the knowledge discovery goals Find out what the real requirements on the tasks and the performance of these tasks are Refine the requirements and choice of tasks until you’re sure you’re setting about answering the correct questions

13 Data cleaning Ensure the quality of the data that will be used in the KDD process Eliminate data quality problems in the data such as… inconsistencies due to differences between various data sources missing data different forms of data representation data incompatibility

14 Model development Select the parameters for the model Segment the data
Involves activities concerned with forming a basic hypothesis which can satisfy the knowledge discovery goals Select the parameters for the model formulate measures that can be used to quantify achievement of the goal (outcome variable or dependent variable) select a set of independent variables which are deemed to have relevance to the outcome variables Segment the data find possible relevant subsets in the population Choose an analysis model which fits the problem domain NOTE: This whole phase demands background knowledge of the domain

15 Data analysis Involves activities aimed at determining the rules/reasons governing the behaviour of those entities focused on by the knowledge discovery goal specify the chosen model use some form of formal expression fit the model to the data perform initial adjustments to some of the parameters evaluate the model check the soundness of the model against the data refine the model modify the model on the basis of its discrepancies with the evidence presented by the data

16 Output generation Reports of findings in the analysis
Action suggestions on the basis of the findings Models for use in similar analysis scenarios Monitoring mechanisms which observe the variables covered in the analysis and “trigger” notifications when certain conditions are noted in the data.

17 Developing KDD applications
Purpose: an application to answer a key business question a labour intensive initial discovery of knowledge by someone who understands the domain as well as the specific data analysis techniques needed encoding of the discovered knowledge within a specific problem solving architecture application of the knowledge in the context of a real world task by a well understood class of end-users Installation of analysis, monitoring, and reporting mechanisms as a base for continual evaluation of data

19 What is data mining? Rather formal definition:
Data mining involves fitting models to, and observing patterns from, observed data through the application of specific algorithms. Less formally: Data analysis in order to explain an aspect of a complex reality by expressing it as an understandable simplification

20 Goals for data mining Prediction Description
involve using some variables or fields in the database to predict unknown or future values of other variables of interest Description focuses on finding human interpretable patterns describing the data

21 Rationale for data mining
Dramatic increase in the amount of data available (the data explosion) Increasing competition in the world’s market The low relative value of easily discovered information Increasing cleverness Emergence of new enabling technology

22 Enabling factors for data mining
Increased data storage ability Increased data gathering ability Increased processing power The introduction of new computationally intensive methods of machine learning

23 Background to data mining
Inductive learning supervised learning unsupervised learning Statistics Machine learning Differences between DM and ML DM finds understandable knowledge, ML improves the performance of an agent DM is concerned with large, real-world databases, ML with smaller data sets ML is a broader files, not only learning by example

24 Data mining algorithms
Specific mix of three components: The model function representational form parameters from the data The model evaluation (preference) criterion preference of one set of models or set of parameters over another based on goodness-of-fit function The search method a method for finding particular models and parameters Given: data, family of models, preference criterion

25 Primary operations in data mining
A number of basic operations can be used for prediction and depiction Classification Regression Clustering Summarisation Dependency modelling Change and deviation detection

26 Classification Learning a function that maps (classifies) a data item into one of several predefined classes In supervised learning it is the user that defines the classes. The classification is applied in the form of one or more attributes that denotes the class of the data item. These classifying attributes are known as predicted attributes. A combination of values for the predicted attributes defines a class Other attributes of the data item are known as predicting attributes

27 Regression A common statistical technique for modelling the relationship between two or more variables Learning a function which maps a data item to a real-valued prediction variable Simple linear regression uses the straight line model Y = 0 + 1X +  , where Y is the prediction variable (dependent variable) and X is the predictive variable (independent variable) Multiple regression involves more than two variables and uses the model Y = 0 + 1X1 + 2X2 +…+ nXn +  , where Y is the prediction variable and X1… Xn are the predictive variables

28 Clustering A common descriptive task for determining a finite set of categories or clusters to describe the data Categories may be mutually descriptive and exhaustive, or consist of richer representations such as hierarchical or overlapping categories A cluster is a group of objects grouped together because of their similarity of proximity. Data units in a cluster are both homogeneous and differ significantly from other groups Correlations and functions of distance between elements are used in defining the clusters

29 Summarisation Methods for finding a compact description for a subset of data Often relies on statistical methods such as the calculating of means and standard derivations Are often applied to interactive exploratory data analysis and automated report generation.

30 Dependency modelling Consists for finding a model which describes significant dependencies between variables There are two levels of dependency in dependency models: The structural level specifies which variables are locally dependent on each other The quantitative level specifies the strengths of the dependencies using some numerical scale Often in the form: x% of all record containing items A and B, also contain items D and E

31 Change and deviation detection
Focuses on discovering the most significant changes in the data from previously measured or normative values Often used on a long time series of records in order to discover trends Often used to discover sequential patterns occurring over extended time periods

32 Problems and issues in data mining
Limited information Noise and missing values Uncertainty Size of databases Irrelevance of certain fields Updates to databases

33 Multidimensional analysis and OLAP

34 OLAP vs OLTP OLTP servers handle mission-critical production data accessed through simple queries usually handles queries of an automated nature OLTP applications consist of a large number of relatively simple transactions. Most often contains data organised on the basis of logical relations between normalised tables OLAP servers handle management-critical data accessed through an iterative analytical investigation usually handles queries of an ad-hoc nature supports more complex and demanding transactions contains logically organised data in multiple dimensions

35 What is OLAP? Definition: The dynamic synthesis, analysis and consolidation of large volumes of multidimensional data. Flexible information synthesis Multiple data dimensions/consolidation paths Dynamic data analysis

36 Codd’s four data models for data analysis
Categorical data models Exegetical data models Contemplative data models Formulaic data models

37 Dimensionality revisited

38 OLAP Tool evaluation criteria (1-6)
Multidimensional conceptual view Transparency Accessibility Consistent reporting performance Client-Server architecture Generic dimensionality

39 OLAP Tool evaluation criteria (7-12)
Dynamic Sparse Matrix handling Multi-user support Unrestricted cross-dimensional analysis Intuitive data manipulation Flexible reporting Unlimited dimensions and aggregation levels

40 Functionality of OLAP tools
Drill-down Drill-up Roll-up or consolidation “Slicing and dicing” by pivoting Drill-through Drill-across

41 An OLAP “answer set”

42 Different forms of OLAP
True OLAP ROLAP (relational OLAP) MOLAP (multidimensional OLAP)

