Download presentation
Presentation is loading. Please wait.
1
Strategies and Tactics for Data Mining Data Mining is part of Knowledge Discovery in databases, KDD. There Are various KDD paradigmns. The CRISP KDD pardigmn is a management cycle approach for business KDD. Moh will consider this later. In this lecture we focus on strategies and tactics for Exploratory Data Analysis, making use of Data Mining tools.
2
Exploration and Discovery Data Mining is defined to be the process of discovery of previously unknown patterns and relationships in datasets, (usually large datasets), using automatic machine learning techniques. Even so, Data Exploration and Mining needs to be driven by human beings who have goals and preconceptions and who wish to make use of the discovered patterns. Hence KDD as a whole is NOT authomatic, and needs to make use of human skills and judgement. EDA and KDD are as much ‘art’ as ‘science’.
3
The Nature of Exploration and Discovery The explorer may have the general aim to explore unknown territory, or data, with the aim of discovering new facts, patterns and relationshsips. However, the explorer will not generally know where he/she is going at the start of an exploratory expedition. An explorer chooses a route, direction, path, often arbitrarily, and observes where it takes them.
4
Observation: Looking at the data and results. “Just” looking at the RAW data in a flat file can often be illuminating. Errors, outliers, missing values are often obvious. Looking at the summary resullts of a preliminary analysis, can suggest interesting facts which may lead on to the next step in the exploration. “Looking” means “really looking”. That is, looking with expectations & preconceptions, with perception, open-mindedness, and including all new facts into the exploration plan.
5
The importance of metadata in Data Mining Metadata is data about (raw) data. A column of data values, corresponding to the values of an attribute over instances will usually be given a meaningful name. This attribute name and its implicit and explicit connotations will influence how we regard this attribute, and influence our expectations of how the attribute will relate with other attributes. E.g. If two attributes have the names “LabourCost ($)” and “NetProfit($)” we have some expectation of the meanings and relationships between the attributes. (e.g. NetProfit = (Total Revenues) –(Total Costs), with LaboutCost included in the last term. We know the business is probably involving the US. However, if we had a date attribute associated with the NetProfit, which was before the datae associated with the LabourCost, then we might expect the NetProfit to be a Target NetProfit. These are SEMANTICS which follow from the metadata. They CANNOT BE OBTAINED BY AUTOMATIC MACHINE LEARNING
6
The importance of Preconceptions and prior information, and research Prior innfomration, and expectations and preconceptions obtained from metadata and its semantics, are the framework and context in which raw data and preliminary results are “looked at”. If you do not have any prior expectations, then do some research (e.g. GOOGLE) Anything unexpected is a discovery! But you may have just discovered an error in the data, the analysis, or preconceptions.
7
Tools and Resources for initial EDA and DM Matadata; semantics Raw Data visual scans Raw data visual display (Matrix scatter plot; histograms) Summaries by attributes (means and sd’s) Unsupervised clustering of instances Principal component analysis of attributes, i.e. culstering of the attributes. (depends on the correlation matrix)
8
The importance of goals A completely objective uncovering of facts, patterns and relations might take forever, and confuse understanding. Hence hence certain goals, or priorities will allow a selection of which direction the exploration and mining should proceed.
9
Finally, interpretation and understanding The discovered facts, patterns and relationships need to be interpreted in the context of prior expectations, and the goals of the EDA/DM. Hence an UNDERSTANDING of these facts, patterns and relationships is hopefully possible. Hence they may be used and applied appropriately.
10
Case Study: The Cardiology Dataset METADATA Age, sex, chest pain type,blood pressure,Cholesterol, Fasting blood sugar <120, resting ecg, maximum heart rate, Angina, peak, slope, #colored vessels Thal, class. 303 instances. Not too large. The attributes are all relevant to cardio-vascular condition, including some ecg attributes. Class is a Healthy; UnHealthy classification. SO WHAT SHOULD WE DO? Class might have been obtained, either (i) From a clinician, or (ii) by a previous datamining classification exercise. If (i) then supervised classification might be appropriate. If (ii) then we may wish to do our own unsupervised classification analysis and see if our resluts agree with the class given. Even if (i), we might wish to check the validity of the clinical diagnosis.
11
EDX Unsupervised classification Look at the working spreadsheet considerations.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.