Strategies and Tactics for Data Mining  Data Mining is part of Knowledge Discovery in databases, KDD.  There Are various KDD paradigmns. The CRISP KDD.

Strategies and Tactics for Data Mining  Data Mining is part of Knowledge Discovery in databases, KDD.  There Are various KDD paradigmns. The CRISP KDD pardigmn is a management cycle approach for business KDD. Moh will consider this later.  In this lecture we focus on strategies and tactics for Exploratory Data Analysis, making use of Data Mining tools.

Exploration and Discovery  Data Mining is defined to be the process of discovery of previously unknown patterns and relationships in datasets, (usually large datasets), using automatic machine learning techniques.  Even so, Data Exploration and Mining needs to be driven by human beings who have goals and preconceptions and who wish to make use of the discovered patterns.  Hence KDD as a whole is NOT authomatic, and needs to make use of human skills and judgement. EDA and KDD are as much ‘art’ as ‘science’.

The Nature of Exploration and Discovery  The explorer may have the general aim to explore unknown territory, or data, with the aim of discovering new facts, patterns and relationshsips.  However, the explorer will not generally know where he/she is going at the start of an exploratory expedition.  An explorer chooses a route, direction, path, often arbitrarily, and observes where it takes them.

Observation: Looking at the data and results.  “Just” looking at the RAW data in a flat file can often be illuminating.  Errors, outliers, missing values are often obvious.  Looking at the summary resullts of a preliminary analysis, can suggest interesting facts which may lead on to the next step in the exploration.  “Looking” means “really looking”. That is, looking with expectations & preconceptions, with perception, open-mindedness, and including all new facts into the exploration plan.

The importance of metadata in Data Mining  Metadata is data about (raw) data.  A column of data values, corresponding to the values of an attribute over instances will usually be given a meaningful name.  This attribute name and its implicit and explicit connotations will influence how we regard this attribute, and influence our expectations of how the attribute will relate with other attributes.  E.g. If two attributes have the names “LabourCost ($)” and “NetProfit($)” we have some expectation of the meanings and relationships between the attributes. (e.g. NetProfit = (Total Revenues) –(Total Costs), with LaboutCost included in the last term. We know the business is probably involving the US.  However, if we had a date attribute associated with the NetProfit, which was before the datae associated with the LabourCost, then we might expect the NetProfit to be a Target NetProfit.  These are SEMANTICS which follow from the metadata.  They CANNOT BE OBTAINED BY AUTOMATIC MACHINE LEARNING

The importance of Preconceptions and prior information, and research  Prior innfomration, and expectations and preconceptions obtained from metadata and its semantics, are the framework and context in which raw data and preliminary results are “looked at”.  If you do not have any prior expectations, then do some research (e.g. GOOGLE)  Anything unexpected is a discovery!  But you may have just discovered an error in the data, the analysis, or preconceptions.

Tools and Resources for initial EDA and DM  Matadata; semantics  Raw Data visual scans  Raw data visual display (Matrix scatter plot; histograms)  Summaries by attributes (means and sd’s)  Unsupervised clustering of instances  Principal component analysis of attributes, i.e. culstering of the attributes. (depends on the correlation matrix)

The importance of goals  A completely objective uncovering of facts, patterns and relations might take forever, and confuse understanding.  Hence hence certain goals, or priorities will allow a selection of which direction the exploration and mining should proceed.

Finally, interpretation and understanding  The discovered facts, patterns and relationships need to be interpreted in the context of prior expectations, and the goals of the EDA/DM.  Hence an UNDERSTANDING of these facts, patterns and relationships is hopefully possible.  Hence they may be used and applied appropriately.

Case Study: The Cardiology Dataset METADATA  Age, sex, chest pain type,blood pressure,Cholesterol, Fasting blood sugar <120, resting ecg, maximum heart rate, Angina, peak, slope, #colored vessels Thal, class.  303 instances. Not too large.  The attributes are all relevant to cardio-vascular condition, including some ecg attributes.  Class is a Healthy; UnHealthy classification. SO WHAT SHOULD WE DO?  Class might have been obtained, either (i) From a clinician, or (ii) by a previous datamining classification exercise.  If (i) then supervised classification might be appropriate.  If (ii) then we may wish to do our own unsupervised classification analysis and see if our resluts agree with the class given.  Even if (i), we might wish to check the validity of the clinical diagnosis.

EDX Unsupervised classification Look at the working spreadsheet considerations.

Strategies and Tactics for Data Mining  Data Mining is part of Knowledge Discovery in databases, KDD.  There Are various KDD paradigmns. The CRISP KDD.

Similar presentations

Presentation on theme: "Strategies and Tactics for Data Mining  Data Mining is part of Knowledge Discovery in databases, KDD.  There Are various KDD paradigmns. The CRISP KDD."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Strategies and Tactics for Data Mining  Data Mining is part of Knowledge Discovery in databases, KDD.  There Are various KDD paradigmns. The CRISP KDD.

Similar presentations

Presentation on theme: "Strategies and Tactics for Data Mining  Data Mining is part of Knowledge Discovery in databases, KDD.  There Are various KDD paradigmns. The CRISP KDD."— Presentation transcript:

Similar presentations

About project

Feedback