Presentation is loading. Please wait.

Presentation is loading. Please wait.

Tang: Introduction to Data Mining (with modification by Ch. Eick) I: Introduction to Data Mining A.Short Preview 1.Initial Definition of Data Mining 2.Motivation.

Similar presentations


Presentation on theme: "Tang: Introduction to Data Mining (with modification by Ch. Eick) I: Introduction to Data Mining A.Short Preview 1.Initial Definition of Data Mining 2.Motivation."— Presentation transcript:

1 Tang: Introduction to Data Mining (with modification by Ch. Eick) I: Introduction to Data Mining A.Short Preview 1.Initial Definition of Data Mining 2.Motivation for Data Mining 3.Examples of Data Mining Tasks B.More detailed Survey on Data Mining C.Course Information

2 Tang: Introduction to Data Mining (with modification by Ch. Eick) Teaching Plan for the Next 5 Weeks 1. Introduction to Data Mining and Course Information 2. Preprocessing (Han Chapter 3) 3. Concept Characterization (Han Chapter 5) 4. Classification Techniques (multiple soursce)

3 Knowledge Discovery in Data [and Data Mining] (KDD) Let us find something interesting! Definition := “ KDD is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data ” (Fayyad) l Frequently, the term data mining is used to refer to KDD. l Many commercial and experimental tools and tool suites are available (see http://www.kdnuggets.com/siftware.html)http://www.kdnuggets.com/siftware.html l Field is more dominated by industry than by research institutions

4 Tang: Introduction to Data Mining (with modification by Ch. Eick) l Lots of data is being collected and warehoused –Web data, e-commerce –purchases at department/ grocery stores –Bank/Credit Card transactions l Computers have become cheaper and more powerful (  machine learning techniques become applicable) l Competitive Pressure is Strong –Provide better, customized services for an edge (e.g. in Customer Relationship Management) Why Mine Data? Commercial Viewpoint

5 Why Mine Data? Scientific Viewpoint l Data collected and stored at enormous speeds (GB/hour) –remote sensors on a satellite –telescopes scanning the skies –microarrays generating gene expression data –scientific simulations generating terabytes of data l Traditional techniques infeasible for raw data l Data mining may help scientists –in classifying and segmenting data –in Hypothesis Formation

6 Tang: Introduction to Data Mining (with modification by Ch. Eick) Mining Large Data Sets - Motivation l There is often information “ hidden ” in the data that is not readily evident l Human analysts may take weeks to discover useful information l Much of the data is never analyzed at all The Data Gap Total new disk (TB) since 1995 Number of analysts From: R. Grossman, C. Kamath, V. Kumar, “Data Mining for Scientific and Engineering Applications”

7 Tang: Introduction to Data Mining (with modification by Ch. Eick) Data Mining Tasks l Prediction Methods –Use some variables to predict unknown or future values of other variables. l Description Methods –Find human-interpretable patterns that describe the data.

8 Tang: Introduction to Data Mining (with modification by Ch. Eick) Classification Example categorical continuous class Test Set Training Set Model Learn Classifier

9 Tang: Introduction to Data Mining (with modification by Ch. Eick) Classifying Galaxies Early Intermediate Late Data Size: 72 million stars, 20 million galaxies Object Catalog: 9 GB Image Database: 150 GB Class: Stages of Formation Attributes: Image features, Characteristics of light waves received, etc. Courtesy: http://aps.umn.edu

10 Tang: Introduction to Data Mining (with modification by Ch. Eick) What is Clustering? l Given a set of objects, each having a set of attributes, and a similarity measure among them, find clusters such that –Objects in one cluster are more similar to one another. –Objects in separate clusters are less similar to one another. l Similarity Measures: –Euclidean Distance if attributes are continuous. –Other Problem-specific Measures.

11 Tang: Introduction to Data Mining (with modification by Ch. Eick) Clustering of S&P 500 Stock Data zObserve Stock Movements every day. zClustering points: Stock-{UP/DOWN} zSimilarity Measure: Two points are more similar if the events described by them frequently happen together on the same day.  We used association rules to quantify a similarity measure.

12 Tang: Introduction to Data Mining (with modification by Ch. Eick) Association Rule Discovery: Definition l Given a set of records each of which contain some number of items from a given collection; –Produce dependency rules which will predict occurrence of an item based on occurrences of other items. Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer} Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer}

13 Tang: Introduction to Data Mining (with modification by Ch. Eick) Sequential Pattern Discovery: Definition l Given is a set of objects, with each object associated with its own timeline of events, find rules that predict strong sequential dependencies among different events. l Rules are formed by first discovering patterns. Event occurrences in the patterns are governed by timing constraints. (A B) (C) (D E) <= ms <= xg >ng<= ws (A B) (C) (D E)


Download ppt "Tang: Introduction to Data Mining (with modification by Ch. Eick) I: Introduction to Data Mining A.Short Preview 1.Initial Definition of Data Mining 2.Motivation."

Similar presentations


Ads by Google