EECS 647: Introduction to Database Systems

Slides:



Advertisements
Similar presentations
Data Engineering.
Advertisements

Association Rule Mining. 2 The Task Two ways of defining the task General –Input: A collection of instances –Output: rules to predict the values of any.
Data Mining Sangeeta Devadiga CS 157B, Spring 2007.
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Data Mining Association Analysis: Basic Concepts and Algorithms Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Lecture Notes for Chapter 2 Introduction to Data Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Chapter 3 Data Issues. What is a Data Set? Attributes (describe objects) Variable, field, characteristic, feature or observation Objects (have attributes)
Data Mining Association Analysis: Basic Concepts and Algorithms
EECS 800 Research Seminar Mining Biological Data
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
University of Minnesota
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Decision Support: Data Mining Introduction.
Special Topics in Data Mining. Direct Objectives To learn data mining techniques To see their use in real-world/research applications To get an understanding.
Data Mining Lecture 2: data.
MAKING THE BUSINESS BETTER Presented By Mohammed Dwikat DATA MINING Presented to Faculty of IT MIS Department An Najah National University.
Data Mining & Knowledge Discovery Lecture: 2 Dr. Mohammad Abu Yousuf IIT, JU.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining Basics: Data Remark: Discusses “basics concerning data sets (first half of Chapter.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Introduction Lecture Notes for Chapter 1 Introduction to Data Mining by Tan,
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach,
Jeff Howbert Introduction to Machine Learning Winter Machine Learning Data.
What is Data? Attributes
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
CIS527: Data Warehousing, Filtering, and Mining
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach,
1 Data Mining: Data Lecture Notes for Chapter 2. 2 What is Data? l Collection of data objects and their attributes l An attribute is a property or characteristic.
1 Data Mining Lecture 2: Data. 2 What is Data? l Collection of data objects and their attributes l Attribute is a property or characteristic of an object.
1 What is Association Analysis: l Association analysis uses a set of transactions to discover rules that indicate the likely occurrence of an item based.
1 What is Data? l An attribute is a property or characteristic of an object l Examples: eye color of a person, temperature, etc. l Attribute is also known.
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
3/13/2016Data Mining 1 Lecture 1-2 Data and Data Preparation Phayung Meesad, Ph.D. King Mongkut’s University of Technology North Bangkok (KMUTNB) Bangkok.
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
Introduction to Data Mining Mining Association Rules Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
An Introduction to Data Mining
DATA MINING and VISUALIZATION Instructor: Dr. Matthew Iklé, Adams State University Remote Instructor: Dr. Hong Liu, Embry-Riddle Aeronautical University.
CS 405G: Introduction to Database Systems. 9/29/20162 Review What Is Data Mining? Data mining (knowledge discovery from data) Extraction of interesting.
1 Data Mining Lecture 02a: Data Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook authors)
Stats 202: Statistical Aspects of Data Mining Professor Rajan Patel
Data Preliminaries CSC 600: Data Mining Class 1.
Data Mining – Intro.
Data Mining Lecture 02a: Theses slides are based on the slides by Data
Data Mining Association Analysis: Basic Concepts and Algorithms
Lecture Notes for Chapter 2 Introduction to Data Mining
Knowledge discovery & data mining Association rules and market basket analysis--introduction UCLA CS240A Course Notes*
Frequent Pattern Mining
Lecture Notes for Chapter 2 Introduction to Data Mining
EECS 647: Introduction to Database Systems
CISC 4631 Data Mining Lecture 02:
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Sangeeta Devadiga CS 157B, Spring 2007
Lecture Notes for Chapter 2 Introduction to Data Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining
Association Analysis: Basic Concepts and Algorithms
Data Mining: Introduction
Data Mining Lecture 02a: Theses slides are based on the slides by Data
Group 9 – Data Mining: Data
Data Preliminaries CSC 576: Data Mining.
Data Pre-processing Lecture Notes for Chapter 2
Association Analysis: Basic Concepts
Data exploration and visualization
Presentation transcript:

EECS 647: Introduction to Database Systems Instructor: Luke Huan Spring 2007

Luke Huan Univ. of Kansas Administrative Final project is due May 9th I will not accept late hand-in for the final project Start early you need to take quick action if you haven’t divided the tasks yet For those in the implementation phase (this is the right phase that you should be in), thinking about testing plans Class review is April 30th. Provide your feedbacks If you like the course, recommend it to other students 4/10/2019 Luke Huan Univ. of Kansas

Review What Is Data Mining? Data mining (knowledge discovery from data) Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data What is classification? Predict the value of unseen data What is clustering Grouping similar objects into groups 4/10/2019 Luke Huan Univ. of Kansas

Luke Huan Univ. of Kansas Today’s Topic Continue the introduction to data mining Sequential patterns Regression Knowing the nature of your data Discover association in your data 4/10/2019 Luke Huan Univ. of Kansas

Sequential Pattern Discovery: Definition Given is a set of objects, with each object associated with its own timeline of events, find rules that predict strong sequential dependencies among different events. Rules are formed by first disovering patterns. Event occurrences in the patterns are governed by timing constraints. (A B) (C) (D E) (A B) (C) (D E) <= ms <= xg >ng <= ws 4/10/2019 Luke Huan Univ. of Kansas

Sequential Pattern Discovery: Examples In telecommunications alarm logs, (Inverter_Problem Excessive_Line_Current) (Rectifier_Alarm) --> (Fire_Alarm) In point-of-sale transaction sequences, Computer Bookstore: (Intro_To_Visual_C) (C++_Primer) --> (Perl_for_dummies,Tcl_Tk) Athletic Apparel Store: (Shoes) (Racket, Racketball) --> (Sports_Jacket) 4/10/2019 Luke Huan Univ. of Kansas

Luke Huan Univ. of Kansas Regression Predict a value of a given continuous valued variable based on the values of other variables, assuming a linear or nonlinear model of dependency. Greatly studied in statistics, neural network fields. Examples: Predicting sales amounts of new product based on advetising expenditure. Predicting wind velocities as a function of temperature, humidity, air pressure, etc. Time series prediction of stock market indices. 4/10/2019 Luke Huan Univ. of Kansas

Deviation/Anomaly Detection Detect significant deviations from normal behavior Applications: Credit Card Fraud Detection Network Intrusion Detection Typical network traffic at University level may reach over 100 million connections per day 4/10/2019 Luke Huan Univ. of Kansas

Challenges of Data Mining Scalability Dimensionality Complex and Heterogeneous Data Data Quality Data Ownership and Distribution Privacy Preservation Streaming Data 4/10/2019 Luke Huan Univ. of Kansas

Knowing the Nature of Your Data Data types: nominal, ordinal, interval, ratio. Data quality Data preprocessing 4/10/2019 Luke Huan Univ. of Kansas

Luke Huan Univ. of Kansas What is Data? Collection of data objects and their attributes An attribute is a property or characteristic of an object Examples: eye color of a person, temperature, etc. Attribute is also known as variable, field, characteristic, or feature A collection of attributes describe an object Object is also known as record, point, case, sample, entity, or instance Attributes Objects 4/10/2019 Luke Huan Univ. of Kansas

Luke Huan Univ. of Kansas Attribute Values Attribute values are numbers or symbols assigned to an attribute Distinction between attributes and attribute values Same attribute can be mapped to different attribute values Example: height can be measured in feet or meters Different attributes can be mapped to the same set of values Example: Attribute values for ID and age are integers But properties of attribute values can be different ID has no limit but age has a maximum and minimum value 4/10/2019 Luke Huan Univ. of Kansas

Luke Huan Univ. of Kansas Types of Attributes There are different types of attributes Nominal Examples: ID numbers, eye color, zip codes Ordinal Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in {tall, medium, short} Interval Examples: calendar dates, temperatures in Celsius or Fahrenheit. Ratio Examples: temperature in Kelvin, length, time, counts 4/10/2019 Luke Huan Univ. of Kansas

Properties of Attribute Values The type of an attribute depends on which of the following properties it possesses: Distinctness: =  Order: < > Addition: + - Multiplication: * / Nominal attribute: distinctness Ordinal attribute: distinctness & order Interval attribute: distinctness, order & addition Ratio attribute: all 4 properties 4/10/2019 Luke Huan Univ. of Kansas

Luke Huan Univ. of Kansas Properties of Attribute Values Attribute Type Description Examples Nominal The values of a nominal attribute are just different names, i.e., nominal attributes provide only enough information to distinguish one object from another. (=, ) zip codes, employee ID numbers, eye color, sex: {male, female} Ordinal The values of an ordinal attribute provide enough information to order objects. (<, >) hardness of minerals, {good, better, best}, grades, street numbers Interval For interval attributes, the differences between values are meaningful, i.e., a unit of measurement exists. (+, - ) calendar dates, temperature in Celsius or Fahrenheit Ratio For ratio variables, both differences and ratios are meaningful. (*, /) temperature in Kelvin, monetary quantities, counts, age, mass, length, electrical current 4/10/2019 Luke Huan Univ. of Kansas

Discrete and Continuous Attributes Discrete Attribute Has only a finite or countablely infinite set of values Examples: zip codes, counts, or the set of words in a collection of documents Often represented as integer variables. Note: binary attributes are a special case of discrete attributes Continuous Attribute Has real numbers as attribute values Examples: temperature, height, or weight. Practically, real values can only be measured and represented using a finite number of digits. Continuous attributes are typically represented as floating-point variables. 4/10/2019 Luke Huan Univ. of Kansas

Structured vs Unstructured Data Data in a relational database Semi-structured data Graphs, trees, sequencs Un-structured data Image, text 4/10/2019 Luke Huan Univ. of Kansas

Important Characteristics Data Dimensionality Curse of Dimensionality Sparsity Only presence counts Resolution Patterns depend on the scale 4/10/2019 Luke Huan Univ. of Kansas

Luke Huan Univ. of Kansas Record Data Data that consists of a collection of records, each of which consists of a fixed set of attributes 4/10/2019 Luke Huan Univ. of Kansas

Luke Huan Univ. of Kansas Data Matrix If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional space, where each dimension represents a distinct attribute Such data set can be represented by an m by n matrix, where there are m rows, one for each object, and n columns, one for each attribute 4/10/2019 Luke Huan Univ. of Kansas

Luke Huan Univ. of Kansas Document Data Each document becomes a `term' vector, each term is a component (attribute) of the vector, the value of each component is the number of times the corresponding term occurs in the document. 4/10/2019 Luke Huan Univ. of Kansas

Luke Huan Univ. of Kansas Transaction Data A special type of record data, where each record (transaction) involves a set of items. For example, consider a grocery store. The set of products purchased by a customer during one shopping trip constitute a transaction, while the individual products that were purchased are the items. 4/10/2019 Luke Huan Univ. of Kansas

Luke Huan Univ. of Kansas Data Quality What kinds of data quality problems? How can we detect problems with the data? What can we do about these problems? Examples of data quality problems: Noise and outliers missing and duplicated data 4/10/2019 Luke Huan Univ. of Kansas

Luke Huan Univ. of Kansas Noise Noise refers to modification of original values Examples: distortion of a person’s voice when talking on a poor phone and “snow” on television screen Two Sine Waves Two Sine Waves + Noise 4/10/2019 Luke Huan Univ. of Kansas

Mapping Data to a New Space Fourier transform Wavelet transform Two Sine Waves Two Sine Waves + Noise Frequency 4/10/2019 Luke Huan Univ. of Kansas

Luke Huan Univ. of Kansas Outliers Outliers are data objects with characteristics that are considerably different than most of the other data objects in the data set One person’s outlier can be another one’s treasure!! 4/10/2019 Luke Huan Univ. of Kansas

Luke Huan Univ. of Kansas Missing Values Reasons for missing values Information is not collected (e.g., people decline to give their age and weight) Attributes may not be applicable to all cases (e.g., annual income is not applicable to children) Handling missing values Eliminate Data Objects Estimate Missing Values Ignore the Missing Value During Analysis Replace with all possible values (weighted by their probabilities) 4/10/2019 Luke Huan Univ. of Kansas

Luke Huan Univ. of Kansas Duplicate Data Data set may include data objects that are duplicates, or almost duplicates of one another Major issue when merging data from heterogeous sources Examples: Same person with multiple email addresses Data cleaning Process of dealing with duplicate data issues 4/10/2019 Luke Huan Univ. of Kansas

EDA: Exploratory Data Analysis Histogram Box plot Scatter plot Correlation 4/10/2019 Luke Huan Univ. of Kansas

Visualization Techniques: Histograms Usually shows the distribution of values of a single variable Divide the values into bins and show a bar plot of the number of objects in each bin. The height of each bar indicates the number of objects Shape of histogram depends on the number of bins Example: Petal Width (10 and 20 bins, respectively) 4/10/2019 Luke Huan Univ. of Kansas

Two-Dimensional Histograms Show the joint distribution of the values of two attributes Example: petal width and petal length What does this tell us? 4/10/2019 Luke Huan Univ. of Kansas

Visualization Techniques: Box Plots Invented by J. Tukey Another way of displaying the distribution of data Following figure shows the basic part of a box plot outlier 10th percentile 25th percentile 75th percentile 50th percentile 4/10/2019 Luke Huan Univ. of Kansas

Luke Huan Univ. of Kansas Example of Box Plots Box plots can be used to compare attributes 4/10/2019 Luke Huan Univ. of Kansas

Scatter Plot Array of Iris Attributes 4/10/2019 Luke Huan Univ. of Kansas

Luke Huan Univ. of Kansas Correlation Correlation measures the linear relationship between objects To compute correlation, we standardize data objects, p and q, and then take their dot product 4/10/2019 Luke Huan Univ. of Kansas

Visually Evaluating Correlation Scatter plots showing the similarity from –1 to 1. 4/10/2019 Luke Huan Univ. of Kansas

Discover Association Rules Apriori Algorithm 4/10/2019 Luke Huan Univ. of Kansas

Association Rule Mining Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction Market-Basket transactions Example of Association Rules {Diaper}  {Beer}, {Milk, Bread}  {Eggs,Coke}, {Beer, Bread}  {Milk}, Implication means co-occurrence, not causality! 4/10/2019 Luke Huan Univ. of Kansas

Definition: Frequent Itemset A collection of one or more items Example: {Milk, Bread, Diaper} k-itemset An itemset that contains k items Support count () Frequency of occurrence of an itemset E.g. ({Milk, Bread,Diaper}) = 2 Support Fraction of transactions that contain an itemset E.g. s({Milk, Bread, Diaper}) = 2/5 Frequent Itemset An itemset whose support is greater than or equal to a minsup threshold 4/10/2019 Luke Huan Univ. of Kansas

Definition: Association Rule An implication expression of the form X  Y, where X and Y are itemsets Example: {Milk, Diaper}  {Beer} Rule Evaluation Metrics Support (s) Fraction of transactions that contain both X and Y Confidence (c) Measures how often items in Y appear in transactions that contain X Example: 4/10/2019 Luke Huan Univ. of Kansas

Mining Association Rules Example of Rules: {Milk,Diaper}  {Beer} (s=0.4, c=0.67) {Milk,Beer}  {Diaper} (s=0.4, c=1.0) {Diaper,Beer}  {Milk} (s=0.4, c=0.67) {Beer}  {Milk,Diaper} (s=0.4, c=0.67) {Diaper}  {Milk,Beer} (s=0.4, c=0.5) {Milk}  {Diaper,Beer} (s=0.4, c=0.5) Observations: All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer} Rules originating from the same itemset have identical support but can have different confidence Thus, we may decouple the support and confidence requirements 4/10/2019 Luke Huan Univ. of Kansas

Luke Huan Univ. of Kansas An Exercise The support value of pattern {acm} is Sup(acm)=3 The support of pattern {ac} is Sup(ac)=3 Given min_sup=3, acm is Frequent The confidence of the rule: {ac} => {m} is 100% Transaction-id Items bought 100 f, a, c, d, g, I, m, p 200 a, b, c, f, l,m, o 300 b, f, h, j, o 400 b, c, k, s, p 500 a, f, c, e, l, p, m, n Transaction database TDB 4/10/2019 Luke Huan Univ. of Kansas

Mining Association Rules Two-step approach: Frequent Itemset Generation Generate all itemsets whose support  minsup Rule Generation Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset Frequent itemset generation is still computationally expensive 4/10/2019 Luke Huan Univ. of Kansas

Frequent Itemset Generation Given d items, there are 2d possible candidate itemsets 4/10/2019 Luke Huan Univ. of Kansas

Luke Huan Univ. of Kansas Apriori Algorithm A level-wise, candidate-generation-and-test approach (Agrawal & Srikant 1994) Data base D 1-candidates Freq 1-itemsets 2-candidates TID Items 10 a, c, d 20 b, c, e 30 a, b, c, e 40 b, e Itemset Sup a 2 b 3 c d 1 e Itemset Sup a 2 b 3 c e Itemset ab ac ae bc be ce Scan D Min_sup=2 3-candidates Freq 2-itemsets Counting Itemset bce Itemset Sup ac 2 bc be 3 ce Itemset Sup ab 1 ac 2 ae bc be 3 ce Scan D Scan D Freq 3-itemsets Itemset Sup bce 2 4/10/2019 Luke Huan Univ. of Kansas

Luke Huan Univ. of Kansas Summary Nature of the data Data types: SSN Nominal Grade Ordinal Temperature (degree) Interval Length Ratio Data Quality Noise Outlier Missing/duplicated data 4/10/2019 Luke Huan Univ. of Kansas

Luke Huan Univ. of Kansas Summary Common tools for exploratory data analysis Histogram Box plot Scatter plot Correlation Association Each rule: L => R has two parts: L, the left hand item set and R the right hand item set Each rule is measured by two parameters: Support Confidence 4/10/2019 Luke Huan Univ. of Kansas