Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS 405G: Introduction to Database Systems. 9/29/20162 Review What Is Data Mining? Data mining (knowledge discovery from data) Extraction of interesting.

Similar presentations


Presentation on theme: "CS 405G: Introduction to Database Systems. 9/29/20162 Review What Is Data Mining? Data mining (knowledge discovery from data) Extraction of interesting."— Presentation transcript:

1 CS 405G: Introduction to Database Systems

2 9/29/20162 Review What Is Data Mining? Data mining (knowledge discovery from data) Extraction of interesting ( non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data What is classification? Predict the value of unseen data What is clustering Grouping similar objects into groups

3 9/29/20163 Challenges of Data Mining Scalability Dimensionality Complex and Heterogeneous Data Data Quality Data Ownership and Distribution Privacy Preservation Streaming Data

4 9/29/20164 Knowing the Nature of Your Data Data types: nominal, ordinal, interval, ratio. Data quality Data preprocessing

5 9/29/20165 What is Data? Collection of data objects and their attributes An attribute is a property or characteristic of an object Examples: eye color of a person, temperature, etc. Attribute is also known as variable, field, characteristic, or feature A collection of attributes describe an object Object is also known as record, point, case, sample, entity, or instance Attributes Objects

6 9/29/20166 Attribute Values Attribute values are numbers or symbols assigned to an attribute Distinction between attributes and attribute values Same attribute can be mapped to different attribute values Example: height can be measured in feet or meters Different attributes can be mapped to the same set of values Example: Attribute values for ID and age are integers But properties of attribute values can be different  ID has no limit but age has a maximum and minimum value

7 9/29/20167 Types of Attributes There are different types of attributes Nominal Examples: ID numbers, eye color, zip codes Ordinal Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in {tall, medium, short} Interval Examples: calendar dates, temperatures in Celsius or Fahrenheit. Ratio Examples: temperature in Kelvin, length, time, counts

8 9/29/20168 Properties of Attribute Values The type of an attribute depends on which of the following properties it possesses: Distinctness: =  Order: Addition: + - Multiplication: * / Nominal attribute: distinctness Ordinal attribute: distinctness & order Interval attribute: distinctness, order & addition Ratio attribute: all 4 properties

9 9/29/20169 Attribute TypeDescriptionExamples NominalThe values of a nominal attribute are just different names, i.e., nominal attributes provide only enough information to distinguish one object from another. (=,  ) zip codes, employee ID numbers, eye color, sex: {male, female} OrdinalThe values of an ordinal attribute provide enough information to order objects. ( ) hardness of minerals, {good, better, best}, grades, street numbers IntervalFor interval attributes, the differences between values are meaningful, i.e., a unit of measurement exists. (+, - ) calendar dates, temperature in Celsius or Fahrenheit RatioFor ratio variables, both differences and ratios are meaningful. (*, /) temperature in Kelvin, monetary quantities, counts, age, mass, length, electrical current Properties of Attribute Values

10 9/29/201610 Discrete and Continuous Attributes Discrete Attribute Has only a finite or countablely infinite set of values Examples: zip codes, counts, or the set of words in a collection of documents Often represented as integer variables. Note: binary attributes are a special case of discrete attributes Continuous Attribute Has real numbers as attribute values Examples: temperature, height, or weight. Practically, real values can only be measured and represented using a finite number of digits. Continuous attributes are typically represented as floating-point variables.

11 9/29/201611 Structured vs Unstructured Data Structured Data Data in a relational database Semi-structured data Graphs, trees, sequencs Un-structured data Image, text

12 9/29/201612 Important Characteristics Data Dimensionality Curse of Dimensionality Sparsity Only presence counts Resolution Patterns depend on the scale

13 9/29/201613 Record Data Data that consists of a collection of records, each of which consists of a fixed set of attributes

14 9/29/201614 Data Matrix If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional space, where each dimension represents a distinct attribute Such data set can be represented by an m by n matrix, where there are m rows, one for each object, and n columns, one for each attribute

15 9/29/201615 Document Data Each document becomes a `term' vector, each term is a component (attribute) of the vector, the value of each component is the number of times the corresponding term occurs in the document.

16 9/29/201616 Transaction Data A special type of record data, where each record (transaction) involves a set of items. For example, consider a grocery store. The set of products purchased by a customer during one shopping trip constitute a transaction, while the individual products that were purchased are the items.

17 9/29/201617 Data Quality What kinds of data quality problems? How can we detect problems with the data? What can we do about these problems? Examples of data quality problems: Noise and outliers missing and duplicated data

18 9/29/201618 Noise Noise refers to modification of original values Examples: distortion of a person’s voice when talking on a poor phone and “snow” on television screen Two Sine WavesTwo Sine Waves + Noise

19 9/29/201619 Mapping Data to a New Space Two Sine Waves Two Sine Waves + NoiseFrequency Fourier transform Wavelet transform

20 9/29/201620 Outliers Outliers are data objects with characteristics that are considerably different than most of the other data objects in the data set One person’s outlier can be another one’s treasure!!

21 9/29/201621 Missing Values Reasons for missing values Information is not collected (e.g., people decline to give their age and weight) Attributes may not be applicable to all cases (e.g., annual income is not applicable to children) Handling missing values Eliminate Data Objects Estimate Missing Values Ignore the Missing Value During Analysis Replace with all possible values (weighted by their probabilities)

22 9/29/201622 Duplicate Data Data set may include data objects that are duplicates, or almost duplicates of one another Major issue when merging data from heterogeous sources Examples: Same person with multiple email addresses Data cleaning Process of dealing with duplicate data issues

23 9/29/201623 EDA: Exploratory Data Analysis Histogram Box plot Scatter plot Correlation

24 9/29/201624 Visualization Techniques: Histograms Histogram Usually shows the distribution of values of a single variable Divide the values into bins and show a bar plot of the number of objects in each bin. The height of each bar indicates the number of objects Shape of histogram depends on the number of bins Example: Petal Width (10 and 20 bins, respectively)

25 9/29/201625 Two-Dimensional Histograms Show the joint distribution of the values of two attributes Example: petal width and petal length What does this tell us?

26 9/29/201626 Visualization Techniques: Box Plots Box Plots Invented by J. Tukey Another way of displaying the distribution of data Following figure shows the basic part of a box plot outlier 10 th percentile 25 th percentile 75 th percentile 50 th percentile 10 th percentile

27 9/29/201627 Example of Box Plots Box plots can be used to compare attributes

28 9/29/201628 Scatter Plot Array of Iris Attributes

29 9/29/201629 Correlation Correlation measures the linear relationship between objects To compute correlation, we standardize data objects, p and q, and then take their dot product

30 9/29/201630 Visually Evaluating Correlation Scatter plots showing the similarity from –1 to 1.

31 9/29/201631 Discover Association Rules Apriori Algorithm

32 9/29/201632 Association Rule Mining Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction Market-Basket transactions Example of Association Rules {Diaper}  {Beer}, {Milk, Bread}  {Eggs,Coke}, {Beer, Bread}  {Milk}, Implication means co-occurrence, not causality!

33 9/29/201633 Definition: Frequent Itemset Itemset A collection of one or more items Example: {Milk, Bread, Diaper} k-itemset An itemset that contains k items Support count (  ) Frequency of occurrence of an itemset E.g.  ({Milk, Bread,Diaper}) = 2 Support Fraction of transactions that contain an itemset E.g. s({Milk, Bread, Diaper}) = 2/5 Frequent Itemset An itemset whose support is greater than or equal to a minsup threshold

34 9/29/201634 Definition: Association Rule Example: Association Rule An implication expression of the form X  Y, where X and Y are itemsets Example: {Milk, Diaper}  {Beer} Rule Evaluation Metrics Support (s) Fraction of transactions that contain both X and Y Confidence (c) Measures how often items in Y appear in transactions that contain X

35 9/29/201635 Mining Association Rules Example of Rules: {Milk,Diaper}  {Beer} (s=0.4, c=0.67) {Milk,Beer}  {Diaper} (s=0.4, c=1.0) {Diaper,Beer}  {Milk} (s=0.4, c=0.67) {Beer}  {Milk,Diaper} (s=0.4, c=0.67) {Diaper}  {Milk,Beer} (s=0.4, c=0.5) {Milk}  {Diaper,Beer} (s=0.4, c=0.5) Observations: All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer} Rules originating from the same itemset have identical support but can have different confidence Thus, we may decouple the support and confidence requirements

36 9/29/201636 An Exercise The support value of pattern {acm} is Sup(acm)=3 The support of pattern {ac} is Sup(ac)=3 Given min_sup=3, acm is Frequent The confidence of the rule: {ac} => {m} is 100% Transaction database TDB Transaction-idItems bought 100f, a, c, d, g, I, m, p 200a, b, c, f, l,m, o 300b, f, h, j, o 400b, c, k, s, p 500a, f, c, e, l, p, m, n

37 9/29/201637 Mining Association Rules Two-step approach: 1. Frequent Itemset Generation – Generate all itemsets whose support  minsup 2. Rule Generation – Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset Frequent itemset generation is still computationally expensive

38 9/29/201638 Frequent Itemset Generation Given d items, there are 2 d possible candidate itemsets

39 9/29/201639 Apriori Algorithm A level-wise, candidate-generation-and-test approach (Agrawal & Srikant 1994) TIDItems 10a, c, d 20b, c, e 30a, b, c, e 40b, e Min_sup=2 ItemsetSup a2 b3 c3 d1 e3 Data base D 1-candidates Scan D ItemsetSup a2 b3 c3 e3 Freq 1-itemsets Itemset ab ac ae bc be ce 2-candidates ItemsetSup ab1 ac2 ae1 bc2 be3 ce2 Counting Scan D ItemsetSup ac2 bc2 be3 ce2 Freq 2-itemsets Itemset bce 3-candidates ItemsetSup bce2 Freq 3-itemsets Scan D

40 9/29/201640 Summary Nature of the data Data types: SSNNominal Grade Ordinal Temperature (degree) Interval Length Ratio Data Quality Noise Outlier Missing/duplicated data

41 9/29/201641 Summary Common tools for exploratory data analysis Histogram Box plot Scatter plot Correlation Association Each rule: L => R has two parts: L, the left hand item set and R the right hand item set Each rule is measured by two parameters: Support Confidence


Download ppt "CS 405G: Introduction to Database Systems. 9/29/20162 Review What Is Data Mining? Data mining (knowledge discovery from data) Extraction of interesting."

Similar presentations


Ads by Google