Data Mining CSCI 307, Spring 2019 Lecture 21

Data Mining CSCI 307, Spring 2019 Lecture 21
Exam Topics Instance Based Learning

Test 1 Chapter 1: Overview Material and introduction to styles of learning, i.e. classification (and numeric prediction), association and clustering. Chapter 2: Terminology: e.g. instance/tuple/example, attribute, outcome, measuring success is hard; missing data Chapter 3: Knowledge Representation: tables, linear models, trees, rules, instance-based, clusters Chapter 4 up through section 4.5: By far the heaviest coverage from Chapter 4

Chapter 4.1-4.5 What are the methods covered? ZeroR OneR (section 4.1)
Statistical Modeling (section 4.2): e.g. Naïve Bayes Divide and Conquer Trees (section 4.3): e.g. ID3 (J48 simplified) recall finding the purest nodes so can quit splitting. Uses the notion of measuring information gain. Also describes the gain ratio. Covering Algorithms (section 4.4): e.g. PRISM.... One iteration covers a class by creating rules that "cover" some instances. Looks at all instances, create possible tests, then evaluate the resulting rule by selecting the maximum of ratio, (positive ex.)/(total instances where antecedent is true). Continue to refine to create a rule. Mining Association Rules (section 4.5): Finding association rules with high coverage/support (number instances predict correctly) and accuracy/confidence (proportion of instances to which the rule applies). Use item sets to create potential rules, then select based on coverage and accuracy.

Types of Questions Short answer questions, more fact based
Longer descriptive questions. Might ask you to summarize an algorithm or give pros/cons given some data type, etc. "Work it out" questions Note: Bring a calculator

Instance-Based Learning
Distance function defines what's learned Most instance-based schemes use Euclidean distance: a(1) and a(2): two instances with k attributes Taking the square root is not required when comparing distances Other popular metric: city-block metric Adds absolute values of differences without squaring them

Normalization and Other Issues
Different attributes are measured on different scales ==> need to be normalized: vi : the actual value of attribute I Nominal attributes: distance either 0 or 1 Common policy for missing values: assumed to be maximally distant (given normalized attributes)

Finding Nearest Neighbors Efficiently
but maybe not the most computationally efficient Simplest way of finding nearest neighbor: linear scan of the data Classification takes time proportional to the product of the number of instances in training and test sets Nearest-neighbor search can be done more efficiently using appropriate data structures Two methods that represent training data in a tree structure: kD-trees and ball trees

kD-trees The idea: Split the point set alternating by x-coordinate and by y-coordinate Split by x-coordinate: split by a vertical line that has half the points left (or on) and half right Split by y-coordinate: split by a horizontal line that has half the points below (or on) and half above

kD-tree (4 instances) Example
(7,4) (6,7) (3,8) (2,2) a1 a2 below above right left

Using kD-trees Target (not an instance in the tree) is marked by a star (2,4);h (4,1);v (6,7);v (1,2) (8,2) (3,8) (7,5) (6,7) (3,8) (8,2) (1,2) (4,1) (2,4) (7,5) Black node is "good first approximation"

More on kD-trees Complexity depends on depth of tree, given by logarithm of number of nodes Amount of backtracking required depends on quality (balance) of tree (“square” vs. “skinny” nodes) How to build a good tree? Need to find good split point and split direction Split direction: direction with greatest variance Split point: median value along that direction Using value closest to mean (rather than median) can be better if data is skewed Apply recursively

Building Trees Incrementally
Big advantage of instance-based learning: classifier can be updated incrementally Just add new training instance! Can we do the same with kD-trees? Heuristic strategy: Find leaf node containing new instance Place instance into leaf if leaf is empty Otherwise, split leaf according to the longest dimension (to preserve squareness) Tree should be re-built occasionally (i.e. if depth grows to twice the optimum depth.)

Data Mining CSCI 307, Spring 2019 Lecture 21

Similar presentations

Presentation on theme: "Data Mining CSCI 307, Spring 2019 Lecture 21"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Mining CSCI 307, Spring 2019 Lecture 21

Similar presentations

Presentation on theme: "Data Mining CSCI 307, Spring 2019 Lecture 21"— Presentation transcript:

Similar presentations

About project

Feedback