1 Converting Categories to Numbers for Approximate Nearest Neighbor Search 嘉義大學資工系 郭煌政 2004/10/20.

Slides:



Advertisements
Similar presentations
Algorithm Design Techniques
Advertisements

ADBIS 2007 Aggregating Multiple Instances in Relational Database Using Semi-Supervised Genetic Algorithm-based Clustering Technique Rayner Alfred Dimitar.
Clustering.
Cluster Analysis: Basic Concepts and Algorithms
1 CSE 980: Data Mining Lecture 16: Hierarchical Clustering.
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Learning Trajectory Patterns by Clustering: Comparative Evaluation Group D.
Clustering Categorical Data The Case of Quran Verses
Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
 Graph Graph  Types of Graphs Types of Graphs  Data Structures to Store Graphs Data Structures to Store Graphs  Graph Definitions Graph Definitions.
Lazy vs. Eager Learning Lazy vs. eager learning
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Graph Algorithms: Minimum Spanning Tree We are given a weighted, undirected graph G = (V, E), with weight function w:
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
4. Ad-hoc I: Hierarchical clustering
Chapter 9: Greedy Algorithms The Design and Analysis of Algorithms.
Basic Data Mining Techniques
CS 206 Introduction to Computer Science II 11 / 05 / 2008 Instructor: Michael Eckmann.
KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.
Prim’s Algorithm and an MST Speed-Up
Nearest Neighbor Classifiers other names: –instance-based learning –case-based learning (CBL) –non-parametric learning –model-free learning.
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
C o n f i d e n t i a l HOME NEXT Subject Name: Data Structure Using C Unit Title: Graphs.
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Module 04: Algorithms Topic 07: Instance-Based Learning
Chapter 9 – Graphs A graph G=(V,E) – vertices and edges
Lecture 12-2: Introduction to Computer Algorithms beyond Search & Sort.
Data mining and machine learning A brief introduction.
CACTUS – Clustering Categorical Data Using Summaries By Venkatesh Ganti, Johannes Gehrke and Raghu Ramakrishnan RongEn Li School of Informatics, Edinburgh.
A Clustering Algorithm based on Graph Connectivity Balakrishna Thiagarajan Computer Science and Engineering State University of New York at Buffalo.
CSCI 115 Chapter 7 Trees. CSCI 115 §7.1 Trees §7.1 – Trees TREE –Let T be a relation on a set A. T is a tree if there exists a vertex v 0 in A s.t. there.
11/12/2012ISC471 / HCI571 Isabelle Bichindaritz 1 Prediction.
 Analysis Wrap-up. What is analysis?  Look at an algorithm and determine:  How much time it takes  How much space it takes  How much programming.
Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.7: Instance-Based Learning Rodney Nielsen.
ITEC 2620A Introduction to Data Structures Instructor: Prof. Z. Yang Course Website: 2620a.htm Office: TEL 3049.
Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree like diagram that.
Graphs Upon completion you will be able to:
Lecture 19 Minimal Spanning Trees CSCI – 1900 Mathematics for Computer Science Fall 2014 Bill Pine.
Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.
Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.
Main Index Contents 11 Main Index Contents Graph Categories Graph Categories Example of Digraph Example of Digraph Connectedness of Digraph Connectedness.
GUILLOU Frederic. Outline Introduction Motivations The basic recommendation system First phase : semantic similarities Second phase : communities Application.
Parametric calibration of speed–density relationships in mesoscopic traffic simulator with data mining Adviser: Yu-Chiang Li Speaker: Gung-Shian Lin Date:2009/10/20.
CSE 4705 Artificial Intelligence
Clustering CSC 600: Data Mining Class 21.
Data Science Algorithms: The Basic Methods
Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2008
Chapter 5. Greedy Algorithms
Minimum Spanning Tree Chapter 13.6.
Minimum Spanning Trees
Lecture 12 Algorithm Analysis
Parametric calibration of speed–density relationships in mesoscopic traffic simulator with data mining Adviser: Yu-Chiang Li Speaker: Gung-Shian Lin Date:2009/10/20.
Short paths and spanning trees
Autumn 2016 Lecture 11 Minimum Spanning Trees (Part II)
Graph Algorithm.
Autumn 2015 Lecture 11 Minimum Spanning Trees (Part II)
Graphs Chapter 11 Objectives Upon completion you will be able to:
DATA MINING Introductory and Advanced Topics Part II - Clustering
ITEC 2620M Introduction to Data Structures
Clustering Wei Wang.
Text Categorization Berlin Chen 2003 Reference:
Chapter 16 1 – Graphs Graph Categories Strong Components
Lecture 12 Algorithm Analysis
Winter 2019 Lecture 11 Minimum Spanning Trees (Part II)
SEEM4630 Tutorial 3 – Clustering.
Hierarchical Clustering
Graph Algorithm.
Clustering.
Autumn 2019 Lecture 11 Minimum Spanning Trees (Part II)
Presentation transcript:

1 Converting Categories to Numbers for Approximate Nearest Neighbor Search 嘉義大學資工系 郭煌政 2004/10/20

2 Outline  Introduction  Motivation  Measurement  Algorithms  Experiments  Conclusion

3 Introduction  Memory-Based Reasoning –Case-Based Reasoning –Instance-Based Learning  Given a training dataset and a new object, predict the class (target value) of the new object.  Focus on table data

4 Introduction  K Nearest Neighbor Search –Compute similarity between the new object and each object in the training dataset. –Linear time to the size of the dataset  Similarity: Euclidean distance  Multi-dimension Index –Spatial data structure, such as R-tree –Numeric data

5 Introduction  Indexing on Categorical Data? –Linear order of the categories –Existing correct ordering? –Best ordering?  Store the mapped data on a multi- dimensional data structure as filtering mechanism

6 Measurement for Ordering Ordering Problem Given an undirected weighted complete graph, a simple path is an ordering of the vertices. The edges are the distances between pairs of vertices. The ordering problem is to find a path, called ordering path, of maximal value according to a certain scoring function.

7 Measurement for Ordering  Relationship Scoring Reasonable Ordering Score  In an ordering path, 3-tuple is reasonable if and only if dist(v i-1, v i+1 ) ≧ dist(v i-1, v i ) and dist(v i-1, v i+1 ) ≧ dist(v i, v i+1 ).

8 Measurement for Mapping  Pairwise Difference Scoring –Normalized distance matrix –Mapping values of categories –Dist m (v i, v j ) = |mapping(v i ) - mapping(v j )|

9 Algorithms  Prim-like Ordering  Kruskal-like Ordering  Divisive Ordering  GA Approach Ordering –A vertex is a category –A graph represent a distance matrix

10 Prim-like Ordering Algorithm  Prim’s Minimum Spanning Tree –Initially, choose a least edge (u, v) –Add the edge to the tree; S = {u, v} –Choose a least edge connecting a vertex in S and a vertex, w, not in S –Add the edge to the tree; Add w to S –Repeat until all vertices are in S

11 Prim-like Ordering Algorithm  Prim-like Ordering –Choose a least edge (u, v) –Add the edge to the ordering path; S = {u, v} –Choose a least edge connecting a vertex in S and a vertex, w, not in S –If the edge creates a circle on the path, discard the edge, and choose again –Else, add the edge to the ordering path; Add w to S –Repeat until all vertices are in S

12 Kruskal-like Ordering Algorithm  Kruskal Minimum Spanning Tree –Initially, choose a least edge (u, v) –Add the edge to the tree; S = {u, v} –Choose a least edge as long as the edge does not create a circle in the tree –Add the edge to the tree; Add the two vertiecs to S –Repeat until all vertices are in S

13 Kruskal-like Ordering Algorithm  Kruskal-like Ordering –Initially, choose a least edge (u, v) and add it to the ordering path; S = {u, v} –Choose a least edge as long as the edge does not create a circle in the tree, and degree of either vertex on the path is <= 2 –Add the edge to the ordering path; Add the two vertices to S –Repeat until all vertices are in S  Heap array can be used to speed up choosing least edge

14 Divisive Ordering Algorithm  Idea: –Pick a central vertex, and split the rest vertices –Building a binary tree: vertices are the leaves  Central Vertex:

15 Divisive Ordering Algorithm  A R is closer to P than A L is.  B L is closer to P than B R is. P A ALAL ARAR B BLBL BRBR

16 Clustering  Splitting a Set of Vertices into Two Groups –Each group has at least one vertex –Close (similar) vertices in same group Distant vertices in different groups  Clustering Algorithms –Two clusters

17 Clustering  Clustering –Grouping a set of objects into classes of similar objects  Agglomerative Hierarchical Clustering Algorithm –Singleton clusters –Merge similar clusters

18 Clustering  Clustering Algorithm: Cluster Similarity –Single link dist(Ci, Cj) = min(dist(p, q)), p in Ci, q in Cj –Complete link dist(Ci, Cj) = max(dist(p, q)), p in Ci, q in Cj –Average link -- adopted in our study dist(Ci, Cj) = avg(dist(p, q)), p in Ci, q in Cj –others

19 Clustering  Clustering Implementation Issues –Which pair of clusters to be merged: Keep cluster-to-cluster similarity for each pair –Recursively partition sets of vertices while building the binary tree: Non-recursive version with a stack

20 GA Ordering Algorithm  Genetic Algorithm for Optimal Problems  Chromosome: solution  Population: pool of solutions  Genetic Operations –Crossover –Mutation

21 GA Ordering Algorithm  Encoding a Solution –Binary string –Ordered list of categories – in our ordering problem  Fitness Function –Reasonable ordering score  Selecting Chromosomes for crossover –High fitness value => high probability

22 GA Ordering Algorithm  Crossover –Single point –Multiple points –Mask  Crossover AB | CDE and BD | AEC  Results in ABAEC and BDCDE => Illegal

23 GA Ordering Algorithm  Repair Illegal Chromosome ABAEC –AB*EC => fill D in * position  Repair Illegal Chromosome ABABC –AB**C –D and E are missing –Which one is closest to B, fill it in first * position

24 Mapping Function  Ordering Path  Mapping(v i ) =

25 Experiments  Synthetic Data (width/length = 5)

26 Experiments  Synthetic Data (width/length = 10)

27 Experiments  Synthetic Data: Reasonable Ordering Score for Divisive Algorithm –width/length = 5 => 0.82 –width/length = 10 => 0.9 –No Ordering => 1/3  Divisive algorithm is better than Prim-like algorithm when number of categories > 100

28 Experiments  Synthetic Data (width/length = 5)

29 Experiments  Synthetic Data (width/length = 10)

30 Experiments  Divisive Ordering is best among the three ordering algorithms  For divisive ordering algorithm on > 100 categories, RMSE scores are around 0.07 when width/length = 10, and are around 0.05 when width/length = 10.  Prim-like ordering algorithm: 0.12 and 0.1, respectively.

31 Experiments  “Census-Income” dataset from the University of California, Irvine (UCI) KDD Archive  33 nominal attributes, 7 continuous attributes  Sample 5000 records for training dataset.  Sample 2000 records for approximate KNN search experiment.

32 Experiments  Distance Matrix: distance between two categories  V. Ganti, J. Gehrke, and R. Ramakrishnan, “CACTUS-Clustering Categorical Data Using Summaries,” ACM KDD, 1999  D = {d 1, d 2, …, d n } of n tuples.  D is subset of D 1 * D 2 * … * D k, where D i is a categorical domain, for 1 ≦ i ≦ k.  di =.

33 Experiments

34 Experiments  Approximate KNN – nominal attributes

35 Experiments  Approximate KNN – nominal attributes

36 Experiments  Approximate KNN – nominal attributes

37 Experiments  Approximate KNN – all attributes

38 Experiments  Approximate KNN – all attributes

39 Experiments  Approximate KNN – all attributes

40 Conclusion  Developed Ordering Algorithms –Prim-like –Krusal-like –Divisive –GA-based  Devised Measurement –Reasonable ordering score –Root mean squared error

41 Conclusion  What next? –New categories, new mapping function –New index structure? –Training mapping function for a given ordering path.

42 Thank you.