Jierui Xie, Boleslaw Szymanski, Mohammed J. Zaki Department of Computer Science Rensselaer Polytechnic Institute Troy, NY 12180, USA {xiej2, szymansk,

Slides:



Advertisements
Similar presentations
A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Advertisements

Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
Suleyman Cetintas 1, Monica Rogati 2, Luo Si 1, Yi Fang 1 Identifying Similar People in Professional Social Networks with Discriminative Probabilistic.
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Linear Regression.
Foreground Focus: Finding Meaningful Features in Unlabeled Images Yong Jae Lee and Kristen Grauman University of Texas at Austin.
Community Detection Algorithm and Community Quality Metric Mingming Chen & Boleslaw K. Szymanski Department of Computer Science Rensselaer Polytechnic.
Ke Liu1, Junqiu Wu2, Shengwen Peng1,Chengxiang Zhai3, Shanfeng Zhu1
Indian Statistical Institute Kolkata
Multiple Criteria for Evaluating Land Cover Classification Algorithms Summary of a paper by R.S. DeFries and Jonathan Cheung-Wai Chan April, 2000 Remote.
Fuzzy rule-based system derived from similarity to prototypes Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Poland School.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Neural Networks. R & G Chapter Feed-Forward Neural Networks otherwise known as The Multi-layer Perceptron or The Back-Propagation Neural Network.
Computing Trust in Social Networks
Margin Based Sample Weighting for Stable Feature Selection Yue Han, Lei Yu State University of New York at Binghamton.
Data Discretization Unification Ruoming Jin Yuri Breitbart Chibuike Muoh Kent State University, Kent, USA.
S IMILARITY M EASURES FOR T EXT D OCUMENT C LUSTERING Anna Huang Department of Computer Science The University of Waikato, Hamilton, New Zealand BY Farah.
Ordinal Decision Trees Qinghua Hu Harbin Institute of Technology
Text Classification With Labeled and Unlabeled Data Presenter: Aleksandar Milisic Supervisor: Dr. David Albrecht.
Evaluating Performance for Data Mining Techniques
Digital Camera and Computer Vision Laboratory Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan, R.O.C.
Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields Yong-Joong Kim Dept. of Computer Science Yonsei.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Semantic Similarity over Gene Ontology for Multi-label Protein Subcellular Localization Shibiao WAN and Man-Wai MAK The Hong Kong Polytechnic University.
Digital Camera and Computer Vision Laboratory Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan, R.O.C.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
CSCI N207: Data Analysis Using Spreadsheets Copyright ©2005  Department of Computer & Information Science Univariate Data Analysis.
Page 1 Ming Ji Department of Computer Science University of Illinois at Urbana-Champaign.
Treatment Learning: Implementation and Application Ying Hu Electrical & Computer Engineering University of British Columbia.
Chengjie Sun,Lei Lin, Yuan Chen, Bingquan Liu Harbin Institute of Technology School of Computer Science and Technology 1 19/11/ :09 PM.
Benk Erika Kelemen Zsolt
Chapter 1 Introduction to Statistics. Statistical Methods Were developed to serve a purpose Were developed to serve a purpose The purpose for each statistical.
A hybrid SOFM-SVR with a filter-based feature selection for stock market forecasting Huang, C. L. & Tsai, C. Y. Expert Systems with Applications 2008.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Xiangnan Kong,Philip S. Yu Multi-Label Feature Selection for Graph Classification Department of Computer Science University of Illinois at Chicago.
Graph-based Text Classification: Learn from Your Neighbors Ralitsa Angelova , Gerhard Weikum : Max Planck Institute for Informatics Stuhlsatzenhausweg.
Ensemble with Neighbor Rules Voting Itt Romneeyangkurn, Sukree Sinthupinyo Faculty of Computer Science Thammasat University.
Exploit of Online Social Networks with Community-Based Graph Semi-Supervised Learning Mingzhen Mo and Irwin King Department of Computer Science and Engineering.
Paired Sampling in Density-Sensitive Active Learning Pinar Donmez joint work with Jaime G. Carbonell Language Technologies Institute School of Computer.
Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall 2004 AdaBoost.. Binary Classification. Read 9.5 Duda,
Semantic v.s. Positions: Utilizing Balanced Proximity in Language Model Smoothing for Information Retrieval Rui Yan†, ♮, Han Jiang†, ♮, Mirella Lapata‡,
SemiBoost : Boosting for Semi-supervised Learning Pavan Kumar Mallapragada, Student Member, IEEE, Rong Jin, Member, IEEE, Anil K. Jain, Fellow, IEEE, and.
Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser.
1Ellen L. Walker Category Recognition Associating information extracted from images with categories (classes) of objects Requires prior knowledge about.
Collecting High Quality Overlapping Labels at Low Cost Grace Hui Yang Language Technologies Institute Carnegie Mellon University Anton Mityagin Krysta.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.
Detecting New a Priori Probabilities of Data Using Supervised Learning Karpov Nikolay Associate professor NRU Higher School of Economics.
Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.
Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: Group:
Data Mining and Decision Support
Text Categorization by Boosting Automatically Extracted Concepts Lijuan Cai and Tommas Hofmann Department of Computer Science, Brown University SIGIR 2003.
Anthony J Greene1 Central Tendency 1.Mean Population Vs. Sample Mean 2.Median 3.Mode 1.Describing a Distribution in Terms of Central Tendency 2.Differences.
Collaborative Filtering via Euclidean Embedding M. Khoshneshin and W. Street Proc. of ACM RecSys, pp , 2010.
Chapter 6 Neural Network.
CS Machine Learning Instance Based Learning (Adapted from various sources)
Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:
Learning Kernel Classifiers 1. Introduction Summarized by In-Hee Lee.
Zijian Wang, Eyuphan Bulut, and Boleslaw K. Szymanski Center for Pervasive Computing and Networking and Department of Computer Science Rensselaer Polytechnic.
A Document-Level Sentiment Analysis Approach Using Artificial Neural Network and Sentiment Lexicons Yan Zhu.
Similarity Measures for Text Document Clustering
Queensland University of Technology
Semi-Supervised Clustering
Artificial Neural Networks
LINEAR AND NON-LINEAR CLASSIFICATION USING SVM and KERNELS
Lecture 22 Clustering (3).
Discriminative Frequent Pattern Analysis for Effective Classification
Fuzzy rule-based system derived from similarity to prototypes
View Inter-Prediction GAN: Unsupervised Representation Learning for 3D Shapes by Learning Global Shape Memories to Support Local View Predictions 1,2 1.
Learning and Memorization
Presentation transcript:

Jierui Xie, Boleslaw Szymanski, Mohammed J. Zaki Department of Computer Science Rensselaer Polytechnic Institute Troy, NY 12180, USA {xiej2, szymansk, Learning Dissimilarities for Categorical Symbols

Presentation Outline Introduction Related Work Learning Dissimilarity (LD) Algorithm Experimental Results Conclusion

Introduction Distance plays an important role in many data mining tasks Distance is rarely defined precisely for categorical data –nominal and ordinal –e.g., rating of a movie {very bad, bad, fair, good, very good} Goal: derive dissimilarities between categorical symbols –To enable the full power of distance-based methods. –Hopefully easier for interpretation as well.

Notation A dataset X ={x 1,x 2,…,x t } of t data points. Each point x i has m attributes values x i = (x 1 i,…, x m i ). Each attribute A i is drawn from n i discrete values {a i 1,…, a i ni }. Each a i j is also called a symbol. The similarity between symbols a i k and a i l : The dissimilarity or The distance between two data points x i and x j is defined in terms of the distance between symbols

Notation (cont.) Let the frequency of symbol a i in the dataset be then the probability Class label Output of the classier on point x i : The error of misclassifying point x i : Total classification error:

Related Work Unsupervised methods: –Assign based on frequency; Emphasize mismatch or match for frequent or rare symbols from certain probability or information theory point of views.  Lin  Burnaby  Smirnov  Goodall Supervised methods: –Take the classes information into account  Value Difference Metric (VDM)  Cheng et al..  Gambaryan  Eskin  Occurrence Frequency (OF)  Inverse Occurrence Frequency (IOF)

Unsupervised Method Examples Goodall : less frequent attribute values make greater contribution to the overall similarity than frequent attribute values on match. That is, if a i =a j otherwise, 0 Inverse Occurrence Frequency (IOF): assigns higher weight to mismatches on less frequent symbols. That is, if a i !=a j otherwise, 1

Supervised Method Examples VDM: –Symbols are similar if they occur with a similar relative frequency for all the classes. where C ai,c is the number of times symbol a i occurs in class c. C ai is the total number of times a i occurs in the whole dataset. h is a constant. Cheng: –based on RBF classier –They attempt to evaluate all the pair-wise distances between symbols, and they optimize the error function using gradient descent method

Learning Dissimilarity Algorithm Motivation: –learn a mapping function from each categorical attribute A i onto the real number interval based on the classes information may facilitate the classification task and is possible.

Learning Dissimilarity Algorithm (cont.) Based on nearest neighbor classifier and the distance difference from two classes Iteration learning Guided by gradient descent method to minimize the total classification error

Learning Dissimilarity Algorithm (cont.) Objective Function and Update Equation

Learning Dissimilarity Algorithm (cont.) The Derivative of ∆d The full update equation

Learning Dissimilarity Algorithm (cont.) Intuitive meaning of assignment update

Experimental Result Datasets

Experimental Result (cont.) Redundancy among symbols

Experimental Result (cont.) Comparison with Various Data-Driven Methods –On average, the LD and VDM achieve the best accuracy, indicating that supervised dissimilarities attain better results over the unsupervised ones. Among the unsupervised measures, IOF, Lin are slightly superior to others.

Experimental Result (cont.) Analysis with confidence interval (accuracy +/- standard deviation) –LD performed statistically worse than Lin on datasets Splice and Tic-tac-toe but better than Lin on datasets Connection-4, Hayes and Balance Scale. –LD performed statistically worse than VDM only on one dataset (Splice) but better on two datasets (Connection-4 and Tic-tac-toe). –Finally, LD performed statistically at least as well as (and on some datasets, e.g. Connection-4, better than) the remaining methods.

Experimental Result (cont.) Comparison with Various Classifiers –LD performed statistically worse than the other methods on only one dataset (Splice) but performed better on at least three other datasets than each of the other methods.

Conclusion A task-oriented or supervised iterative learning approach to learn a distance function for categorical data. –Explores the relationships between categorical symbols by utilizing the classification error as guidance. –The real value mappings found by our algorithm provide discriminative information, which can be used to refine features and improve classification accuracy.

Thank you!