C LUSTERING FOR T AXONOMY E VOLUTION By -Anindya Das - Sneha Bankar
P ROBLEM STATEMENT Problem -Due to lack of correct category many a times products are placed in the wrong category -This could be an indication of taxonomy evolution Solution -Clustering products based on product descriptions
T AXONOMY EVOLUTION Camera & Photo LensesFlashesDigital Cameras Compact System Camera/ Digital SLR Cameras Point & Shoot Cameras/ Digital SLR Cameras
T AXONOMY EVOLUTION Camera & Photo LensesFlashesDigital Cameras Compact System Camera Digital SLR Camera Point & Shoot Cameras
FEATURE E XTRACTION Use product description as features Brand Removal Stemming Use of unigrams and bigrams Feature Weighing based on Term Frequency Feature Weighing based on TFIDF
HIERARCHICAL AGGLOMERATIVE CLUSTERING Initially, each item is considered a cluster. The closest pair is chosen. Those two clusters are merged. Each iteration reduces one cluster. Continues till terminating condition satisfies. No. of clusters Inter cluster Distance UPGMA used for measuring cluster distance.
DISTANCE MEASURES
K-M EANS Select K initial centroids Assign data points(ASIN feature vector) to the centroids based on distances Update Mean for the Centroids Re-assign and update the centroids till data points can be re- assigned
EXECUTION PIPELINE Data Preprocessor Feature Extraction Engine Clustering Engine Cluster Evaluation Engine
CLUSTER EVALUATION How many items in a cluster are talking about the top most frequent features of a cluster? Precision = true positives / (true positives + false positives) Recall = true positives /( true positives + false negatives)
RESULTS Precision Values Recall values for all cases lie between 20% to 30% HACK-Means Dataset 195%92% Dataset 292%96% Dataset 393%90%
FUTURE WORK Mining topics from product descriptions using them as features Approach to detect outliers and merge them to form a new category Use of association rule mining for evaluation instead of top frequent words
R EFERENCES ng Liu, Tao. "An Evaluation on Feature Selection for Text Clustering." N.p., Web.