Chong Sun, Narasimhan Rampalli, Frank Yang, AnHai Doan

Slides:



Advertisements
Similar presentations
A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Advertisements

Mining customer ratings for product recommendation using the support vector machine and the latent class model William K. Cheung, James T. Kwok, Martin.
Document Summarization using Conditional Random Fields Dou Shen, Jian-Tao Sun, Hua Li, Qiang Yang, Zheng Chen IJCAI 2007 Hao-Chin Chang Department of Computer.
CrowdER - Crowdsourcing Entity Resolution
Supervised Learning Techniques over Twitter Data Kleisarchaki Sofia.
Data Mining Classification: Alternative Techniques
Support Vector Machines and Margins
Dual-domain Hierarchical Classification of Phonetic Time Series Hossein Hamooni, Abdullah Mueen University of New Mexico Department of Computer Science.
Learning on Probabilistic Labels Peng Peng, Raymond Chi-wing Wong, Philip S. Yu CSE, HKUST 1.
Face detection Many slides adapted from P. Viola.
Data Mining Sangeeta Devadiga CS 157B, Spring 2007.
Classification and Decision Boundaries
Lesson learnt from the UCSD datamining contest Richard Sia 2008/10/10.
Mapping Between Taxonomies Elena Eneva 11 Dec 2001 Advanced IR Seminar.
Learning syntactic patterns for automatic hypernym discovery Rion Snow, Daniel Jurafsky and Andrew Y. Ng Prepared by Ang Sun
Part I: Classification and Bayesian Learning
Robust Bayesian Classifier Presented by Chandrasekhar Jakkampudi.
New Challenges in Cloud Datacenter Monitoring and Management
Selective Sampling on Probabilistic Labels Peng Peng, Raymond Chi-Wing Wong CSE, HKUST 1.
A Machine Learning-based Approach for Estimating Available Bandwidth Ling-Jyh Chen 1, Cheng-Fu Chou 2 and Bo-Chun Wang 2 1 Academia Sinica 2 National Taiwan.
SVMLight SVMLight is an implementation of Support Vector Machine (SVM) in C. Download source from :
Active Learning for Class Imbalance Problem
Chaitanya Gokhale University of Wisconsin-Madison Joint work with AnHai Doan, Sanjib Das, Jeffrey Naughton, Ram Rampalli, Jude Shavlik, and Jerry Zhu Corleone:
Distributed Networks & Systems Lab. Introduction Collaborative filtering Characteristics and challenges Memory-based CF Model-based CF Hybrid CF Recent.
Predicting performance of applications and infrastructures Tania Lorido 27th May 2011.
1 Pengjie Ren, Zhumin Chen and Jun Ma Information Retrieval Lab. Shandong University 报告人:任鹏杰 2013 年 11 月 18 日 Understanding Temporal Intent of User Query.
Department of Computer Science, University of Waikato, New Zealand Geoffrey Holmes, Bernhard Pfahringer and Richard Kirkby Traditional machine learning.
DATA MINING LECTURE 10 Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
Selective Block Minimization for Faster Convergence of Limited Memory Large-scale Linear Models Kai-Wei Chang and Dan Roth Experiment Settings Block Minimization.
Biswanath Panda, Mirek Riedewald, Daniel Fink ICDE Conference 2010 The Model-Summary Problem and a Solution for Trees 1.
Classification Techniques: Bayesian Classification
Stefan Mutter, Mark Hall, Eibe Frank University of Freiburg, Germany University of Waikato, New Zealand The 17th Australian Joint Conference on Artificial.
Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science & Information Engineering.
JOB EVALUATION MAGNETIC CONTACTORS.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
CSSE463: Image Recognition Day 11 Lab 4 (shape) tomorrow: feel free to start in advance Lab 4 (shape) tomorrow: feel free to start in advance Test Monday.
A Fast and Scalable Nearest Neighbor Based Classification Taufik Abidin and William Perrizo Department of Computer Science North Dakota State University.
USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
Class Imbalance in Text Classification
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
COMP24111: Machine Learning Ensemble Models Gavin Brown
A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Eick: kNN kNN: A Non-parametric Classification and Prediction Technique Goals of this set of transparencies: 1.Introduce kNN---a popular non-parameric.
Cell Segmentation in Microscopy Imagery Using a Bag of Local Bayesian Classifiers Zhaozheng Yin RI/CMU, Fall 2009.
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
DOWeR Detecting Outliers in Web Service Requests Master’s Presentation of Christian Blass.
Book web site:
Learning to Detect and Classify Malicious Executables in the Wild by J
A Smart Tool to Predict Salary Trends of H1-B Holders
Chapter 3: Maximum-Likelihood Parameter Estimation
Efficient Image Classification on Vertically Decomposed Data
CSSE463: Image Recognition Day 11
COMP61011 : Machine Learning Ensemble Models
Efficient Image Classification on Vertically Decomposed Data
Classification Techniques: Bayesian Classification
Graphical Techniques.
A Fast and Scalable Nearest Neighbor Based Classification
Graphical Techniques.
iSRD Spam Review Detection with Imbalanced Data Distributions
Data Mining Classification: Alternative Techniques
Information Retrieval
CSSE463: Image Recognition Day 11
CSSE463: Image Recognition Day 11
NAÏVE BAYES CLASSIFICATION
Presentation transcript:

Chimera: Large-Scale Classification using Machine Learning, Rules, and Crowdsourcing Chong Sun, Narasimhan Rampalli, Frank Yang, AnHai Doan @WalmartLabs & UW-Madison Presenter: Jun Xie, @WalmartLabs Let me start this talk by telling you how I got to social media. For that, I need to go back to the year 2001. @WalmartLabs

Problem Definition Classify tens of millions of products into 5000+ types Each product: a record of attribute-value pairs title: Gerber folding knife 0 KN-Knives description: most versatile knife in its category ... manufacturer, color, etc. many products have just title attribute Product types laptop computers, area rugs, laptop bags & cases, dining chairs, etc. ID Title SCC PT EASW1876 Eastern Weavers Rugs EYEBALLWH-8x10 Shag Eyeball White 8x10 Rug Shag Rugs Area Rugs EMLCO655 Royce Leather 643-RED-4 Ladies Laptop Brief - Red Notebook Cases Laptop Bags and Cases 14968347 International Concepts Stacking Dining Arm Chair (Set of 2) Dining Chairs 12490924 South Carolina Gamecocks Rectangle Toothfairy Pillow Decorative Pillows

Challenges Very large # of product types (5000+) started out having very little training data creating training data for 5000+ is very difficult Very limited human resources 1 developer and 1-2 analysts (who can’t write code) Products often arrive in bursts e.g., a batch of 300K items just come, must classify fast makes it hard to provision for analysts and outsourcing Need very high precision (>92%) can tolerate lower recall, but want to increase recall over time Current approaches can’t handle these scales/challenges

Manually Classifying the Items Using analysts can accurately classify about 100 items per day must understand the item, navigate through a large space of possible types, decide on the most appropriate one e.g., Misses’ Jacket, Pants and Blouse – 14 -16-18-20 Pattern  sewing patterns e.g., Gerber folding knife 0 KN-Knives  utility knives? pocket knives? tactical knives? multitools? would take 5 analysts 200 days to classify 100K items Using outsourcing very expensive: $770K for 1M items outsourcing is not “elastic” Using crowdsourcing crowd workers can’t navigate a complex and large taxonomy of types

Learning-Based Solutions Difficult to generate training data too many prod types (5000+) to label just 200 items per prod type, must label 1M items Difficult to generate representative samples random sampling would severely under-sample certain types analysts and outsourced workers don’t know how to obtain a random sample, e.g., for handbags, computer cables new product types appear all the time  the universe of items keeps changing Difficult to handle “corner” cases items coming from special sources, need to be handled specially hard to “go the last mile”, e.g., increasing precision from 90 to 95% Concept drift and changing distribution e.g., smart phone

Rule-Based Solutions Analysts & outsourcing workers write rules to classify items Writing rules to cover 5000+ product types is very slow doesn’t scale Our Chimera solution combines the above approaches uses learning & hand-crafted rules uses developers, analysts, and crowsourcing continuously improves over time keeps precision high while trying to improve recall

Our Chimera Solution Classification Rules Filter Training Data Reports Crowd Evaluation Sample Analysis Unclassified Classified Gatekeeper Rules Whitelist Rules Items to Classify Result Blacklist Rules K-NN Naïve Bayes Perceptron Regression SVM Voting Master Attribute Based Filter

Examples Rules Classification evaluation using crowdsourcing rings?  rings wedding bands?  rings diamond.*trio sets?  rings macbook  ! Fruit (a blacklist rule) Classification evaluation using crowdsourcing

Key Novelties of Our Solution Use both learning and rules extensively rules are not “nice to have”, they are critical for high accuracy Use both crowd and analysts for evaluation/analysis using both in-house analysts and crowdsourcing is critical at our scale to achieve an accurate, continuously improving, and cost-effective solution Scalable in terms of human resources taps into crowdsourcing (very elastic) and analysts Treat human and machines as first-class citizens solution carefully spells out what techniques are used where, who is doing what, and how to coordinate among them

Evaluation Chimera has been developed and deployed for 2 years Applied to 2.5M items from market place vendors classified more than 90% with 92% precision Applied to 14M items from walmart.com classified 93% with 93% precision As of March 2014 has 852K items in training data for 3,663 types 20,459 rules for 4,930 types Crowdsourcing evaluating 1,000 items takes 1 hour with 15-25 workers Staffing 1 developer + 1 dedicated analyst + 1 more analyst when needed

Conclusion & Lessons Learned Chimera: classifying millions of items into 5000+ types At this scale, existing approaches do not work well We have developed a highly scalable, accurate solution using learning, rules, crowdsourcing, analysts Lessons learned both learning + rules are critical crowdsourcing is critical but must be closely monitored crowdsourcing must be coupled with in-house analysts and developers outsourcing does not work at a very large scale hybrid human-machine systems are here to stay More details in our paper