Large-scale Recommendations in a Dynamic Marketplace Jay Katukuri Rajyashree Mukherjee Tolga Konik Chu-Cheng Hsieh LSRS 20131
John is interested in an item: “iPhone 5 64gb white”, should we recommends – “iPhone 5 case” (or) – “iPhone 5s gold” Meet John Doe LSRS 20132
Recommendation on e-marketplace Recommendation “before” purchase – iPhone 5S gold Recommendation “after” purchase – iPhone 5 case Similar Item Recommendation (SIR) Related Item Recommendation (RIR) LSRS 20133
SIR- Example 1 LSRS 20134
SIR Example 2 LSRS 20135
Related Item Recommendation 6 Recommendations for Xbox 360 4GB on Checkout page LSRS 2013
Main Idea Similar Item Clustering (SIC) – Titles – Attributes (Price, etc.) – Images Recommendation – SIR: (same cluster) – RIR: (neighbor clusters) LSRS 20137
Models Item clusters Cluster represented by meaningful keywords – “clarks women shoe pumps classics” – “authentic handmade amish quilt” Cluster-Cluster Relations – “samsung galaxy s4” – “samsung galaxy s4 screen protector” – “wolfgang puck electric pressure cooker” – “kitchenaid food processor” LSRS 20138
System Architecture - Overview LSRS Inventory Cluster-Cluster Relations Transactions Clusters Conceptual Knowledgebase Offline Model GenerationThe Data StoreReal-time Performance System Similar Items Recommender (SIR) Related Items Recommender (RIR) Clusters Model Generation Related Clusters Model Generation Clickstream Lost Item Similar Items ?similarTo(item) Bought Item Related Items ?relatedTo(item)
Cluster Generation (offline) LSRS
Data on eBay Item-item co-occurrences on transaction logs Large Data – Much bigger data set in both users and inventory than other ecommerce sites. Scale – More than 300M listings. – More than 10M new items every day LSRS
Challenges Global clustering not feasible Size bias on different categories Performance LSRS
Model Generation - Clusters 1.Select a few keyword to represents “big notions”, e.g. iPhone, Handbags, etc. – How to select? 2.Clustering by K-means – How to set K? LSRS
Model Generation - Clusters new clusters items user queries concepts, categories query-to-items Query-Recall Generation Cluster Generation Clusters Model Generation Data Store Clusters Inventory Clickstream Conceptual Knowledgebase Problem: Global clustering not feasible Solution: Partition input data by user queries Parallel distributed K-Means in Hadoop MapReduce Dedupe and merge overlapping clusters (100X reduction in size over inventory with over 90% coverage) LSRS
Base Cluster Generation Base Cluster ≡ Query Find merge candidates based on query term overlap – Eg: “nike airmax tennis shoes” -> “nike airmax” Score candidates using cosine similarity – Term weight : TF-IDF in the query space(document=query) TF : Query Demand IDF : Number of Queries LSRS
Step 1: base cluster candidates Method for choosing the ``base clusters’’ (initial states): – Minimum frequency – Supply threshold (Enough Inventory) – Min and max token constraint (Length of queries) – Heuristic constraints Queries that have only numbers are not allowed: “10 5” … – Merge similar clusters into one LSRS
candidates merge 4.34M base clusters merged into 1.95M Example phrase(hand,made) phrase(king,s) queen quilt phrase(hand,made) phrase(pink,s) quilt phrase(hand,made) phrase(prae,owned) queen quilt phrase(hand,made) queen quilt phrase(hand,made) phrase(prae,owned) quilt phrase(hand,made) quilt size twin phrase(hand,made) quilt silk phrase(hand,made) quilt twin phrase(hand,made) phrase(patch,work) quilt phrase(hand,made) quilt white phrase(hand,made) phrase(king,size) quilt phrase(hand,made) phrase(yo,yo,s) quilt phrase(hand,made) quilt sale phrase(hand,made) quilt red phrase(hand,made) quilt LSRS
Step 2: K-Means Clustering Split Clusters Query to Items Data Base Cluster Generation K-Means Clustering of Base Clusters Generate Item Features Transaction Logs Inventory Logs Scoring Models LSRS
Clusters on Item Signature apple ipod touch 4g clear film protector screen Cluster clarks women shoe pumps classics LSRS
Recommendation (online) LSRS
Performance System ClustersInventory Conceptual Knowledgebase ?similarTo(item) SIR query formation Item Selection Cluster Assignment SIR Ranking items Data Store Lost Item Similar Items recommendations Item Search query Clusters Inventory Conceptual Knowledgebase ?relatedTo(item) Item Selection Cluster Assignment RIR Ranking items Data Store Bought Item Related Items recommendations Item Search queries RIR Query Formation Cluster-Cluster Relations clusters related clusters LSRS
Items in the same cluster LSRS
Similar Item Recommendations LSRS
Experimental Results A/B Tests comparing against legacy systems – SIR legacy system Completely online Naïve approach of using seed item title as a search query – RIR legacy system Chen, Y. and J.F. Canny, Recommending ephemeral items at web scale, ACM SIGIR 2011 Collaborative Filtering on stable representations of items – Significant improvements at 90% confidence interval SIR resulted in 38.18% higher user engagement (CTR) RIR resulted in 10.5% higher CTR Statistically significant improvement in site-wide business metrics from both SIR & RIR LSRS
Conclusion Balance between similarity and quality crucial in driving user engagement and conversion Clusters of similar items in the inventory – Local clustering in the coverage set of user queries Offline models built using Map-Reduce – Huge input datasets including inventory, clickstream and transactional data Efficient real-time performance system Currently deployed on ebay.com LSRS
Acknowledgments Current & Past team members – Kranthi Chalasani – Santanu Kolay – Riyaaz Shaik – Venkat Sundaranatha LSRS
WE’RE HIRING Chu-Cheng Hsieh LSRS