Large-Scale Real-Time Product Recommendation at Criteo Simon Dollé RecSys FR, December 1st, 2015
Catalog data Feed provided by the merchants User behavior data Large scale intent data All visits to merchant websites Page views, basket, sales events Ad display data Displayed and clicked ads
We buy Ad spaces
We buy Ad spaces We sell Clicks
We buy Ad spaces We sell Clicks that convert
We buy Ad spaces We sell Clicks that convert a lot
We buy Ad spaces We sell Clicks that convert a lot We take the risk
10 000 displays
10 000 displays leads to 50 clicks
10 000 displays leads to 50 clicks leads to 1 sale
3 billion ads/day 3 billion products
10ms to pick relevant products
7 data centers 15 000 servers 1200-node hadoop cluster
Catalog data 3B+ products Catalog data Feed provided by the merchants User behavior data Large scale intent data All visits to merchant websites Page views, basket, sales events Ad display data Displayed and clicked ads
Catalog data Browsing history 3B+ products 2B events / day Feed provided by the merchants User behavior data Large scale intent data All visits to merchant websites Page views, basket, sales events Ad display data Displayed and clicked ads
Catalog data Browsing history Ad display data 3B+ products 2B events / day Ad display data 20B events / day Catalog data Feed provided by the merchants User behavior data Large scale intent data All visits to merchant websites Page views, basket, sales events Ad display data Displayed and clicked ads
How do we do it ?
Recommend products for a user What we want: reco(user) = products 1B users x 3B products ! But we need to scale and keep it fresh What we can do : Pre-select products offline Refine scoring online to get final candidates
Bob saw orange shoes
Bob saw orange shoes Some candidate products Historical
Bob saw orange shoes Some candidate products Historical Most viewed
Bob saw orange shoes Some candidate products Historical Most viewed
Bob saw orange shoes Some candidate products Historical Most viewed Similar
Bob saw orange shoes Some candidate products Historical Most viewed Similar
Bob saw orange shoes Some candidate products Historical Most viewed Similar Complementary
Recommendation Service 20K qps
HADOOP 20K qps Recommendation Service 50B Browsing history Preselection computation Map-Reduce jobs 50B Browsing history
HADOOP 20K qps Recommendation Service Preselections 12h 500M 50B Preselection computation Map-Reduce jobs 50B Browsing history
Online: sources Similarities Most viewed Most bought
Online: merge of products Similarities Most viewed Most bought
ML model Logistic regression models because : They scale They are fast They can handle lots of features Product-specific User-specific User-product interactions Display-specific Product-specific: price, category User-specific: usersegment, user last category User-product interactions: time since last view, category match Display-specific: desktop vs mobile
HADOOP 20K qps Recommendation Service Preselections 12h 500M 50B Preselection computation Map-Reduce jobs 50B Browsing history
HADOOP 20K qps Recommendation Service Preselections 6h 12h 500M Preselection computation Map-Reduce jobs Prediction models 50B Browsing history
HADOOP 20K qps Recommendation Service Display, Click, Sale logs Preselections 6h 12h 500M HADOOP Preselection computation Map-Reduce jobs Prediction models 50B Browsing history
HADOOP 20K qps Recommendation Service Display, Click, Sale logs Preselections 6h 12h 500M HADOOP Preselection computation Map-Reduce jobs Prediction models 50B Browsing history
Online: scoring Similarities Most viewed Most bought 0,02 0,12 0,06 0,18 0,03 0,05 0,01 0,005 0,011 0,013 0,004 0,007
Online: scoring Similarities Most viewed Most bought 0,18 0,12 0,06 0,05 0,03 0,02 0,013 0,011 0,01 0,007 0,005 0,004
Online: candidates -50% SHOP SHOP SHOP SHOP 0,18 0,12 0,06 0,05 0,03 0,02 0,013 0,011 0,01 0,007 0,005 0,004
What’s next ?
What’s next for us: Upcoming challenges Long(er)-term user profiles
What’s next for us: Upcoming challenges Long(er)-term user profiles More and better product information (images, semantic, NLP)
What’s next for us: Upcoming challenges Long(er)-term user profiles More and better product information (images, semantic, NLP) Instant-update of similarities
What’s next for us: Upcoming challenges Long(er)-term user profiles More and better product information (images, semantic, NLP) Instant-update of similarities Joint product scoring (score full banner and not products independently)
What’s next for you: Fancy a try? On your own: We published datasets for click prediction 4GB display-click data: Kaggle challenge in 2014 http://bit.ly/1vgw2XC 1TB Display-Click data (industry’s largest dataset): http://bit.ly/1PyH4Vq 4 billion of observations 156 billion feature-value available on Microsoft Azure used by edX (UC Berkeley) With us ! http://labs.criteo.com/jobs/
Questions?
s.dolle@criteo.com @simondolle @recsysfr Thank you ! s.dolle@criteo.com @simondolle @recsysfr Credits: Creative Stall, Gilbert Bages