1 ITERATIVE FILE- BASED ITEM:ITEM SIMILARITY COMPUTATION 1 ● Will Holcomb – Vanderbilt University ● Project Aura Intern
Sun Confidential: Internal Only 2 1.Recommender systems overview 2.The shape of the tail 3.The role of Project Aura 4.Item:item similarity 5.Programming for Project Caroline 6.Reworking item:item in terms of tuples 7.Parallelizability and computability 8.Computation in Project Aura Presentation Overview
Sun Confidential: Internal Only 3 Recommender Systems Exploiting The Long Tail The Theoretical Tail
Sun Confidential: Internal Only 4 The Actual Tail (More Or Less) Crawl of Last.fm Top 50 Artists for 11,985 Users 21,858 Total Artists 598,168 Artist/User Pairs 83,668,000 Listens
Sun Confidential: Internal Only 5 Collaborative Filtering in Project Aura More Aura details coming later Collaborative filtering is about adding the hybrid to the hybrid recommender system Main concerns for filtering algorithms: > Stability – How much can a recommendation change? > Computability – How long does it take to find the answer?
Sun Confidential: Internal Only 6 Item:Item Collaborative Filtering Users are dimensions Items are vectors Similarity is the cosine distance
Sun Confidential: Internal Only 7 Project Caroline Designed for internet applications Utility style pricing – pay for what you use Multiple processes distributed across multiple machines Shared file storage No shared memory
Sun Confidential: Internal Only 8 The Aura Datastore Requests funneled through Data Store Head Subtrees distributed to Partition Clusters running in separate processes > Process coordination using Jini
Sun Confidential: Internal Only 9 Cosine Generation Overview
Sun Confidential: Internal Only 10 Composition > For a single Record Set, perform an operation on a list of all records with a given key > (Artist, ) Cartesian Join > For n input Record Sets, permute all pairs of records with matching keys > (Artist A, Length A ) × (Artist B, Length B ) = (Artist A.Artist B, Length A *Length B ) Join Methods Composition
Sun Confidential: Internal Only 11 Partitioning Collect all matching keys in a single file Run in m processes for m output files Each processor puts records in a set of shared files as determined by a common hashing scheme File locking necessary to prevent concurrent access
Sun Confidential: Internal Only 12 Cosine Generation As Tuples
Sun Confidential: Internal Only 13 Computational Complexity Optimizations Exploit symmetricity in output files to only do n! joins Exploit symmetricity in records to only do (.5)n joins
14 Any Questions? ● Will Holcomb ● hoenir.himinbi.org hoenir.himinbi.org