Download presentation
Presentation is loading. Please wait.
1
Recommender System with Hadoop and Spark
Wu, Yongkai Zhang, Fengli
2
Hadoop Basically parallel processing framework running on cluster of commodity machines Stateless functional programming because processing of each row of data does not depend upon any other row or state Data is replicated, partitioned and resides in Hadoop Distributed File System (HDFS) Each partitions get processed by a separated mapper or reducer task
3
Item-based Collaborative Filtering
the similarities between different items in the dataset are calculated according to the items’ properties these similarity values are used to recommend items to users One problems? If a lot of users like item A and B, but A and B have no similarity such as beer and baby diapers?
4
Item-Based Recommender using Co-occurrence similarity
Look at User 1 and 2 who have watched Movie 101, you will see that there is 2 cooccurrence of Movie 102
6
Implement of the recommender system
Built history matrix — contains the interactions between users and items as a user-by-item matrix Built co-occurrence matrix — transforms the history matrix into an item-by-item matrix, recording which items appear together in user histories Do recommendation
7
History matrix Map (key, value) key.set (User ID) Value.set (ItemID+
“:” + Rating) <1, (101: 5.0)> <1, (102:3.0)> Combine(key, list( value)) <1, ((101: 5.0), (102:3.0))> Reduce
8
Co-occurrence matrix transforms the history matrix into an item-by-item matrix, recording which items appear together in user histories
9
Indicator Matrix
10
Validation Select 20% users from the data set, and assume 20% movies have not been watched
11
Mahout Mahout is a machine-learning library with tools for clustering, classification, and several types of recommenders, including tools to calculate most-similar items or build item recommendations for users. Mahout employs the Hadoop framework to distribute calculations across a cluster Command “mahout recommenditembesed --input /movielens.txt --output recommendations --numRecommendations 10 -- outputPathForSimilarityMatrix similarity-matrix --similarityClassname SIMILARITY_COOCCURRENCE”
12
User-based CF RS with Spark
Task: to implement a recommendation system using Apache Spark framework Approach: user-based collaborative filtering Similarity metric Dataset: MovieLens
13
Spark
14
Spark operators
15
Word Count Example Combine tuples with the same key.
19
Very easy approach, but…
20
Memory-based Approach
Movielens : 1. 10 million ratings applied to 10,000 movies by 72,000 users 2. The size of rating matrix: 10 k * 72 k = 720 M
21
memory-based model in distributed platform
Sparse matrix Node 1 Node 2 Node n
22
Similarity metric loadFile.map().groupByKey()
23
map().groupByKey()
24
map(calcSim()).map().groupByKey()
25
Load history record loadFile.map().groupByKey()
26
Neighbor recommendation
Similarity model Historical data (broadcast)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.