Download presentation
Presentation is loading. Please wait.
Published byHorace Richard Modified over 9 years ago
1
Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla
2
Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla X
3
Applicability of Mahout for Large Data Sets Michael Sevilla
4
What is Mahout? Distributed machine learning libraries – “scalable to reasonably large data sets” – Runs on Hadoop http://heureka.blogetery.com/
5
The Data: Million Song Data Set Large Data Set – 1,019,318 users – 384,546 MSD songs – 48,373,586 (user, song, count) Kaggle Competition: offline evaluation – Predict songs a user will listen to using Training: 1M user listening history Validation: 110K users “Martin L” blogged his methodology + results
6
22 vs. Motivations Can Mahout easily be modified? Can Mahout perform well for this workload? Can Mahout produce accurate results? Can Mahout work ‘out of box’? Hypothesis: 22 machines + Mahout > 1 guy
7
What kind of Recommender? Format: Users interacting with items Users express preferences towards items We can us Collaborative Filtering 22 vs.
8
Collaborative Filtering Predicts preference of user towards an item Constructs a Top-N-Recommendation 1.Parse input training data 2.Create user-item-matrix 3.Predict missing entries Mahout has item-based Collaborative Filtering jobs!
9
CAN MAHOUT EASILY BE MODIFIED?
10
Martin’s Code Methodology: similarity vector of history – Sparse-matrix COLISTEN(i, j) – listeners who listened to i and j – Sum similarities for each song user x listens to The code: all python – Parse: 27 lines of code (l.o.c) – Create Matrix: 46 l.o.c – Predict: 45 l.o.c
11
Mahout’s Code Methodology: – No Idea… The code: all java – Poorly commented – 14 *.java files – Many Directories ~/mahout/core/src/main/java/org/apache/mahout/cf/tast e/hadoop/item/RecommenderJob.java – RecommenderJob.java: 284 lines of code (l.o.c) – SimilarityMatrixRowWrapperMapper.java: 47 l.o.c – UserVectorSplitterMapper.java: 138 l.o.c
12
Mahout’s Code
13
CAN MAHOUT EASILY BE MODIFIED? NO
14
CAN MAHOUT PERFORM WELL FOR THIS WORKLOAD?
15
Performance on 86MB: – Parse data: 10 minutes – Make Matrix: 22 minutes – Predict songs for 11000 users: 1 hour, 18 minutes Did not test scalability $/ python convertToNumbers.py $/ python colisten.py $/ python predict_colisten.py Martin’s Code
16
Performance on 86MB: – Parse Time: 10 minutes – Total Time: 25 minutes Tested scalability – 64MB, 128MB, 256MB, 1GB, 2GB, 3GB Mahout’s Code
17
Total Time ~ 12m, 43m, 1hr, 2hr, 4hr, >5hr …. 10 Nodes Failed
18
Prepare Jobs (parse): seconds - minutes Mahout’s Code
19
Recommend Jobs (predict): seconds - minutes
20
Mahout’s Code Create Matrix Jobs: minutes - hours
21
CAN MAHOUT PERFORM WELL FOR THIS WORKLOAD? NO
22
CAN MAHOUT PRODUCE ACCURATE RESULTS?
23
Training Set Kaggle Million Song Subset: 110K users – User 2: 16 entries – took out 8 – User 16: 32 entries – took out 8 – User 17: 25 entries – took out 8
24
User 2: User 16: User 17: where Q is the number of queries Martin’s Code
25
User 2: User 16: User 17: where Q is the number of queries Mahout’s Code
26
CAN MAHOUT PRODUCE ACCURATE RESULTS? YES
27
CAN MAHOUT WORK ‘OUT OF BOX’? YES… but not well
28
Conclusion Mahout did not scale well Mahout was not easy to learn Mahout was not easily modifiable For performance and efficiency, it is better to – Understand the data set – Understand data mining – Understand the methodology
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.