Download presentation
Presentation is loading. Please wait.
1
Big Data is a Big Deal! Capstone Project 2015-2016
Computer Science Department Texas Christian University Big Data is a Big Deal!
2
Frog-B-Data Team Sushant Ahuja Project Lead Cassio Cristovao
Technical Lead Sameep Mohta Testing Lead
3
Agenda Project Overview and Goals Milestones Apache Hadoop
Apache Spark Initial Testing Recommender System Clustering Challenges Questions Needs to be modified
4
Project Background Big Data Revolution Phones, Tablets, Laptops, Computers Credit Cards Transport Systems 0.5% of data stored is actually analyzed1 Smart Data: Selection based on a pattern of behavior Google: I’m Feeling lucky Recommendation Systems Netflix, Amazon, Facebook 1. Source: Published in May 2013 0.5% of data stored is actually analyzed1 Using databases and sql providing actually valid data Find good explanation for smart data – a selection based on a pattern of behavior (loan problem) Google normal data search and I’m feeling lucky example
5
Goals Performance Analysis Validate feasibility
Hadoop vs Spark Speed Size of data Efficiency Validate feasibility Predict data - recommendation systems ‘Big Data’ ‘Smart Data’ Needs to be modified
6
Project Technologies Eclipse IDE Apache Hadoop Apache Spark
Java Virtual Machine Eclipse IDE Apache Hadoop Apache Spark Maven for Hadoop and Spark Mahout on Hadoop systems MLlib on Spark systems
7
Milestones Iteration 1 15 December, 2015 Iteration 2 2 February, 2016
Iteration March, 2016 Iteration April, 2016 SRS April, 2016 NTASC Presentation April, 2016 Final Presentation April, 2016 Complete Documentation 2 May, 2016 Milestones with dates
8
Fall 2015 Project Selection Iteration 1:
Setting up Hadoop and Spark on 6 Linux machines Initial Software Tests Initial Documentation Project Plan v1.0 Project Requirements v1.0 Project Design v1.0
9
Spring 2016 Iteration 2: Iteration 3: Cluster of nodes, Hadoop & Spark
Starting to understand basic recommender algorithms Revision of all the documents to v2.0 Iteration 3: Recommendation System on Hadoop K-Means Clustering on Hadoop Revision of all the documents to v3.0
10
Spring 2016 Iteration 4: Recommendation System for Spark
K-Means Clustering on Spark Improve Recommender’s Scalability and Reliability Final versions of documents Developer’s Manual User’s Manual
11
Our Cluster 8 GB RAM M W W 2 Clusters: 3 nodes each Hadoop and Spark
500GB HDD Ubuntu 15.04 Manager-Worker Architecture 1 Manager 2 Workers M W W
12
Apache Hadoop Framework for large-scale, data-intensive deployments
Open-source MapReduce – Stream I/O style of data processing, created by Google Map – Filtering input line by line Reduce – Collecting and processing filtered data Write-once storage infrastructure
13
Apache Hadoop 4 Dimensions – Volume, Velocity, Variety, Veracity
Both Structured (converted) and Unstructured HDFS – Breaks up input data – Blocks Stores it on compute nodes (Parallel Processing) Explain structured and unstructured data text separated by commas with no field names (such as name, age, title) – unstructured – we have a list of words separated by commas create our own structure, NOSQL
14
HDFS Segmentation
16
Hadoop Map/Reduce
17
Apache Spark Supports MapReduce – a.k.a. Action Transformation
Open-source Supports MapReduce – a.k.a. Action Transformation Lazy (delayed, on demand) evaluation In-memory storage and computing Offers APIs in Scala, Java, Python, R, SQL Built-in libraries
18
Apache Spark RDD – Resilient Distributed Dataset
19
Another example for spark – yet to add
20
Hadoop or Spark? NOT mutually exclusive Database – HDFS or others
Rate of processing data Third-party machine-learning library Non-commercial, open-source
21
Initial Tests, Recommender and Clustering
Word Count Matrix Multiplication Recommender Systems K-Means Clustering
22
Word Count (Cluster) Spark Hadoop Time (in minutes)
Size of the text file
23
Matrix Multiplication (Cluster)
Spark Hadoop Time (in minutes) Feasibility is the word to use – Hadoop or spark Size of the Matrices
24
Movie Recommender-Hadoop
“ I have watched the same movies as you. What other movies have you watched that I have not?”
25
Movie Recommender-Hadoop
26
Movie Recommender-Spark
Collaborative Filtering: “I like the same movies that you like. What other movies did you like that I haven’t seen?” Alternating Least Squares (ALS): “Effective and Scalable method of implementing Collaborative Filtering.”
27
Movie Recommender-Spark
28
Recommender Comparison
Spark Hadoop Time (in minutes) Number of records in the input file
29
K Means Clustering k1 k3 k2 Y X
This is only 2-dimensional, but in reality its multi-dimensional For example for a loan, we have age, credit history, gender, number of children, marital status, previous loans etc. k2 Source:
30
Clustering Applications
Who decides whether or not you get the loan that you applied for? Approved Loan Cluster Rejected Loan Cluster Doubtful Loan Cluster Marketing: Helps marketers discover distinct groups in their customer bases Develop targeted marketing programs for each group
31
Clustering Comparison
Spark Spark Hadoop Hadoop Time (in minutes) Number of records in the input file
32
Hadoop or Spark? (Based on a cluster of 1 M and 2 W)
Hadoop – Huge datasets Spark – Computational capability Spark – Degrades on I/O swapping
33
Technical Challenges Multiple Reducer Problem Amount of data
Co-occurrence algorithm Dumping clusters: K-Means Quality of Recommendations Quality of the choice you get Wisdom = f(knowledge) = f(choice) Culture = f(information) = f(truth) Java – efficiency runs into memory problems Hadoop/Spark – feasibility Smart data/mahout/MLlib - quality
34
General Challenges Punctuality Updating Documents after each Iteration
Not able to work from home Rapidly changing software versions
35
Presentations TCU Student Research Symposium
North Texas Area Student Conference Add some photos of srs and ntasc presentation
36
Acknowledgements Project Client Dr. Antonio Sanchez Faculty Advisor
Dr. Donnell Payne Systems Admin Mr. Billy Farmer Advice on Complex Algorithms Dr. Eric Hanson Add their roles
37
Recommended Movies
38
Thank you! Questions Maybe add videos
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.