Big Data is a Big Deal! Capstone Project

Big Data is a Big Deal! Capstone Project 2015-2016
Computer Science Department Texas Christian University Big Data is a Big Deal!

Frog-B-Data Team Sushant Ahuja Project Lead Cassio Cristovao
Technical Lead Sameep Mohta Testing Lead

Agenda Project Overview and Goals Milestones Apache Hadoop
Apache Spark Initial Testing Recommender System Clustering Challenges Questions Needs to be modified

Project Background Big Data Revolution Phones, Tablets, Laptops, Computers Credit Cards Transport Systems 0.5% of data stored is actually analyzed1 Smart Data: Selection based on a pattern of behavior Google: I’m Feeling lucky Recommendation Systems Netflix, Amazon, Facebook 1. Source: Published in May 2013 0.5% of data stored is actually analyzed1 Using databases and sql providing actually valid data Find good explanation for smart data – a selection based on a pattern of behavior (loan problem) Google normal data search and I’m feeling lucky example

Goals Performance Analysis Validate feasibility
Hadoop vs Spark Speed Size of data Efficiency Validate feasibility Predict data - recommendation systems ‘Big Data’ ‘Smart Data’ Needs to be modified

Project Technologies Eclipse IDE Apache Hadoop Apache Spark
Java Virtual Machine Eclipse IDE Apache Hadoop Apache Spark Maven for Hadoop and Spark Mahout on Hadoop systems MLlib on Spark systems

Milestones Iteration 1 15 December, 2015 Iteration 2 2 February, 2016
Iteration March, 2016 Iteration April, 2016 SRS April, 2016 NTASC Presentation April, 2016 Final Presentation April, 2016 Complete Documentation 2 May, 2016 Milestones with dates

Fall 2015 Project Selection Iteration 1:
Setting up Hadoop and Spark on 6 Linux machines Initial Software Tests Initial Documentation Project Plan v1.0 Project Requirements v1.0 Project Design v1.0

Spring 2016 Iteration 2: Iteration 3: Cluster of nodes, Hadoop & Spark
Starting to understand basic recommender algorithms Revision of all the documents to v2.0 Iteration 3: Recommendation System on Hadoop K-Means Clustering on Hadoop Revision of all the documents to v3.0

Spring 2016 Iteration 4: Recommendation System for Spark
K-Means Clustering on Spark Improve Recommender’s Scalability and Reliability Final versions of documents Developer’s Manual User’s Manual

Our Cluster 8 GB RAM M W W 2 Clusters: 3 nodes each Hadoop and Spark
500GB HDD Ubuntu 15.04 Manager-Worker Architecture 1 Manager 2 Workers M W W

Apache Hadoop Framework for large-scale, data-intensive deployments
Open-source MapReduce – Stream I/O style of data processing, created by Google Map – Filtering input line by line Reduce – Collecting and processing filtered data Write-once storage infrastructure

Apache Hadoop 4 Dimensions – Volume, Velocity, Variety, Veracity
Both Structured (converted) and Unstructured HDFS – Breaks up input data – Blocks Stores it on compute nodes (Parallel Processing) Explain structured and unstructured data text separated by commas with no field names (such as name, age, title) – unstructured – we have a list of words separated by commas create our own structure, NOSQL

HDFS Segmentation

Hadoop Map/Reduce

Apache Spark Supports MapReduce – a.k.a. Action Transformation
Open-source Supports MapReduce – a.k.a. Action Transformation Lazy (delayed, on demand) evaluation In-memory storage and computing Offers APIs in Scala, Java, Python, R, SQL Built-in libraries

Apache Spark RDD – Resilient Distributed Dataset

Another example for spark – yet to add

Hadoop or Spark? NOT mutually exclusive Database – HDFS or others
Rate of processing data Third-party machine-learning library Non-commercial, open-source

Initial Tests, Recommender and Clustering
Word Count Matrix Multiplication Recommender Systems K-Means Clustering

Word Count (Cluster) Spark Hadoop Time (in minutes)
Size of the text file

Matrix Multiplication (Cluster)
Spark Hadoop Time (in minutes) Feasibility is the word to use – Hadoop or spark Size of the Matrices

Movie Recommender-Hadoop
“ I have watched the same movies as you. What other movies have you watched that I have not?”

Movie Recommender-Hadoop

Movie Recommender-Spark
Collaborative Filtering: “I like the same movies that you like. What other movies did you like that I haven’t seen?” Alternating Least Squares (ALS): “Effective and Scalable method of implementing Collaborative Filtering.”

Movie Recommender-Spark

Recommender Comparison
Spark Hadoop Time (in minutes) Number of records in the input file

K Means Clustering k1 k3 k2 Y X
This is only 2-dimensional, but in reality its multi-dimensional For example for a loan, we have age, credit history, gender, number of children, marital status, previous loans etc. k2 Source:

Clustering Applications
Who decides whether or not you get the loan that you applied for? Approved Loan Cluster Rejected Loan Cluster Doubtful Loan Cluster Marketing: Helps marketers discover distinct groups in their customer bases Develop targeted marketing programs for each group

Clustering Comparison
Spark Spark Hadoop Hadoop Time (in minutes) Number of records in the input file

Hadoop or Spark? (Based on a cluster of 1 M and 2 W)
Hadoop – Huge datasets Spark – Computational capability Spark – Degrades on I/O swapping

Technical Challenges Multiple Reducer Problem Amount of data
Co-occurrence algorithm Dumping clusters: K-Means Quality of Recommendations Quality of the choice you get Wisdom = f(knowledge) = f(choice) Culture = f(information) = f(truth) Java – efficiency runs into memory problems Hadoop/Spark – feasibility Smart data/mahout/MLlib - quality

General Challenges Punctuality Updating Documents after each Iteration
Not able to work from home Rapidly changing software versions

Presentations TCU Student Research Symposium
North Texas Area Student Conference Add some photos of srs and ntasc presentation

Acknowledgements Project Client Dr. Antonio Sanchez Faculty Advisor
Dr. Donnell Payne Systems Admin Mr. Billy Farmer Advice on Complex Algorithms Dr. Eric Hanson Add their roles

Recommended Movies

Thank you! Questions Maybe add videos

Big Data is a Big Deal! Capstone Project

Similar presentations

Presentation on theme: "Big Data is a Big Deal! Capstone Project"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Big Data is a Big Deal! Capstone Project

Similar presentations

Presentation on theme: "Big Data is a Big Deal! Capstone Project"— Presentation transcript:

Similar presentations

About project

Feedback