Cameron Bell Luke Doolittle Amod Ghangurde

Slides:



Advertisements
Similar presentations
© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
Advertisements

VMware Capacity Planner 2.7 Discussion and Demo from Engineering May 2009.
Cloud SUT proposal OSGcloud group. Objective To fill in the Research the group about the thinking within the OSG working group To solicit new ideas/proposals.
Movie Recommendation System
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
Zabbix Performance Tuning
` tuplejump The data engineering platform. A startup with a vision to simplify data engineering and empower the next generation of data powered miracles!
An introduction to SQL 1/21/2014 – See chapter 2.3 and 6.1 PostgreSQL -
Data Structures & Algorithms and The Internet: A different way of thinking.
EMIS 8381 – Spring Netflix and Your Next Movie Night Nonlinear Programming Ron Andrews EMIS 8381.
Sign up Account Profile page (Basic Function). Home page Add Review New Sale Rank Search algorithm Search Results.
ILDG Middleware Status Chip Watson ILDG-6 Workshop May 12, 2005.
Online Learning for Collaborative Filtering
Students: Anurag Anjaria, Charles Hansen, Jin Bai, Mai Kanchanabal Professors: Dr. Edward J. Delp, Dr. Yung-Hsiang Lu CAM 2 Continuous Analysis of Many.
Title Line Subtitle Line Top of Content Box Line Top of Footer Line Left Margin LineRight Margin Line Top of Footer Line Top of Content Box Line Subtitle.
Project Name Program Name Project Scope Title Project Code and Name Insert Project Branding Image Here.
Intro to Datazen.
1 MSRBot Web Crawler Dennis Fetterly Microsoft Research Silicon Valley Lab © Microsoft Corporation.
SUPPLY CHAIN OF BIG DATA. WHAT IS BIG DATA?  A lot of data  Too much data for traditional methods  The 3Vs  Volume  Velocity  Variety.
+ Logentries Is a Real-Time Log Analytics Service for Aggregating, Analyzing, and Alerting on Log Data from Microsoft Azure Apps and Systems MICROSOFT.
INFO 344 Web Tools And Development CK Wang University of Washington Spring 2014.
Big Data Javad Azimi May First of All… Sorry about the language  Feel free to ask any question Please share similar experiences.
Recommendation Systems ARGEDOR. Introduction Sample Data Tools Cases.
EQuIS and Tableau Getting the most out of your tools.
Cloud Analytics Platforms Christian Frey. About AIDA Our mission is to advance knowledge in data analytics through research, education and outreach Our.
Data Visualization with Tableau
Information Retrieval in Practice
Detecting Web Attacks Using Multi-Stage Log Analysis
Getting started with Power BI
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
Add More Zing to your Dashboards – Creating Zing Plot Gadgets
Search Engine Architecture
Statistical Information Systems Introducing SIS tool .Stat
Data-driven serverless apps with Azure functions
Software Architecture in Practice
Power BI Architecture, Best Practices, and Performance Tuning
Web 2.0 and Library 2.0 A Brief Overview
High Performance Computing on an IBM Cell Processor --- Bioinformatics
Powered by Azure, PUSH Solutions Measure the Velocities of Workouts by Athletes to Determine Levels of Fatigue and Optimize Training Strategies MICROSOFT.
Alma Analytics Usage Yoel Kortick | Senior Librarian.
Stop Doing That! Common T-SQL Anti-Patterns
Big-Data Fundamentals
”The Ball” Radical Cloud Resource Consolidation
Hadoop Clusters Tess Fulkerson.
Software Architecture in Practice
Network Performance Insight PoC Performance Readout
Using Postgres/PostGIS for WFSRI
Adopted from Bin UIC Recommender Systems Adopted from Bin UIC.
Collaborative Filtering Nearest Neighbor Approach
CMPT 733, SPRING 2016 Jiannan Wang
Apache Spark & Complex Network
Welcome and thank you for choosing SharkGate
Accelerate Your Self-Service Data Analytics
Data Visualization Dashboard (with D3)
MyAppFree, Powered by Microsoft Azure, Lets Global Users Discover and Download Tested and Handpicked Windows Apps and Games for Free MICROSOFT AZURE ISV.
Glynk on Microsoft Azure: A Social Networking Platform Connecting Like-Minded People Nearby for Recommendations, Activities, and Meetups MICROSOFT AZURE.
Adra ACCOUNTS: Transaction Matching Software Powered by the Microsoft Azure Cloud That Helps Optimize the Accounting and Finance Processes MICROSOFT AZURE.
Overview of big data tools
TITLE OF PRESENTATION PRESENTER.
A Component-based Architecture for Mobile Information Access
Tableau and DataSelf Analytics
Creative Activity and Research Day (CARD)
Web AppBuilder for ArcGIS
Overview. Overview User Profile & FAQs Summary Tab Break down of user clicks   Account Director contact information   Analysis of data accuracy,
Dashboard in an Hour Using Power BI
Power BI Engagement Model
The Student’s Guide to Apache Spark
Atlantic Coast Operations
Bridge the Gap Between Statistician and Data Analysis Professionals
APE EAD3 introduction - DARIAH - Brussels
Presentation transcript:

Cameron Bell Luke Doolittle Amod Ghangurde Movie Match Cameron Hi, we are Amod, Luke, and Cameron and we will be showing you our application Movie Match.

Overview Problem statement It takes far too long to choose a movie to watch Even when you do choose, how do you know you’ll like it? Movie Match hasn't been done successfully cross-platform (s) Cameron Imagine sitting down with two of your friends to watch a movie. I have spent as long as an hour before to choose a movie. Imagine if you could have an application suggest the perfect movie nearly instantaneously! Movie match does just that. Based on a user’s ratings, it suggests movies to watch using a machine-learning recommender system.

Solo and Group recommender Cameron In the future, it will suggest movies for groups, as described. It will also suggest movies for a variety of genres, so you can watch something you’re in the mood for. This could be implemented on a mobile platform as well, but right now is only on a web front end.

What we’ve done Cameron

Dataset Update (Near real time) Initial Load Update Source GroupLens Netflix prize Ratings 24 million 100 million Movies and Users 40,000 movies by 260,000 users 17,000 movies by 480,000 users Timeframe Jan, 1995 and Oct, 2016 Oct, 1998 and Dec, 2005 Size ~1Gb ~2Gb Format .csv files text files Cameron Our data come from the Netflix Prize and from MovieLens, a research group with a similar goal. 3 Vs - Velocity, Volume, and variety Velocity - Slow (right now). But in the future there will be incoming data from user ratings. Volume - We have about 3 Gb of data initially Variety - The incoming streaming data would be schema-on-read for easy processing, but the initial datasets required a fair amount of manipulation to use the machine learning model and to consolidate the two different forms. Update (Near real time) Updates made by users in the front end web application

Dataset Preprocessing Used Movielens dataset as base Exact title and year match After merging -> ~70 million Netflix ratings Luke

Architecture Hardware Options Data Storage Options Physical systems Virtual environment (EC2, Azure, etc) Data Storage Options Spark Graph Database (Neo4j) Machine Learning Options Collaborative Filtering Nearest Neighbors Luke

Architecture (Proposed) Luke

Architecture (Prototype) Luke

Requirements (for the given datasets) Storage > 10 GB storage Memory within Spark requires > 8 GB RAM CPU within Spark requires > 2 virtual cores Luke

Website Luke

Population view - Most often rated movies and average rating Amod The current movie match prototype uses Tableau for generating user analytics. For production release, the analytics can be ported to the movie match application. There are two components to this analytics: Admin and User analytics. While we have focused on user analytics, this and next slide give a glimpse of admin level analytics. The above dashboard shows a tree map for most often movies rated by the users and their average ratings. The tree map provides information of the movie id, name, average rating and number of times the movie was rated by the users. Forrest Gump / Pretty Woman - no surprises there

Population view - Count of ratings and Genre distribution Amod Continuing on the admin dashboards, the above charts show a population view of user’s and count of movies that they have rated along with genre distribution. Lets take a sample user (user id 186590) to do a more deep dive on user analytics

Distribution of my ratings (user id = 186590) Amod We envision to make the user analytics available to each user who has an account for movie match application. Users will get a view of his/her ratings and genre distribution. The above charts are for the sample user id. The solo recommender and the futuristic group recommender that we discussed previously is based on a user profile. User ratings and genre distribution becomes one of the key components of that user profile.

My unusual likes and dislikes Amod Another interesting aspect which builds this user profile is to display to the user the unusal likes and dislikes. The unusual dislikes are the movies which are to the left of the spectrum. Those are the movies that were rated much lower by the user than others. To the right of the spectrum are unusual likes.

Rare movies rated by me Amod Another aspect of user profile which may impact are rarely rated movies. What we see above are some of the movies rated at 5-star by this user but this movie was rated only by one other user out of a unique user size of over 2 million

Data Load Performance Real-time updates should be possible Cameron We tried visualizing our data on neo4j and python’s networkx. We also tried to perform principal component analysis (PCA) on it to reduce the dimensionality and visualize, but ran into issues with the requirements to load everything into memory. It has been done before, though (https://flowingdata.com/2007/12/11/netflix-prize-dataset-visualization/), so it is a possibility in the future. Feeding data back to the web server is complicated, so we haven’t done it yet. But purely from a data storage/processing point of view, it should be feasible.

Machine Learning Performance Hyperparameter Tuning Grid search on small dataset Hyperparameters: rank, iterations, lambda Luke

Machine Learning Performance Luke