USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.

Slides:



Advertisements
Similar presentations
Meet Hadoop Doug Cutting & Eric Baldeschwieler Yahoo!
Advertisements

Inner Architecture of a Social Networking System Petr Kunc, Jaroslav Škrabálek, Tomáš Pitner.
The map and reduce functions in MapReduce are easy to test in isolation, which is a consequence of their functional style. For known inputs, they produce.
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
LIBRA: Lightweight Data Skew Mitigation in MapReduce
CS525: Special Topics in DBs Large-Scale Data Management HBase Spring 2013 WPI, Mohamed Eltabakh 1.
Recommender System with Hadoop and Spark
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
Presenter: Joshan V John Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan & Tien N. Nguyen Iowa State University, USA Instructor: Christoph Csallner 1 Joshan.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
HADOOP ADMIN: Session -2
Hadoop & Cheetah. Key words Cluster  data center – Lots of machines thousands Node  a server in a data center – Commodity device fails very easily Slot.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland MLG, January, 2014 Jaehwan Lee.
Ch 4. The Evolution of Analytic Scalability
Committed to Deliver….  We are Leaders in Hadoop Ecosystem.  We support, maintain, monitor and provide services over Hadoop whether you run apache Hadoop,
ZhangGang, Fabio, Deng Ziyan /31 NoSQL Introduction to Cassandra Data Model Design Implementation.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.
Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Building BI App on Cloud Rohit Chatter Sr.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer.
CSE 548 Advanced Computer Network Security Document Search in MobiCloud using Hadoop Framework Sayan Cole Jaya Chakladar Group No: 1.
Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 作者 :Rutvik Karve , Devendra Dahiphale , Amit Chhajer 報告 : 饒展榕.
Application of Hadoop to Proteomic Searches Steven Lewis 1, Attila Csordas 2, Sarah Killcoyne 1, Henning Hermjakob 2, John Boyle 1 1 Institute for Systems.
Key/Value Stores CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Introduction to Hbase. Agenda  What is Hbase  About RDBMS  Overview of Hbase  Why Hbase instead of RDBMS  Architecture of Hbase  Hbase interface.
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
MapReduce and the New Software Stack CHAPTER 2 1.
Hadoop implementation of MapReduce computational model Ján Vaňo.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
MapReduce and NoSQL CMSC 461 Michael Wilson. Big data  The term big data has become fairly popular as of late  There is a need to store vast quantities.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.
Google News Personalization Big Data reading group November 12, 2007 Presented by Babu Pillai.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
CISC 849 : Applications in Fintech Namami Shukla Dept of Computer & Information Sciences University of Delaware iCARE : A Framework for Big Data Based.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
By: Joel Dominic and Carroll Wongchote 4/18/2012.
Moscow, November 16th, 2011 The Hadoop Ecosystem Kai Voigt, Cloudera Inc.
Big Data is a Big Deal!.
Hadoop.
Software Systems Development
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
Large-scale file systems and Map-Reduce
MongoDB Er. Shiva K. Shrestha ME Computer, NCIT
Map Reduce.
Web Mining Ref:
Hadoop Clusters Tess Fulkerson.
湖南大学-信息科学与工程学院-计算机与科学系
Ch 4. The Evolution of Analytic Scalability
VI-SEEM data analysis service
CS 345A Data Mining MapReduce This presentation has been altered.
Charles Tappert Seidenberg School of CSIS, Pace University
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Presentation transcript:

USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar

Ameya Kanitkar – That’s me! Big Data Infrastructure Groupon, Palo Alto USA (Working on Deal Relevance & Personalization Systems)

Agenda Basics of Hadoop & HBase How you can use Hadoop & HBase for big data application Case Study: Deal Relevance and Personalization Systems at Groupon with Hadoop & HBase

Big Data Application Examples Recommendation Systems Ad targeting Personalization Systems BI/ DW Log Analysis Natural Language Processing

So what is Hadoop? General purpose framework for processing huge amounts of data. Open Source Batch / Offline Oriented

Hadoop - HDFS Open Source Distributed File System. Store large files. Can easily be accessed via application built on top of HDFS. Data is distributed and replicated over multiple machines Linux Style commands eg. ls, cp, mv, touchz etc

Hadoop – HDFS Example: hadoop fs –dus /data/ bytes =~ 168 TB (One of the folders from one of our hadoop cluster)

Hadoop – Map Reduce Application Framework built on top of HDFS to process your big data Operates on key-value pairs Mappers filter and transform input data Reducers aggregate mapper output

Example Given web logs, calculate landing page conversion rate for each product So basically we need to see how many impressions each product received and then calculate conversion rate of for each product

Map Reduce Example Map 1: Process Log File: Output: Key (Product ID), Value (Impression Count) Map 2: Process Log File: Output: Key (Product ID), Value (Impression Count) Map N: Process Log File: Output: Key (Product ID), Value (Impression Count) Reducer: Here we receive all data for a given product. Just run simple for loop to calculate conversion rate. (Output: Product ID, Conversion Rate Map Phase Reduce Phase

Recap We just processed terabytes of data, and calculated conversion rate across millions of products. Note: This is batch process only. It takes time. You can not start this process after some one visits your website. How about we generate recommendations in batch process and serve them in real time?

HBase Provides real time random read/ write access over HDFS Built on Google’s ‘Big Table’ design Open Sourced This is not RDBMS, so no joins. Access patterns are generally simple like get(key), put(key, value) etc.

RowCf: ….Cf: Row 1Cf1:qual1Cf1:qual2 Row 11Cf1:qual2Cf1:qual22Cf1:qual3 Row 2Cf2:qual1 Row N Dynamic Column Names. No need to define columns upfront. Both rows and columns are (lexicological) sorted

RowCf: …. user1Cf1:click_history:{actual_cl icks_data} Cf1:purchases:{actual_pur chases} user11Cf1:purchases:{actual_pur chases} user20Cf1:mobile_impressions:{a ctual mobile impressions} Cf1:purchases:{actual_pur chases} Note: Each row has different columns, So think about this as a hash map rather than at table with rows and columns

Putting it all together Analyze Data (Map Reduce) Generate Recommendations (Map Reduce) Store data in HDFS Serve Real Time Requests (HBase) Web Mobile Do offline analysis in Hadoop, and serve real time requests with HBase

Use Case: Deal Relevance & Groupon

What are Groupon Deals?

Our Relevance Scenario Users

Our Relevance Scenario Users How do we surface relevant deals ? Deals are perishable (Deals expire or are sold out) No direct user intent (As in traditional search advertising) Relatively Limited User Information Deals are highly local

Two Sides to the Relevance Problem Algorithmic Issues How to find relevant deals for individual users given a set of optimization criteria Scaling Issues How to handle relevance for all users across multiple delivery platforms

Developing Deal Ranking Algorithms Exploring Data Understanding signals, finding patterns Building Models/Heuristics Employ both classical machine learning techniques and heuristic adjustments to estimate user purchasing behavior Conduct Experiments Try out ideas on real users and evaluate their effect

Data Infrastructure Growing DealsGrowing Users 100 Million+ subscribers We need to store data like, user click history, records, service logs etc. This tunes to billions of data points and TB’s of data

Deal Personalization Infrastructure Use Cases Deliver Personalized s Deliver Personalized Website & Mobile Experience Offline System Online System Personalize billions of s for hundredsof millions of users Personalize one of the most popular e-commerce mobile & web app for hundreds of millions of users & page views

Architecture HBase Offline System HBase for Online System Real Time Relevance Relevance Map/Reduce Replication Data Pipeline We can now maintain different SLA on online and offline systems We can tune HBase cluster differently for online and offline systems

HBase Schema Design User IDColumn Family 1Column Family 2 Unique Identifier for Users User History and Profile Information History For Users Most of our data access patterns are via “User Key” This makes it easy to design HBase schema The actual data is kept in JSON Overwrite user history and profile info Append history for each day as a separate columns. (On avg each row has over 200 columns)

Cluster Sizing Hadoop + HBase Cluster 100+ machine Hadoop cluster, this runs heavy map reduce jobs The same cluster also hosts 15 node HBase cluster Online HBase Cluster HBase Replication 10 Machine dedicated HBase cluster to serve real time SLA Machine Profile 96 GB RAM (HBase 25 GB) 24 Virtual Cores CPU 8 2TB Disks Data Profile 100 Million+ Records 2TB+ Data Over 4.2 Billion Data Points

Questions? Thank You! (We are hiring!)