PARALLELIZING LARGE-SCALE DATA- PROCESSING APPLICATIONS WITH DATA SKEW: A CASE STUDY IN PRODUCT-OFFER MATCHING Ekaterina Gonina UC Berkeley Anitha Kannan,

Slides:

Advertisements

Similar presentations

Experiences with Hadoop and MapReduce

Advertisements

SkewReduce YongChul Kwon Magdalena Balazinska, Bill Howe, Jerome Rolia* University of Washington, *HP Labs Skew-Resistant Parallel Processing of Feature-Extracting.

LIBRA: Lightweight Data Skew Mitigation in MapReduce

© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert

EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.

Parallel Computing MapReduce Examples Parallel Efficiency Assignment

Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.

Optimus: A Dynamic Rewriting Framework for Data-Parallel Execution Plans Qifa Ke, Michael Isard, Yuan Yu Microsoft Research Silicon Valley EuroSys 2013.

Clydesdale: Structured Data Processing on MapReduce Jackie.

Authors: Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox Publish: HPDC'10, June 20–25, 2010, Chicago, Illinois, USA ACM Speaker: Jia Bao Lin.

Monitoring and Debugging Dryad(LINQ) Applications with Daphne Vilas Jagannath, Zuoning Yin, Mihai Budiu University of Illinois, Microsoft Research SVC.

Copyright© 2003 Avaya Inc. All rights reserved Avaya Interactive Dashboard (AID): An Interactive Tool for Mining Avaya Problem Ticket Database Ziyang Wang.

Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.

Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.

P ARALLEL S ORTED N EIGHBORHOOD B LOCKING WITH M AP R EDUCE Lars Kolb, Andreas Thor, Erhard Rahm Database Group Leipzig Kaiserslautern,

Google Distributed System and Hadoop Lakshmi Thyagarajan.

Large-Scale Content-Based Image Retrieval Project Presentation CMPT 880: Large Scale Multimedia Systems and Cloud Computing Under supervision of Dr. Mohamed.

Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.

USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.

SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.

Ex-MATE: Data-Intensive Computing with Large Reduction Objects and Its Application to Graph Mining Wei Jiang and Gagan Agrawal.

Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.

Face Detection And Recognition For Distributed Systems Meng Lin and Ermin Hodžić 1.

MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.

EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES.

Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.

Introduction to Hadoop and HDFS

Training Kinect Mihai Budiu Microsoft Research, Silicon Valley UCSD CNS 2012 RESEARCH REVIEW February 8, 2012.

Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)

An Introduction to HDInsight June 27 th,

Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.

MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.

Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.

MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.

Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.

Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.

FREERIDE: System Support for High Performance Data Mining Ruoming Jin Leo Glimcher Xuan Zhang Ge Yang Gagan Agrawal Department of Computer and Information.

Hung-chih Yang 1, Ali Dasdan 1 Ruey-Lung Hsiao 2, D. Stott Parker 2

Database Applications (15-415) Part II- Hadoop Lecture 26, April 21, 2015 Mohammad Hammoud.

Making Watson Fast Daniel Brown HON111. Need for Watson to be fast to play Jeopardy successfully – All computations have to be done in a few seconds –

Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.

Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.

BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

Dryad and DryaLINQ. Dryad and DryadLINQ Dryad provides automatic distributed execution DryadLINQ provides automatic query plan generation Dryad provides.

CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.

MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015.

Site Technology TOI Fest Q Celebration From Keyword-based Search to Semantic Search, How Big Data Enables That?

Big traffic data processing framework for intelligent monitoring and recording systems 學生 : 賴弘偉教授 : 許毅然作者 : Yingjie Xia a, JinlongChen a,b,n, XindaiLu.

MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.

Massive Semantic Web data compression with MapReduce Jacopo Urbani, Jason Maassen, Henri Bal Vrije Universiteit, Amsterdam HPDC ( High Performance Distributed.

A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)

ApproxHadoop Bringing Approximations to MapReduce Frameworks

Parallel Applications And Tools For Cloud Computing Environments CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.

Beyond Hadoop The leading open source system for processing big data continues to evolve, but new approaches with added features are on the rise. Ibrahim.

BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data Authored by Sameer Agarwal, et. al. Presented by Atul Sandur.

Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )

Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi

A Simple Approach for Author Profiling in MapReduce

Distributed Programming in “Big Data” Systems Pramod Bhatotia wp

MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner

Ministry of Higher Education

Database Applications (15-415) Hadoop Lecture 26, April 19, 2016

MapReduce: Data Distribution for Reduce

February 26th – Map/Reduce

Cse 344 May 4th – Map/Reduce.

Overview of big data tools

Experiences with Hadoop and MapReduce

Presentation transcript:

PARALLELIZING LARGE-SCALE DATA- PROCESSING APPLICATIONS WITH DATA SKEW: A CASE STUDY IN PRODUCT-OFFER MATCHING Ekaterina Gonina UC Berkeley Anitha Kannan, John Shafer, Mihai Budiu Microsoft Research 1 MapReduce’11 Workshop. June 8, San Jose, CA

Motivation 2  Enabling factors:  Web-scale data  Cluster computing platforms  Web-scale data  large  skewed  MapReduce frameworks do not handle skew well  Build data-dependent optimizations  Adapt computation to nature of data  Product-Offer matching as an example of application with large (factors of 10^5) data skew

Outline 3  Product-Offer Matching: Application in E-Commerce  Data Skew  Parallel Implementation with DryadLINQ  Algorithm  Three distributed join strategies  Results  Conclusion

Automatic matching Product-Offer Matching 4 Over a week for 30M offers

Product-Offer Matching 5  Search engines provide e-commerce search services that allow users to sift through catalogs of product information  Merchants offers of a product need to be matched to the catalog entry for the product  -> Product-Offer Matching

Product-Offer Matching Algorithm 6 1. Offline training phase  Learning per-category matching functions 2. Online matching phase – Match products to offers for each category

Need for Efficient Parallelization 7  Publish a new Offer database on a daily basis  Matching only on a subset of offers (1M)  3 hours on 3 machines (9 hours sequentially)  Full set of offers (>30M) would take over a week  Anticipating growth in  Offers set from merchants  Product selection in catalog Focus: scalable parallelization of the matching algorithm

Offline Training Phase 8 Learn per-category matching functions

Product-Offer Matching – Online Matching Phase 9  Foreach offer string, category c  Annotate the offer by extracting attributes  Select a set of products, P from c for a potential match  For each product p in P Compute matching score based on attributes using learned model  Return p(s) with maximum score Title: Panasonic Lumix DMC- FX07B digital camera silver Category: digital camera Product1: Category: Digital Cameras Brand: Panasonic Product line: Lumix Model: DMC-FX07B Color: black Product2: Category: Digital Cameras Brand: Canon Product line: Powershot Model: 30M Color: silver Category: Digital Cameras Brand: Panasonic Product line: Lumix Model: DMC-FX07B Color: silver

Matching Set Size Reduction 10 Matching job Num offers * num products

Outline 11  Product-Offer Matching: Application in E-Commerce  Data Skew  Parallel Implementation with DryadLINQ  Algorithm  Three distributed join strategies  Results  Conclusion

Parallel Computation Frameworks 12 Runtime Environment Application Storage System Language Parallel Databases Map- Reduce GFS BigTable Cosmos Azure SQL Server Dryad DryadLINQ Scope Sawzall,FlumeJava Hadoop HDFS S3 Pig, Hive SQL≈SQLLINQ, SQLSawzall, Java Offer matching

Nested Parallelization 13  DryadLINQ across machines  Multi-threading across cores L3 $ DRAM T1 | T2 C1C2 C3 C4 Ni (Node i) Ni-1 Ni+1 Ni+2 4 cores/node 2 threads/core 8MB L3 Cache/node 16GB DRAM/node

Algorithm outline 14

Algorithm outline 15

Algorithm outline 16

Algorithm outline 17

Algorithm outline 18 Hash-Partition JoinBroadcast Join Skew-Adaptive Join

Skew & Scheduling Parallel jobs time (min) Machines Schedule of Offer Matching with no Skew Adaptation

Skew-Adaptive Join 20  Group Join: build lists of offers and products with matching values for top attributes Join groups: distributed join using deterministic hash-partition Dynamically re-partition large groups

Dynamic Repartitioning 21  After creating joined lists of offers and products – need to estimate the skew and redistribute  Dynamically repartition data if work/group is greater than a threshold  Split the offer list into several smaller lists  Replicate the product list

Dynamic Repartitioning 22

Dynamic Repartitioning 23

Skew & Scheduling 24 4M offers 256 partitions # nodes determined at runtime time (min) Machines Machines time (min) Grouping by Category 1 hour 50 min Grouping by Category and 1 top attribute 1 hour 40 min Grouping by Category and 1 top attribute + Dynamic Repartitioning 35 min Parallel jobs time (min) Machines

Outline 25  Product-Offer Matching: Application in E-Commerce  Data Skew  Parallel Implementation with DryadLINQ  Algorithm  Three distributed join strategies  Results  Conclusion

Speedup 26 55x 2h18m 2.5m 8h06m 4.4m 44h ~78h 110x 25m 106x 74x 63m

Scaling K 1M 4M 7M

Conclusion 28  Parallelization of large-scale data processing: multiple join strategies  Dynamically estimate skew  Repartition data based on skew  Mix of high-level and low level primitives => can implement custom strategies  Result:  4 min - 1M offers (9 hours before)  25 min - 7M offers (>3 days before)

29 Thank you! Questions?

30 Backup Slides

Multi-threaded Matching 31  Each node gets assigned a set of matching jobs  Split each job among N threads on a single node  Each thread gets K/N offers (K = #offers in current job)  Each thread computes a maximum match for its offers Multi-core parallelization gave 3.5x improvement on one category, 1.5x on average (fork-join overhead)

Product-Offer Matching System Online matching phase  Input:  Offer strings from different merchants – Pre- categorized  Database of products  Output:  Best matching product(s) for each offer string

Online Shopping 33  Over 77% of the U.S. population is now online  80% expected to make an online purchase in next 6 months  U.S. online retail spending surpassed $140B in 2010 and is on track to surpass $250B by 2014  80% of online purchases start with search  Key gateway into this huge market

Product Consideration Set Selection 34  Select a set of products to match to an offer (offer’s consideration set)  Baseline: all products in the category the offer was classified to Too many offer-product comparisons (~10K) Too few actually relevant (10-100) Large skew in the distribution of products and offers across categories

Product Consideration Set Selection 35  Algorithmically reduce the size of the consideration set by selecting products that match the offer on top attributes*  Ex: for offer with attribute [brand name] and value [“Canon”] only select products with this value * Top attributes = attributes with highest weight in the learned per-category matching models

Three Join() Operations Broadcast Join – join big table of offers and products with small table of category attributes - Broadcast the small table 2. Distributed Join – regular partitioned join on table of product attributes and annotated products - Hash-partition both input sets based on product ID key

DryadLINQ parallelization Get Data Product DB Offer DB tagger 2. Group 3. Join & Match For each category Category 2Category 1 … …

DryadLINQ parallelization Get Data Lincoln- Master-PA Cosmos Lincoln tagger Get offer attributes based on its category Load the Lincoln Tagger Extract numeric and string attributes from offer title Extract candidate model number Get product attributes based on its category Get attributes from the Database Annotate product titles Merge based on precedence Job = annotating 1 title Millions of jobs “CGA-S002 Panasonic Lumix DMC- FZ5S 8.2Mp Digital Camera” Brand Name: Panasonic Resolution: 8.2 Product Line: Lumix “CGA-S002 Panasonic Lumix DMC- FZ5S Digital Camera” MN: CGAS002 MN: DMCFZ5S

DryadLINQ parallelization Group Group offers and products based on keys Top weighted attributes in each category For each offer/product select values for the top attributes Multiple values for each attributes Extract only numeric part of model for model number attribute key Ex: “4362_Canon_30”, “4458_Nike_Cotton” Job distribution on the cluster nodes is based on this key K groups Category 2Category 1

DryadLINQ parallelization Join & Match … LINQ join Join offer and product groups on keys: categoryId + attribute values(s) Ex: “4362_Canon_30” Match offers to products within each group Compute best match for each offer in the set Similarity feature vector * weights Strict equality for strings Within threshold for numeric String alignment scorer for model number Multi-threaded implementation Apply Matching function on each group

DryadLINQ Parallelization parallel jobs Annotate offers Group products Match Join Annotate products Data input tables Group products