eBay Marketplaces Ming Ma June 27 th, 2013.

Slides:



Advertisements
Similar presentations
Data Freeway : Scaling Out to Realtime Author: Eric Hwang, Sam Rash Speaker : Haiping Wang
Advertisements

 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
A Hadoop Overview. Outline Progress Report MapReduce Programming Hadoop Cluster Overview HBase Overview Q & A.
Wei-Chiu Chuang 10/17/2013 Permission to copy/distribute/adapt the work except the figures which are copyrighted by ACM.
Resource Management with YARN: YARN Past, Present and Future
© 2009 VMware Inc. All rights reserved Big Data’s Virtualization Journey Andrew Yu Sr. Director, Big Data R&D VMware.
Observation Pattern Theory Hypothesis What will happen? How can we make it happen? Predictive Analytics Prescriptive Analytics What happened? Why.
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2.
What are your questions and feedback? What happens when there is change or a service incident? What is the Service Health Dashboard? What is our communications.
SQL on Hadoop. Todays agenda Introduction Hive – the first SQL approach Data ingestion and data formats Impala – MPP SQL.
Copyright © 2012 Cleversafe, Inc. All rights reserved. 1 Combining the Power of Hadoop with Object-Based Dispersed Storage.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
The Hadoop Distributed File System: Architecture and Design by Dhruba Borthakur Presented by Bryant Yao.
Windows Azure SQL Database and Storage Name Title Organization.
Committed to Deliver….  We are Leaders in Hadoop Ecosystem.  We support, maintain, monitor and provide services over Hadoop whether you run apache Hadoop,
Data Freeway : Scaling Out to Realtime Eric Hwang, Sam Rash
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
State of the Elephant Hadoop yesterday, today, and tomorrow Page 1 Owen
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
Hive Facebook 2009.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Site Technology TOI Fest Q Celebration From Keyword-based Search to Semantic Search, How Big Data Enables That?
 Introduction  Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots  File I/O Operations and Replica Management File.
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
Next Generation of Apache Hadoop MapReduce Owen
Learn. Hadoop Online training course is designed to enhance your knowledge and skills to become a successful Hadoop developer and In-depth knowledge of.
BIG DATA/ Hadoop Interview Questions.
Microsoft Partner since 2011
Microsoft Ignite /28/2017 6:07 PM
Practical Hadoop: do’s and don’ts by example Kacper Surdy, Zbigniew Baranowski.
Page 1 © Hortonworks Inc – All Rights Reserved Apache Hadoop - Virtualization Winter 2015 Version 1.4 Hortonworks. We do Hadoop.
Data Analytics Challenges Some faults cannot be avoided Decrease the availability for running physics Preventive maintenance is not enough Does not take.
Hadoop Introduction. Audience Introduction of students – Name – Years of experience – Background – Do you know Java? – Do you know linux? – Any exposure.
Software Testing Training Online. Software testing is ruling the software business in current scenario. It provides an objective, independent view of.
Big Data & Test Automation
About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.
Scalable Data Processing & Analytical Approach for Big Data Cloud Platform Bikash Agrawal.
Smart Building Solution
Hadoop.
Introduction to Distributed Platforms
ITCS-3190.
What is Apache Hadoop? Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created.
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
HDFS Yarn Architecture
Chapter 10 Data Analytics for IoT
Open Source distributed document DB for an enterprise
Spark Presentation.
Smart Building Solution
Enabling Scalable and HA Ingestion and Real-Time Big Data Insights for the Enterprise OCJUG, 2014.
Introduction to HDFS: Hadoop Distributed File System
Hadoop Clusters Tess Fulkerson.
Software Engineering Introduction to Apache Hadoop Map Reduce
AWS DevOps Engineer - Professional dumps.html Exam Code Exam Name.
Ministry of Higher Education
Big Data - in Performance Engineering
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Introduction to Apache
Overview of big data tools
Charles Tappert Seidenberg School of CSIS, Pace University
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
Big-Data Analytics with Azure HDInsight
Customer 360.
Pig Hive HBase Zookeeper
Presentation transcript:

eBay Marketplaces Ming Ma June 27 th, 2013

Overview Hadoop eBay Marketplaces Availability study Opportunities ahead

Big eBay Marketplaces 120+ Million Active users 300+ Million search queries every single day 350+ Million items available eBay Marketplaces 3

Data Sets Inventory Data –Product Listings, Catalogue, Quantity etc. Transactional Data –Buying, Returning etc. User Behavioral Data –Click stream, comments, suggestions, user activities etc. Customer profiles –Buyer, Seller, Partner information etc. Machine data –Logs, application data etc. eBay Marketplaces 4

Hadoop eBay Marketplaces 2007 Single digit nodes 2010 Shared cluster 100s nodes 1000s + core PB CDH Shared clusters 1000s node 10,000+ core 10s PB Wilma (0.20) 2012 Shared clusters 1000s node 10,000+ core 10s PB 2013 Shared clusters 4k+ node 40,000+ core 50s PB HDP 2009 Search 10s- nodes eBay Marketplaces 5

Shared vs. Dedicated Clusters Shared clusters –10s of PB and 10s of thousands of slots per cluster –Run HDP 1.2 –Used primarily for analytics of user behavior and inventory –Mix of production and ad-hoc jobs –Mix of MR, Hive, PIG, Cascading etc. –Hadoop and HBase security enabled Dedicated clusters –Very specific use cases like Index Building –Tight SLAs for jobs (in order of minutes) –Immediate revenue impact –Usually smaller than our shared clusters, but still big (100s of nodes…) eBay Marketplaces 6

Job Distribution by Type eBay Marketplaces 7

Use Case Examples Cassini, full re-write of eBay’s search engine: –Use MR to build full and incremental near-real-time indexes –Data for indexing is stored in HBase for efficient updates and random read –Strong SLAs –Run on dedicated clusters Related and similar Items recommendations: –Use transactional data, click stream data, search index, etc. –Production MR jobs on a shared cluster Analytics dashboard: –Run Mobius MR jobs to join click stream data and transactional data –Store summary data in HBase –Web application to query HBase eBay Marketplaces 8

eBay Hadoop Data Platform eBay Marketplaces 9 Data Ingest Extract Load Validate Transform Clients Java Scala Pig Hive Cascading Mobius Hadoop Behavioral Transactional Inventory Metadata Metastore Type System Service API Data Access Java POJO Pig UDF Hive UDF Tools ETL Monitor Metadata Mgmt Data Catalog User Mgmt

Platform Innovation Many reliability improvements New Security features –Multi-realm support –Encryption –https in hadoop 1 Hadoop 2.0 –MR 1 and YARN binary compatibility Automation for operations –Machine decommission and re-commission process Data and user management –Metadata management –User account provisioning eBay Marketplaces 10

Overview Hadoop eBay Availability study Next steps

Case study – defective applications HBase: A test app created heavy write load –Test app used all region server RPC threads –All RPCs are blocked by region flush –RPC requests from production HBase MR job timed out HDFS: An app created lots of small files inside map tasks –NN RPC Queue length spiked –DN heartbeat RPC can’t be processed –HDFS replication storm eBay Marketplaces 12

Case study – platform bugs Hadoop: –DFSClient.LeaseChecker thread leak in job tracker -> bi-weekly JT restart –dfs.datanode.balance.bandwidthPerSec set to 200MB -> big performance impact JVM: –leap second bug -> All clusters were down the same time –GC setting -> NN full GC happened regularly OS: –“Divide by zero” in CentOS and RH 6.1 -> machine reboot eBay Marketplaces 13

Case study – cluster maintenance Code rollout: –NN SPOF –RPC compatibility between old and new versions Hadoop configuration change: –Likely required Hadoop JVM restart –Rolling restart has impact on job latency –Datanode rolling restart caused HBase region servers to exit Machines re-commission: –Hadoop version drift –OS configuration bug reappeared eBay Marketplaces 14

Metrics Definition: –Availability = MTBF ( mean time between failure ) / MTBF + MDT ( mean down time ) –Down time includes planned maintenance Measurement: –Synthetic transaction approach –Run regular canary work count MR job –Canary job times out in X minutes eBay Marketplaces 15

More about metrics Availability != MTTR ( mean time to recover ) –MTTR is more important for applications like Cassini index build What is considered “available”? –Performance degradation –% of live slave nodes –Other entry points such as Web UI –Core data set availability –Multi-tenancy scenario eBay Marketplaces 16

Ways to improve availability Automation –Use puppet and daemontools –Monitor system health Redundancy –Namenode HA –Hot standby region server Isolation –HDFS federation –Region server grouping Congestion control –RPC congestion control, Hadoop-9640 –Apply to both HDFS and HBase Features to enable “no downtime maintenance” –Dynamic configuration update –RPC compatibility –Better ways to do rolling restart eBay Marketplaces 17

Overview Hadoop eBay Availability study Next steps

Opportunities ahead More automation Availability and scalability –Hadoop 2.0 –HBase fast recovery time Multi-tenancy –Run production jobs with strong SLAs in big shared clusters –QoS in HDFS and HBase New scenarios –Interactive Analysis with SQL language –Direct Hadoop Access from dev machines eBay Marketplaces 19