© Hortonworks Inc. 2011 MapReduce over snapshots HBASE-8369 Enis Soztutar Enis [at] apache [dot] Page 1.

Slides:



Advertisements
Similar presentations
Meet Hadoop Doug Cutting & Eric Baldeschwieler Yahoo!
Advertisements

© 2007 Cisco Systems, Inc. All rights reserved.ISCW-Mod3_L7 1 Network Security 2 Module 6 – Configure Remote Access VPN.
0 - 0.
Addition Facts
Distributed and Parallel Processing Technology Chapter2. MapReduce
Beyond Mapper and Reducer
Data Freeway : Scaling Out to Realtime Author: Eric Hwang, Sam Rash Speaker : Haiping Wang
FitNesse in Fifty Minutes Chris Harbert Resonate 1.
Dan Bassett, Jonathan Canfield December 13, 2011.
Addition 1’s to 20.
Week 1.
Pig Optimization and Execution Page 1 Alan F. © Hortonworks Inc
© 2009 GroundWork Open Source, Inc. PROPRIETARY INFORMATION: Information contained herein is not for use or disclosure outside of GroundWork Open Source,
Software and Services Group “Project Panthera”: Better Analytics with SQL, MapReduce and HBase Jason Dai Principal Engineer Intel SSG (Software and Services.
Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
© Hortonworks Inc Daniel Dai Thejas Nair Page 1 Making Pig Fly Optimizing Data Processing on Hadoop.
Developing a MapReduce Application – packet dissection.
HBase MTTR, Stripe Compaction and Hoya
© Hortonworks Inc Running Non-MapReduce Applications on Apache Hadoop Hitesh Shah & Siddharth Seth Hortonworks Inc. Page 1.
SwatI Agarwal, Thomas Pan eBay Inc.
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
-A APACHE HADOOP PROJECT
Overview of Hadoop for Data Mining Federal Big Data Group confidential Mark Silverman Treeminer, Inc. 155 Gibbs Street Suite 514 Rockville, Maryland
© Hortonworks Inc Secure SQL Standard based Authorization for Apache Hive Thejas Page 1.
Next Generation of Apache Hadoop MapReduce Arun C. Murthy - Hortonworks Founder and Architect Formerly Architect, MapReduce.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Gowtham Rajappan. HDFS – Hadoop Distributed File System modeled on Google GFS. Hadoop MapReduce – Similar to Google MapReduce Hbase – Similar to Google.
HADOOP ADMIN: Session -2
Hadoop & Cheetah. Key words Cluster  data center – Lots of machines thousands Node  a server in a data center – Commodity device fails very easily Slot.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
MAHADEV KONAR Apache ZooKeeper. What is ZooKeeper? A highly available, scalable, distributed coordination kernel.
The Multiple Uses of HBase Jean-Daniel Cryans, DB Berlin Buzzwords, Germany, June 7 th,
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.
Penwell Debug Intel Confidential BRIEF OVERVIEW OF HIVE Jonathan Brauer ESE 380L Feb
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Distributed Systems Fall 2014 Zubair Amjad. Outline Motivation What is Sqoop? How Sqoop works? Sqoop Architecture Import Export Sqoop Connectors Sqoop.
Data storing and data access. Plan Basic Java API for HBase – demo Bulk data loading Hands-on – Distributed storage for user files SQL on noSQL Summary.
Performance Evaluation on Hadoop Hbase By Abhinav Gopisetty Manish Kantamneni.
An Introduction to HDInsight June 27 th,
Key/Value Stores CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.
1 Chapter Overview Performing Configuration Tasks Setting Up Additional Features Performing Maintenance Tasks.
Data storing and data access. Adding a row with Java API import org.apache.hadoop.hbase.* 1.Configuration creation Configuration config = HBaseConfiguration.create();
What does it mean to virtualize the Hadoop File System?
Apache Hadoop on the Open Cloud David Dobbins Nirmal Ranganathan.
IBM Research ® © 2007 IBM Corporation A Brief Overview of Hadoop Eco-System.
Impala. Impala: Goals General-purpose SQL query engine for Hadoop High performance – C++ implementation – runtime code generation (using LLVM) – direct.
HEMANTH GOKAVARAPU SANTHOSH KUMAR SAMINATHAN Frequent Word Combinations Mining and Indexing on HBase.
Cloudera Kudu Introduction
HDFS MapReduce Hadoop  Hadoop Distributed File System (HDFS)  An open-source implementation of GFS  has many similarities with distributed file.
Master Cluster Manager User Interface (API Level) User Interface (API Level) Query Translator Avro NTA Query Engine NTA Query Engine Job Scheduler Avro.
Apache Accumulo CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook.
Learn. Hadoop Online training course is designed to enhance your knowledge and skills to become a successful Hadoop developer and In-depth knowledge of.
BIG DATA/ Hadoop Interview Questions.
MSBIC Hadoop Series Implementing MapReduce Jobs Bryan Smith
Bigtable A Distributed Storage System for Structured Data.
Practical Hadoop: do’s and don’ts by example Kacper Surdy, Zbigniew Baranowski.
Amit Ohayon, seminar in databases, 2017
Spark and YARN: Better Together
How did it start? • At Google • • • • Lots of semi structured data
CSE-291 (Cloud Computing) Fall 2016
Gowtham Rajappan.
Overview of Azure Data Lake Store
Introduction to Apache
Group 15 Swathi Gurram Prajakta Purohit
HBase on MapR Lohit VijayaRenu, MapR Technologies, Inc.
Ch 9 – Distributed Filesystem
Presentation transcript:

© Hortonworks Inc MapReduce over snapshots HBASE-8369 Enis Soztutar Enis [at] apache [dot] Page 1

© Hortonworks Inc About Me Page 2 Architecting the Future of Big Data In the Hadoop space since 2007 Committer and PMC Member in Apache HBase and Hadoop Working at Hortonworks as member of Technical Staff

© Hortonworks Inc Snapshots Currently a snapshot is a bunch of reference files together with some metadata A table snapshot can contain –Table descriptor –List of regions –References to files in the regions –References to WALs for regionservers Current snapshot impl is flush based –Forces flush to all regions, so that in-memory data is written to disk Page 3 Architecting the Future of Big Data

© Hortonworks Inc MR over Snapshots Idea is do scans on the client side bypassing region servers Use snapshots since they are immutable Similar to short circuit hdfs reads TableSnapshotInputFormat works similar to TableInputFormat TableMapReduceUtil methods to configure the job Page 4 Architecting the Future of Big Data

© Hortonworks Inc Deployment Options HBase online Take snaphot while HBase is running Run MR job over the snapshot HBase offline Take snapshot while HBase is running Export Snapshot using ExportSnapshot to a different hdfs Run MR job over snapshot with or without HBase running Page 5 Architecting the Future of Big Data

© Hortonworks Inc TableSnapshotInputFormat Gets a Scan representing the query Restore the snapshot to a temporary directory For each region in the snapshot: –Determine whether the region should be scanned (falls between scan start row and stop row) –Create one split per region in the scan range ( # of map tasks) –Each RecordReader will open the region (Hregion) as in HRegionServer –An internal RegionScanner is used for running the scan Page 6 Architecting the Future of Big Data

© Hortonworks Inc API Page 7 Architecting the Future of Big Data

© Hortonworks Inc Timeline Will (hopefully) be committed to trunk next week or so Interest in bringing this to 0.94 and 0.96 bases as well Will come in HDP-2.1, which will be based on 0.96 line Page 8 Architecting the Future of Big Data

© Hortonworks Inc Security Aspects HBase user owns the files in filesystem Snapshot files are also owned by the HBase user Mapreduce job should be able to read the files in the snapshot + actual data files HDFS only has posix-like perms based on user/group/other –User running MR job has to be either the HBase user, or have group perms –HDFS does not have ACLs, so there is no easy way to grant read access at filesystem layer Idea: similar to current short circuit impl, we can implement a FD transfer –User will submit jobs under her own user credentials –Ask HBase daemons to open the files, and pass a handler / token Page 9 Architecting the Future of Big Data

© Hortonworks Inc Performance ScanTest: Scan : open a scanner, do full table scan SnapshotScan : open a client-side scanner, do full table scan ScanMR : parallel full table scan from MR SnapshotScanMR : do full table scan 8 Region servers, 6 disks each HBase trunk Hadoop-2.2 (HDP ) Load data with IntegrationTestBulkLoad –Evenly distributed rows, created as bulk loaded hfiles. 3 column families # store files per region varies 3,6,9, and 12 (1,2,3,4 file per store) Data sizes: 6.6G, 13.2G, 19.8G, 26.4G Page 10 Architecting the Future of Big Data

© Hortonworks Inc Scan speed Page 11 Architecting the Future of Big Data

© Hortonworks Inc API We do not want to limit snapshot scanning only to MapReduce Allow client side scanners over snapshot files Page 12 Architecting the Future of Big Data

© Hortonworks Inc ResultScanner is main scan API Page 13 Architecting the Future of Big Data

© Hortonworks Inc API (caution: not final yet) Page 14 Architecting the Future of Big Data

© Hortonworks Inc To the future and beyond HBASE-8691 High-Throughput Streaming Scan API Can we bypass regionservers without taking snapshots? Bypass memstore data, or stream memstore data, but read directly from hfiles Secure reading from snapshots Keep up with the updates at – Page 15 Architecting the Future of Big Data

© Hortonworks Inc Thanks Questions? Architecting the Future of Big Data Page 16 Enis Söztutar enis [ at ] apache [dot]