HBase Coprocessor to Index Columns into ElasticSearch Cluster Dibyendu Bhattacharya Architect – Big Data Analytics HappiestMinds.

Slides:



Advertisements
Similar presentations
Chen Zhang Hans De Sterck University of Waterloo
Advertisements

Digital Library Service – An overview Introduction System Architecture Components and their functionalities Experimental Results.
Omid Efficient Transaction Management and Incremental Processing for HBase Copyright © 2013 Yahoo! All rights reserved. No reproduction or distribution.
CS525: Special Topics in DBs Large-Scale Data Management HBase Spring 2013 WPI, Mohamed Eltabakh 1.
© 2013 A. Haeberlen, Z. Ives Cloud Storage & Case Studies NETS 212: Scalable & Cloud Computing Fall 2014 Z. Ives University of Pennsylvania 1.
Fast Track to ColdFusion 9. Getting Started with ColdFusion Understanding Dynamic Web Pages ColdFusion Benchmark Introducing the ColdFusion Language Introducing.
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
Hands-On Microsoft Windows Server 2003 Administration Chapter 5 Administering File Resources.
Basic features ● Document database ● Paid deployment ● JSON ● C#, HTTP REST, Java ● version 3.0.
Gowtham Rajappan. HDFS – Hadoop Distributed File System modeled on Google GFS. Hadoop MapReduce – Similar to Google MapReduce Hbase – Similar to Google.
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
1 The Google File System Reporter: You-Wei Zhang.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
Facebook (stylized facebook) is a Social Networking System and website launched in February 2004, operated and privately owned by Facebook, Inc. As.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
M i SMob i S Mob i Store - Mobile i nternet File Storage Platform Chetna Kaur.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.
EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
DBSQL 14-1 Copyright © Genetic Computer School 2009 Chapter 14 Microsoft SQL Server.
© Dennis Shasha, Philippe Bonnet – 2013 Communicating with the Outside.
Experimenting Lucene Index on HBase in an HPC Environment Xiaoming Gao Vaibhav Nachankar Judy Qiu.
L/O/G/O 云端的小飞象系列报告之二 Cloud 组. L/O/G/O Hadoop in SIGMOD
Google’s Big Table 1 Source: Chang et al., 2006: Bigtable: A Distributed Storage System for Structured Data.
Data storing and data access. Plan Basic Java API for HBase – demo Bulk data loading Hands-on – Distributed storage for user files SQL on noSQL Summary.
Information Systems and Network Engineering Laboratory II DR. KEN COSH WEEK 1.
Presentation. Recap A multi layer architecture powered by Spring Framework, ExtJS, Spring Security and Hibernate. Taken advantage of Spring’s multi layer.
Performance Evaluation on Hadoop Hbase By Abhinav Gopisetty Manish Kantamneni.
Large-scale Incremental Processing Using Distributed Transactions and Notifications Daniel Peng and Frank Dabek Google, Inc. OSDI Feb 2012 Presentation.
1 HBase Intro 王耀聰 陳威宇
Data storing and data access. Adding a row with Java API import org.apache.hadoop.hbase.* 1.Configuration creation Configuration config = HBaseConfiguration.create();
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation MongoDB Architecture.
Introduction to Hbase. Agenda  What is Hbase  About RDBMS  Overview of Hbase  Why Hbase instead of RDBMS  Architecture of Hbase  Hbase interface.
.  A multi layer architecture powered by Spring Framework, ExtJS, Spring Security and Hibernate.  Taken advantage of Spring’s multi layer injection.
Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture The Metadata Consistency Model File Mutation.
INTRODUCTION TO DBS Database: a collection of data describing the activities of one or more related organizations DBMS: software designed to assist in.
What is a Package? A package is an Oracle object, which holds other objects within it. Objects commonly held within a package are procedures, functions,
Copyright © 2006 Pilothouse Consulting Inc. All rights reserved. Search Overview Search Features: WSS and Office Search Architecture Content Sources and.
CERN IT Department CH-1211 Geneva 23 Switzerland t CF Computing Facilities Agile Infrastructure Monitoring CERN IT/CF.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
CSI 3125, Preliminaries, page 1 SERVLET. CSI 3125, Preliminaries, page 2 SERVLET A servlet is a server-side software program, written in Java code, that.
Session 1 Module 1: Introduction to Data Integrity
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Bigtable: A Distributed Storage System for Structured Data
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
1 HBASE – THE SCALABLE DATA STORE An Introduction to HBase XLDB Europe Workshop 2013: CERN, Geneva James Kinley EMEA Solutions Architect, Cloudera.
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
Abstract MarkLogic Database – Only Enterprise NoSQL DB Aashi Rastogi, Sanket V. Patel Department of Computer Science University of Bridgeport, Bridgeport,
Presenter: Yue Zhu, Linghan Zhang A Novel Approach to Improving the Efficiency of Storing and Accessing Small Files on Hadoop: a Case Study by PowerPoint.
1 Gaurav Kohli Xebia Breaking with DBMS and Dating with Relational Hbase.
A presentation on ElasticSearch
and Big Data Storage Systems
HBase Mohamed Eltabakh
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
Open Source distributed document DB for an enterprise
UFC #1433 In-Memory tables 2014 vs 2016
CLOUDERA TRAINING For Apache HBase
Gowtham Rajappan.
The Improvement of PaaS Platform ZENG Shu-Qing, Xu Jie-Bin 2010 First International Conference on Networking and Distributed Computing SQUARE.
Analysis of Lucene Index on Hbase in an HPC Environment
Ashutosh Rana Rahul Nori 7/17/2018
Introduction to Apache
another noSql customization for the HDB++ archiving system
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Using TLA+ for fun and profit in the development of Elasticsearch
Presentation transcript:

HBase Coprocessor to Index Columns into ElasticSearch Cluster Dibyendu Bhattacharya Architect – Big Data Analytics HappiestMinds

About HappiestMinds Next Gen IT Consultancy Company launched Aug Head office in Bangalore, India, have offices in USA, UK, Canada, Australia and Singapore. Core focus on disruptive technologies like Big Data/Analytics, Cloud, Mobile and Social. Raised USD 45M Series A Funding from prominent VCs, Intel Capital, Canaan Partners and founders Client Globally, Employees. About Myself : Dibyendu is Big Data Architect at HappiestMinds where he is involved in architecting and developing solutions on a Hadoop-based analytics and search platform. In the past few years, he has worked on complex data analytics related projects that utilize Hadoop, HBase, and real time analytics. Before HappiestMinds, he worked at EMC, FairIsaac, Cisco, IBM etc.

This Presentation…. …….will explores the design and challenges HappiestMinds faced while implementing a storage and search infrastructure for a library procurement system where books/documents/artifacts related records are stored in Apache HBase. Upon bulk insert of book records into HBase, the Elasticsearch index is built offline using MapReduce but there are certain use cases where the records need to be re-indexed in Elasticsearch using Region Observer Coprocessors.

Storing and Indexing Book records from Publishers and Libraries Publisher/ Library Data HDFS HBase Cluster Data Pre Processing Data ingestion to Hadoop Data Loading : Map Reduce Bulk Data upload to HBase table Elastic Search Cluster 3 Data Indexing : Map Reduce Incremental Data Indexing to ElasticSearch Part of the document is indexed. User Search 4 4 User Search: User Search Data. Search engine display results. Full data access request fetch from HBase. User Update data 5a 5b 5 User Update: User update HBase record. Update will propagate to Search Cluster.

HBase Write Path

HBase Storage Layout Region Server …………………. …….

HBase Put Request

Here comes the Coprocessors The idea of HBase Coprocessors was inspired by Google’s Big Table coprocessors. HBase coprocessors are an addition to data-manipulation toolset that were introduced as a feature in HBase in the release. With the introduction of coprocessors, we can push arbitrary computation out to the HBase nodes hosting data. Coprocessors can be loaded globally on all tables and regions hosted by the region server, or the administrator can specify which coprocessors should be loaded on all regions for a table on a per-table basis.

Coprocessors Class and Interfaces The Coprocessor Interface All User code must inherit from this class The CoprocessorEnvironement Interface Retain state across invocation The CoprocessorHost interfaces Tied state and the user code

Observer Coprocessors Two types of Coprocessor observer, which are like triggers in conventional databases. endpoint, dynamic RPC endpoints that resemble stored procedures. Observer Coprocessor : Callback functions/hooks for every explicit API method MasterObserver Hooks into HMaster API RegionObserver Hooks into Region related operations WALObserver Hooks into write-ahead log operations

RegionObserver Coprocessor … Put ( ) RegionObserver: Provides hooks for data manipulation events, Get, Put, Delete, Scan, and so on. There is an instance of a RegionObserver coprocessor for every table region and the scope of the observations they can make is constrained to that region.

RegionObserver Coprocessor... Get ( )

Let us see what is ElasticSearch

Distributed Search Engine : ElasticSearch Distributed Highly-available REST based search engine (on top of Lucene) Designed to speak JSON (JSON in, JSON out) Built on top of Lucene. For each index you can specify: Number of shards Each index has fixed number of shards Number of replicas Each shard can have 0-many replicas, can be changed dynamically

ElasticSearch : Automatic Discovery Discovery Module responsible for discovering nodes within the cluster, as well as electing master node. The responsibility of master node is to maintain global cluster state, and act if nodes join or leave cluster by reassigning shards.

ElasticSearch : Talking to Cluster

ElasticSearch : Nodes are Different

The idea is to perform Indexing into ElasticSearch from HBase Coprocessors…..

We need a Java Client… Use ElasticSearch Transport Client : The Transport Client connects remotely to an ElasticSearch cluster. It does not join the cluster, but simply gets one or more initial transport addresses and communicates with them in round robin fashion on each action (though most actions will probably be “two hop” operations).

And Index with Transport Client…

But this approach has a problem.. Client does not have the knowledge of the ElasticSearch cluster. Two Hop indexing. No fault tolerant mechanism if transport address is down. HBase Region Servers can have hundreds regions and hence hundreds of transport client. Solution Use ElasticSearch Node Client. Client Node does not hold index but have knowledge of complete Cluster. Use HBASE-6505 to share Node Client across Regions in a RegionServer.

HBase 6505 RegionCoprocessorEnvironment provides a getSharedData() method, which returns a ConcurrentMap, which is held by the RegionCoprocessorHost as a weak reference (in a special map with strongly referenced keys and weakly referenced values), and held strongly by the RegionEnvironment. That way if the coprocessor is blacklisted the coprocessors environment is removed, and any shared data is immediately available for garbage collection. This shared data is per RegionServer. As long as there is at least one region observer or endpoint active this shared data is not garbage collected and can be accessed to share state between the remaining coprocessors of the same class.

Shared Node Client across Regions

The Final Problem…. Concurrency Control … HBase Solve it using MVCC (Multi Version Concurrency Control): Implement updates not by deleting an old piece of data and overwriting it with a new one, but instead by making the old data as obsolete and adding newer version And ElasticSearch using OCC (Optimistic Concurrency Control) : Multiple transactions can complete without affecting each other, and that therefore transactions can proceed without locking the data resources that they affect. Before committing, each transaction verifies that no other transaction has modified its data. If the check reveals conflicting modifications, the committing transaction rolls back.

Let See a Conflict.. Search and Update HBaseES C1 C2 V1 V1(M/R) HBaseES C1 C2 V1 V2 (Update success) Conflict V2(CP) V1(M/R)

One More Conflict.. Search and Update HBaseES C1 C2 V1 V1(M/R) HBaseES C1 C2 V1 Conflict V2(M/R) Conflict

The bottom line is. Search and Update should only be successful when the Version of ElasticSearch and Version of HBase is same during the update.

Solution.. 1.Data Load from Source to HBase will insert a document with Put call. 2. postPut coprocessor will perform incrementColumnValue for a version column. ………………………

Solution.. 3. Same Version number will be propagated to ElasticSearch during Map Reduce based bulk indexing. ElasticSearch support version number supplied externally. 4.Step 1-3 will repeat for any new data upload. 5.During search and update, the client will perform checkAndPut () call. 5i. Client perform search and get the Version number from ElasticSearch 5ii. Client construct a Put with new Version No = Old Version + 1 5iii. Client perform checkAndPut, and check for old Version number before doing Put. 5iv. postCheckAndPut Coprocessor invoked to propagate the successful Put to Search Cluster. 5v. After this step the Version Number of HBase column and ElasticSearch version will be equal.

Solution.. ……………………………….

Thanks