Analysis of Lucene Index on Hbase in an HPC Environment

Slides:



Advertisements
Similar presentations
Programming Models for IoT and Streaming Data IC2E Internet of Things Panel Judy Qiu Indiana University.
Advertisements

Anand Hegde Prerna Shraff Performance Analysis of Lucene Index on HBase Environment Group #13.
Paula Ta-Shma, IBM Haifa Research 1 “Advanced Topics on Storage Systems” - Spring 2013, Tel-Aviv University Big Data and.
BY VAIBHAV NACHANKAR ARVIND DWARAKANATH Evaluation of Hbase Read/Write (A study of Hbase and it’s benchmarks)
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
Projects. High Performance Computing Projects Design and implement an HPC cluster with one master node and two compute nodes. (Hint: use Rocks HPC Cluster.
Zois Vasileios Α. Μ :4183 University of Patras Department of Computer Engineering & Informatics Diploma Thesis.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.
Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.
Getting Biologists off ACID Ryan Verdon 3/13/12. Outline Thesis Idea Specific database Effects of losing ACID What is a NoSQL database Types of NoSQL.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
What is Big Data? Bid Data extremely large data sets that may be analyzed computationally to reveal patterns, trends, and associations, especially.
Experimenting Lucene Index on HBase in an HPC Environment Xiaoming Gao Vaibhav Nachankar Judy Qiu.
Kelly Boccia Abi Natarajan Konstantin Livitski Senthil Anand Subbanan Meyyappan 1.
1 Dennis Kafura – CS5204 – Operating Systems Big Table: Distributed Storage System For Structured Data Sergejs Melderis 1.
Performance Evaluation on Hadoop Hbase By Abhinav Gopisetty Manish Kantamneni.
The exponential growth of data –Challenges for Google,Yahoo,Amazon & Microsoft in web search and indexing The volume of data being made publicly available.
Large-scale Incremental Processing Using Distributed Transactions and Notifications Daniel Peng and Frank Dabek Google, Inc. OSDI Feb 2012 Presentation.
1 HBase Intro 王耀聰 陳威宇
Introduction to Hbase. Agenda  What is Hbase  About RDBMS  Overview of Hbase  Why Hbase instead of RDBMS  Architecture of Hbase  Hbase interface.
From Coulouris, Dollimore, Kindberg and Blair Distributed Systems: Concepts and Design Edition 5, © Addison-Wesley 2012 Slides for Chapter 21: Designing.
By Vaibhav Nachankar Arvind Dwarakanath.  HBase is an open-source, distributed, column- oriented and sorted-map data storage.  It is a Hadoop Database;
 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.
CSC590 Selected Topics Bigtable: A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A.
Grid Appliance The World of Virtual Resource Sharing Group # 14 Dhairya Gala Priyank Shah.
Supporting Queries and Analyses of Large- Scale Social Media Data with Customizable and Scalable Indexing Techniques over NoSQL databases Xiaoming Gao,
Breaking points of traditional approach What if you could handle big data?
HEMANTH GOKAVARAPU SANTHOSH KUMAR SAMINATHAN Frequent Word Combinations Mining and Indexing on HBase.
Bigtable : A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach Mike Burrows,
Bigtable: A Distributed Storage System for Structured Data
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
1 HBASE – THE SCALABLE DATA STORE An Introduction to HBase XLDB Europe Workshop 2013: CERN, Geneva James Kinley EMEA Solutions Architect, Cloudera.
Bigtable: A Distributed Storage System for Structured Data Google Inc. OSDI 2006.
Department of Computer Science, Johns Hopkins University EN Instructor: Randal Burns 24 September 2013 NoSQL Data Models and Systems.
HBase Coprocessor to Index Columns into ElasticSearch Cluster Dibyendu Bhattacharya Architect – Big Data Analytics HappiestMinds.
Abstract MarkLogic Database – Only Enterprise NoSQL DB Aashi Rastogi, Sanket V. Patel Department of Computer Science University of Bridgeport, Bridgeport,
1 Gaurav Kohli Xebia Breaking with DBMS and Dating with Relational Hbase.
Hadoop Introduction. Audience Introduction of students – Name – Years of experience – Background – Do you know Java? – Do you know linux? – Any exposure.
Big Data & Test Automation
Big Data Enterprise Patterns
Column-Based.
HBase Mohamed Eltabakh
Hadoop.
Software Systems Development
Scaling Spark on HPC Systems
CS122B: Projects in Databases and Web Applications Winter 2017
Searching and Indexing
Hadoop.
Gowtham Rajappan.
NOSQL databases and Big Data Storage Systems
Central Florida Business Intelligence User Group
Big Data - in Performance Engineering
CS6604 Digital Libraries IDEAL Webpages Presented by
Massively Parallel Processing in Azure Comparing Hadoop and SQL based MPP architectures in the cloud Josh Sivey SQL Saturday #597 | Phoenix.
DATABASE SYSTEM UNIT I.
Introduction to Apache
Database Systems Summary and Overview
Department of Intelligent Systems Engineering
Zoie Barrett and Brian Lam
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
Convergence of Big Data and Extreme Computing
Presentation transcript:

Analysis of Lucene Index on Hbase in an HPC Environment Prerna Shraff Anand Hegde

Concept BigTable Compressed, high performance database system built on GFS, Chubby Lock Service, SSTable etc. Hbase Hadoop database Open source distributed versioned column oriented Modeled after BigTable

Outline Data intensive computing requires storage solutions for huge amount of data. The requirement is to host very large tables on clusters of commodity hardware. HBase helps in fulfilling the above requirement. Hbase provides Bigtable like capabilities on top of Hadoop.

The Idea Current implementation in this field includes an experiment using Lucene Index on Hbase in an HPC Environment. (Xiaoming Gao, Vaibhav Nachankar, Judy Qiu) To expand the scope of the existing project. To evaluate the performance in terms of many other parameters.

Architecture

Implemented solution Use of inverted index using Lucene index. Index refers to doc1 -> “cloud” Inverted index refers to “cloud” -> doc1 Apache Lucene was used to implement inverted indices. Apache Lucene supports full-text search.

Implemented Design The existing design has separate tables for book images, book texts and Lucene indices.

System Implementation

Initial Analysis Experiment was performed on the Alamo HPC cluster of FutureGrid. Experiment was conducted with 5 Books. Total terms evaluated : 8263

Initial Data Analysis

Initial Data Analysis

Proposed Work To test across more number of data sets. To test across different clusters like India on FutureGrid. To test across different number of HDFS data nodes. To test across more number of client nodes with different number of client queries.

Obstacles we can face Hardware differs from cluster to cluster and performance will differ accordingly. Problems may occur with increase or decrease of data nodes. Important items to consider would be switching capacity of the device, number of systems connected and uplink capacity. Finding appropriate number of data sets.

References Hbase http://hbase.apache.org/book.html#ops_mgt BigTable http://static.googleusercontent.com/external_content/untrusted_dlcp/labs.google.com/en/us/papers/bigtable-osdi06.pdf Experiment using Lucene Index on Hbase in an HPC Environment. (Xiaoming Gao, Vaibhav Nachankar, Judy Qiu)

Thank You