Supporting Large-scale Social Media Data Analyses with Customizable Indexing Techniques on NoSQL Databases.

Slides:



Advertisements
Similar presentations
Inner Architecture of a Social Networking System Petr Kunc, Jaroslav Škrabálek, Tomáš Pitner.
Advertisements

© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
Store RDF Triples In A Scalable Way Liu Long & Liu Chunqiu.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Spark: Cluster Computing with Working Sets
Presented by Vigneshwar Raghuram
Supporting Queries and Analyses of Large- Scale Social Media Data with Customizable and Scalable Indexing Techniques over NoSQL databases Thesis Proposal.
Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.
CS 405G: Introduction to Database Systems 24 NoSQL Reuse some slides of Jennifer Widom Chen Qian University of Kentucky.
A Social blog using MongoDB ITEC-810 Final Presentation Lucero Soria Supervisor: Dr. Jian Yang.
HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC
HADOOP ADMIN: Session -2
Project By: Anuj Shetye Vinay Boddula. Introduction Motivation HBase Our work Evaluation Related work. Future work and conclusion.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.
Systems analysis and design, 6th edition Dennis, wixom, and roth
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
IT The Relational DBMS Section 06. Relational Database Theory Physical Database Design.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
NoSQL continued CMSC 461 Michael Wilson. MongoDB  MongoDB is another NoSQL solution  Provides a bit more structure than a solution like Accumulo  Data.
Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.
Getting Biologists off ACID Ryan Verdon 3/13/12. Outline Thesis Idea Specific database Effects of losing ACID What is a NoSQL database Types of NoSQL.
Appraisal and Data Mining of Large Size Complex Documents Rob Kooper, William McFadden and Peter Bajcsy National Center for Supercomputing Applications.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
Introduction to Hadoop and HDFS
VLDB2012 Hoang Tam Vo #1, Sheng Wang #2, Divyakant Agrawal †3, Gang Chen §4, Beng Chin Ooi #5 #National University of Singapore, †University of California,
Experimenting Lucene Index on HBase in an HPC Environment Xiaoming Gao Vaibhav Nachankar Judy Qiu.
Storage and Analysis of Tera-scale Data : 2 of Database Class 11/24/09
Data storing and data access. Plan Basic Java API for HBase – demo Bulk data loading Hands-on – Distributed storage for user files SQL on noSQL Summary.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
Harp: Collective Communication on Hadoop Bingjing Zhang, Yang Ruan, Judy Qiu.
Large-scale Incremental Processing Using Distributed Transactions and Notifications Daniel Peng and Frank Dabek Google, Inc. OSDI Feb 2012 Presentation.
© Pearson Education Limited, Chapter 13 Physical Database Design – Step 4 (Choose File Organizations and Indexes) Transparencies.
Key/Value Stores CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.
SALSASALSA Scalable Architecture for Integrated Batch and Streaming Analysis of Big Data 1 Thesis Defense Xiaoming Gao Advisor: Prof. Judy Qiu 01/21/2015.
Spatial Tajo Supporting Spatial Queries on Apache Tajo Slideshare Shorten URL : goo.gl/j0VLXpgoo.gl/j0VLXp.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Database Applications (15-415) Part II- Hadoop Lecture 26, April 21, 2015 Mohammad Hammoud.
Indexes and Views Unit 7.
By Vaibhav Nachankar Arvind Dwarakanath.  HBase is an open-source, distributed, column- oriented and sorted-map data storage.  It is a Hadoop Database;
Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*
Database Indexing 1 After this lecture, you should be able to:  Understand why we need database indexing.  Define indexes for your tables in MySQL. 
 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.
Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.
CSC590 Selected Topics Bigtable: A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A.
Supporting Queries and Analyses of Large- Scale Social Media Data with Customizable and Scalable Indexing Techniques over NoSQL databases Xiaoming Gao,
Nov 2006 Google released the paper on BigTable.
NoSQL Systems Motivation. NoSQL: The Name  “SQL” = Traditional relational DBMS  Recognition over past decade or so: Not every data management/analysis.
Parallel Applications And Tools For Cloud Computing Environments CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.
HEMANTH GOKAVARAPU SANTHOSH KUMAR SAMINATHAN Frequent Word Combinations Mining and Indexing on HBase.
Integrating Big Data into the Computing Curricula 02/2015 Achmad Benny Mutiara
Copyright © 2016 Pearson Education, Inc. Modern Database Management 12 th Edition Jeff Hoffer, Ramesh Venkataraman, Heikki Topi CHAPTER 11: BIG DATA AND.
Department of Computer Science, Johns Hopkins University EN Instructor: Randal Burns 24 September 2013 NoSQL Data Models and Systems.
Apache Accumulo CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook.
1 Divya Jain Oct 10 th, 2014 Big Data Products: Where do I start?
CSCI-235 Micro-Computers in Science Databases. Database Concepts Data is any unorganized text, graphics, sounds, or videos A database is a collection.
Learn. Hadoop Online training course is designed to enhance your knowledge and skills to become a successful Hadoop developer and In-depth knowledge of.
Abstract MarkLogic Database – Only Enterprise NoSQL DB Aashi Rastogi, Sanket V. Patel Department of Computer Science University of Bridgeport, Bridgeport,
Storage and File Organization
CS 405G: Introduction to Database Systems
Indexing Structures for Files and Physical Database Design
CS122B: Projects in Databases and Web Applications Winter 2017
Storage and Indexes Chapter 8 & 9
CLOUDERA TRAINING For Apache HBase
Physical Database Design and Performance
Database Implementation Issues
Database Management System
Big DATA.
Presentation transcript:

Supporting Large-scale Social Media Data Analyses with Customizable Indexing Techniques on NoSQL Databases

Large data size: - TBs/PBs of historical data, 10s of millions of streaming social updates per day - Fine-grained social update records, mostly written-once-read-many Focus on data subsets about specific social events/activities - Require efficient queries about social events Workflows contain algorithms of different computation/communication patterns: - Require dynamic adoption of different processing frameworks: Hadoop, Harp, Giraph, etc. Motivation - characteristics of social media data analysis

Motivation – addressing the requirements NoSQL databases as possible storage solution: - Scalable storage and access speed for fine-grained data records - Varied level of indexing support - Index structures not customizable enough for queries in different applications Our contribution: - A general customizable indexing framework on NoSQL databases - Extended analysis stack based on YARN - Queries and post-query analysis algorithms as building blocks for workflows - Usage of customized index structures in both queries and analysis algorithms

Customizable indexing framework over NoSQL databases Abstract index structure - A sorted list of index keys. - Each key associated with multiple entries sorted by unique entry IDs. - Each entry contains multiple additional fields. - Index configuration file defines what to contain in keys and entries Field2 Entry ID Field1 Field2 Entry ID Field1 Field2 key1 … Field2 Entry ID Field1 Field2 key2 Entry ID Field1 Field2 Entry ID Field1 Field2 Entry ID Field1 Field2 key3 Entry ID Field1 Field2 key4 tweets text full-text textIndex {source-record}.id {source-record}.created_at users snameIndex iu.pti.hbaseapp.truthy.UserSnameIndexer

Demonstration of Customizability Sorted single dimensional index … … tweet id Multi-dimensional index time Inverted index for text retrieval applications doc id frequency doc id frequency enjoy doc id frequency great Multi-dimensional index involving both text and non-text fields tweet id time tweet id time #occupy tweet id time #oscar … … Suitable for social queries about hashtags etc. with time constraints Other possible extensions: multi-key index similar to MongoDB, geospatial indexes using geohashes as keys

Implementation on HBase Mapping between abstract structure and NoSQL storage units: -HBase designed for real-time updates of single column values – efficient access of individual index entries -Data storage with HDFS – reliable and scalable index data storage -Region split and dynamic load balancing – scalable index read/write speed -Data stored under the hierarchical order of - efficient range scan of index keys and entry IDs Field2 Entry ID Field1 Field2 Entry ID Field1 Field2 enjoy Field2 Entry ID Field1 Field2 great entries enjoy Filed2 Text Index Table Entry ID Field1 Filed2 Entry ID Field1 Filed great Filed2 Entry ID Field1 Filed2 Text Index

Analysis stack based on YARN Building blocks: - Parallel query evaluation using Hadoop MapReduce - Related hashtag mining algorithm using Hadoop MapReduce: - Meme daily meme frequency using MapReduce over index tables - Parallel force-directed graph layout algorithm using Twister (Harp) iterative MapReduce

Evaluation with Truthy Applications Scalable indexing: Two months’ data loading for varied cluster sizeOne day’s data loading for fixed cluster size at 8 Queries and analysis algorithms: Scalability of iterative graph layout algorithm on Twister