Storage Systems for Managing Voluminous Data

Slides:



Advertisements
Similar presentations
Finding a needle in Haystack Facebook’s Photo Storage
Advertisements

Introduction to cloud computing Jiaheng Lu Department of Computer Science Renmin University of China
1 Hash-Based Indexes Module 4, Lecture 3. 2 Introduction As for any index, 3 alternatives for data entries k* : – Data record with key value k – –Choice.
CSE 486/586 CSE 486/586 Distributed Systems Case Study: Facebook f4 Steve Ko Computer Sciences and Engineering University at Buffalo.
January 11, Csci 2111: Data and File Structures Week1, Lecture 1 Introduction to the Design and Specification of File Structures.
By: Chris Hayes. Facebook Today, Facebook is the most commonly used social networking site for people to connect with one another online. People of all.
Cassandra Database Project Alireza Haghdoost, Jake Moroshek Computer Science and Engineering University of Minnesota-Twin Cities Nov. 17, 2011 News Presentation:
1 Hash-Based Indexes Chapter Introduction  Hash-based indexes are best for equality selections. Cannot support range searches.  Static and dynamic.
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
1 Hash-Based Indexes Chapter Introduction : Hash-based Indexes  Best for equality selections.  Cannot support range searches.  Static and dynamic.
Undergraduate Poster Presentation Match 31, 2015 Department of CSE, BUET, Dhaka, Bangladesh Wireless Sensor Network Integretion With Cloud Computing H.M.A.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
What makes Facebook do what it does? By Gavin Mais.
CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.
Suggested Exercise 9 Sarah Diesburg Operating Systems CS 3430.
Operating Systems (CS 340 D) Dr. Abeer Mahmoud Princess Nora University Faculty of Computer & Information Systems Computer science Department.
Meet with the AppEngine Márk Gergely eu.edge. What is AppEngine? It’s a tool, that lets you run your web applications on Google's infrastructure. –Google's.
W HAT IS H ADOOP ? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity.
The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.
CS211 - Fernandez - 1 CS211 Graduate Computer Architecture Network 3: Clusters, Examples.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
NoSQL Or Peles. What is NoSQL A collection of various technologies meant to work around RDBMS limitations (mostly performance) Not much of a definition...
Presented By: Nick Koziol ISC110.  Had 1.19 billion members as of October  Largest social networking site in the world  Mark Zuckerberg  Many databases.
Scalable Data Scale #2 site on the Internet (time on site) >200 billion monthly page views Over 1 million developers in 180 countries.
GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
1 HBASE – THE SCALABLE DATA STORE An Introduction to HBase XLDB Europe Workshop 2013: CERN, Geneva James Kinley EMEA Solutions Architect, Cloudera.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
Abstract MarkLogic Database – Only Enterprise NoSQL DB Aashi Rastogi, Sanket V. Patel Department of Computer Science University of Bridgeport, Bridgeport,
Organizations Are Embracing New Opportunities
Jonathan Walpole Computer Science Portland State University
CS122B: Projects in Databases and Web Applications Winter 2017
Steve Ko Computer Sciences and Engineering University at Buffalo
Steve Ko Computer Sciences and Engineering University at Buffalo
Finding a Needle in Haystack : Facebook’s Photo storage
Steve Ko Computer Sciences and Engineering University at Buffalo
File System Implementation
File System Structure How do I organize a disk into a file system?
Operating Systems (CS 340 D)
Hash-Based Indexes Chapter 11
CPSC-608 Database Systems
Extraction, aggregation and classification at Web Scale
Consistency in Distributed Systems
Future Data Architecture Cloud Hosting at USGS
Storage Systems for Managing Voluminous Data
Distributed Shared Memory
Steve Ko Computer Sciences and Engineering University at Buffalo
File Systems: Fundamentals.
Introduction to Database Systems
CS222: Principles of Data Management Notes #8 Static Hashing, Extendible Hashing, Linear Hashing Instructor: Chen Li.
Hash-Based Indexes Chapter 10
Hadoop Technopoints.
Distributed File Systems
CS222P: Principles of Data Management Notes #8 Static Hashing, Extendible Hashing, Linear Hashing Instructor: Chen Li.
Big Data Young Lee BUS 550.
Lecture 15 Reading: Bacon 7.6, 7.7
Hash-Based Indexes Chapter 11
Introduction to Programming Part 2
Energy-Efficient Storage Systems
Indexing and Hashing B.Ramamurthy Chapter 11 2/5/2019 B.Ramamurthy.
Database Systems (資料庫系統)
Replication and Availability in Distributed Systems
Outline Review of Quiz #1 Distributed File Systems 4/20/2019 COP5611.
Hash-Based Indexes Chapter 11
Sarah Diesburg Operating Systems CS 3430
Chapter 11 Instructor: Xin Zhang
Chapter 11: Indexing and Hashing
CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #07 Static Hashing, Extendible Hashing, Linear Hashing Instructor: Chen Li.
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Presentation transcript:

Storage Systems for Managing Voluminous Data CS455: Introduction To Distributed Systems Meghana Santhapur Pratyusha Reddy Degapudi Rasika Warade Department Of Computer Science

WHY IS THIS PROBLEM IMPORTANT ? Big Data universe beginning to explode Store and manage large volumes of data efficiently Selection of a storage system for a particular use case So where do we see this explosion? source: http://www.nbnnews.com/NBN/issues/2011-12-05/Sales+and+Marketing/index.html

PROBLEM CHARACTERIZATION April 2009 Current Total 15 billion photos 60 billion images 1.5 PB 65 billion photos 260 billion images 20 PB Upload rate 220 million photos per week 25 TB 1 billion photos per week 60 TB Server rate 550,000 images per sec 1 million images per sec Photo storage systems Block storage? File storage? Object storage? Source: Facebook Engineering research group

TRADE_OFF SPACE FOR SOLUTIONS Thousands of files in each directory with more than 10 disk operations Directory size reduced to store hundreds of images with 3 disk operations to maintain File handles cached in Photo servers File handles of every image in memcache No decrease in caches and disk operations giving Overhead on Metadata OBJECT STORAGE

FACEBOOK HAYSTACK STORE LAYOUT DOMINANT APPROACHES Can you find a needle in a haystack? FACEBOOK HAYSTACK STORE LAYOUT It maintains an incore index for all photos This eliminated unneccesary metadata Source: Facebook Engineering research group

DOMINANT APPROACHES (Contd…) MObStor/DORA HBase Dora provides backend service to MObStor Elimination of Metadata from Object Storage Use of data locators Top of Hadoop framework Vector data is converted to Well Known Binary and Well Known Text Can handle spatial images

Object Storage Haystack Store Mobstor/DORA INSIGHTS GLEANED Ignores the file system and puts everything in a bucket Lot faster than file systems Performance does not degrade as the cluster grows Object Storage Old NFS infrastructure was replaced due to more file system metadata Allows storage of multiple photos in single file with less metadata Haystack Store Supported high request rates for operational storage Added features to do object storage on cheaper systems Mobstor/DORA

WHAT THE PROBLEM SPACE IN FUTURE WOULD LOOK LIKE

TRADE_OFF SPACE AND SOLUTIONS IN THE FUTURE Currently, Object storage is the Smartest Solution for voluminous data Scaling in a less expensive way, suggests open source programs Storage Systems supporting both object and block storage, or may be file storage altogether Metadata can be further reduced using dynamic data structure to locate servers responsible for data Which is the World’s largest Haystack possible?