- History and Motivations

Slides:



Advertisements
Similar presentations
Distributed Processing, Client/Server and Clusters
Advertisements

HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.
Database Management Systems 1 Ramakrishnan & Gehrke Introduction to Database Systems Chapter 1 Instructor: Mirsad Hadzikadic.
10 REASONS Why it makes a good option for your DB IN-MEMORY DATABASES Presenter #10: Robert Vitolo.
IiWAS2002, Bandung, Indonesia Teaching and Learning Databases Dr. Stéphane Bressan National University of Singapore.
Spark: Cluster Computing with Working Sets
Objektorienteret Middleware Presentation 2: Distributed Systems – A brush up, and relations to Middleware, Heterogeneity & Transparency.
Distributed components
Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.
The Google File System. Why? Google has lots of data –Cannot fit in traditional file system –Spans hundreds (thousands) of servers connected to (tens.
Distributed Database Management Systems
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
The Architecture of Transaction Processing Systems
Chapter 14 The Second Component: The Database.
David Gibbs and Govardhan Tanniru Georgia State University Department of Computer Science P.O. Box 3965 Atlanta, GA
1: IntroductionData Management & Engineering1 Course Overview: CS 395T Semantic Web, Ontologies and Cloud Databases Daniel P. Miranker Objectives: Get.
Analyzing the Energy Efficiency of a Database Server Hanskamal Patel SE 521.
INTRODUCTION TO TRANSACTION PROCESSING CHAPTER 21 (6/E) CHAPTER 17 (5/E)
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
1 The Google File System Reporter: You-Wei Zhang.
CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.
Overview of a Database Management System
Database Management Systems 1 Introduction to Database Systems Instructor: Xintao Wu Ramakrishnan & Gehrke.
Chapter 2 Database System Architecture. An “architecture” for a database system. A specification of how it will work, what it will “look like.” The “ANSI/SPARC”
Lecture On Database Analysis and Design By- Jesmin Akhter Lecturer, IIT, Jahangirnagar University.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
Transaction Management: Concurrency Control CS634 Class 16, Apr 2, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
Database Management Systems 1 Ramakrishnan & Gehrke Introduction to Database Systems Chpt 1 Instructor: Xintao Wu.
W HAT IS H ADOOP ? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Database Management Systems 1 Ramakrishnan & Gehrke Introduction to Database Systems Chpt 1 Instructor: Weichao Wang.
NoSQL Databases Oracle - Berkeley DB. Content A brief intro to NoSQL About Berkeley Db About our application.
Week 5 Lecture Distributed Database Management Systems Samuel ConnSamuel Conn, Asst Professor Suggestions for using the Lecture Slides.
Introduction. Readings r Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 m Note: All figures from this book.
1 CS 430 Database Theory Winter 2005 Lecture 16: Inside a DBMS.
The Client/Server Database Environment Ployphan Sornsuwit KPRU Ref.
Advanced Computer Networks Topic 2: Characterization of Distributed Systems.
Database Architectures Database System Architectures Considerations – Data storage: Where do the data and DBMS reside? – Processing: Where.
Introduction to Database Systems1. 2 Basic Definitions Mini-world Some part of the real world about which data is stored in a database. Data Known facts.
Chapter 2 Database System Concepts and Architecture Dr. Bernard Chen Ph.D. University of Central Arkansas.
INTRODUCTION TO DBS Database: a collection of data describing the activities of one or more related organizations DBMS: software designed to assist in.
DATABASE MANAGEMENT SYSTEM ARCHITECTURE
CS338Parallel and Distributed Databases11-1 Parallel and Distributed Databases Lecture Topics Multi-CPU and distributed systems Monolithic system Client–server.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.
연세대학교 Yonsei University Data Processing Systems for Solid State Drive Yonsei University Mincheol Shin
History & Motivations –RDBMS History & Motivations (cont’d) … … Concurrent Access Handling Failures Shared Data User.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
CS 540 Database Management Systems
Oracle Architecture - Structure. Oracle Architecture - Structure The Oracle Server architecture 1. Structures are well-defined objects that store the.
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Database Management Systems Chapter 1.
BIG DATA/ Hadoop Interview Questions.
Presenter: Yue Zhu, Linghan Zhang A Novel Approach to Improving the Efficiency of Storing and Accessing Small Files on Hadoop: a Case Study by PowerPoint.
CPSC-310 Database Systems
CS 540 Database Management Systems
Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.
Flash Storage 101 Revolutionizing Databases
The Client/Server Database Environment
Software Architecture in Practice
Tiers vs. Layers.
H-store: A high-performance, distributed main memory transaction processing system Robert Kallman, Hideaki Kimura, Jonathan Natkins, Andrew Pavlo, Alex.
CS639: Data Management for Data Science
Presentation transcript:

- History and Motivations 2016.05 Big Data Platforms - History and Motivations Jae Hyung Kim Ph.D. Candidate , Department of Computer Science, Yonsei University

Lab Vision and Research Area Data Engineering Lab. has various efforts in the area of both data processing system technologies with modern hardware and bioinformatics based on data mining. Dataware: Data-centric system over HW and SW. Storing and processing data using novel memory storage Tiering Data among DRAM, NVRAM, SSD and HDD Cooperation between SW and storage: In-storage processing and migrating recovery to storage SW layer. Distributed processing on the faster networks: Data placement and scheduling tasks of Hadoop stack on 10G networks. Bioinformatics Systems Biology Studies Developing Tools for Bio-data Analysis Our vision is to optimize typical data processing and management technologies for modern hardware and big data management. We also aim to research on various computational methods for omics data analysis and high-throughput biological data analysis.

Data Management with Modern Hardware Data Management Technologies with Modern Hardware Efficient page layout and file organization Query processing and index structures Column data store technologies Big Data Management on Modern Hardware Boosting Hadoop performance using NVRAM and SSD Distributed graph processing Optimizing Hadoop on 10G networks Data Processing in Solid State Drives SSD guaranteeing ACID properties In-Storage processing: filtering records in SSD Flash SSD PRAM RDBMS NVRAM NVDIMM SQL-on-Hadoop NVMe PCIe Interface Publications NoSQL VLDB, ICDE, CIKM, Information Systems, etc. 10G Networks Distributed processing Graph parallel computation GPU Multi-core CPUs Modern Hardware Projects SKTelecom, LG electronics, KISTI, etc Patents 11 applied (4 Int’l) patents 5 issued patents Dataware Technologies

Database, Data Mining, Bioinformatics Network Biology Graph Theory Machine Learning Data Integration System Biology Studies Microarrays Protein Abundance Literature data Clinical data Somatic mutation data Research Goal Disease Analysis and Functional Genomics by Computational Approach Various Biological data Developing Tools for Bio data Analysis and Visualization tools for Various Bio-data Publications (~2016) Nucleic Acids Research, Bioinformatics, PLoS One, Information Sciences, ISMB, Informatics Sciences Molecular biosystems, Journal of biomedical Informatics, Computer Methods and Programs in biomedicine, etc.

Index Introduction RDBMS vs Big Data Platforms Growing Big Data Platforms

DB시장 규모 및 전망 국내 RDBMS 시장 전망 2017년 약 6,000억원 DB 라이선스 매출 및 유지보수 매출만 포함

DB시장 규모 및 전망 2013년 국내 DB시장 점유율

글로벌 DB 시장 규모 2017년 500억 달러 (≒ 60조원) DB 라이선스 매출 및 유지보수 매출만 포함

DB시장 규모 및 전망

Introduction

Introduction History & Motivations RDBMS

… History & Motivations (cont’d) Concurrent Access Handling Failures … Introduction History & Motivations (cont’d) … Concurrent Access Handling Failures … Shared Data User

Introduction Transaction Powerful abstraction concept which forms the “interface contract” between an application program and a transactional server Application Lifecycle Program Start Begin Transaction . . . Commit Transaction Program End Transaction Boundary

Transaction (cont’d) The core requirement on a DBMS is Introduction Transaction (cont’d) The core requirement on a DBMS is ACID guarantees for set of operations in the same transaction concurrency control component to guarantee the isolation properties of transactions, for both committed and aborted transactions recovery component to guarantee the atomicity and durability of transactions

… RDBMS Architecture – Heavy!!! Clients Requests Database Server Introduction RDBMS Architecture – Heavy!!! … Clients Requests Language and Interface Layer Query Decomposition and Optimization Layer Database Server Query Execution Layer Request execution threads Access Layer Storage Layer To facilitate disk I/O parallelism between different requests Data Access Database

RDBMS Architecture – How data is stored Introduction RDBMS Architecture – How data is stored Database usually has a cretain amount of preallocated disk space consists of one or more extents Page 1) The minimum unit of data transfer between disk and main memory 2) The unit of caching in memory Each extent is a range of pages that are contiguous on disk Slot = A page number + A slot number A page number  A disk number + A physical address on disk by looking up an entry in an extent table and adding a relative offset

RDBMS Computational Model – Page model Introduction RDBMS Computational Model – Page model Requests  Processing of pages (read or write) ACID Properties of Transaction Page based Concurrency Control and Recovery should be based on page model ※ The details of how data is manipulated within the local variables of the executing programs are mostly irrelevant Parallelized transaction execution r(x) r(y) r(z) t = r(x)r(y)r(z)w(u)w(x) Partial Order w(u) w(x)

Conclusion: Need large, distributed, highly fault tolerant file system Introduction Needs for huge data from Google More than 15,000 commodity-class PC's Multiple clusters distributed worldwide Thousands of queries served per second One query reads 100's of MB of data One query consumes 10's of billions of CPU cycles Google stores dozens of copies of the entire Web! Conclusion: Need large, distributed, highly fault tolerant file system  Traditional DBMS cannot tolerate

RDBMS vs Big Data Platforms

RDBMS vs Big Data Platforms Problems of RDBMS RDBMS’s clustering Transaction Maintain cost Data Copy Cost  Performance does not increase as we expected

RDBMS vs Big Data Platforms 인텔 제온 E5-2697V3 (하스웰-EP) 인텔(소켓2011-V3) / 테트라데카(14) 코어 / 쓰레드 28개 / 64(32)비트 / 2.6GHz / DDR4 / PCI-Express 40개 레인 Problems of RDBMS Scale-up vs Scale-out (Cost perspective) \3,400,000 \250,000 인텔 코어i5-6세대 6600 (스카이레이크) 인텔(소켓1151) / DDR4 / DDR3L / 64비트 / 쿼드 코어 / 쓰레드 4개 / 3.3GHz / 인텔 HD 530 / PCI-Express 16개 레인

RDBMS vs Big Data Platforms Google File System Beginning of the big data platforms Affects to Hadoop Chunk : Analogous to block, except larger (typically 64MB)

RDBMS vs Big Data Platforms Google File System Read Algorithm (1/2)

RDBMS vs Big Data Platforms Google File System Read Algorithm (2/2)

RDBMS vs Big Data Platforms Google File System Write Algorithm (1/4)

RDBMS vs Big Data Platforms Google File System Write Algorithm (2/4)

RDBMS vs Big Data Platforms Google File System Write Algorithm (3/4)

RDBMS vs Big Data Platforms Google File System Write Algorithm (4/4)

RDBMS vs Big Data Platforms Hadoop HDFS + MapReduce 128MB file (e.g. /data/hdfs/block1) on Local Filesystem

RDBMS vs Big Data Platforms Hadoop HDFS + MapReduce (Computational Model) On Local Filesystem

Growing Bigdata Platforms

Growing Big Data Platforms

Growing Big Data Platforms Gartner’s hype cycle 2012

Growing Big Data Platforms Gartner’s hype cycle 2013

Growing Big Data Platforms Gartner’s hype cycle 2014

Growing Big Data Platforms Gartner’s hype cycle 2015 Big data dropped from cycle, Big data is now into practice

Emerging Hardwares

Emerging H/Ws History of Memory

Emerging H/Ws All flash array

Emerging H/Ws All flash array

Emerging H/Ws NVRAM

Emerging H/Ws NVDIMM

Q&A Thank you