- History and Motivations 2016.05 Big Data Platforms - History and Motivations Jae Hyung Kim Ph.D. Candidate , Department of Computer Science, Yonsei University
Lab Vision and Research Area Data Engineering Lab. has various efforts in the area of both data processing system technologies with modern hardware and bioinformatics based on data mining. Dataware: Data-centric system over HW and SW. Storing and processing data using novel memory storage Tiering Data among DRAM, NVRAM, SSD and HDD Cooperation between SW and storage: In-storage processing and migrating recovery to storage SW layer. Distributed processing on the faster networks: Data placement and scheduling tasks of Hadoop stack on 10G networks. Bioinformatics Systems Biology Studies Developing Tools for Bio-data Analysis Our vision is to optimize typical data processing and management technologies for modern hardware and big data management. We also aim to research on various computational methods for omics data analysis and high-throughput biological data analysis.
Data Management with Modern Hardware Data Management Technologies with Modern Hardware Efficient page layout and file organization Query processing and index structures Column data store technologies Big Data Management on Modern Hardware Boosting Hadoop performance using NVRAM and SSD Distributed graph processing Optimizing Hadoop on 10G networks Data Processing in Solid State Drives SSD guaranteeing ACID properties In-Storage processing: filtering records in SSD Flash SSD PRAM RDBMS NVRAM NVDIMM SQL-on-Hadoop NVMe PCIe Interface Publications NoSQL VLDB, ICDE, CIKM, Information Systems, etc. 10G Networks Distributed processing Graph parallel computation GPU Multi-core CPUs Modern Hardware Projects SKTelecom, LG electronics, KISTI, etc Patents 11 applied (4 Int’l) patents 5 issued patents Dataware Technologies
Database, Data Mining, Bioinformatics Network Biology Graph Theory Machine Learning Data Integration System Biology Studies Microarrays Protein Abundance Literature data Clinical data Somatic mutation data Research Goal Disease Analysis and Functional Genomics by Computational Approach Various Biological data Developing Tools for Bio data Analysis and Visualization tools for Various Bio-data Publications (~2016) Nucleic Acids Research, Bioinformatics, PLoS One, Information Sciences, ISMB, Informatics Sciences Molecular biosystems, Journal of biomedical Informatics, Computer Methods and Programs in biomedicine, etc.
Index Introduction RDBMS vs Big Data Platforms Growing Big Data Platforms
DB시장 규모 및 전망 국내 RDBMS 시장 전망 2017년 약 6,000억원 DB 라이선스 매출 및 유지보수 매출만 포함
DB시장 규모 및 전망 2013년 국내 DB시장 점유율
글로벌 DB 시장 규모 2017년 500억 달러 (≒ 60조원) DB 라이선스 매출 및 유지보수 매출만 포함
DB시장 규모 및 전망
Introduction
Introduction History & Motivations RDBMS
… History & Motivations (cont’d) Concurrent Access Handling Failures … Introduction History & Motivations (cont’d) … Concurrent Access Handling Failures … Shared Data User
Introduction Transaction Powerful abstraction concept which forms the “interface contract” between an application program and a transactional server Application Lifecycle Program Start Begin Transaction . . . Commit Transaction Program End Transaction Boundary
Transaction (cont’d) The core requirement on a DBMS is Introduction Transaction (cont’d) The core requirement on a DBMS is ACID guarantees for set of operations in the same transaction concurrency control component to guarantee the isolation properties of transactions, for both committed and aborted transactions recovery component to guarantee the atomicity and durability of transactions
… RDBMS Architecture – Heavy!!! Clients Requests Database Server Introduction RDBMS Architecture – Heavy!!! … Clients Requests Language and Interface Layer Query Decomposition and Optimization Layer Database Server Query Execution Layer Request execution threads Access Layer Storage Layer To facilitate disk I/O parallelism between different requests Data Access Database
RDBMS Architecture – How data is stored Introduction RDBMS Architecture – How data is stored Database usually has a cretain amount of preallocated disk space consists of one or more extents Page 1) The minimum unit of data transfer between disk and main memory 2) The unit of caching in memory Each extent is a range of pages that are contiguous on disk Slot = A page number + A slot number A page number A disk number + A physical address on disk by looking up an entry in an extent table and adding a relative offset
RDBMS Computational Model – Page model Introduction RDBMS Computational Model – Page model Requests Processing of pages (read or write) ACID Properties of Transaction Page based Concurrency Control and Recovery should be based on page model ※ The details of how data is manipulated within the local variables of the executing programs are mostly irrelevant Parallelized transaction execution r(x) r(y) r(z) t = r(x)r(y)r(z)w(u)w(x) Partial Order w(u) w(x)
Conclusion: Need large, distributed, highly fault tolerant file system Introduction Needs for huge data from Google More than 15,000 commodity-class PC's Multiple clusters distributed worldwide Thousands of queries served per second One query reads 100's of MB of data One query consumes 10's of billions of CPU cycles Google stores dozens of copies of the entire Web! Conclusion: Need large, distributed, highly fault tolerant file system Traditional DBMS cannot tolerate
RDBMS vs Big Data Platforms
RDBMS vs Big Data Platforms Problems of RDBMS RDBMS’s clustering Transaction Maintain cost Data Copy Cost Performance does not increase as we expected
RDBMS vs Big Data Platforms 인텔 제온 E5-2697V3 (하스웰-EP) 인텔(소켓2011-V3) / 테트라데카(14) 코어 / 쓰레드 28개 / 64(32)비트 / 2.6GHz / DDR4 / PCI-Express 40개 레인 Problems of RDBMS Scale-up vs Scale-out (Cost perspective) \3,400,000 \250,000 인텔 코어i5-6세대 6600 (스카이레이크) 인텔(소켓1151) / DDR4 / DDR3L / 64비트 / 쿼드 코어 / 쓰레드 4개 / 3.3GHz / 인텔 HD 530 / PCI-Express 16개 레인
RDBMS vs Big Data Platforms Google File System Beginning of the big data platforms Affects to Hadoop Chunk : Analogous to block, except larger (typically 64MB)
RDBMS vs Big Data Platforms Google File System Read Algorithm (1/2)
RDBMS vs Big Data Platforms Google File System Read Algorithm (2/2)
RDBMS vs Big Data Platforms Google File System Write Algorithm (1/4)
RDBMS vs Big Data Platforms Google File System Write Algorithm (2/4)
RDBMS vs Big Data Platforms Google File System Write Algorithm (3/4)
RDBMS vs Big Data Platforms Google File System Write Algorithm (4/4)
RDBMS vs Big Data Platforms Hadoop HDFS + MapReduce 128MB file (e.g. /data/hdfs/block1) on Local Filesystem
RDBMS vs Big Data Platforms Hadoop HDFS + MapReduce (Computational Model) On Local Filesystem
Growing Bigdata Platforms
Growing Big Data Platforms
Growing Big Data Platforms Gartner’s hype cycle 2012
Growing Big Data Platforms Gartner’s hype cycle 2013
Growing Big Data Platforms Gartner’s hype cycle 2014
Growing Big Data Platforms Gartner’s hype cycle 2015 Big data dropped from cycle, Big data is now into practice
Emerging Hardwares
Emerging H/Ws History of Memory
Emerging H/Ws All flash array
Emerging H/Ws All flash array
Emerging H/Ws NVRAM
Emerging H/Ws NVDIMM
Q&A Thank you