CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t XLDB 2010 (Extremely Large Databases) conference summary Dawid Wójcik.

Slides:



Advertisements
Similar presentations
CS525: Special Topics in DBs Large-Scale Data Management MapReduce High-Level Langauges Spring 2013 WPI, Mohamed Eltabakh 1.
Advertisements

Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.
ETM Hadoop. ETM IDC estimate put the size of the “digital universe” at zettabytes in forecasting a tenfold growth by 2011 to.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2.
Panel Summary Andrew Hanushevsky Stanford Linear Accelerator Center Stanford University XLDB 23-October-07.
CS525: Big Data Analytics MapReduce Languages Fall 2013 Elke A. Rundensteiner 1.
Copyright © 2012 Cleversafe, Inc. All rights reserved. 1 Combining the Power of Hadoop with Object-Based Dispersed Storage.
Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151.
HADOOP ADMIN: Session -2
Scale-out databases for CERN use cases Strata Hadoop World London 6 th of May,2015 Zbigniew Baranowski, CERN IT-DB.
© 2013 Mellanox Technologies 1 NoSQL DB Benchmarking with high performance Networking solutions WBDB, Xian, July 2013.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
8/9/2015 Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, dtsouma, Computing Systems Laboratory.
Facebook (stylized facebook) is a Social Networking System and website launched in February 2004, operated and privately owned by Facebook, Inc. As.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark Cluster Monitoring 2.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
Whirlwind Tour of Hadoop Edward Capriolo Rev 2. Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from systems Batch Processing High.
Data & Storage Services CERN IT Department CH-1211 Genève 23 Switzerland t DSS From data management to storage services to the next challenges.
Storage and Analysis of Tera-scale Data : 2 of Database Class 11/24/09
Introduction to Hadoop Owen O’Malley Yahoo!, Grid Team
An Introduction to HDInsight June 27 th,
Benchmarking MapReduce-Style Parallel Computing Randal E. Bryant Carnegie Mellon University.
Introduction to Hbase. Agenda  What is Hbase  About RDBMS  Overview of Hbase  Why Hbase instead of RDBMS  Architecture of Hbase  Hbase interface.
Hadoop implementation of MapReduce computational model Ján Vaňo.
Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Nov 2006 Google released the paper on BigTable.
HADOOP Carson Gallimore, Chris Zingraf, Jonathan Light.
Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT IT Monitoring WG Technology for Storage/Analysis 28 November 2011.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy.
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
1 HBASE – THE SCALABLE DATA STORE An Introduction to HBase XLDB Europe Workshop 2013: CERN, Geneva James Kinley EMEA Solutions Architect, Cloudera.
Distributed Process Discovery From Large Event Logs Sergio Hernández de Mesa {
Beyond Hadoop The leading open source system for processing big data continues to evolve, but new approaches with added features are on the rise. Ibrahim.
BIG DATA. Big Data: A definition Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database.
BIG DATA/ Hadoop Interview Questions.
CPSC8985 FA 2015 Team C3 DATA MIGRATION FROM RDBMS TO HADOOP By Naga Sruthi Tiyyagura Monika RallabandiRadhakrishna Nalluri.
Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT Our experience with NoSQL and MapReduce technologies Fabio Souto.
Microsoft Ignite /28/2017 6:07 PM
Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit
1 Gaurav Kohli Xebia Breaking with DBMS and Dating with Relational Hbase.
Image taken from: slideshare
Big Data, Data Mining, Tools
WP18, High-speed data recording Krzysztof Wrona, European XFEL
Hadoop Aakash Kag What Why How 1.
Database Services Katarzyna Dziedziniewicz-Wojcik On behalf of IT-DB.
Hadoop.
Hadoop and Analytics at CERN IT
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
Hadoop.
Introduction to MapReduce and Hadoop
Introduction to HDFS: Hadoop Distributed File System
Hadoop Clusters Tess Fulkerson.
Central Florida Business Intelligence User Group
Powering real-time analytics on Xfinity using Kudu
Ministry of Higher Education
Introduction to Spark.
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Hadoop Technopoints.
Overview of big data tools
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
Analysis of Structured or Semi-structured Data on a Hadoop Cluster
Presentation transcript:

CERN IT Department CH-1211 Geneva 23 Switzerland t XLDB 2010 (Extremely Large Databases) conference summary Dawid Wójcik

CERN IT Department CH-1211 Geneva 23 Switzerland t XLDB 2010 summary - 2 About XLDB conference Conference format –Invitational workshop (industry & science) –Main conference Participants –Industry: oil/gas (Chevron, Exxon, others), financial (Visa, NY exchange, banking), Medical/Bioinformatics, others including big names like eBay, Yahoo, Facebook, Amazon, IBM, EMC, HP, …; RDBMS vendors (Oracle, MS, Teradata) –Science: many laboratories and science institutes, representatives of different science projects in astronomy, astrophysics, bioinformatics; open-source communities, …

CERN IT Department CH-1211 Geneva 23 Switzerland t XLDB 2010 summary - 3 XLDB topics Main topics and challenges –How to store PBs of data and retrieve them efficiently HEP community – 10-15PB/year now Astronomy – 10PB/year –Large Synoptic Survey Telescope (in 2017) Use of SSDs –How to analyze the data Unpredictable query load (real-time vs. offline processing) Full scans preferred over index access for some data (astronomical pixel data, genome,...) Complicated algorithms for data processing –Use of GPUs for offloading Stream processing –SciDB, benchmarking

CERN IT Department CH-1211 Geneva 23 Switzerland t XLDB questions Data management –File systems vs. (R)DBMSs –Scientific tools and data formats –Online data and historical data challenges Millisecond latency vs. PB analysis Data processing –How to build efficient processing systems 2 nd Amdahl’s Law – number of bits of IO/sec per instruction/sec –Parallel processing XLDB 2010 summary - 4

CERN IT Department CH-1211 Geneva 23 Switzerland t XLDB solutions Different scientific tools and data formats –ROOT, FTOOLS, DS9 –dCache, CASTOR, Xrootd –netCDF, HDF5, fits, xtc SSDs used for data and caching Clusters with Amdahl number = 1 for under $40k (18GB IO/s) –Test – histogram of 544 million objects from 1.2TB of data – SQL executes in 100s SQL query offloading in GPUs Scalable share-nothing MySQL (Facebook) –Memcache, flsahcache Different RDBMS systems – most run Oracle, MSSQL or MySQL (industry also runs MySQL) Move the processing to the data Hadoop (Yahoo and many others in industry and science!) –Map reduce and extreme parallel processing XLDB 2010 summary - 5

CERN IT Department CH-1211 Geneva 23 Switzerland t Hadoop Open-source project for reliable, scalable, distributed computing –Subprojects: Chukwa (monitoring), HBase (DB with Bigtable-like capabilities), HDFS (clsuter file system), Hive (parallel SQL data warehouse), MapReduce, Pig (high- level data-flow language), ZooKeeper (a high-performance coordination service) –Many use cases across different domains (industry and science) –Super parallel computing (60 seconds for 1TB sort with 1500 nodes – year 2009) –Yahoo – 3.7PB data processed daily, 120TB daily event data processed, >4000 nodes, 16 PB raw disk space Streaming analytics Warehouse solution XLDB 2010 summary - 6

CERN IT Department CH-1211 Geneva 23 Switzerland t Hadoop - HDFS Hadoop Distributed File System (HDFS) –Primary storage system used by Hadoop applications. HDFS creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable, extremely rapid computations. –HDFS used as storage layer for CMS Tier-2 at Nebraska – replaced dCache (see this link)see this link XLDB 2010 summary - 7

CERN IT Department CH-1211 Geneva 23 Switzerland t Hadoop – how to analyze data Hadoop provides massive scale out and fault tolerance capabilities for data storage and processing (using the map-reduce programming paradigm) Hadoop core – java programming required Hive – SQL like interface: CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY (ds STRING); LOAD DATA LOCAL INPATH './files/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds=' '); FROM invites a INSERT OVERWRITE TABLE events SELECT a.bar, count(*) WHERE a.foo > 0 GROUP BY a.bar; Pig – Pig Latin language: raw = LOAD ‘mylog.log' USING PigStorage('\t') AS (user, time, query); clean = FOREACH raw GENERATE user, time, org.apache.pig.tutorial.ToLower(query) as query; houred = FOREACH clean GENERATE user, org.apache.pig.tutorial.ExtractHour(time) as hour, query; ngramed1 = FOREACH houred GENERATE user, hour, flatten(org.apache.pig.tutorial.NGramGenerator(query)) as ngram;... STORE res2 INTO '/tmp/tutorial-join-results' USING PigStorage(); XLDB 2010 summary - 8

CERN IT Department CH-1211 Geneva 23 Switzerland t XLDB summary XLDB – a very interesting conference with science and industry brought together Many questions asked and issues raised Still not many successful stories (mostly industry) –SKA (Square Kilometre Array) may change it – see: SciDB – a solution to science DB projects? –See Most presentations from XLDB 2010: – XLDB 2010 summary - 9