Plans for the renovation of the Post Mortem infrastructure

Slides:



Advertisements
Similar presentations
The Big Data Ecosystem at LinkedIn
Advertisements

Adding scalability to legacy PHP web applications Overview Mario A. Valdez-Ramirez.
Introduction to Backend James Kahng. Install Node.js.
An Information Architecture for Hadoop Mark Samson – Systems Engineer, Cloudera.
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
Elad Hayun Agenda What's New in Hyper-V 2012 Storage Improvements Networking Improvements VM Mobility Improvements.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Scale-out databases for CERN use cases Strata Hadoop World London 6 th of May,2015 Zbigniew Baranowski, CERN IT-DB.
How WebMD Maintains Operational Flexibility with NoSQL Rajeev Borborah, Sr. Director, Engineering Matt Wilson – Director, Production Engineering – Consumer.
Cloud MapReduce : a MapReduce Implementation on top of a Cloud Operating System Speaker : 童耀民 MA1G Authors: Huan Liu, Dan Orban Accenture.
Software Engineer, #MongoDBDays.
 DATABASE DATABASE  DATABASE ENVIRONMENT DATABASE ENVIRONMENT  WHY STUDY DATABASE WHY STUDY DATABASE  DBMS & ITS FUNCTIONS DBMS & ITS FUNCTIONS 
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
1 Alice DAQ Configuration DB
Cassandra - A Decentralized Structured Storage System
© Hortonworks Inc HDFS: Hadoop Distributed FS Steve Loughran, ATLAS workshop, June 2013.
Introduction to Database Management Systems. Information Instructor: Csilla Farkas Office: Swearingen 3A43 Office Hours: M,T,W,Th,F 2:30 pm – 3:30 pm,
Daniel J. Abadi · Adam Marcus · Samuel R. Madden ·Kate Hollenbach Presenter: Vishnu Prathish Date: Oct 1 st 2013 CS 848 – Information Integration on the.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation MongoDB Architecture.
Kjell Orsborn UU - DIS - UDBL DATABASE SYSTEMS - 10p Course No. 2AD235 Spring 2002 A second course on development of database systems Kjell.
CASTOR evolution Presentation to HEPiX 2003, Vancouver 20/10/2003 Jean-Damien Durand, CERN-IT.
CERN IT Department CH-1211 Geneva 23 Switzerland t CF Computing Facilities Agile Infrastructure Monitoring CERN IT/CF.
Stairway to the cloud or can we take the highway? Taivo Liik.
Copyright 2007, Information Builders. Slide 1 Machine Sizing and Scalability Mark Nesson, Vashti Ragoonath June 2008.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Agile Infrastructure Monitoring HEPiX Spring th April.
CERN IT Department CH-1211 Genève 23 Switzerland t CERN IT Monitoring and Data Analytics Pedro Andrade (IT-GT) Openlab Workshop on Data Analytics.
Experiments in Utility Computing: Hadoop and Condor Sameer Paranjpye Y! Web Search.
Accelerator Data Analysis Framework: Infrastructure Improvements for Increased Analysis Performance Serhiy Boychenko, TE-MPE, CERN, 26/11/2015 Acknowledgements:
Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy.
Apache Solr Dima Ionut Daniel. Contents What is Apache Solr? Architecture Features Core Solr Concepts Configuration Conclusions Bibliography.
BIG DATA/ Hadoop Interview Questions.
Data Analytics and Hadoop Service in IT-DB Visit of Cloudera - April 19 th, 2016 Luca Canali (CERN) for IT-DB.
Testing the Zambeel Aztera Chris Brew FermilabCD/CSS/SCS Caveat: This is very much a work in progress. The results presented are from jobs run in the last.
CCD-410 Cloudera Certified Developer for Apache Hadoop (CCDH) Cloudera.
Log-Structured Memory for DRAM-Based Storage Stephen Rumble and John Ousterhout Stanford University.
1 Gaurav Kohli Xebia Breaking with DBMS and Dating with Relational Hbase.
Virtualization of Infrastructure as a Service (IaaS): Redundancy Mechanism of the Controller Node in OpenStack Cloud Computing Platform BY Shahed murshed.
ADVANCED HOSTING Adrian Newby, CTO.
Pilot Kafka Service Manuel Martín Márquez. Pilot Kafka Service Manuel Martín Márquez.
Integration of Oracle and Hadoop: hybrid databases affordable at scale
and Big Data Storage Systems
Abstract Title: Experience with running Owncloud on virtualized infrastructure (Openstack/Ceph) (continuation of talk at C3 2016) Owncloud is.
Integration of Oracle and Hadoop: hybrid databases affordable at scale
BigData - NoSQL Hadoop - Couchbase
Cassandra - A Decentralized Structured Storage System
Database Services Katarzyna Dziedziniewicz-Wojcik On behalf of IT-DB.
Hadoop and Analytics at CERN IT
Section 6 Object Storage Gateway (RADOS-GW)
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
Triple Stores.
CS122B: Projects in Databases and Web Applications Winter 2017
Running virtualized Hadoop, does it make sense?
MongoDB Er. Shiva K. Shrestha ME Computer, NCIT
Next Generation of Post Mortem Event Storage and Analysis
Collecting heterogeneous data into a central repository
Operational & Analytical Database
Enabling Scalable and HA Ingestion and Real-Time Big Data Insights for the Enterprise OCJUG, 2014.
Kubernetes Container Orchestration
The Google File System Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung Google Presented by Jiamin Huang EECS 582 – W16.
Storage Systems for Managing Voluminous Data
Big Data - in Performance Engineering
Triple Stores.
Hadoop for SQL Server Pros
Btrfs Filesystem Chris Mason.
Overview of big data tools
DriveScale Log Collection Method of Procedure
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
Triple Stores.
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Presentation transcript:

Plans for the renovation of the Post Mortem infrastructure TE-MPE-TM 82 15/09/2016 Matthias Pöschl – TE-MPE-MS

Matthias Pöschl – TE-MPE-MS Agenda Current Post Mortem Architecture Shortcomings Improvements New Post Mortem Storage New File Format Collaboration with CALS 2.0 Shared data and infrastructure New Post Mortem Architecture Conclusion 15/09/2016 Matthias Pöschl – TE-MPE-MS

Current Archtitecture 15/09/2016 Matthias Pöschl – TE-MPE-MS

Matthias Pöschl – TE-MPE-MS Shortcomings Direct user access to underlaying filesystem and file format Outdated data collection stack Manual load balancing Very limited horizontal scaling Unfit for future use-cases with strict time constraints 15/09/2016 Matthias Pöschl – TE-MPE-MS

Matthias Pöschl – TE-MPE-MS Improvements Update storage technology Allow for dynamic load balancing Update data collection stack and file format User access only through REST API 15/09/2016 Matthias Pöschl – TE-MPE-MS

New Post Mortem Storage Benchmark of Ceph, MongoDB and GlusterFS Use of real Post Mortem Data (all of 2015) Constraints: Technology has to handle drive and node failures, avoid data inconsistencies and degrade gracefully Three replicas of each object have to be stored Acknowledge of write only after all three copies have been written Adding more nodes should increase capacity and throughput by a known factor (“linear scaling”) 15/09/2016 Matthias Pöschl – TE-MPE-MS

New Post Mortem Storage GlusterFS performed best in read-only and write-only benchmarks, very closely followed by Ceph Ceph showed best mixed-workload performance: Technology Time to complete Objects/s Throughput NFS 26.3h 340 8,018 KB/s Ceph 20.4h 438 12,154 KB/s MongoDB 23.5h 381 8,973 KB/s GlusterFS 20.7h 432 10,187 KB/s Explain Testbench, Servers, HDD rpms, … 15/09/2016 Matthias Pöschl – TE-MPE-MS

New Post Mortem Storage Meeting with Dan van der Ster and Herve Rousseau from IT-ST-FDO to discuss Ceph IT has had good experiences with Ceph Biggest Ceph test so far with a ~30PB cluster IT is running the VMs on top of Ceph Not a single byte lost in +5 years of operation IT sees no problem in using it for Post Mortem IT is willing to provide support and assistance 15/09/2016 Matthias Pöschl – TE-MPE-MS

New Post Mortem Storage Test with test Ceph cluser provided by IT Explain variations in graph -> Filesize Benchmark Time to complete Objects/s Throughput Import 12.9h 345 8,135 KB/s Read 3.5h 1269 30,080 KB/s Read/Write 16.8h 795 12,552 KB/s 15/09/2016 Matthias Pöschl – TE-MPE-MS

Matthias Pöschl – TE-MPE-MS New file format Aspirations on using Apache Avro as file format for storage and to serve the users “Raw” RDA-data will still be stored for safety Avro offers many useful features Partial data retrieval (specific signals from a dump) Efficient and fast compression Self-describing schema Libraries for almost every programming language Allows direct conversion to JSON 15/09/2016 Matthias Pöschl – TE-MPE-MS

Collaboration with CALS 2.0 The Logging Service team is updating their whole infrastructure Oracle cluster will be replaced with Hadoop 15/09/2016 Matthias Pöschl – TE-MPE-MS

Collaboration with CALS 2.0 New CALS design will allow BigData queries “Show me the biggest deviations from the mean of a certain BLM in sector 67 in the year 2015 when dumping at 6.5 TeV” Idea: Feed the Logging Service the high resolution Post Mortem data to make this kind of queries even more valuable for the users 15/09/2016 Matthias Pöschl – TE-MPE-MS

Collaboration with CALS 2.0 Meetings with Chris Roderick, Jakub Wozniak and Marcin Sobieszek from BE-CO-DS to evaluate common use of technologies and services as well as details of data ingestion Result: CALS might also use Avro for their data ingestion Possible shared use of a Kafka cluster Shared data storage not (yet) possible 15/09/2016 Matthias Pöschl – TE-MPE-MS

New Post Mortem Architecture 15/09/2016 Matthias Pöschl – TE-MPE-MS

Matthias Pöschl – TE-MPE-MS Conclusion Ceph cluster for Post Mortem data seems feasible and suitable to tackle future use cases No common storage of CALS and PM (yet ) different timing constraints and data sizes different preferred storage technologies But: Data and infrastructure can be shared between both systems, providing a good trade-off for the users 15/09/2016 Matthias Pöschl – TE-MPE-MS

Matthias Pöschl – TE-MPE-MS Conclusion Avro allows efficient data storage, convenience for the users and easy integration with CALS 2.0 All data access through a REST API, serving uncompressed JSON compressed Avro 15/09/2016 Matthias Pöschl – TE-MPE-MS