Data Analytics and Hadoop Service in IT-DB Visit of Cloudera - April 19 th, 2016 Luca Canali (CERN) for IT-DB.

Slides:



Advertisements
Similar presentations
Syncsort Data Integration Update Summary Helping Data Intensive Organizations Across the Big Data Continuum Hadoop – The Operating System.
Advertisements

Updates from Database Services at CERN Andrei Dumitru CERN IT Department / Database Services.
Evaluation of distributed open source solutions in CERN database use cases HEPiX, spring 2015 Kacper Surdy IT-DB-DBF M. Grzybek, D. L. Garcia, Z. Baranowski,
An Information Architecture for Hadoop Mark Samson – Systems Engineer, Cloudera.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2.
Hadoop Ecosystem Overview
TITLE SLIDE: HEADLINE Presenter name Title, Red Hat Date For Red Hat, it's 1994 all over again Sarangan Rangachari VP and GM, Storage and Big Data Red.
SQL on Hadoop. Todays agenda Introduction Hive – the first SQL approach Data ingestion and data formats Impala – MPP SQL.
May 23nd 2012 Matt Mead, Cloudera
Scale-out databases for CERN use cases Strata Hadoop World London 6 th of May,2015 Zbigniew Baranowski, CERN IT-DB.
© 2013 Mellanox Technologies 1 NoSQL DB Benchmarking with high performance Networking solutions WBDB, Xian, July 2013.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
Database Services for Physics at CERN with Oracle 10g RAC HEPiX - April 4th 2006, Rome Luca Canali, CERN.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark Cluster Monitoring 2.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
Presented by John Dougherty, Viriton 4/28/2015 Infrastructure and Stack.
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
OSG Area Coordinator’s Report: Workload Management February 9 th, 2011 Maxim Potekhin BNL
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
6 May 2014 CERN openlab IT Challenges workshop, Kacper Szkudlarek, CERN Manuel.
1 © Cloudera, Inc. All rights reserved. Partner Solution Overview 1 Partner Logo Full Color Partner Logo Full Color.
CERN IT Department CH-1211 Geneva 23 Switzerland t CF Computing Facilities Agile Infrastructure Monitoring CERN IT/CF.
Hadoop implementation of MapReduce computational model Ján Vaňo.
Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*
CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,
Nov 2006 Google released the paper on BigTable.
Breaking points of traditional approach What if you could handle big data?
Impala. Impala: Goals General-purpose SQL query engine for Hadoop High performance – C++ implementation – runtime code generation (using LLVM) – direct.
CERN IT Department CH-1211 Geneva 23 Switzerland t WLCG Operation Coordination Luca Canali (for IT-DB) Oracle Upgrades.
CERN IT Department CH-1211 Genève 23 Switzerland t CERN IT Monitoring and Data Analytics Pedro Andrade (IT-GT) Openlab Workshop on Data Analytics.
Big Data Tools Hadoop S.S.Mulay Sr. V.P. Engineering February 1, 2013.
Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy.
Spark and Jupyter 1 IT - Analytics Working Group - Luca Menichetti.
Streaming Analytics with Spark 1 Magnoni Luca IT-CM-MM 09/02/16EBI - CERN meeting.
Next Generation of Apache Hadoop MapReduce Owen
OSG Area Coordinator’s Report: Workload Management February 9 th, 2011 Maxim Potekhin BNL
Azure HDInsight And Excel Analyze unstructured data at scale, then visualize! George Walters Sr. Technical Solutions Professional, Data Platform Microsoft.
1 © Cloudera, Inc. All rights reserved. Alexander Bibighaus| Director of Engineering The Future of Data Management with Hadoop and the Enterprise Data.
Learn. Hadoop Online training course is designed to enhance your knowledge and skills to become a successful Hadoop developer and In-depth knowledge of.
Data Science Hadoop YARN Rodney Nielsen. Rodney Nielsen, Human Intelligence & Language Technologies Lab Outline Classical Hadoop What’s it all about Hadoop.
What is it and why it matters? Hadoop. What Is Hadoop? Hadoop is an open-source software framework for storing data and running applications on clusters.
Apache Hadoop on Windows Azure Avkash Chauhan
Microsoft Partner since 2011
Microsoft Ignite /28/2017 6:07 PM
Data Analytics Challenges Some faults cannot be avoided Decrease the availability for running physics Preventive maintenance is not enough Does not take.
Eric Grancher CERN IT department Overview of Database Technologies Computing and Astroparticle Physics 2 nd ASPERA Workshop /1.
Hadoop Introduction. Audience Introduction of students – Name – Years of experience – Background – Do you know Java? – Do you know linux? – Any exposure.
Qlik + Cloudera 10 Points of Integration
Integration of Oracle and Hadoop: hybrid databases affordable at scale
OMOP CDM on Hadoop Reference Architecture
Backup and Recovery for Hadoop: Plans, Survey and User Inputs
PROTECT | OPTIMIZE | TRANSFORM
Integration of Oracle and Hadoop: hybrid databases affordable at scale
Database Services Katarzyna Dziedziniewicz-Wojcik On behalf of IT-DB.
Data Analytics and CERN IT Hadoop Service
Hadoop and Analytics at CERN IT
TDWI EXECUTIVE SUMMIT From Traditional to Modern: How Rakuten Marketing Realized the Promise of a New Generation of BI September 21, 2015 Donald Krapohl.
Future Archiver (librarian) for WinCC OA Control Systems
Database Workshop Report
Hadoopla: Microsoft and the Hadoop Ecosystem
New Big Data Solutions and Opportunities for DB Workloads
Data Analytics and CERN IT Hadoop Service
Data Analytics and CERN IT Hadoop Service
Partner Solution Overview
Powering real-time analytics on Xfinity using Kudu
Data Analytics and CERN IT Hadoop Service
Data Analytics – Use Cases, Platforms, Services
Oracle 1z0-928 Oracle Cloud Platform Big Data Management 2018 Associate.
Presentation transcript:

Data Analytics and Hadoop Service in IT-DB Visit of Cloudera - April 19 th, 2016 Luca Canali (CERN) for IT-DB

Data Life Cycle CERN, Luca Canali 2

Click to edit Master title style 3 Interconnect Node 1 Node 2 Node n Queries run in parallel on the cluster nodes The shared nothing architecture allows to scale for high capacity and throughput on commodity HW Example of Oracle RAC deployed with shared storage

Hadoop Service at CERN IT 3 production + 1 QA cluster: ~100 nodes in total Notable items in the tech stack (CDH): HBase, Map Reduce, Pig, Hive, Spark, Impala Kafka, Flume, Sqoop Parquet, Avro Hue 4

In the following: examples of projects we are working on with users community/developers. Our goals: help implementation, support, drive platform evolution 5

ATLAS EventIndex Repository of events Uses HBASE for fast lookup of events Size ~ 40 TB/year Uses HDFS sequence (Map) files In production and also being developed/evolved 6

Accelerator Log System Currently in Oracle ~400 TB New version being developed on Hadoop Prototype ingesting ~200 GB/day Kafka+Goblin -> Parquet Access: Impala + Spark 7

Analytics for the Future Circular Collider (FCC) Accelerator logging data from Oracle Copy to Hadoop Read with Impala Front end: Hue This project also using Oracle BDD 8

Industrial Controls WinCC (Siemens) currently archiving into Oracle (~30 TB) Project to offload queries to Hadoop Hybrid solution: new data in Oracle and archive read with Impala Data movement with Sqoop Submitted a Sqoop patch improving performance for writing into Parquet 9

Monitoring Monitoring dashboards In production For IT, WLCG New generation applications Moving from relational DBs Use lambda architecture Stream: Flume+ Spark streaming Batch: with Spark jobs 10

Challenges Real-time analytics Currently batch processing or ad-hoc solutions Integration between components Access control, resource management/security for Impala, Spark, HBase Integration with legacy systems and data ingestion Issue: missing support for complex data types in Impala and Kudu Operational issues Learn how to run critical services on Hadoop Example: backups and data preservation 11

Testing at Scale Use cases from controls and physics See openlab project proposal Higher scale and throughput than what has been done with our clusters so far Ingestion of 1M changes/sec Processing of 1PB of physics data with ~1000 cores 12