Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*

Slides:



Advertisements
Similar presentations
Introduction to Hadoop Richard Holowczak Baruch College.
Advertisements

Big Data Training Course for IT Professionals Name of course : Big Data Developer Course Duration : 3 days full time including practical sessions Dates.
Transform + analyze Visualize + decide Capture + manage Dat a.
ETM Hadoop. ETM IDC estimate put the size of the “digital universe” at zettabytes in forecasting a tenfold growth by 2011 to.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2.
CERN IT Department CH-1211 Geneva 23 Switzerland t Sequential data access with Oracle and Hadoop: a performance comparison Zbigniew Baranowski.
Hadoop Ecosystem Overview
SQL on Hadoop. Todays agenda Introduction Hive – the first SQL approach Data ingestion and data formats Impala – MPP SQL.
May 23nd 2012 Matt Mead, Cloudera
Scale-out databases for CERN use cases Strata Hadoop World London 6 th of May,2015 Zbigniew Baranowski, CERN IT-DB.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark Cluster Monitoring 2.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
Presented by John Dougherty, Viriton 4/28/2015 Infrastructure and Stack.
Introduction to Hadoop and HDFS
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
OSG Area Coordinator’s Report: Workload Management February 9 th, 2011 Maxim Potekhin BNL
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
Data and SQL on Hadoop. Cloudera Image for hands-on Installation instruction – 2.
© 2012 Unisys Corporation. All rights reserved. 1 Unisys Corporation. Proprietary and Confidential.
Hadoop implementation of MapReduce computational model Ján Vaňo.
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Big Data Directions Greg.
Nov 2006 Google released the paper on BigTable.
Cloudera Kudu Introduction
CERN IT Department CH-1211 Genève 23 Switzerland t CERN IT Monitoring and Data Analytics Pedro Andrade (IT-GT) Openlab Workshop on Data Analytics.
Big Data Tools Hadoop S.S.Mulay Sr. V.P. Engineering February 1, 2013.
Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy.
1 HBASE – THE SCALABLE DATA STORE An Introduction to HBase XLDB Europe Workshop 2013: CERN, Geneva James Kinley EMEA Solutions Architect, Cloudera.
Next Generation of Apache Hadoop MapReduce Owen
OSG Area Coordinator’s Report: Workload Management February 9 th, 2011 Maxim Potekhin BNL
Learn Hadoop and Big Data Technologies. Hadoop  An Open source framework that stores and processes Big Data in distributed manner on a large groups of.
Harnessing Big Data with Hadoop Dipti Sangani; Madhu Reddy DBI210.
Learn. Hadoop Online training course is designed to enhance your knowledge and skills to become a successful Hadoop developer and In-depth knowledge of.
Moscow, November 16th, 2011 The Hadoop Ecosystem Kai Voigt, Cloudera Inc.
BIG DATA/ Hadoop Interview Questions.
Data Science Hadoop YARN Rodney Nielsen. Rodney Nielsen, Human Intelligence & Language Technologies Lab Outline Classical Hadoop What’s it all about Hadoop.
What is it and why it matters? Hadoop. What Is Hadoop? Hadoop is an open-source software framework for storing data and running applications on clusters.
Apache Hadoop on Windows Azure Avkash Chauhan
CPSC8985 FA 2015 Team C3 DATA MIGRATION FROM RDBMS TO HADOOP By Naga Sruthi Tiyyagura Monika RallabandiRadhakrishna Nalluri.
Data Analytics and Hadoop Service in IT-DB Visit of Cloudera - April 19 th, 2016 Luca Canali (CERN) for IT-DB.
Microsoft Partner since 2011
Microsoft Ignite /28/2017 6:07 PM
Data Analytics Challenges Some faults cannot be avoided Decrease the availability for running physics Preventive maintenance is not enough Does not take.
Hadoop Introduction. Audience Introduction of students – Name – Years of experience – Background – Do you know Java? – Do you know linux? – Any exposure.
Big Data & Test Automation
Integration of Oracle and Hadoop: hybrid databases affordable at scale
OMOP CDM on Hadoop Reference Architecture
SAS users meeting in Halifax
Integration of Oracle and Hadoop: hybrid databases affordable at scale
Database Services Katarzyna Dziedziniewicz-Wojcik On behalf of IT-DB.
Hadoop.
Data Analytics and CERN IT Hadoop Service
Hadoop and Analytics at CERN IT
An Open Source Project Commonly Used for Processing Big Data Sets
Running virtualized Hadoop, does it make sense?
Hadoop Developer.
Big Data Technology.
Data Analytics and CERN IT Hadoop Service
Data Analytics and CERN IT Hadoop Service
Data Analytics and CERN IT Hadoop Service
Data Analytics – Use Cases, Platforms, Services
Massively Parallel Processing in Azure Comparing Hadoop and SQL based MPP architectures in the cloud Josh Sivey SQL Saturday #597 | Phoenix.
Hadoop for SQL Server Pros
Introduction to Apache
Overview of big data tools
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
Pig Hive HBase Zookeeper
Presentation transcript:

Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*

Hadoop A framework for large scale data processing Distributed storage and processing Shared nothing architecture – scales horizontally Optimized for high throughput on sequential data access 2 Interconnect network MEMORY CPU Disks MEMORY CPU Disks MEMORY CPU Disks MEMORY CPU Disks MEMORY CPU Disks MEMORY CPU Disks Node 1Node 2Node 3Node 4 Node 5 Node X

How Hadoop Can Help You Parallel processing of large amounts of data Perform analytics on a big scale Dealing with diverse data: structured, semi- structured, unstructured ‘Cold’ storage / Archives Performance is usually suboptimal for Random reads and real-time access ‘Small’ datasets 3

There are already interesting use cases of WLCG grid monitoring Data Transfers etc. Atlas Events Indexing CASTOR log aggregation Data Warehousing Logging/time series data IT monitoring 4

Hadoop Service in IT Setup and run the infrastructure Provide consultancy Build the community Joint work IT-DB and IT-DSS 5

Hadoop Clusters in IT (Oct 2015) lxhadoop (22 nodes) general purpose cluster (mainly used by ATLAS) stable software setup recent hardware analytix (56 nodes) for analysis of monitoring data varied hardware specifications the biggest in terms of number of nodes hadalytic (17 nodes) general purpose cluster with additional services recent hardware 6

Many Configuration Options Hadoop is a platform Many components and key decisions in the implementation Rapidly evolving field Examples Data access: domain specific language or SQL Many components and data formats Data loading and unloading tools 7

Currently available components 8 HDFS Hadoop Distributed File System Hbase NoSql columnar store YARN Cluster resource manager MapReduce Hive SQL Pig Scripting Flume Log data collector Sqoop Data exchange with RDBMS Zookeeper Coordination Impala SQL Spark Large scale data proceesing

Software version policy Align to CDH distributions 9 lxhadoop (22 nodes) analytix (56 nodes) hadalytic (17 nodes) CDH HDFS HBase Hive Pig Spark Impala Sqoop

Maintenance activities Actions Upgrades to a newer CDH Frequency Typically twice a year Impact Downtime 1-3 hours 10

Recent activities (last 3 months) Hadoop Tutorials – during summer Deployment of Coudera Impala component Monitoring of hanging HBase region servers Self-service Oracle2Hadoop integration (work in progress) Building a database of users’ data sources 11

Contact points Service is available in SNOW SE: Hadoop Service FE: Hadoop Components FE: Hadoop Core E-group: Show up on the Wednesday’s meeting Analytic Working Group Hadoop User Forum 12

How to Learn More Hadoop tutorials at CERN, summer 2015 Introduction to Hadoop (Architecture, HDFS, MapReduce, Spark) SQL on Hadoop (Hive, Impala) NoSQL on Hadoop (HBase) We plan to do more/repeats in the future 13

Future plans Infrastructure HDFS backups Rolling upgrades Support from Cloudera? Users community Write a Knowledge Base (SNOW) New features/technology testing Kudu – a new columnar file system from Cloudera Tachyon – in-memory file system 14