Presentation is loading. Please wait.

Presentation is loading. Please wait.

Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*

Similar presentations


Presentation on theme: "Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*"— Presentation transcript:

1 Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*

2 Hadoop A framework for large scale data processing Distributed storage and processing Shared nothing architecture – scales horizontally Optimized for high throughput on sequential data access 2 Interconnect network MEMORY CPU Disks MEMORY CPU Disks MEMORY CPU Disks MEMORY CPU Disks MEMORY CPU Disks MEMORY CPU Disks Node 1Node 2Node 3Node 4 Node 5 Node X

3 How Hadoop Can Help You Parallel processing of large amounts of data Perform analytics on a big scale Dealing with diverse data: structured, semi- structured, unstructured ‘Cold’ storage / Archives Performance is usually suboptimal for Random reads and real-time access ‘Small’ datasets 3

4 There are already interesting use cases of Hadoop @CERN WLCG grid monitoring Data Transfers etc. Atlas Events Indexing CASTOR log aggregation Data Warehousing Logging/time series data IT monitoring 4

5 Hadoop Service in IT Setup and run the infrastructure Provide consultancy Build the community Joint work IT-DB and IT-DSS 5

6 Hadoop Clusters in IT (Oct 2015) lxhadoop (22 nodes) general purpose cluster (mainly used by ATLAS) stable software setup recent hardware analytix (56 nodes) for analysis of monitoring data varied hardware specifications the biggest in terms of number of nodes hadalytic (17 nodes) general purpose cluster with additional services recent hardware 6

7 Many Configuration Options Hadoop is a platform Many components and key decisions in the implementation Rapidly evolving field Examples Data access: domain specific language or SQL Many components and data formats Data loading and unloading tools 7

8 Currently available components 8 HDFS Hadoop Distributed File System Hbase NoSql columnar store YARN Cluster resource manager MapReduce Hive SQL Pig Scripting Flume Log data collector Sqoop Data exchange with RDBMS Zookeeper Coordination Impala SQL Spark Large scale data proceesing

9 Software version policy Align to CDH distributions 9 lxhadoop (22 nodes) analytix (56 nodes) hadalytic (17 nodes) CDH5.1.05.4.2 HDFS2.3.02.6.0 HBase0.98.11.0.0 Hive0.12.01.1.0 Pig0.12.0 Spark1.0.01.3.0 Impala--2.2.0 Sqoop1.4.41.4.5

10 Maintenance activities Actions Upgrades to a newer CDH Frequency Typically twice a year Impact Downtime 1-3 hours 10

11 Recent activities (last 3 months) Hadoop Tutorials – during summer Deployment of Coudera Impala component Monitoring of hanging HBase region servers Self-service Oracle2Hadoop integration (work in progress) Building a database of users’ data sources 11

12 Contact points Service is available in SNOW SE: Hadoop Service FE: Hadoop Components FE: Hadoop Core E-group: it-analytics-wg@cern.ch Show up on the Wednesday’s meeting Analytic Working Group Hadoop User Forum 12

13 How to Learn More Hadoop tutorials at CERN, summer 2015 Introduction to Hadoop (Architecture, HDFS, MapReduce, Spark) https://indico.cern.ch/event/404527/ https://indico.cern.ch/event/404527/ SQL on Hadoop (Hive, Impala) https://indico.cern.ch/event/434650/ https://indico.cern.ch/event/434650/ NoSQL on Hadoop (HBase) https://indico.cern.ch/event/442004/ https://indico.cern.ch/event/442004/ We plan to do more/repeats in the future 13

14 Future plans Infrastructure HDFS backups Rolling upgrades Support from Cloudera? Users community Write a Knowledge Base (SNOW) New features/technology testing Kudu – a new columnar file system from Cloudera Tachyon – in-memory file system 14


Download ppt "Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*"

Similar presentations


Ads by Google