Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Analytics and Hadoop Service in IT-DB Visit of Cloudera - April 19 th, 2016 Luca Canali (CERN) for IT-DB.

Similar presentations


Presentation on theme: "Data Analytics and Hadoop Service in IT-DB Visit of Cloudera - April 19 th, 2016 Luca Canali (CERN) for IT-DB."— Presentation transcript:

1 Data Analytics and Hadoop Service in IT-DB Visit of Cloudera - April 19 th, 2016 Luca Canali (CERN) for IT-DB

2 Data Life Cycle Management @ CERN, Luca Canali 2

3 Click to edit Master title style 3 Interconnect Node 1 Node 2 Node n Queries run in parallel on the cluster nodes The shared nothing architecture allows to scale for high capacity and throughput on commodity HW Example of Oracle RAC deployed with shared storage

4 Hadoop Service at CERN IT 3 production + 1 QA cluster: ~100 nodes in total Notable items in the tech stack (CDH): HBase, Map Reduce, Pig, Hive, Spark, Impala Kafka, Flume, Sqoop Parquet, Avro Hue 4

5 In the following: examples of projects we are working on with users community/developers. Our goals: help implementation, support, drive platform evolution 5

6 ATLAS EventIndex Repository of events Uses HBASE for fast lookup of events Size ~ 40 TB/year Uses HDFS sequence (Map) files In production and also being developed/evolved 6

7 Accelerator Log System Currently in Oracle ~400 TB New version being developed on Hadoop Prototype ingesting ~200 GB/day Kafka+Goblin -> Parquet Access: Impala + Spark 7

8 Analytics for the Future Circular Collider (FCC) Accelerator logging data from Oracle Copy to Hadoop Read with Impala Front end: Hue This project also using Oracle BDD 8

9 Industrial Controls WinCC (Siemens) currently archiving into Oracle (~30 TB) Project to offload queries to Hadoop Hybrid solution: new data in Oracle and archive read with Impala Data movement with Sqoop Submitted a Sqoop patch improving performance for writing into Parquet 9

10 Monitoring Monitoring dashboards In production For IT, WLCG New generation applications Moving from relational DBs Use lambda architecture Stream: Flume+ Spark streaming Batch: with Spark jobs 10

11 Challenges Real-time analytics Currently batch processing or ad-hoc solutions Integration between components Access control, resource management/security for Impala, Spark, HBase Integration with legacy systems and data ingestion Issue: missing support for complex data types in Impala and Kudu Operational issues Learn how to run critical services on Hadoop Example: backups and data preservation 11

12 Testing at Scale Use cases from controls and physics See openlab project proposal Higher scale and throughput than what has been done with our clusters so far Ingestion of 1M changes/sec Processing of 1PB of physics data with ~1000 cores 12


Download ppt "Data Analytics and Hadoop Service in IT-DB Visit of Cloudera - April 19 th, 2016 Luca Canali (CERN) for IT-DB."

Similar presentations


Ads by Google