Download presentation
Presentation is loading. Please wait.
Published byAdam Allen Modified over 8 years ago
1
Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t DBCF GT Our experience with NoSQL and MapReduce technologies Fabio Souto IT Monitoring Working Group, 19 th September 2011
2
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t GT Outline Objective Big data technologies Technologies reviewed Deployed infrastructure Current status Lessons learned 2
3
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t GT Problem and goal The SAM infrastructure for WLCG –monitors 400 sites and ~2,000 services daily –receives and stores ~600,000 metric results daily –computes statuses and hourly availabilities for services and sites SWAT is a system to gather information about the configuration of WNs Massive data generation, making storage, search, sharing, analytics and visualizing difficult Objective: proof of concept using big data technologies 3
4
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t GT Big Data Technologies NoSQL databases –Not relational. Schema free. –Distributed –High availability MapReduce –Framework for processing huge datasets on clusters of computers –Takes advantage of data locality: Move computation is more efficient than moving data 4
5
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t GT Technologies reviewed NoSQL databases ~140 different solutions, we focused on: –MongoDB No durability(at the moment of study) –Cassandra No single point of failure Big and responsive community Apache Hadoop –Big data de facto standard –Framework for data intensive applications –To write MapReduce jobs for Cassandra 5
6
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t GT Technologies reviewed II Hive and Pig –ease the complexity of writing MapReduce –Initially not considered Less efficient than pure Hadoop –Independent from the data source We can change to HBase easily –Hive: SQL-like syntax –Pig: data flow language Is not turing complete (no loops, if-else…) –But can be embebed into python code –It’s possible to write custom functions in python/java 6
7
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t GT Technologies reviewed III Hue –Set of Django apps to interact with Hadoop OpenTSDB –Open source time series database –Lack of flexibility Oozie –Job scheduler and workflow engine for Hadoop 7
8
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t GT Other Tools Msg-consume2db inserter: –WLCG Messaging infrastructure -> NoSQL sql2nosql-sync –SAM Oracle DB -> NoSQL 8
9
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t GT Actual infrastructure Deployed infrastructure 9
10
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t GT Actual infrastructure 10
11
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t GT Current status 11 SAM –DONE: running infrastructure reading messaging and SAM data and launch pig jobs to calculate availability. –TODO: Results tuning Web interface to visualize the results JSON/XML API to extract results Unit testing SWAT –Early stage of development (~6 days) –Data collection
12
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t GT Lessons learned Use abstraction layer on top of Hadoop –Write pure MapReduce Hadoop apps is difficult and time-consuming Choose a solution with a responsive community: –Technology in early state(unresolved bugs, undocumented functions), you will need to get in touch with developers/users Big data needs big platform 12
13
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t GT Lessons learned Must keep up to date. New companies, technologies and tools are emerging –Twitter real time hadoop about to be released –Cascalog, hadoop data mining language –Bigdata distributions: Cloudera, Datastax, Mapr… 13
14
Grid Technology Questions? 14
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.