BIG DATA APPLICATIONS & ANALYTICS LOOKING AT INDIVIDUAL HPCABDS SOFTWARE LAYERS 1/26/2015 Cloud Computing Software 1 Geoffrey Fox January 26 2014 BigDat.

Slides:



Advertisements
Similar presentations
Suggested Course Outline Cloud Computing Bahga & Madisetti, © 2014Book website:
Advertisements

Big Data Open Source Software and Projects ABDS in Summary XIV: Level 14B I590 Data Science Curriculum August Geoffrey Fox
ASCR Data Science Centers Infrastructure Demonstration S. Canon, N. Desai, M. Ernst, K. Kleese-Van Dam, G. Shipman, B. Tierney.
SALSA HPC Group School of Informatics and Computing Indiana University.
Big Data Open Source Software and Projects ABDS in Summary I I590 Data Science Curriculum August Geoffrey Fox
International Conference on Cloud and Green Computing (CGC2011, SCA2011, DASC2011, PICom2011, EmbeddedCom2011) University.
Big Data Open Source Software and Projects Data Access Patterns and Introduction to using HPC-ABDS I590 Data Science Curriculum August Geoffrey.
CSCE 5203 Section 1: Advanced Database Systems Spring 2015.
Clouds from FutureGrid’s Perspective April Geoffrey Fox Director, Digital Science Center, Pervasive.
Big Data Open Source Software and Projects ABDS in Summary XIX: Layer 14B Data Science Curriculum March Geoffrey Fox
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
Big Data Open Source Software and Projects ABDS in Summary XVI: Layer 13 Part 1 Data Science Curriculum March Geoffrey Fox
Big Data Open Source Software and Projects ABDS in Summary II: Layers 3 to 4 Data Science Curriculum March Geoffrey Fox
Big Data Open Source Software and Projects ABDS in Summary XIII: Level 14A I590 Data Science Curriculum August Geoffrey Fox
Big Data Open Source Software and Projects ABDS in Summary VI: Layer 6 Part 2 Data Science Curriculum March Geoffrey Fox
Big Data Open Source Software and Projects ABDS in Summary XXI: Layer 15B Part 1 Data Science Curriculum March Geoffrey Fox
Big Data Open Source Software and Projects Unit 0 Part B: Class Introduction Data Science Curriculum March Geoffrey Fox
HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC
Cyberinfrastructure Supporting Social Science Cyberinfrastructure Workshop October Chicago Geoffrey Fox
Iterative computation is a kernel function to many data mining and data analysis algorithms. Missing in current MapReduce frameworks is collective communication,
3DAPAS/ECMLS panel Dynamic Distributed Data Intensive Analysis Environments for Life Sciences: June San Jose Geoffrey Fox, Shantenu Jha, Dan Katz,
A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.
Big Data and Clouds: Challenges and Opportunities NIST January Geoffrey Fox
Science Clouds and FutureGrid’s Perspective June Science Clouds Workshop HPDC 2012 Delft Geoffrey Fox
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
Data Science at Digital Science October Geoffrey Fox Judy Qiu
Big Data Open Source Software and Projects ABDS in Summary I: Layers 1 to 2 Data Science Curriculum March Geoffrey Fox
Harp: Collective Communication on Hadoop Bingjing Zhang, Yang Ruan, Judy Qiu.
FutureGrid Connection to Comet Testbed and On Ramp as a Service Geoffrey Fox Indiana University Infra structure.
Big Data Open Source Software and Projects ABDS in Summary XVIII: Layer 14A Data Science Curriculum March Geoffrey Fox
Stu Fox Datacom Systems Ltd. ON-PREMISES SERVICE PROVIDERMICROSOFT CONSISTENT PLATFORM Modern platform for the world’s apps 1.
SALSA HPC Group School of Informatics and Computing Indiana University.
SALSASALSASALSASALSA FutureGrid Venus-C June Geoffrey Fox
Big Data Open Source Software and Projects ABDS in Summary IV: Level 7 I590 Data Science Curriculum August Geoffrey Fox
Internet of Things (Smart Grid) Storm Archival Storage – NOSQL like Hbase Streaming Processing (Iterative MapReduce) Batch Processing (Iterative MapReduce)
Introduction to Grids By: Fetahi Z. Wuhib [CSD2004-Team19]
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Recipes for Success with Big Data using FutureGrid Cloudmesh SDSC Exhibit Booth New Orleans Convention Center November Geoffrey Fox, Gregor von.
Directions in eScience Interoperability and Science Clouds June Interoperability in Action – Standards Implementation.
Panel Discussion Software Defined Ecosystems June BigSystem Software-Defined Ecosystems at HPDC Vancouver Canada Geoffrey Fox.
Big Data Open Source Software and Projects ABDS in Summary II: Layer 5 I590 Data Science Curriculum August Geoffrey Fox
Indiana University Faculty Geoffrey Fox, David Crandall, Judy Qiu, Gregor von Laszewski Data Science at Digital Science Center.
Big Data Yuan Xue CS 292 Special topics on.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI EGI Services for Distributed e-Infrastructure Access Tiziana Ferrari on behalf.
Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit
1 Panel on Merge or Split: Mutual Influence between Big Data and HPC Techniques IEEE International Workshop on High-Performance Big Data Computing In conjunction.
Panel: Beyond Exascale Computing
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING CLOUD COMPUTING
Big Data is a Big Deal!.
Big Data Enterprise Patterns
Digital Science Center II
Status and Challenges: January 2017
NSF start October 1, 2014 Datanet: CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science Indiana University.
Some Remarks for Cloud Forward Internet2 Workshop
NSF : CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science PI: Geoffrey C. Fox Software: MIDAS HPC-ABDS.
I590 Data Science Curriculum August
Applications SPIDAL MIDAS ABDS
Data Science Curriculum March
Cloud DIKW based on HPC-ABDS to integrate streaming and batch Big Data
Overview of big data tools
Big Data Open Source Software and Projects ABDS in Summary I
Department of Intelligent Systems Engineering
Charles Tappert Seidenberg School of CSIS, Pace University
Panel on Research Challenges in Big Data
CS 239 – Big Data Systems Fall 2018
Cloud versus Cloud: How Will Cloud Computing Shape Our World?
Big Data, Simulations and HPC Convergence
Convergence of Big Data and Extreme Computing
Twister2 for BDEC2 Poznan, Poland Geoffrey Fox, May 15,
I590 Data Science Curriculum August
Presentation transcript:

BIG DATA APPLICATIONS & ANALYTICS LOOKING AT INDIVIDUAL HPCABDS SOFTWARE LAYERS 1/26/2015 Cloud Computing Software 1 Geoffrey Fox January BigDat 2015: International Winter School on Big Data Tarragona, Spain, January 26-30, School of Informatics and Computing Digital Science Center Indiana University Bloomington

Using the HPC-ABDS Software Stack CLOUD COMPUTING SOFTWARE 1/26/ Cloud Computing Software

There are a lot of Big Data and HPC Software systems Challenge! Manage environment offering these different components 1/26/ Cloud Computing Software

USING HPC-ABDS LAYERS I 1) Message Protocols This layer is unlikely to directly visible in many applications as used in “underlying system”. Thrift and Protobuf have similar functionality and are used to build messaging protocols between components (services) of system 2) Distributed Coordination Zookeeper is likely to be used in many applications as it is way that one achieves consistency in distributed systems – especially in overall control logic and metadata. It is for example used in Apache Storm to coordinate distributed streaming data input with multiple servers ingesting data from multiple sensors. JGroups is less commonly used and is very different. It builds secure multi-cast messaging with a variety of transport mechanisms. 3) Security & Privacy I Security & Privacy is of course a huge area present implicitly or explicitly in all applications. It covers authentication and authorization of users and the security of running systems. In the Internet there are many authentication systems with sites often allowing you to use Facebook, Microsoft, Google etc. credentials. InCommon, operated by Internet2, federates research and higher education institutions, in the United States with identity management and related services. 1/26/2015 Cloud Computing Software 4

USING HPC-ABDS LAYERS II 3) Security & Privacy II LDAP is a simple database (key-value) forming a set of distributed directories recording properties of users and resources according to X.500 standard. It allows secure management of systems. OpenStack Keystone is a role-based authorization and authentication environment to be used in OpenStack private clouds. 4) Monitoring: Here Ambari is aimed at installing and monitoring Hadoop systems. Nagios and Ganglia are similar system monitors with ability to gather metrics and produce alerts. Inca is a higher level system allowing user reporting of performance of any sub system. Essentially all systems use monitoring but most users do not add custom reporting. 5) IaaS Management from HPC to hypervisors: These technologies underlie all applications. The classic technology OpenStack manages virtual machines and associated capabilities such as storage and networking. The commercial clouds have their own solution and it is possible to move machine images between these different environments. As a special case there is “bare-metal” i.e. the null hypervisor. The DevOPs technology Docker is playing an increasing role as a linux container. 1/26/2015 Cloud Computing Software 5

USING HPC-ABDS LAYERS III 6)DevOps This describes technologies and approaches that automate the deployment and installation of software systems and underlies “software-defined systems”. At IU, we integrate tools together in Cloudmesh – Libcloud, Cobbler, Chef, Docker, Slurm, Ansible, Puppet. Celery. We saw Docker earlier in 5 on last slide. 7)Interoperability This is both standards and interoperability libraries for services (Whirr), compute (OCCI), virtualization and storage (CDMI) 8)File systems One will use files in most applications but the details may not be visible to the user. Maybe you interact with data at level of a data management system or an Object store (OpenStack Swift or Amazon S3). Most science applications are organized around files; commercial systems at a higher level. 9)Cluster Resource Management You will certainly need cluster management in your application although often this is provided by the system and not explicit to the user. Yarn from Hadoop is gaining in popularity while Slurm is a basic HPC system as are Moab, SGE, OpenPBS while Condor also well known for scheduling of Grid applications. Mesos is similar to Yarn and is also becoming popular. Many systems are in fact collections of clusters as in data centers or grids. These require management and scheduling across many clusters; the latter is termed meta-scheduling. 1/26/2015 Cloud Computing Software 6

USING HPC-ABDS LAYERS IV 10) Data Transport Globus Online or GridFTP is dominant system in HPC community but this area is often not highlighted as often application only starts after data has made its way to disk of system to be used. Simple HTTP protocols are used for small data transfers while the largest ones use the “Fedex/UPS” solution of transporting disks between sites. 11) A) File management, B) NoSQL, C) SQL This is a critical area for nearly all applications as it captures areas of file, object, NoSQL and SQL data management. The many entries in area testify to variety of problems (graphs, tables, documents, objects) and importance of efficient solution. Just a little while ago, this area was dominated by SQL databases and file managers. 12) In-memory databases&caches / Object-relational mapping / Extraction Tools This is another important area addressing two points. Firstly conversion of data between formats and secondly enabling caching to put as much processing as possible in memory. This is an important optimization with Gartner highlighting this areas in several recent hype charts with In-Memory DBMS and In-Memory Analytics. 1/26/2015 Cloud Computing Software 7

USING HPC-ABDS LAYERS V 13) Inter process communication Collectives, point-to-point, publish- subscribe, MPI This describes the different communication models used by the systems in layers 13, 14) below. Results may be very sensitive to choices here as there are big differences from disk-based versus point to point (no disk) for Hadoop v. Harp (MPI)or the different latencies exhibited by publish-subscribe systems. I always recommend Pub-Sub systems like ActiveMQ or RabbitMQ for messaging. 14) A) Basic Programming model and runtime, SPMD, MapReduce, MPI B) Streaming A very important layer defining the cloud (HPC-ABDS) programming model. Includes Hadoop and related tools Spark, Twister, Stratosphere, Hama (iterative MapReduce); Giraph, Pregel, Pegasus (Graphs); Storm, S4, Samza (Streaming); Tez (workflow) and Yarn integration. Most applications use something here! 15) A) High level Programming Components at this level are not required but are very interesting and we can expect great progress to come both in improving them and using them. Pig and Sawzall offer data parallel programming models; Hive, HCatalog, Shark, MRQL, Impala, and Drill support SQL interfaces to MapReduce, HDFS and Object stores 1/26/2015 Cloud Computing Software 8

USING HPC-ABDS LAYERS VI 15) B) Frameworks This is exemplified by Google App Engine and Azure (when it was called PaaS) but now there are many “integrated environments”. 16) Application and Analytics This is the “business logic” of application and where you find machine learning algorithms like clustering. Mahout, MLlib, MLbase are in Apache for Hadoop and Spark processing; R is a central library from statistics community. There are many other important libraries where we mention those in deep learning (CompLearn Caffe), image processing (ImageJ), bioinformatics (Bioconductor) and HPC (Scalapack and PetSc). You will nearly always need these or other software at this level 17) Workflow-Orchestration This layer implements orchestration and integration of the different parts of a job. These can be specified by a directed data-flow graph and often take a simple pipeline form illustrated in “access pattern” 10 discussed later. This field was advanced significantly by the Grid community and the systems are quite similar in functionality although their maturity and ease of use can be quite different. The interface is either visual (link programs as bubbles with data flow) or as an XML or program (Python) script. 1/26/2015 Cloud Computing Software 9