CLARA Micro-services architecture for distributed data analytics

Slides:



Advertisements
Similar presentations
Distributed components
Advertisements

A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
Thomas Jefferson National Accelerator Facility Page 1 Clas12 Reconstruction and Analysis Framework V. Gyurjyan S. Mancilla.
Thomas Jefferson National Accelerator Facility Page 1 Clas12 Reconstruction and Analysis Framework V. Gyurjyan S. Mancilla.
CISC 849 : Applications in Fintech Namami Shukla Dept of Computer & Information Sciences University of Delaware iCARE : A Framework for Big Data Based.
What we know or see What’s actually there Wikipedia : In information technology, big data is a collection of data sets so large and complex that it.
Smart Grid Big Data: Automating Analysis of Distribution Systems Steve Pascoe Manager Business Development E&O - NISC.
A Data Handling System for Modern and Future Fermilab Experiments Robert Illingworth Fermilab Scientific Computing Division.
Distributed Systems Architectures Chapter 12. Objectives  To explain the advantages and disadvantages of different distributed systems architectures.
Distributed Systems Architectures. Topics covered l Client-server architectures l Distributed object architectures l Inter-organisational computing.
Platform as a Service (PaaS)
AuraPortal Cloud Helps Empower Organizations to Organize and Control Their Business Processes via Applications on the Microsoft Azure Cloud Platform MICROSOFT.
Univa Grid Engine Makes Work Management Automatic and Efficient, Accelerates Deployment of Cloud Services with Power of Microsoft Azure MICROSOFT AZURE.
Fan Engagement Solution
Platform as a Service (PaaS)
PROTECT | OPTIMIZE | TRANSFORM
Chapter 13: I/O Systems Modified by Dr. Neerja Mhaskar for CS 3SH3.
Chapter 9: The Client/Server Database Environment
Smart Building Solution
Introduction to Distributed Platforms
Connected Maintenance Solution
Metis Data Science Meetup:
Status and Challenges: January 2017
Chapter 14 Big Data Analytics and NoSQL
Open Source distributed document DB for an enterprise
Spark Presentation.
Smart Building Solution
Connected Maintenance Solution
Cherwell Service Management is an IT Service Management Solution that Makes it Easier for Users to Capitalize on Power of Microsoft Azure MICROSOFT AZURE.
Wonderware Online Cost-Effective SaaS Solution Powered by the Microsoft Azure Cloud Platform Delivers Industrial Insights to Users and OEMs MICROSOFT AZURE.
The Client/Server Database Environment
The Client/Server Database Environment
Chapter 18 MobileApp Design
The Improvement of PaaS Platform ZENG Shu-Qing, Xu Jie-Bin 2010 First International Conference on Networking and Distributed Computing SQUARE.
Using Microsoft Azure, Crowdnetic Launches Innovative Lending Gateway Platform That Connects Borrowers to Alternative Lenders MICROSOFT AZURE SOLUTION.
University of Technology
CLARA Based Application Vertical Elasticity
NSF : CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science PI: Geoffrey C. Fox Software: MIDAS HPC-ABDS.
Replication Middleware for Cloud Based Storage Service
Running on the Powerful Microsoft Azure Platform,
Distributed Systems CS
Oscar AP by Massive Analytic: A Precognitive Analytics Platform for Effortless Data-Driven Decisions. Now Available in Azure Marketplace MICROSOFT AZURE.
Designed for Big Data Visual Analytics, Zoomdata Allows Business Users to Quickly Connect, Stream, and Visualize Data in the Microsoft Azure Platform MICROSOFT.
Yellowfin: An Azure-Compatible Business Intelligence Platform That Connects People with Their Data for Better Decision Making MICROSOFT AZURE APP BUILDER.
Oracle Architecture Overview
DeFacto Planning on the Powerful Microsoft Azure Platform Puts the Power of Intelligent and Timely Planning at Any Business Manager’s Fingertips Partner.
Accelerate Your Self-Service Data Analytics
Chapter 17 Parallel Processing
ExaO: Software Defined Data Distribution for Exascale Sciences
MyAppFree, Powered by Microsoft Azure, Lets Global Users Discover and Download Tested and Handpicked Windows Apps and Games for Free MICROSOFT AZURE ISV.
Lecture 1: Multi-tier Architecture Overview
Web Application Architectures
EnMS Polska Builds energyBIS on Microsoft Azure to Ensure a Scalable and Secure Energy Efficiency Monitoring and Management System MICROSOFT AZURE ISV.
CS703 - Advanced Operating Systems
Characteristics of Reconfigurable Hardware
Architectures of distributed systems Fundamental Models
Software models - Software Architecture Design Patterns
Architectures of distributed systems Fundamental Models
Web Application Architectures
Last.Backend is a Continuous Delivery Platform for Developers and Dev Teams, Allowing Them to Manage and Deploy Applications Easier and Faster MICROSOFT.
Distributed Systems CS
Physics data processing with SOA
Introduction To Distributed Systems
CLARA . What’s new? CLAS Collaboration Meeting. March 6, 2019
Architectures of distributed systems Fundamental Models
Web Application Architectures
Chapter 13: I/O Systems.
Design.
DIBBs Brown Dog BDFiddle
Presentation transcript:

CLARA Micro-services architecture for distributed data analytics Vardan Gyurjyan (gurjyan@jlab.org)

Overview Scientific data challenges Modern software solutions Micro-services architecture Flow-based reactive programming How this applies to TMD/GPD extraction project

Science Data 3V Expansion Unprecedented growth of data Volumes Data at rest is growing exponentially. EOS: 1Exabyte in 10 years. LHC: 12-14 Petabyte/year. In 2023 400Petabyte/year. SKA: 22Exabyte/year in 2023 Velocities Data acquisition rates, as well as new data producing devices are in rise. LHC: 1.5GByte/sec SKA: 700Terabyte/sec Varieties Plethora of data formats, data structures and data types

That is a good news We need more data to confirm and/or generate a new knowledge. We need more diverse data. Correlating multiple data sources can lead to interesting insights of all. We need more and easy access to the data. We need to communicate data not the publication.

Then we realized… Existing scientific data processing architectures will have difficulties handling future data volumes and data distribution. DOE Exascale Initiative Commercial data processing engines are well advanced (Apache Hadoop, Spark, Storm, etc.). Can we adopt them for our needs?

Data/software characteristics Yes but… Data/software characteristics Science data Social media data Origin Distributed with computing Distributed with no computing Format Diverse Quasi uniform Processing unit File Processing steps Multiple Limited (map, reduce) Language Old/trusted Modern Heritage code Yes No

Say NO to data-format racism ! We need a new approach…. To store, access and process data Optimize data migration Bring application to data (as much as it is possible) Data location and format agnosticism Data instream processing File based processing In stream processing Say NO to data-format racism ! We all are bytes

CS architectures to help Micro-services architecture (SOA) Flow based reactive programming (FBP)

Micro-services Micro-services architecture Software application is composed of interlocking software bricks called services. Services are linked to each other by the transient data. Applications execute according to the rules of data flow instead of sequential series of instructions (lines of code), written to perform a required algorithm. Services migrate to build an application close to data. Handful of software bricks can be used to build variety of data processing applications Mobile and talkative

Micro-services: Advantages Application is made of components that communicate data Small, simple and independent Easier to understand and develop Less dependencies Faster to build and deploy Reduced develop-deploy-debug cycle Easy to migrate to data Scales independently Independent optimizations Improves fault isolation Eliminates long term commitment to a single technology stack. Easy to embrace new technologies

FBP Glue between micro-services How they connect to achieve computational goals Makes application truly distributed A B B reacts on message Named inputs Sync/Async message passing { A calls B } v.s. { A message B }

FBP: Advantages Concurrency and parallelism are natural. Code can be distributed between cores and across networks. Data flow networks are natural for representing process. Message passing gets rid of the problems associated with shared memory and locks. Data flow programs are more extensible than traditional programs. Fault tolerant

CLARA implements SOA and FBP Application is defined as a network of loosely coupled processes, called services. Services exchange data across predefined connections by message passing, where connections are specified externally to the services. Services can be requested from different data processing applications. Loose coupling of services makes polyglot data access and processing solutions possible (C++, Java, Python, Fortran) Services communicate with each other by exchanging the data quanta. Thus, services share the same understanding of the transient data, hence the only coupling between services.

CLARA architecture Orchestration Layer gateway security Service Layer DPE Local Registration DPE Local Registration SC SC SC S SC S FE Registration S S S S gateway security DPE Local Registration DPE Local Registration SC SC SC S SC S S S S S Service Layer Service Bus (xMsg)

CLARA Service Service Engine Service Container Service Container SE One request at a time Multiple simultaneous requests C++ engine : https://claraweb.jlab.org/clara//docs/quickstart/cpp.html Java engine : https://claraweb.jlab.org/clara//docs/quickstart/java.html Python engine: https://claraweb.jlab.org/clara//docs/quickstart/python.html

CLARA Data Processing Environment Service Container 1 Service Container 2 SE1 SE SE SE SE1 SEn SE1 Registration Admin Service Container n SE1 SEn Data processing Environment

Transient data quanta Data driven, data centric design. The focus is on transient data event modifications. Advantage over algorithm driven design is that a much greater ignorance of the data processing code is allowed (loose coupling).

Application Examples S1 S2 S3 S4 S1 + S2 + S3 + S4; S3 + S5 + S6; S5 F1 F2 S1 + S2 + S3; while( S3 == ”needs calibration") { F1 + F2 + S3; } S3 + S4 + S6; S1 S2 S3 S4 S6

CLARA Distributed Data Provisioning Data Center 1 (e.g. JLAB) Data Center 1 Distributed Client Layer Data Bus N (e.g. EBCMD grid) Data Bus 1 (e.g. raw data) Data Bus 1 Service Bus (CLARA services) Data Bus N Data communications Service Bus User Front-End Data Center 2 User Front-End Data Center 2 CLARA FE hosts: Meta-data Database Data Server Application Server Web Server (JNLP support) Data Visualization Server Application Design, Service upgrade CLARA Cloud Data Center N Data Center N

CLARA Distributed File Storage System CDFS Master DPE Heartbeat/2sec. Available local SSD/HHD Available memory Stored Files File name Status: Active Passive-hot Passive-cold Passive-ready-remove File Metadata DB Key = File Name Value= DPE Info Metadata of the Dataset to be processed Control via SSH Stage Promote Demote Remove

Experimental and Simulated Data CLARA Data Bus Glossary Data processing services. Scaling up to Cn cores. Data passage/link between services Service control messages Service configuration Knowledge feedback IO services In memory data queue (ring-buffer) Computing resource boundaries Data-processing environment CLARA DPE CLAS12 Cn 1 CLARA DPE Farm NodeN EBCMD Table R CLARA Data Bus 1 Cn CLAS12 Simulation and Reconstruction Multi-Node Application CLARA DPE MC S1 MC SN EBCMD Extraction Service 1 Cn EC HT TOF TT PID DST Cloud Computing Resources EC Event Generator HT Cloud NodeN TOF CLAS12 Reconstruction Multi-Node Application Assumptions TT STAT CLARA Data Bus EBCMD Extraction Service JLAB Computing Resources reconstruction program Some other experiment σR Calculation Service σB Calculation Service SF Calculation Service GPD-TMD Calculation Service Simulated data Reconstruction Model Parameterization PID W Cn 1 CLARA DPE CLARA DPE Other Experiment and/or Research Facility Computing Resources CLARA DPE R σR Extraction Service σB Extraction Service SF Extraction Service GPD-TMD Extraction Service STAT EBCMD Extraction Service W W EBCMD W W CLARA DPE Assumptions CLARA DPE Cooked, DST EBCMD Table EBCMD Table EBCMD Table EBCMD Table Reconstructed Data CLARA Data Bus

Conclusion CLARA can be used to build distributed TMD/GPD extraction, analysis, collaboration, and data provisioning system. Data streaming Data migration optimization Heritage code reuse Correlational and predictive studies Increased collaboration First public web server 1993 at CERN: http://info.cern.ch/hypertext/WWW/TheProject.html Here goes the list of TMD/GPD CLARA based data servers…, the first WDW data and application server URLs