Petr Škoda, Jakub Koza Astronomical Institute Academy of Sciences

Slides:

Advertisements

Similar presentations

LEAD Portal: a TeraGrid Gateway and Application Service Architecture Marcus Christie and Suresh Marru Indiana University LEAD Project (

Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.

1 OBJECTIVES To generate a web-based system enables to assemble model configurations. to submit these configurations on different.

Bookshelf.EXE - BX A dynamic version of Bookshelf –Automatic submission of algorithm implementations, data and benchmarks into database Distributed computing.

Languages for Dynamic Web Documents

Understanding and Managing WebSphere V5

Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc

A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.

Apache Airavata GSOC Knowledge and Expertise Computational Resources Scientific Instruments Algorithms and Models Archived Data and Metadata Advanced.

Building service testbeds on FIRE D5.2.5 Virtual Cluster on Federated Cloud Demonstration Kit August 2012 Version 1.0 Copyright © 2012 CESGA. All rights.

THE HOG LANGUAGE A scripting MapReduce language. Jason Halpern Testing/Validation Samuel Messing Project Manager Benjamin Rapaport System Architect Kurry.

Projects. High Performance Computing Projects Design and implement an HPC cluster with one master node and two compute nodes. (Hint: use Rocks HPC Cluster.

CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.

WSRF Supported Data Access Service (VO-DAS)‏ Chao Liu, Haijun Tian, Dan Gao, Yang Yang, Yong Lu China-VO National Astronomical Observatories, CAS, China.

Flexibility and user-friendliness of grid portals: the PROGRESS approach Michal Kosiedowski

Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.

f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read

Last News of and

How computer’s are linked together.

Installation and Development Tools National Center for Supercomputing Applications University of Illinois at Urbana-Champaign The SEASR project and its.

1 Welcome to CSC 301 Web Programming Charles Frank.

NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.

Interactive Workflows Branislav Šimo, Ondrej Habala, Ladislav Hluchý Institute of Informatics, Slovak Academy of Sciences.

Virtualization and Databases Ashraf Aboulnaga University of Waterloo.

CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.

Module: Software Engineering of Web Applications Chapter 2: Technologies 1.

Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies

Cloud Computing project NSYSU Sec. 1 Demo. NSYSU EE IT_LAB2 Outline  Our system’s architecture  Flow chart of the hadoop’s job(web crawler) working.

Spark and Jupyter 1 IT - Analytics Working Group - Luca Menichetti.

INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.

The LGI Pilot job portal EGI Technical Forum 20 September 2011 Jan Just Keijser Willem van Engen Mark Somers.

KOREL-VO - The small RESTful "cloud" environment for stellar analysis Petr Škoda Jan Fuchs Petr Hadrava Astronomical Institute Academy of Sciences Ondřejov.

Scientific Data Processing Portal and Heterogeneous Computing Resources at NRC “Kurchatov Institute” V. Aulov, D. Drizhuk, A. Klimentov, R. Mashinistov,

1 Chapter 1 INTRODUCTION TO WEB. 2 Objectives In this chapter, you will: Become familiar with the architecture of the World Wide Web Learn about communication.

Architecture and design

SSA Revisited (Also SDM and SODA suggestions)

Big Data is a Big Deal!.

JRA2: Acceptance Testing senarious

MIRACLE Cloud-based reproducible data analysis and visualization for outputs of agent-based models Xiongbing Jin, Kirsten Robinson, Allen Lee, Gary Polhill,

Foundations of Data Science

Introduction to Distributed Platforms

Big Data A Quick Review on Analytical Tools

Chapter 10 Data Analytics for IoT

Work report Xianghu Zhao Nov 11, 2014.

Nope OS Prepared by, Project Guides: Ms. Divya K V Ms. Jucy Vareed

Spark Presentation.

Bridges and Clouds Sergiu Sanielevici, PSC Director of User Support for Scientific Applications October 12, 2017 © 2017 Pittsburgh Supercomputing Center.

Introduction to HDFS: Hadoop Distributed File System

Software Engineering Introduction to Apache Hadoop Map Reduce

Central Florida Business Intelligence User Group

MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner

Ministry of Higher Education

The Basics of Apache Hadoop

CS6604 Digital Libraries IDEAL Webpages Presented by

湖南大学-信息科学与工程学院-计算机与科学系

HDFS on Kubernetes -- Lessons Learned

CS110: Discussion about Spark

Introduction to Apache

Overview of big data tools

HDFS on Kubernetes -- Lessons Learned

Lecture 16 (Intro to MapReduce and Hadoop)

Distributing META-pipe on ELIXIR compute resources

CEA Experiences Paul Harrison ESO.

Bryon Gill Pittsburgh Supercomputing Center

Big-Data Analytics with Azure HDInsight

Enol Fernandez & Giuseppe La Rocca EGI Foundation

MapReduce: Simplified Data Processing on Large Clusters

Web Application Development Using PHP

Presentation transcript:

VO-CLOUD upgrade Integration of Spark, Jupyter and HDFS in a UWS-driven cloud service Petr Škoda, Jakub Koza Astronomical Institute Academy of Sciences Ondřejov Czech Republic Supported by grant COST LD-15113 of the Ministry of Education, Youth and Sports of the Czech Republic And National Science Foundation of China IVOA Interoperability meeting , GWS Session 1 Shanghai, China, 16th May 2017

Concept of scientific „CLOUD“ ITERATIVE REPEATING of SAME computation (workflow) Machine Learning (of emission line profiles of LAMOST) LARGE stable INPUT data + small changing PARAMS Many runs on SAME data (tuning required) Graphics visualization from postprocessed output (text) files Using WWW browser - supercomputing in PDA/mobil

VO-CLOUD Architecture Distributed engine MASTER (frontend) Database of users and their experiments Visualization Scheduling Load balancing SHARED DATA STORAGE - controlled access (Big Data) WORKERS (backend) Computation [+ output for visualization]

Getting spectra + store Sources of Spectra Getting spectra + store (restricted access – big files) Files UPLOAD from given local directory (recursive) DOWNLOAD by http + index, FTP (recursive) VOTable UPLOAD VOTable (e.g. prepared in TOPCAT - meta) REMOTE VOTable SSAP query + Accref + DataLink + SODA SAMP control - send to SPLAT

Machine Learning of BIG Archive

VO-CLOUD deployment

SOM Worker example

VO-CLOUD spectra visualisation Big Data visualization (thousands spectra) Implemented in Python using Matplotlib Can visualise multiple selected spectra files Figure generated on server-side and then transfered to client User can use panning, zooming and export to different formats Uses WebSocket for server-client communication

VO-CLOUD spectra visualisation

VO-CLOUD spectra visualisation

VO-CLOUD spectra visualisation

Hadoop cluster infrastructure HDFS – Distributed filesystem utilized as a storage in Hadoop/Spark jobs NameNode – One process per cluster. Contains information about all files saved inside HDFS DataNode – One process per each node in cluster. Stores individual file data blocks. YARN – Resource manager and scheduler for Hadoop/Spark jobs. ResourceManager – Main process managing resources and scheduling jobs. One process per cluster. NodeManager – Process executing assigned work on each node. One process per each cluster node. Problem of millions of small files (FITS) – Apache AVRO (Sequence files)

Hadoop cluster deployment

Spark Worker in VO-CLOUD Accepts JSON configuration Downloads requested files from the Master server to HDFS Executes spark-submit script using implicit parameters or parameters present in JSON configuration Awaits spark-submit script completion Downloads requested files from the HDFS Master FILESYSTEM – NFS emulation for HDFS

JupyterHub Consists of Proxy, Hub and individual Jupyter Notebook server instances One Jupyter Notebook server instance per authenticated user JupyterHub runs as Docker container JupyterHub spawns additional Docker containers – one for each Jupyter Notebook server instance Users are isolated from each other and from hosting system itself

JupyterHub authentication VO-CLOUD generates temporal token and passes it to user‘s web browser Web browser uses username and token for JupyterHub authentication JupyterHub uses vocloud-authenticator package for authentication Authenticator asks VO-CLOUD whether token is valid for relevant username JupyterHub spawns new Docker container with the new Jupyter Notebook server instance for the user

JupyterHub deployment

JupyterHub example

JupyterHub example

JupyterHub example

Conclusions VO-CLOUD is now very powerful machine learning environment capable of visualization of Big data It can spawn jobs on remote Spark cluster Provides sandbox for playing with big data in Jupyter notebook ON BIG SERVER (memory, CPU, GPU) but Still missing important capabiliites AVRO not part of Spark-worker Using Docker but so far not deployable as a docker (compose) Combines Java EE + Python (+ Scala) Focused on Machine learning of 1D vectors (spectra, time series) employing VO technology and protocols

https://github.com/vodev/vocloud Source Code https://github.com/vodev/vocloud