HDF Data in the Cloud The HDF Team

Slides:

Advertisements

Similar presentations

Distributed Data Processing

Advertisements

Technical Architectures

Asper School of Business University of Manitoba Systems Analysis & Design Instructor: Bob Travica System architectures Updated: November 2014.

Presented by Sujit Tilak. Evolution of Client/Server Architecture Clients & Server on different computer systems Local Area Network for Server and Client.

A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.

 Cloud computing  Workflow  Workflow lifecycle  Workflow design  Workflow tools : xcp, eucalyptus, open nebula.

Unidata TDS Workshop TDS Overview – Part I XX-XX October 2014.

material assembled from the web pages at

Unit – I CLIENT / SERVER ARCHITECTURE. Unit Structure  Evolution of Client/Server Architecture  Client/Server Model  Characteristics of Client/Server.

OME-TIFF and Bio-Formats K. Eliceiri, E. Hathaway, M. Linkert, and C. Rueden

NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.

PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.

3/12/2013Computer Engg, IIT(BHU)1 CLOUD COMPUTING-1.

Amazon Web Services. Amazon Web Services (AWS) - robust, scalable and affordable infrastructure for cloud computing. This session is about:

Snip2Code: Search, Share and Collect Code Snippets Faster, Easier, Efficiently with Power of Microsoft Azure Platform MICROSOFT AZURE ISV PROFILE: SNIP2CODE.

Successfully Implementing The Information System Systems Analysis and Design Kendall and Kendall Fifth Edition.

Distributed Systems Architectures Chapter 12. Objectives  To explain the advantages and disadvantages of different distributed systems architectures.

Distributed Systems Architectures. Topics covered l Client-server architectures l Distributed object architectures l Inter-organisational computing.

GIS IN THE CLOUD Cloud computing furnishes scalable GIS technology that is maintained off premises and delivered on demand as services via the Internet.

Client/Server Technology

HDF Data Services John Readey The HDF Group

Prof. Jong-Moon Chung’s Lecture Notes at Yonsei University

Building a Data Warehouse

The Post Windows Operating System

Organizational IT Stack

Can Data be Organized for Science and Reuse?

Chapter 6: Securing the Cloud

Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore

Creating Oracle Business Intelligence Interactive Dashboards

Meemim's Microsoft Azure-Hosted Knowledge Management Platform Simplifies the Sharing of Information with Colleagues, Clients or the Public MICROSOFT AZURE.

Web Application.

The Client/Server Database Environment

Prepared by: Assistant prof. Aslamzai

Google App Engine Mandeep Singh (37926)

Vidcoding Introduces Scalable Video and TV Encoding in the Cloud at an Affordable Price by Utilizing the Processing Power of Azure Batch MICROSOFT AZURE.

Hybrid Cloud Architecture for Software-as-a-Service Provider to Achieve Higher Privacy and Decrease Securiity Concerns about Cloud Computing P. Reinhold.

Spark Presentation.

Platform as a Service.

Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore

Chapter 9: The Client/Server Database Environment

Chapter 18 MobileApp Design

The Improvement of PaaS Platform ZENG Shu-Qing, Xu Jie-Bin 2010 First International Conference on Networking and Distributed Computing SQUARE.

Introduction to client/server architecture

Cloud Computing.

Built on the Powerful Microsoft Azure Platform, Lievestro Delivers Care Information, Capacity Management Solutions to Hospitals, Medical Field MICROSOFT.

Take Control of Insurance Product Management: Build, Test, and Launch Any Product Globally 10x Faster, 10x More Cheaply with INSTANDA on Azure Partner.

Cloud Computing Dr. Sharad Saxena.

Oscar AP by Massive Analytic: A Precognitive Analytics Platform for Effortless Data-Driven Decisions. Now Available in Azure Marketplace MICROSOFT AZURE.

Yellowfin: An Azure-Compatible Business Intelligence Platform That Connects People with Their Data for Better Decision Making MICROSOFT AZURE APP BUILDER.

Be Better: Achieve Customer Service Excellence and Create a Lean RMA and Returns Process with Renewity RMA and the Power of Microsoft Azure MICROSOFT AZURE.

Scalable SoftNAS Cloud Protects Customers’ Mission-Critical Data in the Cloud with a Highly Available, Flexible Solution for Microsoft Azure MICROSOFT.

Logsign All-In-One Security Information and Event Management (SIEM) Solution Built on Azure Improves Security & Business Continuity MICROSOFT AZURE APP.

The Only Digital Asset Management System on Microsoft Azure, MediaValet Is Uniquely Equipped to Meet Any Company’s Needs MICROSOFT AZURE ISV PROFILE: MEDIAVALET.

DeFacto Planning on the Powerful Microsoft Azure Platform Puts the Power of Intelligent and Timely Planning at Any Business Manager’s Fingertips Partner.

Built on the Powerful Microsoft Azure Platform, the SiouxApp “Project-Server” Helps to Manage Projects and More with App Enhancement Tools MICROSOFT AZURE.

Dell Data Protection | Rapid Recovery: Simple, Quick, Configurable, and Affordable Cloud-Based Backup, Retention, and Archiving Powered by Microsoft Azure.

Ch 4. The Evolution of Analytic Scalability

Cloud Computing and its Implementation

Introduction to Apache

Keep Your Digital Media Assets Safe and Save Time by Choosing ImageVault to be Your Digital Asset Management Solution, Hosted in Microsoft Azure Partner.

Appcelerator Arrow: Build APIs in Minutes. Connect to Any Data Source

Media365 Portal by Ctrl365 is Powered by Azure and Enables Easy and Seamless Dissemination of Video for Enhanced B2C and B2B Communication MICROSOFT AZURE.

AWS Cloud Computing Masaki.

Quasardb Is a Fast, Reliable, and Highly Scalable Application Database, Built on Microsoft Azure and Designed Not to Buckle Under Demand MICROSOFT AZURE.

Storing and Accessing G-OnRamp’s Assembly Hubs outside of Galaxy

Cloud Computing: Concepts

Introduction To Distributed Systems

COMPANY PROFILE: REELWAY

Client/Server Computing and Web Technologies

DIBBs Brown Dog BDFiddle

Presentation transcript:

HDF Data in the Cloud The HDF Team Many users of HDF5 are now migrating data archives to public or private cloud systems. The access approaches and performance characteristics of cloud storage are fundamentally different than traditional data storage systems because 1) the data are accessed over http and 2) the data are stored in an object store and identified using unique keys. There are many different ways to organize and access data in the cloud. The HDF Group is currently exploring and developing approaches that will facilitate migration to the cloud and support many existing HDF5 data access use cases. Our goal is to protect data providers and users from disruption as their data and applications are migrated to the cloud. Enabling collaboration while Protecting data producers and users from disruption as data move to the cloud

The Landsat Experience Processing Time (Seconds) 2014 2015 2016 The U.S. Geological Survey migrated their archive of Landsat data to Amazon Web Services. This plot shows the processing time / image before and after the migration. The average time to process an image decreased from 375 seconds to 75 seconds because only 3 bands were being downloaded instead of 11+. This saved 21,600,000 seconds or 250 days. Landsat moved to Amazon Web Services. The most significant satellite data in the cloud experience in the United States comes from the U.S. Geological Survey. They migrated their archive of Landsat data to Amazon Web Services during 2014. The average time to process an image decreased from 375 seconds to 75 seconds because only 3 bands were being downloaded instead of 11+. This saved 21,600,000 seconds or 250 days in the total time required to process 72,000 images. High-performance subsetting has been a cornerstone of the HDF5 experience for many decades. HDF5 supports extraction of only the metadata and data users need whether it be selected bands or subsets along up to 32 dimensions (space, time, band, …). Our goal is to continue this tradition with high-performance large-scale analysis from the desktop, the organizational Data Center, or the cloud. Old queries 18,000 New queries 72,000 Old time (seconds) 375 New time (seconds) 75 Difference (seconds) 300 Time saved (seconds) 21,600,000 Time saved (days) 250 Graph by Drew Bollinger (@drewbo19) at Development Seed

HDF5 Virtual File Driver Highly Scalable Data Service Flexible Data Structures / Stable Access S T A B I L Y Existing Analysis, Visualization Applications New Cloud Native Applications HDF5 Library (C, Fortran, Java, Python) HDF5 Virtual File Driver Highly Scalable Data Service Maps Chunks / Rods Cloud lat lon time metadata metadata -- -- --- - --- The HDF5 library, shown as the box in the upper left of this slide, supports existing commercial and open source analysis and visualization applications written in many languages. The HDF Group directly supports C, C++, Fortran, and Java while other communities support Python, Julia, R, and many other languages. The data in HDF5 files can be organized in many ways to improve performance for expected use cases. This slide shows two end-member organizations (maps: single lat/lon slices for each time and rods: single pixels for all times) for supporting mapping and timeseries studies, and compromise 3D chunks that work well for supporting ad-hoc subsets. Current HDF5 users do not need to know the specifics of the data organization to access data. The library allows users to access data organized in any way with the same application code, although performance will vary. Our goal is to keep the analysis and visualization applications the same as the data in any organization move to the cloud. We will accomplish this goal using virtual file drivers (VFD) that plug in to the library to support different storage architectures. This approach has been used in HDF5 to support specialized file systems in high performance computing for many years. We are now applying that experience to support access to data in object stores. We are also developing new tools, like the Highly Scalable Data Service, and new interfaces, like the RESTful API (not shown here), to support access to data that are distributed across object stores using on-demand processing. Our approach protects existing investments in code and tools, the expensive parts of user systems, while allowing data to migrate and evolve. We are also working to support new cloud native applications and tools. Data Migration / Evolution

HDF5 Virtual File Driver Highly Scalable Data Service Flexible Data Location and Storage S T A B I L Y Existing Analysis, Visualization Applications New Cloud Native Applications HDF5 Library (C, Fortran, Java, Python) HDF5 Virtual File Driver Highly Scalable Data Service Local Files Private Cloud Public Cloud The HDF Group is developing library plug-ins and tools for accessing cloud data organized to support any analysis need or use case. Some data providers prefer to store entire files in native organizations as single objects in the cloud and to access the data from those files. Other data providers prefer to split the file into smaller pieces, typically datasets or chunks, and to access the data from those smaller chunks. We expect that, in the end, most data providers will use a mix of these two strategies to support diverse users and use cases. Current HDF5 users do not need to know the specifics of the data organization to access data. Our cloud strategy will allow users to access data organized in any way and stored in any storage system with the same application code, although performance will vary. Our approach protects existing investments in code and tools, the expensive parts of user systems, while allowing data to migrate and evolve. We are also working to support new cloud native applications and tools. metadata -- -- --- - --- Data Migration / Evolution

Python alternatives for netCDF API xarray A optimized - API h5netcdf - python netcdf-API netcdf4-python netcdf-C h5py h5pyd HDF5 C B HDF REST Highly Scalable Data Server C HDF5 Data

Client/Server Architecture Data Access Options Client SDKs for Python and C are drop-in replacements for libraries used with local files. C/Fortran Applications Web Applications HDF Services Clients do not know the details of the data structures or the storage system Community Conventions Browser No significant code change to access local and cloud based data. HDF5 Lib REST Virtual Object Layer S3 Virtual File Driver HDF5 data providers and users write and access data in HDF5 using many different programming languages and in many different architectures. The same diversity will continue as data is moved to the cloud. We are currently supporting a number of access options. C and Fortran applications will continue to use community conventions (e.g. HDF-EOS, netCDF, NEXUS, BAG, ASDF, …) and the HDF5 library to access data. Two library plug-ins: the REST Virtual Object Layer (VOL) and the S3 Virtual File Driver (VFD) are available to support these users in different ways depending on details of their needs. Our growing community of Python users access HDF5 data using the open source h5py package. They can replace that package with h5pyd, which has identical function calls, and access data in the cloud using the new REST API. The REST API can also support users that prefer accessing data through a web browser. These different access options all hide the details of the data storage from the users, supporting our goal of data access that is independent of data organization and storage architecture. Protecting data producers and users from disruption as data move to the cloud h5py h5pyd REST API Python Applications Command Line Tools

Collaboration A D C B Programs Projects Teams Individuals HDF Cloud will enable users to access and analyze data they need to answer new questions that require distributed datasets from many sources. The research group on the right is accessing many different chunks from the same original file in one case (A), and combining data from one file with chunks from another in case B. The group on the left is accessing a single chunk from an original file (C) to answer a local question or develop a model, and then applying that model to multiple chunks from separate datasets (D). C B

Cloud Optimized HDF A Cloud Optimized HDF is a regular HDF file, aimed at being hosted on a HTTP file server, with an internal organization that enables efficient access patterns for expected use cases on the cloud. Cloud Optimized HDF leverages the ability of clients to access just the data in a file they need and localizes metadata in order to decrease the time it takes to understand the file structure. HDF Cloud enables range gets for files or data collections with hundreds of parameters including geolocation information.

Metadata and Data Options C metadata -- -- --- - --- A B Kita enables many options for organizing data and metadata. shows a single user accessing an existing HDF5 file on their desktop. In this case, the metadata (grey) are distributed through the file (not necessarily as organized as they look here). shows access to the same file (unchanged) in the cloud. The change in location is handled in the HDF5 library using the S3 Virtual File Driver (VFD). shows the same data with metadata separated and/or centralized in the file. In either case the goal is to enable the metadata to be read in a single access. In some cases the metadata may be stored or cached on the processing machine. This option typically requires an optimization step during the migration of the file to the cloud. Note that the data in the cloud can be accessed by the individual or by others in the team (or users external to the team if appropriate). shows the file shared into metadata (grey) and data (white). In this case the original file no longer exists. Access in this case is done using the Highly Scalable Data Server, h5pyd, or the restful HDF5 API. metadata metadata

Sustainable Open Source Projects We should hold ourselves accountable to the goal of building sustainable open projects, and lay out a realistic and hard-nosed framework by which the investors with money (the tech companies and academic communities that depend on our toolkits) can help foster that sustainability. To be clear, in my view it should not be the job of (e.g.) Google to figure out how to contribute to our sustainability; it should be our job to tell them how they can help us, and then follow through when they work with us. developers effort Titus Brown, A framework for thinking about Open Source Sustainability? http://ivory.idyll.org/blog/2018-oss-framework-cpr.html

Interactive Wind Data From HDF Cloud National Renewable Energy Lab Wind Data The HDF Group, the U.S. National Renewable Energy Lab (NREL), and the Amazon Web Services open data team have worked together to test HDF Cloud with a large collection of wind data from a mesoscale weather forecast model (WRF). These data were restructured to improve access and migrated to the cloud and an interactive web visualization tool was built by an intern at NREL). Click the National Renewable Energy Lab Wind Data link to see the web application and other links to find out more about HDF Cloud. Amazon Web Services Blog More HDF Cloud Information

Architecture for Highly Scalable Data Service Distributing computing over a collection of processors that can grow and shrink as needed is one of the principle benefits of moving data access systems to the cloud. The Highly Scalable Data Service was developed by The HDF Group to help users take advantage of this critical benefit. Data files are split into datasets and chunks and distributed throughout the data store in a number of “buckets”, each of which is managed by a specific data node. When requests arrive, they are balanced across a number of service nodes, each of which access part of the original datasets. This approach can take advantage of large numbers of nodes when necessary to do large-scale analytics. As shown in the previous slide, the HSDS can be accessed many ways. The most well developed and tested is the Python package h5pyd which is an extension of h5py that is optimized to use the new RESTful API for HDF that was implemented specifically for data in the cloud. As data moves to the cloud, users replace the h5py package with h5pyd and the data are accessed without any changes to the application. Users can also create web applications built directly on top of the REST API. Legend: Client: Any user of the service Load balancer – distributes requests to Service nodes Service Nodes – processes requests from clients (with help from Data Nodes) Data Nodes – responsible for partition of Object Store Object Store: Base storage service (e.g. AWS S3)

Cloud Optimized HDF HDF5 (require v1.10?) Use chunking for datasets larger than 1MB Use “brick style” chunk layouts (enable slicing via any dimension) Use readily available compression filters Pack metadata in front of file (optimal for S3 VFD) Provide sizes and locations of chunks in file Compressed variable length data is supported Many communities optimize HDF5 by creating specialized data models specific to their needs and conventions for writing data using those models. As cloud usage increases and The HDF Group continues to explore cloud access options, we are identifying approaches to writing HDF5 files that improve performance in the cloud. If data providers are writing files that they plan to access from the cloud, they can take advantage of what we have learned to optimize data access for their users.

Why HDF in the Cloud Cost-effective infrastructure Pay for what you use vs pay for what you may need Lower overhead: no hardware setup/network configuration, etc. Benefit from cloud-based technologies: Elastic compute – scale compute resources dynamically Object based storage – low cost/built in redundancy Community platform Enables interested users to bring their applications to the data Share data among many users This slide summarizes some of the important reasons for migrating HDF5 data archives to the cloud.

More Information: H5serv: https://github.com/HDFGroup/h5serv Documentation: http://h5serv.readthedocs.io/ H5pyd: https://github.com/HDFGroup/h5pyd RESTful HDF5 White Paper: https://www.hdfgroup.org/pubs/papers/RESTful_HDF5.pdf Blogs: https://hdfgroup.org/wp/2015/04/hdf5-for-the-web-hdf-server/ https://hdfgroup.org/wp/2015/12/serve-protect-web-security-hdf5/ https://www.hdfgroup.org/2017/04/the-gfed-analysis-tool-an-hdf-server-implementation/ Please click these links for more details.

HDF5 Community Support Documentation, Tutorials, FAQs, examples 16 Documentation, Tutorials, FAQs, examples https://portal.hdfgroup.org/display/support HDF-Forum – mailing list and archive Great for specific questions Helpdesk Email – help@hdfgroup.org Issues with software and documentation https://portal.hdfgroup.org/display/support/Community Please click these links for more details.