Big Earth Data Cloud Service Platform:Architecture & Service

Slides:



Advertisements
Similar presentations
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grids for Collection Federation Reagan W. Moore University.
Advertisements

Prof. Natalia Kussul, PhD. Andrey Shelestov, Lobunets A., Korbakov M., Kravchenko A.
Cloud computing in spatial data processing for the Integrated Geographically Distributed Information System of Earth Remote Sensing (IGDIS ERS) Open Joint-Stock.
System Center 2012 R2 Overview
ASCR Data Science Centers Infrastructure Demonstration S. Canon, N. Desai, M. Ernst, K. Kleese-Van Dam, G. Shipman, B. Tierney.
Building an Operational Enterprise Architecture and Service Oriented Architecture Best Practices Presented by: Ajay Budhraja Copyright 2006 Ajay Budhraja,
® IBM India Research Lab © 2006 IBM Corporation Challenges in Building a Strategic Information Integration Infrastructure Mukesh Mohania IBM India Research.
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
New Generation SDI and Cyber-Infrastructure Prof. Guoqing Li CEODE/CAS March 29, 2009, Newport Beach, USA Presented to 4th China-US Roundtable Meeting.
Scientific Data Infrastructure in CAS Dr. Jianhui Scientific Data Center Computer Network Information Center Chinese Academy of Sciences.
Cloud Computing for the Enterprise November 18th, This work is licensed under a Creative Commons.
Introduction To Windows Azure Cloud
Master Thesis Defense Jan Fiedler 04/17/98
Wenjing Wu Andrej Filipčič David Cameron Eric Lancon Claire Adam Bourdarios & others.
19/10/20151 Semantic WEB Scientific Data Integration Vladimir Serebryakov Computing Centre of the Russian Academy of Science Proposal: SkTech.RC/IT/Madnick.
The Grid System Design Liu Xiangrui Beijing Institute of Technology.
Integrated Grid workflow for mesoscale weather modeling and visualization Zhizhin, M., A. Polyakov, D. Medvedev, A. Poyda, S. Berezin Space Research Institute.
1 4/23/2007 Introduction to Grid computing Sunil Avutu Graduate Student Dept.of Computer Science.
NOVA Networked Object-based EnVironment for Analysis P. Nevski, A. Vaniachine, T. Wenaus NOVA is a project to develop distributed object oriented physics.
ICDL 2004 Improving Federated Service for Non-cooperating Digital Libraries R. Shi, K. Maly, M. Zubair Department of Computer Science Old Dominion University.
Next Generation Operating Systems Zeljko Susnjar, Cisco CTG June 2015.
©2012 LIESMARS Wuhan University Building Integrated Cyberinfrastructure for GIScience through Geospatial Service Web Jianya Gong, Tong Zhang, Huayi Wu.
7. Grid Computing Systems and Resource Management
20409A 7: Installing and Configuring System Center 2012 R2 Virtual Machine Manager Module 7 Installing and Configuring System Center 2012 R2 Virtual.
Development of e-Science Application Portal on GAP WeiLong Ueng Academia Sinica Grid Computing
PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.
Comprehensive Scientific Support Of Large Scale Parallel Computation David Skinner, NERSC.
CUAHSI HIS: Science Challenges Linking small integrated research sites (
The Earth Information Exchange. Portal Structure Portal Functions/Capabilities Portal Content ESIP Portal and Geospatial One-Stop ESIP Portal and NOAA.
Distributed Archives Interoperability Cynthia Y. Cheung NASA Goddard Space Flight Center IAU 2000 Commission 5 Manchester, UK August 12, 2000.
Latest Improvements in the PROOF system Bleeding Edge Physics with Bleeding Edge Computing Fons Rademakers, Gerri Ganis, Jan Iwaszkiewicz CERN.
Collection-Based Persistent Archives Arcot Rajasekar, Richard Marciano, Reagan Moore San Diego Supercomputer Center Presented by: Preetham A Gowda.
Store and exchange data with colleagues and team Synchronize multiple versions of data Ensure automatic desktop synchronization of large files B2DROP is.
INTRODUCTION TO XSEDE. INTRODUCTION  Extreme Science and Engineering Discovery Environment (XSEDE)  “most advanced, powerful, and robust collection.
Grid Services for Digital Archive Tao-Sheng Chen Academia Sinica Computing Centre
Petr Škoda, Jakub Koza Astronomical Institute Academy of Sciences
Enhancements to Galaxy for delivering on NIH Commons
Accessing the VI-SEEM infrastructure
Elastic Cyberinfrastructure for Research Computing
Pasquale Pagano (CNR-ISTI) Project technical director
Big Data Enterprise Patterns
Landsat Remote Sensing Workflow
Clouds , Grids and Clusters
Tools and Services Workshop
Joslynn Lee – Data Science Educator
Working With Azure Batch AI
Status and Challenges: January 2017
StratusLab Final Periodic Review
StratusLab Final Periodic Review
Hybrid Cloud Architecture for Software-as-a-Service Provider to Achieve Higher Privacy and Decrease Securiity Concerns about Cloud Computing P. Reinhold.
Bridges and Clouds Sergiu Sanielevici, PSC Director of User Support for Scientific Applications October 12, 2017 © 2017 Pittsburgh Supercomputing Center.
Platform as a Service.
Grid Computing.
Traditional Enterprise Business Challenges
Recap: introduction to e-science
University of Technology
NSF : CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science PI: Geoffrey C. Fox Software: MIDAS HPC-ABDS.
به نام خدا Big Data and a New Look at Communication Networks Babak Khalaj Sharif University of Technology Department of Electrical Engineering.
Designed for Big Data Visual Analytics, Zoomdata Allows Business Users to Quickly Connect, Stream, and Visualize Data in the Microsoft Azure Platform MICROSOFT.
20409A 7: Installing and Configuring System Center 2012 R2 Virtual Machine Manager Module 7 Installing and Configuring System Center 2012 R2 Virtual.
Data Warehousing and Data Mining
EUDAT B2FIND A Cross-Discipline Metadata Service and Discovery Portal
Module 01 ETICS Overview ETICS Online Tutorials
Overview of big data tools
AWS Cloud Computing Masaki.
AIMS Equipment & Automation monitoring solution
Bird of Feather Session
Distributing META-pipe on ELIXIR compute resources
敦群數位科技有限公司(vanGene Digital Inc.) 游家德(Jade Yu.)
VIFI : Virtual Information Fabric for Data-Driven Discovery from Distributed Fragmented Repositories PI: Dr. Ashit Talukder Bank of America Endowed Chair.
Presentation transcript:

Big Earth Data Cloud Service Platform:Architecture & Service Xuebin CHI(chi@sccas.cn) Computer Network Information Center Chinese Academy of Sciences ISGC2019, 2019-04-04

Outline Background Computing Facilities & Storage System Data Management and Data Infrastructure Computing Engines & Data Analysis Services Cloud Service Catalog and Portal Conclusion

CASEarth Program A CAS Strategic Pioneer Research and Development Program Total investment is almost 1.8 billion RMB (almost 250 million euro) for 5 years (2018-2022) CASEarth Satellite, CASEarth Cloud platform, Digital Earth platform Aims to build the International Big Earth data Science Center Building the leading edge big earth data infrastructure Accelerating Big data driven science discovery Providing decision supporting services for government Census data Aviation data Remote-sensing data Navigation data Monitoring data Big Earth data Cloud BioOne Beautiful China DBAR Tri-poles Ocean Digital Earth CAS Projects National Projects Technical Innovation Scientific Discovery Macro Decision Social benefits

Cloud Service Platform A lot of Legacy edge systems Multi-discipline data and applications Edge computing + Cloud computing … Data can be transfered and shared on demand Computing capacity can be shared on demand Data analysis methods and algorithms can be shared Cross disciplinary discovery can be supported … Digital Earth Big Earth Data Cloud Service Platform … Bio diversity BioONE Ocean DeBar Bio diversity sequences … Eco system … … … … sequences

Big Challenges How to make data findable, accessible and usable? Flowing from the source to target applications automatically Heterogeneous data integrating and processing How to make cyber infrastructure and computing facilities be easily shared by multiple applications and users Software defined deployment for specific applications Autonomous and elastic scaling out Invisible for scientists How to share scientific models, big data analysis methods and algorithms e.g pre-trained machine learning algorithm can be used by multiple applications …

High-level architecture of Cloud Service Platform Digital earth Cloud service Portal Computing service Storage service Data access service Analysis service Subject-oriented service …… Big earth data software stacks Middle ware & software Data management Computing engines Analysis Engines Visualization 社会统计数据 专题数据产品 Research data Earth data Pool 卫星遥感数据 导航定位数据 航空监测数据 地面调查数据 Infrastructure High-performance computing High-Throughput computing Massive storage system Network

Outline Background Computing Facilities & Storage System Data Management and Data Infrastructure Computing Engines & Data Analysis Services Cloud Service Catalog and Portal Conclusion

Hybrid Solution Cloud service platform Special-purpose computing system for big earth data China National High Performance Computing Environment

A special-purpose computing system for big Earth data Integrating HPC/Big Data/Cloud Computing Hybrid architecture 1Pflops HPC ≥10000 CPU cores support 10000 VMs ≥ 35PB available storage space High speed data exchange network GPU acceleration Unified authentication, administration, portal

Data Flow Path Supercomputing Cloud Computing File storage system Object storage system

China National High Performance Computing Environment 2 Operating Centers ( Beijing / Hefei ) 19 Sites Portal with Micro-Service Architecture Application Oriented Global Scheduling & Predicting Resource Evaluation Standard & Comprehensive Evaluation Index

Outline Background Computing Facilities & Storage System Data Management and Data Infrastructure Computing Engines & Data Analysis Services Cloud Service Catalog and Portal Conclusion

Key Components of Data Infrastructure Data Portal Unified data access interface Finable, Accessible, Usable Data Bank Remote sensing data pool on-demand computing and analysis Data Fabric Distributed data sources dynamic aggregating accessing Data Repository Online data publishing & sharing citable, evaluable Data Box Analysis-oriented Remote sensing data management DataStor Object storage, SQL, NoSQL, File system

DataStore Multi-mode Storage Object + SQL + NoSQL + Filesystem Object storage system architecture & pressure test

Data Repository Research data long-term storing, sharing and discovering Uploading data online Self-management, publish on demand Unique identification, citable, evaluable Store& Manage Create Upload Publish

Data Management for Research projects data management cloud service for data produced by research projects Covering data life cycle, from data management plan, data upload, data curation and publish

DataBox & DataBank Efficient search and access for PB-scale RS data

Databox: a spatio-temporal data management engine reduces processing time of traditional image analysis by calibrating, pre-computing known extents, pixel alignment and storing metadata in a cell lattice structure, makes data analysis ready DboxStorage:IO Middleware DBoxDataset:GDALDriver DBoxMapServer:map serve engine DBoxCache:distribute cache DBoxMR:real-time scheduling Ceph cluster Mongodb cluster mongos DboxStorage Local Cache GDAL & DBoxDataset DBoxCache DBoxMapServd Python3 Dboxio API DBoxWebServd Task Queue DBoxMR DBoxTaskServd Workers

Data Portal To Browse, Search, Access, Download, and Visualize Data linking & data recommendation Search by keywords & categories Hybrid search by keywords and geographical coverage

Fair Data Make each data set Findable, Accessible, Interoperable and Reusable PID Citation Citation Data linkage Data Recommendation APIs for Machine

Outline Background Computing Facilities & Storage System Data Management and Data Infrastructure Computing Engines & Data Analysis Services Cloud Service Catalog and Portal Conclusion

Multiple Data Process Engines based on Virtualization and Caching Technology CE for Images Utilizing Container and Virtualization Technology pack-up Computing Engine Logic for rapid deployment and hybrid deployment The Distributed Cache can solve data persistent issue and enhance the performance of mass data process as well CE for Multi- Dimensional Spatial Data Computing Engine for Time-Series Data MPP DB with Spatial Computing Extension MPP DB for PostGIS Parallel Spatial Aggregation Functions Index for Spatial Objs Apache HAWQ(PostgreSQL 8.2) MPP DB on Cloud Centralized Storage Multi-Tenancy Distributed Hierarchical Cache DHC KV, File System and Object Interfaces APIs RMDA over IB Data Mgmt. for Local Cache Local Cache Distributed In-Memory KV Cache IM Cache Data persistency based on Local File System (SSD+HD) Local FS Persistent Storage (HDFS, S3, Swift, Ceph, Luster,NFS, etc.)

MPP DB with Spatial Computing Extension Performance test Output format: netCDF Start date: 2015-01-15 00:00 End date: 2018-05-24 12:00 Parameter(s): Temperature Vertical level(s): Ground or water surface Product(s): 3-hour Forecast Link:https://rda.ucar.edu/#dsrqst/JIAN295398/index.html 4.1GB netcdf compressed data 13GB netcdf 12GB Tiff 23GB loaded database original splitted Records 14845 2116206 Size pre record 253x205 20x20 Query time 109.5 s 112.8 s Optimized query time 4.3s Query: SELECT avg(ST_Value(rast, ST_Point(103.23087483, 24.531609336))) from test SELECT avg(ST_Value(rast, ST_Point(103.23087483, 24.531609336))) as value from test_tile where ST_Intersects(ST_Point(103.23087483, 24.531609336), bounding ) =true Optimized performance speed up by 23 times

Computing Engine for Time-Series Data Second-level task distribution and startup Container enabled Average delay, mirror volume, and startup time are better than Apache Spark & Apache Flink 系统名称 平均延迟(ms) 镜像体积 启动时间(s) Spark Streaming(KVM) 351.90 5GB ~60s Spark Streaming(Docker) 416.76 1.2G ~6s Flink(KVM) 129.83 Flink(Docker) 35.57 800MB Computing Engine for Time-Series Data 28.42 100MB ~2s Architecture

EarthDataMiner Online interactive data analysis environment Using the data processing and analysis function API provided by the system, writing mining analysis code (Python)

Architecture of EarthDataMiner

Web IDE for EarthDataMiner A prototype been developed, Supporting users to write data analysis code (Python) online, providing a batch of basic data processing and analysis function API

Algorithm & Model Library More than 20 algorithm developed and provide cloud service: FAAS(Function As A Service) Data Algorithm Model

Integrated with DataBank Upload models, select data, and process data products through instruction operations

Outline Background Computing Facilities & Storage System Data Management and Data Infrastructure Computing Engines & Data Analysis Services Cloud Service Catalog and Portal Conclusion

Cloud Service: Category Compute, Storage, Networking HPC, EMR, ECS, etc. Infrastructure as a Service Publishing, Integration, Discovery, Accessing, Sharing Data Management & Sharing Processing Engines for CASEarth Online Big Earth Data Analysis Processing & Analysis Domain Research Achievements Specialized Application Services Applications Open Registration for Services Universal Discovery of Services Service Registration & Sharing

Cloud Service: Infrastructure as a Service Integrating HPC, Cloud Computing, Cloud Storage as a Unity Online Application & On-demand Rapid Deployment atmospheric circulation simulation Remote Sensing Image Processing

Cloud Service: Data Management & Sharing Data Discovery & Accessing Both for Web Users & Shell Users Data Reproduction Supported Data Publish & Publication Online Hybrid integration mode: Centralized & Distributed Unique Identification, Intelligible Shell data access for workspace

Cloud Service: Processing & Analysis Specialized processing engines and analysis platform for Earth study Processing engines are applied and used online Accelerating querying and computing of remote sensing data EarthDataMiner: Online code editing Code management Task management Map Service

Cloud Service:Applications BioONE Integrating Research Achievement of CASEarth projects BioONE(Biodiversity), One Belt One Road, Tri-polar, Ocean DataBank A Specialized Application Service for CASEarth Ready to Use Remote Sensing Image Data High-efficiency RS Data Engine Querying, Accessing, Computing DBAR DataBank

Conclusion A integrated environment based on super-computing and cloud computing technology is crucial for Big earth data driven discovery and decision supporting The Big Earth Data Cloud Service Project will be a good exploration on how to integrate computing power, algorithms and data to accelerate science discovery Just beginning, long way to go

Thank You very much for Your Attention Thank You very much for Your Attention! Thank my colleagues Jianhui Li, Yining Zhao, etc.