Astronomical Data Processing & Workflow Scheduling in cloud

Slides:

Advertisements

Similar presentations

Energy-efficient Task Scheduling in Heterogeneous Environment 2013/10/25.

Advertisements

SkewReduce YongChul Kwon Magdalena Balazinska, Bill Howe, Jerome Rolia* University of Washington, *HP Labs Skew-Resistant Parallel Processing of Feature-Extracting.

SLA-Oriented Resource Provisioning for Cloud Computing

APACHE GIRAPH ON YARN Chuan Lei and Mohammad Islam.

Authors: Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox Publish: HPDC'10, June 20–25, 2010, Chicago, Illinois, USA ACM Speaker: Jia Bao Lin.

The new The new MONARC Simulation Framework Iosif Legrand  California Institute of Technology.

Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.

Present By : Bahar Fatholapour M.Sc. Student in Information Technology Mazandaran University of Science and Technology Supervisor:

Astro-DISC: Astronomy and cosmology applications of distributed super computing.

WORKFLOWS IN CLOUD COMPUTING. CLOUD COMPUTING  Delivering applications or services in on-demand environment  Hundreds of thousands of users / applications.

Iterative computation is a kernel function to many data mining and data analysis algorithms. Missing in current MapReduce frameworks is collective communication,

Data Mining on the Web via Cloud Computing COMS E6125 Web Enhanced Information Management Presented By Hemanth Murthy.

Ch 4. The Evolution of Analytic Scalability

Thermal Aware Resource Management Framework Xi He, Gregor von Laszewski, Lizhe Wang Golisano College of Computing and Information Sciences Rochester Institute.

Scientific Data Infrastructure in CAS Dr. Jianhui Scientific Data Center Computer Network Information Center Chinese Academy of Sciences.

Cloud MapReduce ： a MapReduce Implementation on top of a Cloud Operating System Speaker : 童耀民 MA1G Authors: Huan Liu, Dan Orban Accenture.

Texas A&M University Page 1 9/16/ :22:47 PM Wei Zhao Texas A&M University Is Computer Stuff Science, Engineering, or Something else?

Distributed Computing Systems Current Issues in DCS Dr. Sunny Jeong. Mr. Colin Zhang With Thanks to Prof. G. Coulouris,

CONTENTS Arrival Characters Definition Merits Chararterstics Workflows Wfms Workflow engine Workflows levels & categories.

天文信息技术联合实验室 New Progress On Astronomical Cross-Match Research Zhao Qing.

W HAT IS H ADOOP ? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity.

Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.

Introduction to Hadoop and HDFS

Eneryg Efficiency for MapReduce Workloads: An Indepth Study Boliang Feng Renmin University of China Dec 19.

Euro-Par, A Resource Allocation Approach for Supporting Time-Critical Applications in Grid Environments Qian Zhu and Gagan Agrawal Department of.

Data Placement and Task Scheduling in cloud, Online and Offline 赵青天津科技大学

BOF: Megajobs Gracie: Grid Resource Virtualization and Customization Infrastructure How to execute hundreds of thousands tasks concurrently on distributed.

Performance Evaluation of Image Conversion Module Based on MapReduce for Transcoding and Transmoding in SMCCSE Speaker : 吳靖緯 MA0G IEEE.

A Hierarchical MapReduce Framework Yuan Luo and Beth Plale School of Informatics and Computing, Indiana University Data To Insight Center, Indiana University.

Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.

DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.

Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke

Xi He Golisano College of Computing and Information Sciences Rochester Institute of Technology Rochester, NY THERMAL-AWARE RESOURCE.

LOGO Parallel computing technique for EM modeling makai 天津大学电子信息工程学院 School of Electronic Information Engineering.

Euro-Par, HASTE: An Adaptive Middleware for Supporting Time-Critical Event Handling in Distributed Environments ICAC 2008 Conference June 2 nd,

PARALLEL AND DISTRIBUTED PROGRAMMING MODELS U. Jhashuva 1 Asst. Prof Dept. of CSE om.

Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi

INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.

Resource Provision for Batch and Interactive Workloads in Data Centers Ting-Wei Chang, Pangfeng Liu Department of Computer Science and Information Engineering,

Hadoop Javad Azimi May What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data. It includes:

Igor EPIMAKHOV Abdelkader HAMEURLAIN Franck MORVAN

Big data classification using neural network

Organizations Are Embracing New Opportunities

Introduction to Distributed Platforms

R. Rastogi, A. Srivastava , K. Sirasala , H. Chavhan , K. Khonde

Networking & Communications Prof. Javad Ghaderi

IM.Grid: A Grid Computing Solution for image processing

Pagerank and Betweenness centrality on Big Taxi Trajectory Graph

ECRG High-Performance Computing Seminar

Spark Presentation.

Grid Computing.

Ching-Chi Lin Institute of Information Science, Academia Sinica

Computing Resource Allocation and Scheduling in A Data Center

Grid Computing Colton Lewis.

AWS Batch Overview A highly-efficient, dynamically-scaled, batch computing service May 2017.

Hadoop Clusters Tess Fulkerson.

Large Scale Data Processing Techniques for Astronomical Applications

湖南大学-信息科学与工程学院-计算机与科学系

CS110: Discussion about Spark

Scalable Parallel Interoperable Data Analytics Library

Ch 4. The Evolution of Analytic Scalability

Clouds from FutureGrid’s Perspective

Interpret the execution mode of SQL query in F1 Query paper

First Hop Offloading of Mobile DAG Computations

Department of Computer Science, University of Tennessee, Knoxville

MapReduce: Simplified Data Processing on Large Clusters

Communication Driven Remapping of Processing Element (PE) in Fault-tolerant NoC-based MPSoCs Chia-Ling Chen, Yen-Hao Chen and TingTing Hwang Department.

Convergence of Big Data and Extreme Computing

Twister2 for BDEC2 Poznan, Poland Geoffrey Fox, May 15,

L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher

Presentation transcript:

Astronomical Data Processing & Workflow Scheduling in cloud China-VO Astronomical Data Processing & Workflow Scheduling in cloud -- Big-Data Oriented Research Zhao Qing Tianjin University of Science & Technology (天津科技大学) China-VO Shanghai, the 17th of May, 2017 Hello, every one, I am a teacher from ...

Team Introduction Tianjin University of Science & Technology A young group of the China-VO family Today I would like to share with you the work we are researching. I am very glad to get your advice, and that would be very important to us.

Contents Big-data oriented astronomical data processing based on Hadoop & Spark Footprint Generation Cross-match Task scheduling strategies of astronomical workflows in cloud This is the abstract. One of our research is using high-performance computing technologies such as hadoop and spark to resolve astronomical applications, espetially the data-intensive applications. The first one is footprint generation and the second one is cross-match the other research is the task scheduling strategies of astronomical workflows in cloud. since scientific workflow is one of the most common application model in astronomy.

1. Astronomical data processing based on Hadoop & Spark Footprint Generation Sky coverage - an important piece of information about astronomical observations. Applications: intersections unions other logical operations based on the geometric coverage of regions of the sky cross-match Multi-order coverage Healpix maps generated on Hadoop & Spark platform Sky coverage is one of the most important pieces of information about astronomical observations. it is very useful for many purposes such as establishing intersections, unions, and other logical operations based on the geometric coverage of regions of the sky. And it is also useful for crossmatch. The footprint is represented as a multi-order coverage Healpix maps. Since the catalogs are big, the paralleled methods based on Hadoop and Spark have been developed and tested.

Footprint generation based on Spark This is the basic flow of spark based footprint generation. There are some iteration operations, so the experimental results on Spark is better than Hadoop. And larger-scale experiment will be coming soon. Data: Twomass,12.6G, 41067000 records Environment: Dual-core with 4G memory Spark-2.0.2, Hadoop-2.7.3 node number 4 8 time (s) 138s 69s

Hadoop based cross-match And this is the astronomical cross-match based on Hadoop. Step1: data distribution (1 Map+ 1 Reduce） Step2: distance calculation（1 Map）

Experimental results Data: SDSS, 100,106,811 records This is the experimental results, we use 4, 8, 16, 32, 64 PCs to do these experiments, and got a good speedup effect. Data: SDSS, 100,106,811 records 2MASS, 470,992,970 records node number 4 8 16 32 64 time (s) 273 136 69 38 25

Spark based cross-match integrated with footprint - generation typical rich-BoT workflows further optimization scientific workflow scheduling research At present, we only use some programming skills of spark to improve its performance it can be modeled as a rich bag of task workflow. since there are many batches of independant tasks. in the future we will also applicate the result of our task scheduling research to further optimize its performance

2. Task scheduling strategies of astronomical workflows in cloud China-VO and Alibaba-Cloud The cloud China-VO will provide to users: data software computing resources Science workflow one of the most commonly used application model in Astronomy d4 d7 d1 t2 t4 t5 d3 d2 t1 d6 one of the big event in this year for china-vo is the coorporation with Alibaba-Cloud. The cloud China-VO will not only provide data, software, but also computing resources for varied astronomical applications. science workflow is one of the most commonly used application model in Astronomy, so its task scheduling research in cloud is valuable. d8 d6 d3 t3 d5

What is worthy of concern running efficiency rental cost energy consumption How to achieve these goals? data placement resource allocation task sheduling High Performance low cost low energy consumption t1 t2 t3 t4 t5 d3 d1 d2 d4 d7 d8 d5 d6 upload data launch application There are multiple optimization objectives, But how to achieve them? we research the strategies of data placement, resource allocation and task scheduling in cloud.

Characteristics of astronomical workflows applications 2. Cloud environment modeling and the heuristic rule based task scheduling method 1 . task and data clustering based on data correlation Characteristics of astronomical workflows applications Data-intensive & compute-intensive Rich-BoT structures Task execution time difficult to estimate complex network structure heterogeneous machines Our contributions 4. multi-objective optimization 3 . Dynamic multi-layer deadline decomposition

Main publications A new energy-aware task scheduling method for data-intensive applications in the cloud， Journal of Network and Computer Applications，2016，59：14-27。 (SCI: WOS:000367491600003) A Data Placement Algorithm for Data Intensive Applications in Cloud，International Journal of Grid and Distribution Computing，2016，9（2）：145-156。(EI: 20161002063730) A data placement strategy for data-intensive scientific workflows in cloud，15th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing, CCGrid 2015，928-934， 2015.5.4-2015.5.7。(EI: 20153701270757) Heuristic Data Placement for Data-Intensive Applications in Heterogeneous Cloud, Journal of Electrical & Computer Engineering, 2016, 2016(13):1-8 ( EI) Qing Zhao ，Haonan Dai，Congcong Xiong ，Peng Wang，Heuristic Data Layout for Heterogeneous Cloud Data Centers，2015 International Symposium on Information Technology Convergence，2015.10.13-2015.10.15 Qing Zhao, Jizhou Sun，Ce Yu，Jian Xiao，Chenzhou Cui, Xiao Zhang, Improved parallel processing function for high-performance large-scale astronomical cross-matching， Transactions of Tianjin University，2011，17（1）：62-67。（EI: 20112013983867） This is the main pulications of us

Qing Zhao, Congcong Xiong, An Improved Data Layout Algorithm Based on Data Correlation Clustering in Cloud, 2014 International Symposium on Information Technology Convergence, 2014 Qing Zhao, Jizhou Sun，Ce Yu，Chenzhou Cui，Liqiang Lv，Jian Xiao，A paralleled large- scale astronomical cross-matching function，9th International Conference on Algorithms and Architectures for Parallel Processing, ICA3PP 2009，604-614，2009.6.8-2009.6.11。(EI: 20093912332041) Qing Zhao, Jizhou Sun，Ce Yu，Chenzhou Cui, Jian Xiao，Big data oriented paralleled astronomical cross-match, Journal of Computer Application，2010，30（8）：2056-2059 Qing Zhao, Jizhou Sun，Jian Xiao, Ce Yu，Chenzhou Cui, Xu Liu, Ao Yuan, Distributed astronomical cross-match based on MapReduce, Journal of Computer application research， 2010，27（9）：3322-3325 Thank you! We need your advice, so we can get further understanding about what is the most needed for VO, and what we can do better for VO in the future.