Many-SC Project Runtime Environment (RTE) CSAP Lab 2014/10/28.

Slides:



Advertisements
Similar presentations
Energy-efficient Task Scheduling in Heterogeneous Environment 2013/10/25.
Advertisements

© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.
Hadi Goudarzi and Massoud Pedram
Scheduling in Distributed Systems Gurmeet Singh CS 599 Lecture.
SLA-Oriented Resource Provisioning for Cloud Computing
System Simulation Of 1000-cores Heterogeneous SoCs Shivani Raghav Embedded System Laboratory (ESL) Ecole Polytechnique Federale de Lausanne (EPFL)
Energy-Efficient System Virtualization for Mobile and Embedded Systems Final Review 2014/01/21.
1 Advancing Supercomputer Performance Through Interconnection Topology Synthesis Yi Zhu, Michael Taylor, Scott B. Baden and Chung-Kuan Cheng Department.
Distributed Process Scheduling Summery Distributed Process Scheduling Summery BY:-Yonatan Negash.
SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.
COMMA: Coordinating the Migration of Multi-tier applications 1 Jie Zheng* T.S Eugene Ng* Kunwadee Sripanidkulchai† Zhaolei Liu* *Rice University, USA †NECTEC,
CSE 160/Berman Programming Paradigms and Algorithms W+A 3.1, 3.2, p. 178, 6.3.2, H. Casanova, A. Legrand, Z. Zaogordnov, and F. Berman, "Heuristics.
CS244-Introduction to Embedded Systems and Ubiquitous Computing Instructor: Eli Bozorgzadeh Computer Science Department UC Irvine Winter 2010.
CS 3013 & CS 502 Summer 2006 Scheduling1 The art and science of allocating the CPU and other resources to processes.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Fault-tolerant Adaptive Divisible Load Scheduling Xuan Lin, Sumanth J. V. Acknowledge: a few slides of DLT are from Thomas Robertazzi ’ s presentation.
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
Wk 2 – Scheduling 1 CS502 Spring 2006 Scheduling The art and science of allocating the CPU and other resources to processes.
HeteroPar 2013 Optimization of a Cloud Resource Management Problem from a Consumer Perspective Rafaelli de C. Coutinho, Lucia M. A. Drummond and Yuri Frota.
Virtualization Concept. Virtualization  Real: it exists, you can see it.  Transparent: it exists, you cannot see it  Virtual: it does not exist, you.
Computer System Architectures Computer System Software
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos.
Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Neural and Evolutionary Computing - Lecture 10 1 Parallel and Distributed Models in Evolutionary Computing  Motivation  Parallelization models  Distributed.
Network Aware Resource Allocation in Distributed Clouds.
Performance Tuning on Multicore Systems for Feature Matching within Image Collections Xiaoxin Tang*, Steven Mills, David Eyers, Zhiyi Huang, Kai-Cheung.
Integrating Fine-Grained Application Adaptation with Global Adaptation for Saving Energy Vibhore Vardhan, Daniel G. Sachs, Wanghong Yuan, Albert F. Harris,
An Autonomic Framework in Cloud Environment Jiedan Zhu Advisor: Prof. Gagan Agrawal.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
TILEmpower-Gx36 - Architecture overview & performance benchmarks – Presented by Younghyun Jo 2013/12/18.
1 中華大學資訊工程學系 Ching-Hsien Hsu ( 許慶賢 ) Localization and Scheduling Techniques for Optimizing Communications on Heterogeneous.
RECON: A TOOL TO RECOMMEND DYNAMIC SERVER CONSOLIDATION IN MULTI-CLUSTER DATACENTERS Anindya Neogi IEEE Network Operations and Management Symposium, 2008.
Course Wrap-Up Miodrag Bolic CEG4136. What was covered Interconnection network topologies and performance Shared-memory architectures Message passing.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
MapReduce How to painlessly process terabytes of data.
A Survey of Distributed Task Schedulers Kei Takahashi (M1)
Optimization Problems - Optimization: In the real world, there are many problems (e.g. Traveling Salesman Problem, Playing Chess ) that have numerous possible.
LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC.
Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.
NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.
CS244-Introduction to Embedded Systems and Ubiquitous Computing Instructor: Eli Bozorgzadeh Computer Science Department UC Irvine Winter 2010.
Resource Mapping and Scheduling for Heterogeneous Network Processor Systems Liang Yang, Tushar Gohad, Pavel Ghosh, Devesh Sinha, Arunabha Sen and Andrea.
MROrder: Flexible Job Ordering Optimization for Online MapReduce Workloads School of Computer Engineering Nanyang Technological University 30 th Aug 2013.
Network-Aware Scheduling for Data-Parallel Jobs: Plan When You Can
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Quadrisection-Based Task Mapping on Many-Core Processors for Energy-Efficient On-Chip Communication Nithin Michael, Yao Wang, G. Edward Suh and Ao Tang.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
Energy-Aware Resource Adaptation in Tessellation OS 3. Space-time Partitioning and Two-level Scheduling David Chou, Gage Eads Par Lab, CS Division, UC.
Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
Static Process Scheduling
Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC.
Embedded System Lab. 오명훈 Addressing Shared Resource Contention in Multicore Processors via Scheduling.
Self-Hosted Placement for Massively Parallel Processor Arrays (MPPAs) Graeme Smecher, Steve Wilton, Guy Lemieux Thursday, December 10, 2009 FPT 2009.
Incremental Run-time Application Mapping for Heterogeneous Network on Chip 2012 IEEE 14th International Conference on High Performance Computing and Communications.
Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi
R-Storm: Resource Aware Scheduling in Storm
Resource Elasticity for Large-Scale Machine Learning
Nithin Michael, Yao Wang, G. Edward Suh and Ao Tang Cornell University
Accelerating MapReduce on a Coupled CPU-GPU Architecture
Bank-aware Dynamic Cache Partitioning for Multicore Architectures
Verilog to Routing CAD Tool Optimization
Multi-hop Coflow Routing and Scheduling in Data Centers
Networked Real-Time Systems: Routing and Scheduling
Maria Méndez Real, Vincent Migliore, Vianney Lapotre, Guy Gogniat
Presentation transcript:

Many-SC Project Runtime Environment (RTE) CSAP Lab 2014/10/28

2Computer Systems and Platforms Lab 1 st Year Progress Overview Literature survey - many-core OSes Barrelfish, Corey, Exokernel, Harmony, Tessellation, fOS - DVFS for multi/many-core architectures Random diagnostics generator - implementation of WE signal - implementation of predicate routing - testing (CSIM done, binary/RTL under way) Many-SC RTE design - services, functions provided - resource management (cores, memory, power, …) - communication - requirements - H/W - application level (OpenCL) Many-SC RTE implementation - on existing H/W (Tilera, SCC) - on simulator (  simulator team) Many-SC power mgmt - DVFS on many-core architectures (voltage/frequency island aware) - combining DVFS with OS/process migration - implemented on real H/W (Intel SCC)

3Computer Systems and Platforms Lab The Many-SC RTE Framework Overview RTE prototype host CPU/ host machine many-core architecture description many-core H/W App1:tile App2:tile scheduling result offline profiler Static Scheduler a list of target applications application profiles Many-core H/W description e.g., Many-SC, Tilera, AMD - processor description cluster:a set of tiles tile: a set of cores core:a computing unit NOC:interconnection network - memory description each memory controller Application profiles - minimal requirements (cores, memory) - performance profiles - memory access latencies online profiler application profiles application profiles Target application list - applications given - application categorization e.g., seven dwarfs Work Progress ~ 2014/10 - architecture/application description - scheduling algorithms - offline profiling tools - the prototype interact with Tilera further improvements needed

4Computer Systems and Platforms Lab The Static Scheduler Scheduling assumptions an application is restricted to one cluster an application’s address space is fixed to one memory controller no support for memory contention modeling (at the moment) Inter-cluster scheduling: cluster (and memory controller) packing round robin (current scheduler focuses on intra-cluster scheduling) elaborate cluster assignment until 2014/11 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Application 1 Application 2 Application 3 Application 4 Application 5

5Computer Systems and Platforms Lab Intra-cluster Scheduling Algorithms Fairly dividing scheduler fairly divide core/memory resources for each running application Brute force greedy scheduler allocate each core for an application which has maximum performance up with the core Hybrid scheduler combine the fairly dividing allocator with brute-force scheduling starting from fairly dividing scheduler, adjacent applications exchange their resources during iterations in a simulated annealing like heuristic way

6Computer Systems and Platforms Lab Scheduling Scenarios Target applications (openMP) matrix multiplication (dense/sparse linear algebra) fft (spectral methods) molecular dynamics (N-body methods) image blurring (structured grids) monte carlo approximation (map-reduce) Performance profiles (offline profiler)

7Computer Systems and Platforms Lab A Scheduling Result (cont’d) The scheduling result Benchmark result (on Tilera) Hybrid scheduler Brute force greedy Fairly dividing matrix FFT blur

8Computer Systems and Platforms Lab A Scheduling Result (cont’d) The scheduling result Benchmark result (on Tilera) Hybrid scheduler Brute force greedy Fairly dividing matrix FFT blur molecular dynamics monte-carlo

9Computer Systems and Platforms Lab A Scheduling Result (cont’d) The scheduling result Benchmark result (on Tilera) Hybrid scheduler Brute force greedy Fairly dividing matrix 1 matrix2 molecular dynamics monte-carlo

10Computer Systems and Platforms Lab Scheduler Evaluation For scheduling parallel applications on many cores, space-sharing scheduling results are better than time-sharing scheduling results on Linux Further improvements needed application-aware profiling  current application performance model is based on the number of tiles and the average memory access latencies (i.e., manhattan distances)  need to consider tile-interconnection patterns and routing network  need to consider memory contention-based scheduling dynamic resource management  dynamic core allocation during an application lifetime when other applications finish

11Computer Systems and Platforms Lab Conclusion and Outlook Current status (2014/10) static scheduling framework including architecture & application descriptions several static scheduling algorithms implemented and under test  greedy and heuristic algorithms  categorize target applications according to seven dwarfs Static scheduler (2014/11; 1 st year) comparison of heuristic schedulers with an optimal scheduler inter-cluster scheduling consider memory contention-based scheduling Dynamic scheduler (2 nd year) dynamic resource allocation algorithms and policies interaction with application runtime (dynamic resource mgmt.) move the RTE framework onto the Many-SC simulator

12Computer Systems and Platforms Lab Thank you Questions?

13Computer Systems and Platforms Lab Backup Slides

14Computer Systems and Platforms Lab Architecture Description Example: Many-SC prototype <architecture name=“many-sc” topology=“mesh” clusters=“4” tiles=“48” cores=“96” memories=“1” memsize=“4096”> …… …… …… <architecture name=“many-sc” topology=“mesh” clusters=“4” tiles=“48” cores=“96” memories=“1” memsize=“4096”> …… …… ……

15Computer Systems and Platforms Lab Application Profiling Example <specification architecture=“many-sc” mintiles=“1” maxtiles=“12” minmemory=“100” maxmemory=“1000” path=“./apps/matrix”> …… …… …… … <specification architecture=“many-sc” mintiles=“1” maxtiles=“12” minmemory=“100” maxmemory=“1000” path=“./apps/matrix”> …… …… …… …

16Computer Systems and Platforms Lab Scheduling Algorithms (cont’d) Fairly dividing allocator // fairly divide tiles for each application foreach (app) app->reserved = fairly divided tile number; // start from the application who has the maximum // memory controller proximity benefits (m_prior) foreach (app) { tile = cluster->GetMemoryClosestIdleTile (app->GetMemoryController()); while (app->allocated reserved) { app->allocate(tile); // find next tile and clustering // consider colored flags, degrees, and depths tile = FindNextTile(app->tilePool, CLUSTERING); } // fairly divide tiles for each application foreach (app) app->reserved = fairly divided tile number; // start from the application who has the maximum // memory controller proximity benefits (m_prior) foreach (app) { tile = cluster->GetMemoryClosestIdleTile (app->GetMemoryController()); while (app->allocated reserved) { app->allocate(tile); // find next tile and clustering // consider colored flags, degrees, and depths tile = FindNextTile(app->tilePool, CLUSTERING); }

17Computer Systems and Platforms Lab Scheduling Algorithms (cont’d) Brute force greedy scheduler // start from the tile that has the maximum // memory controller proximity benefits while (idle core exists) { tile = cluster->GetMemoryClosestIdleTile(); // find one app in a greedy way app = PeekAppThatHasMaximumScoreUpWith(tile); app->allocate(tile); } // start from the tile that has the maximum // memory controller proximity benefits while (idle core exists) { tile = cluster->GetMemoryClosestIdleTile(); // find one app in a greedy way app = PeekAppThatHasMaximumScoreUpWith(tile); app->allocate(tile); }

18Computer Systems and Platforms Lab Scheduling Algorithms (cont’d) Hybrid scheduler combines the fairly dividing allocator with brute-force scheduling FairDivideScheduler(); // do algorithm 1 do { // compute tile exchange performance benefits // for each set of two adjacent applications map > pairs; pairs = ComputeTileExchangeBenefits(allApps); // select a pair of two adjacent applications pair selectedPair = pairs.get(maxBenefit); badApp = selectedPair->first; goodApp = selectedPair->second; // reallocate one tile from badApp to goodApp ExchangeOneTile(badApp, goodApp); } while (benefit is enough); FairDivideScheduler(); // do algorithm 1 do { // compute tile exchange performance benefits // for each set of two adjacent applications map > pairs; pairs = ComputeTileExchangeBenefits(allApps); // select a pair of two adjacent applications pair selectedPair = pairs.get(maxBenefit); badApp = selectedPair->first; goodApp = selectedPair->second; // reallocate one tile from badApp to goodApp ExchangeOneTile(badApp, goodApp); } while (benefit is enough);

19Computer Systems and Platforms Lab Cluster 1 Cluster 2 Cluster 3 Cluster 4 Application 1 Application 2 Application 3 Application 4 Application 5 Application 5 Application 4 Application 3 Application 2 Application 1

20Computer Systems and Platforms Lab A Scheduling Result (cont’d) Hybrid scheduler Brute force greedy Fairly dividing matrix FFT blur

21Computer Systems and Platforms Lab A Scheduling Result (cont’d) Hybrid scheduler Brute force greedy Fairly dividing matrix FFT blur

22Computer Systems and Platforms Lab A Scheduling Result (cont’d) The scheduling result Benchmark result (on Tilera) Hybrid scheduler Brute force greedy Fairly dividing matrix FFT blur molecular dynamics monte-carlo

23Computer Systems and Platforms Lab A Scheduling Result (cont’d) The scheduling result Benchmark result (on Tilera) Hybrid scheduler Brute force greedy Fairly dividing matrix 1 matrix2 molecular dynamics monte-carlo

24Computer Systems and Platforms Lab A Scheduling Result (cont’d) The scheduling result Benchmark result (on Tilera) Hybrid scheduler Brute force greedy Fairly dividing app 1 app 2 app 3 app 4 Brute force greedyHybrid scheduler app 1 app 2 app 3 app 4

25Computer Systems and Platforms Lab A Scheduling Result (cont’d) The scheduling result Benchmark result (on Tilera) Hybrid scheduler Brute force greedy Fairly dividing matrix FFT blur molecular dynamics monte-carlo Fairly dividingBrute force greedyHybrid scheduler matrix FFT blur molecular dynamics monte-carlo

26Computer Systems and Platforms Lab A Scheduling Result (cont’d) The scheduling result Benchmark result (on Tilera) Hybrid scheduler Brute force greedy Fairly dividing matrix 1 matrix2 molecular dynamics monte-carlo Fairly dividingBrute force greedyHybrid scheduler app 1 app 2 app 3 app 4