“Early Estimation of Cache Properties for Multicore Embedded Processors” ISERD ICETM 2015 Bangkok, Thailand May 16, 2015.

Slides:



Advertisements
Similar presentations
Mafijul Islam, PhD Software Systems, Electrical and Embedded Systems Advanced Technology & Research Research Issues in Computing Systems: An Automotive.
Advertisements

Evaluation of Message Passing Synchronization Algorithms in Embedded Systems 1 Evaluation of Message Passing Synchronization Algorithms in Embedded Systems.
MULTICORE PROCESSOR TECHNOLOGY.  Introduction  history  Why multi-core ?  What do you mean by multicore?  Multi core architecture  Comparison of.
Software Architecture of High Efficiency Video Coding for Many-Core Systems with Power- Efficient Workload Balancing Muhammad Usman Karim Khan, Muhammad.
Embedded Software Optimization for MP3 Decoder Implemented on RISC Core Yingbiao Yao, Qingdong Yao, Peng Liu, Zhibin Xiao Zhejiang University Information.
Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.
© 2002 University of South Carolina CSCE 491 Computer Engineering Senior Design Project Proposal for Spring 2002 Dr. James P. Davis, Associate Professor.
Ziliang Zong, Adam Manzanares, and Xiao Qin Department of Computer Science and Software Engineering Auburn University Energy Efficient Scheduling for High-Performance.
GHS: A Performance Prediction and Task Scheduling System for Grid Computing Xian-He Sun Department of Computer Science Illinois Institute of Technology.
Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.
Measuring Network Performance of Multi-Core Multi-Cluster (MCMCA) Norhazlina Hamid Supervisor: R J Walters and G B Wills PUBLIC.
ECE 510 Brendan Crowley Paper Review October 31, 2006.
Multi-core processors. History In the early 1970’s the first Microprocessor was developed by Intel. It was a 4 bit machine that was named the 4004 The.
Dr. Gheith Abandah, Chair Computer Engineering Department The University of Jordan 20/4/20091.
Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.
Invited Talk 5: “Discovering Energy-Efficient High-Performance Computing Systems? WSU CAPPLab may help!” ICIEV 2014 Dhaka, Bangladesh Dr. Abu Asaduzzaman,
Efficient Parallel Implementation of Molecular Dynamics with Embedded Atom Method on Multi-core Platforms Reporter: Jilin Zhang Authors:Changjun Hu, Yali.
COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.
Software Engineering Methodologies (Introduction)
CPACT – The Conditional Parameter Adjustment Cache Tuner for Dual-Core Architectures + Also Affiliated with NSF Center for High- Performance Reconfigurable.
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
Sigrity, Inc © Efficient Signal and Power Integrity Analysis Using Parallel Techniques Tao Su, Xiaofeng Wang, Zhengang Bai, Venkata Vennam Sigrity,
MS Thesis Defense “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” by Deepthi Gummadi CoE EECS Department April 21, 2014.
CoE EECS Department Graduate Students Welcome Party – 2014 (Updated on Aug. 22, 2014) Welcome questions: Do you know Einstein? Do you know me? Knowing.
InCoB August 30, HKUST “Speedup Bioinformatics Applications on Multicore- based Processor using Vectorizing & Multithreading Strategies” King.
Reconfigurable Real-Time Middleware for Distributed Cyber-Physical Systems with Aperiodic Events Yuanfang Zhang, Christopher Gill, Chenyang Lu Department.
Computer Architecture and Parallel Programming Laboratory (CAPPLab) Group Meetings Greetings! Abu Asaduzzaman Assistant Professor, Elec. Eng. & Comp. Sci.
Fast Memory Addressing Scheme for Radix-4 FFT Implementation Presented by Cheng-Chien Wu, Master Student of CSIE,CCU 1 Author: Xin Xiao, Erdal Oruklu and.
1 Instruction Sets and Beyond Computers, Complexity, and Controversy Brian Blum, Darren Drewry Ben Hocking, Gus Scheidt.
Advanced SW/HW Optimization Techniques for Application Specific MCSoC m Yumiko Kimezawa Supervised by Prof. Ben Abderazek Graduate School of Computer.
An Energy-Efficient Hypervisor Scheduler for Asymmetric Multi- core 1 Ching-Chi Lin Institute of Information Science, Academia Sinica Department of Computer.
A Decompression Architecture for Low Power Embedded Systems Lekatsas, H.; Henkel, J.; Wolf, W.; Computer Design, Proceedings International.
Radix-2 2 Based Low Power Reconfigurable FFT Processor Presented by Cheng-Chien Wu, Master Student of CSIE,CCU 1 Author: Gin-Der Wu and Yi-Ming Liu Department.
CISC 879 : Advanced Parallel Programming Vaibhav Naidu Dept. of Computer & Information Sciences University of Delaware Importance of Single-core in Multicore.
Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.
Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.
Parallelizing Video Transcoding Using Map-Reduce-Based Cloud Computing Speaker : 童耀民 MA1G0222 Feng Lao, Xinggong Zhang and Zongming Guo Institute of Computer.
By Edward A. Lee, J.Reineke, I.Liu, H.D.Patel, S.Kim
Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
MROrder: Flexible Job Ordering Optimization for Online MapReduce Workloads School of Computer Engineering Nanyang Technological University 30 th Aug 2013.
DBS A Bit-level Heuristic Packet Classification Algorithm for High Speed Network Author : Baohua Yang, Xiang Wang, Yibo Xue, Jun Li Publisher : th.
MULTICORE PROCESSOR TECHNOLOGY.  Introduction  history  Why multi-core ?  What do you mean by multicore?  Multi core architecture  Comparison of.
Program Optimizations and Recent Trends in Heterogeneous Parallel Computing Dušan Gajić, University of Niš Program Optimizations and Recent Trends in Heterogeneous.
Computer Science and Engineering Power-Performance Considerations of Parallel Computing on Chip Multiprocessors Jian Li and Jose F. Martinez ACM Transactions.
Computing Systems: Next Call for Proposals Dr. Panagiotis Tsarchopoulos Computing Systems ICT Programme European Commission.
Introduction CSE 1310 – Introduction to Computers and Programming Vassilis Athitsos University of Texas at Arlington 1.
Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.
Research Methods Technical Writing Thesis Conference/Journal Papers
Migration Cost Aware Task Scheduling Milestone Shraddha Joshi, Brian Osbun 10/24/2013.
Advanced SW/HW Optimization Techniques for Application Specific MCSoC m Yumiko Kimezawa Supervised by Prof. Ben Abderazek Graduate School of Computer.
Sunpyo Hong, Hyesoon Kim
1/50 University of Turkish Aeronautical Association Computer Engineering Department Ceng 541 Introduction to Parallel Computing Dr. Tansel Dökeroğlu
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
1 of 14 Lab 2: Design-Space Exploration with MPARM.
“A Learner-Centered Computational Experience in Nanotechnology for Undergraduate STEM Students” IEEE ISEC 2016 Friend Center at Princeton University March.
CSE 340 Computer Architecture Summer 2016 Understanding Performance.
“Temperature-Aware Task Scheduling for Multicore Processors” Masters Thesis Proposal by Myname 1 This slides presents title of the proposed project State.
“SMT Capable CPU-GPU Systems for Big Data”
IEEE SoutheastCon 2016 Norfolk, Virginia, USA
Selective Code Compression Scheme for Embedded System
Waltham, Massachusetts, USA Wichita State University (WSU), USA
“Temperature-Aware Task Scheduling for Multicore Processors”
“Promising Shuffle-Exchange
Accelerating MapReduce on a Coupled CPU-GPU Architecture
Improved schedulability on the ρVEX polymorphic VLIW processor
Ann Gordon-Ross and Frank Vahid*
ICIEV 2014 Dhaka, Bangladesh
Realizing Closed-loop, Online Tuning and Control for Configurable-Cache Embedded Systems: Progress and Challenges Islam S. Badreldin*, Ann Gordon-Ross*,
Automatic Tuning of Two-Level Caches to Embedded Applications
Presentation transcript:

“Early Estimation of Cache Properties for Multicore Embedded Processors” ISERD ICETM 2015 Bangkok, Thailand May 16, 2015

“Early Estimation of Cache Properties for Multicore Embedded Processors” ISERD ICETM 2015 Bangkok, Thailand May 16, 2015 Presenter: Dr. Abu Asaduzzaman, Assistant Professor Prepared by: Mr. Kishore K. Chidella, PhD Student Computer Architecture and Parallel Programming Laboratory (CAPPLab) Department of Electrical Engineering and Computer Science (EECS) Wichita State University (WSU), USA

Dr. Zaman3 “Early Estimation of Cache Properties for Multicore Embedded Processors” Outline► ■Introduction  Embedded systems with multicore processors  Pros and cons due to cache ■Background and Motivation  Impact of cache on performance and power consumption  Optimized cache improves the performance to power ratio ■Proposed Cache Modeling Strategy  Multicore architecture for embedded systems  Work-flow diagram ■Experimental Results ■Discussion QUESTIONS? Any time!

Dr. Zaman4 Authors ■Kishore K. Chidella, PhD Student  EECS Department, Wichita State University (WSU), USA ■Muhammad F. Mridha, Assistant Professor  CSE Department, University of Asia Pacific (UAP), Bangladesh ■Abu Asaduzzaman, Assistant Professor  EECS Department, Wichita State University (WSU), USA  Director, Computer Arch & Parallel Prog Lab (CAPPLab) “Early Estimation of Cache Properties for Multicore Embedded Processors”

Dr. Zaman5 Introduction ■Multicore Embedded Systems  Future embedded systems should have multicore processors.  Currently available single-core based simulation techniques are not adequate to design multicore embedded systems [1-4].  Software applications are having more and more threads to take advantage of the available cores [5-8].  Multicore processors are frequently deployed with multilevel cache memories [9].  Parallel thread execution to achieve the best performance in such a multicore system is difficult as it relates to cache sharing.  Complex embedded systems design methodology needs supports from early estimation techniques. “Early Estimation of Cache Properties for Multicore Embedded Processors”

Dr. Zaman6 Background and Motivation ■Some Early Work  The technical challenges associated with the integration of homogeneous and heterogeneous multiple cores in embedded systems is elucidated in [1].  However, a viable way to make early estimation on future embedded systems design is not provided.  According to the experimental results published in [4], cache parameters and the application code size have impact on total power consumption and mean delay per task.  This approach is not focused on designing embedded systems and does not cover the cache locking aspect. “Early Estimation of Cache Properties for Multicore Embedded Processors”

Dr. Zaman7 Background and Motivation (+) ■Some Early Work  Issues related to cache locking at level-1 and level-2 caches are discussed in [11, 12]. In [14], various algorithms to select a set of instructions to be locked in cache are compared. Cache locking may improve performance.  Entire (100% of the cache size) level-1 cache locking is not efficient for some applications, especially when the data size to be locked is smaller compared to the cache size.  Worst-case performance with locked caches may degrade with large cache lines due to cache pollution [12]. “Early Estimation of Cache Properties for Multicore Embedded Processors”

Dr. Zaman8 Background and Motivation (+) ■Some Early Work  These techniques are developed for single-core systems and not suitable for contemporary multicore embedded systems. Also, these techniques are not useful to estimate power consumption, a crucial design factor for embedded systems.  Therefore, an early estimation technique to evaluate cache properties for multicore embedded systems is required. “Early Estimation of Cache Properties for Multicore Embedded Processors”

Dr. Zaman9 “Early Estimation of Cache Properties for Multicore Embedded Processors” Outline► ■Introduction  Embedded systems with multicore processors  Pros and cons due to cache ■Background and Motivation  Impact of cache on performance and power consumption  Optimized cache improves the performance to power ratio ■Proposed Cache Modeling Strategy  Multicore architecture for embedded systems  Work-flow diagram ■Experimental Results ■Discussion QUESTIONS?Any time!

Dr. Zaman10 Proposed Cache Modeling Strategy ■Multicore Cache Organization  Level-1 Private Split into I1 and D1  Level-2 Private or Shared Unified  Level-3 Optional (or Shared) “Early Estimation of Cache Properties for Multicore Embedded Processors”

Dr. Zaman11 Proposed Cache Modeling Strategy (+) ■Cache Locking  Private first level cache?  Shared last level cache?  Entire locking or partial/way locking? “Early Estimation of Cache Properties for Multicore Embedded Processors”

Dr. Zaman12 Proposed Cache Modeling Strategy (+) ■Work-Flow  Master Core Select jobs Assign jobs Pre-load cache memory Mean delay; Total power  Core x Select cache size Lock? (Yes or No) Assign task “Early Estimation of Cache Properties for Multicore Embedded Processors”

Dr. Zaman13 Simulation ■Simulation Tool  VisualSim tool to develop the modeling platform ■Applications to Run the Simulation Program  FFT (Fast Fourier Transform)  GIF (Graphics Interchange Format)  JPEG (Joint Photographic Experts Group)  MPEG (Moving Picture Experts Group)-3  MPEG-4  Here, FFT is the smallest application (with code size 2.34 KB) and MPEG-4 is the biggest application (with code size KB). “Early Estimation of Cache Properties for Multicore Embedded Processors”

Dr. Zaman14 Input / Output Parameters ■Inputs  Number of cores: 4 (fixed)  I1 / D1 size (KB): 2 / 2 (fixed)  Line size (Byte): 128 (fixed)  Associativity level (n-way): 8 (fixed)  CL2 cache size (KB): 32, 64, 128, 256, or 512  Locked CL2 cache size (%): 0.0, 12.5, 25.0, 37.5, 50.0 ■Outputs  Mean delay per task  Total power consumption “Early Estimation of Cache Properties for Multicore Embedded Processors”

Dr. Zaman15 “Early Estimation of Cache Properties for Multicore Embedded Processors” Outline► ■Introduction  Embedded systems with multicore processors  Pros and cons due to cache ■Background and Motivation  Impact of cache on performance and power consumption  Optimized cache improves the performance to power ratio ■Proposed Cache Modeling Strategy  Multicore architecture for embedded systems  Work-flow diagram ■Experimental Results ■Discussion QUESTIONS?Any time!

Dr. Zaman16 Experimental Results ■Shared L2 Cache Size  JPEG behaves almost like GIF and MPEG-3 behaves almost like MPEG-4.  For CL2 cache size 32 KB to 128 KB, mean delay per task and total power consumption for MPEG-4 decrease significantly when we increase cache size and/or move from no locking to 25% locking.  It should be noted that the impact of shared CL2 on power consumption is more significant than that on delay. “Early Estimation of Cache Properties for Multicore Embedded Processors”

Dr. Zaman17 Experimental Results (+) ■Shared L2 Cache Size  Only for CL2 cache size 32 KB, mean delay per task and total power consumption for GIF decrease when 25% locking is applied.  However, CL2 cache size/locking has no positive impact on mean delay per task and total power consumption for FFT.  Increasing CL2 size beyond 128 KB has no positive impact (consumes more power without reducing the delay). “Early Estimation of Cache Properties for Multicore Embedded Processors”

Dr. Zaman18 Experimental Results (+) ■Shared L2 Cache Locking  Cache locking at shared CL2 has significant impact on mean delay per task and total power consumption for large applications (like MPEG-4) than small applications (like FFT).  According to shared CL2 cache locking results, the optimal performance (delay)/power ratio is obtained for 25% cache locking for all the workloads. “Early Estimation of Cache Properties for Multicore Embedded Processors”

Dr. Zaman19 Conclusions ■A simulation methodology is presented to early estimate the effective cache properties (parameters and locked cache size) for multicore embedded systems. ■A quad-core system with shared CL2 is simulated using FFT, GIF, JPEG, MPEG-3, and MPEG-4 workloads. ■Albeit both mean delay per task and total power consumption decrease when shared CL2 cache size is increased and/or cache locking is applied, it is noted that the impact of shared CL2 on power consumption is more significant than that on delay. “Early Estimation of Cache Properties for Multicore Embedded Processors”

Thank You! QUESTIONS? Contact: Abu Asaduzzaman Phone: CAPPLab: ISERD ICETM 2015 in Bangkok, Thailand “Early Estimation of Cache Properties for Multicore Embedded Processors”