PRESENTED BY MOHAMMED ALHARBI MOHAMMED ALEISABAKRI AWAJI Causing Incoherencies Parallel sorting algorithms and study cache behaviors L1 and L2 on Multi-

Slides:

Advertisements

Similar presentations

Introduction to Algorithms Quicksort

Advertisements

Lecture 19: Cache Basics Today’s topics: Out-of-order execution

Cache Coherence Mechanisms (Research project) CSCI-5593

Chapter 6: Memory Management

1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.

Algorithms Analysis Lecture 6 Quicksort. Quick Sort Divide and Conquer.

Sorting Comparison-based algorithm review –You should know most of the algorithms –We will concentrate on their analyses –Special emphasis: Heapsort Lower.

Practice Quiz Question

Lecture 7-2 : Distributed Algorithms for Sorting Courtesy : Michael J. Quinn, Parallel Programming in C with MPI and OpenMP (chapter 14)

Fundamentals of Python: From First Programs Through Data Structures

Quick Sort, Shell Sort, Counting Sort, Radix Sort AND Bucket Sort

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

Quicksort CS 3358 Data Structures. Sorting II/ Slide 2 Introduction Fastest known sorting algorithm in practice * Average case: O(N log N) * Worst case:

Data Structures Data Structures Topic #13. Today’s Agenda Sorting Algorithms: Recursive –mergesort –quicksort As we learn about each sorting algorithm,

Updated QuickSort Problem From a given set of n integers, find the missing integer from 0 to n using O(n) queries of type: “what is bit[j]

Fundamentals of Algorithms MCS - 2 Lecture # 16. Quick Sort.

© Janice Regan, CMPT 102, Sept CMPT 102 Introduction to Scientific Computer Programming Recursion.

Data Structures Advanced Sorts Part 2: Quicksort Phil Tayco Slide version 1.0 Mar. 22, 2015.

Lecture 25 Selection sort, reviewed Insertion sort, reviewed Merge sort Running time of merge sort, 2 ways to look at it Quicksort Course evaluations.

1 Characterizing the Sort Operation on Multithreaded Architectures Layali Rashid, Wessam M. Hassanein, and Moustafa A. Hammad* The Advanced Computer Architecture.

Cache effective mergesort and quicksort Nir Zepkowitz Based on: “Improving Memory Performance of Sorting Algorithms” by Li Xiao, Xiaodong Zhang, Stefan.

Algorithm Efficiency and Sorting

1  2004 Morgan Kaufmann Publishers Chapter Seven.

CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.

Computer ArchitectureFall 2007 © November 12th, 2007 Majd F. Sakr CS-447– Computer Architecture.

Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.

Multiprocessor Cache Coherency

Improved results for a memory allocation problem Rob van Stee University of Karlsruhe Germany Leah Epstein University of Haifa Israel WADS 2007 WAOA 2007.

1 Data Structures and Algorithms Sorting. 2  Sorting is the process of arranging a list of items into a particular order  There must be some value on.

Shared Address Space Computing: Hardware Issues Alistair Rendell See Chapter 2 of Lin and Synder, Chapter 2 of Grama, Gupta, Karypis and Kumar, and also.

Merge Sort. What Is Sorting? To arrange a collection of items in some specified order. Numerical order Lexicographical order Input: sequence of numbers.

C++ Programming: From Problem Analysis to Program Design, Second Edition Chapter 19: Searching and Sorting.

CS 61B Data Structures and Programming Methodology July 28, 2008 David Sun.

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

L/O/G/O Cache Memory Chapter 3 (b) CS.216 Computer Architecture and Organization.

SYNAR Systems Networking and Architecture Group CMPT 886: Computer Architecture Primer Dr. Alexandra Fedorova School of Computing Science SFU.

Sorting: Implementation Fundamental Data Structures and Algorithms Klaus Sutner February 24, 2004.

Review 1 Selection Sort Selection Sort Algorithm Time Complexity Best case Average case Worst case Examples.

UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.

Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.

Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache And Pefetch Buffers Norman P. Jouppi Presenter:Shrinivas Narayani.

A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)

Liang, Introduction to Java Programming, Sixth Edition, (c) 2007 Pearson Education, Inc. All rights reserved Chapter 23 Algorithm Efficiency.

Exploiting Multithreaded Architectures to Improve Data Management Operations Layali Rashid The Advanced Computer Architecture U of C (ACAG) Department.

Copyright © Curt Hill Sorting Ordering an array.

1  2004 Morgan Kaufmann Publishers Locality A principle that makes having a memory hierarchy a good idea If an item is referenced, temporal locality:

ICS201 Lecture 21 : Sorting King Fahd University of Petroleum & Minerals College of Computer Science & Engineering Information & Computer Science Department.

1 Ch. 2: Getting Started. 2 About this lecture Study a few simple algorithms for sorting – Insertion Sort – Selection Sort (Exercise) – Merge Sort Show.

Using Sequential Containers Lecture 8 Hartmut Kaiser

Project Summary Fair and High Throughput Cache Partitioning Scheme for CMPs Shibdas Bandyopadhyay Dept of CISE University of Florida.

SYNAR Systems Networking and Architecture Group CMPT 886: Computer Architecture Primer Dr. Alexandra Fedorova School of Computing Science SFU.

Computer Sciences Department1. Sorting algorithm 4 Computer Sciences Department3.

Chapter 15 Running Time Analysis. Topics Orders of Magnitude and Big-Oh Notation Running Time Analysis of Algorithms –Counting Statements –Evaluating.

The University of Adelaide, School of Computer Science

Multiprocessors – Locks

Memory COMPUTER ARCHITECTURE

6/16/2010 Parallel Performance Parallel Performance.

CSCI206 - Computer Organization & Programming

Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy

Morgan Kaufmann Publishers

Multiprocessor Cache Coherency

Parallel Sorting Algorithms

Applying Twister to Scientific Applications

Module IV Memory Organization.

Performance metrics for caches

Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP

Performance metrics for caches

Parallel Sorting Algorithms

Performance metrics for caches

Performance metrics for caches

Presentation transcript:

PRESENTED BY MOHAMMED ALHARBI MOHAMMED ALEISABAKRI AWAJI Causing Incoherencies Parallel sorting algorithms and study cache behaviors L1 and L2 on Multi- Core Architecture Instructor Prof. Gita Alaghband

Outlines Motivation Our implementation in details Contributions Experiments Evaluation Related work Conclusion Challenges What did we learn ?

Motivation Our motivation in this project is to cause incoherence by simulate three sorting algorithms(bubble sort, Quick sort, insertion sort) : but, the big question is : Why would we want to cause incoherencies ? Coherence is needed to meet an architectural assumption held by the software designer. The bad program design identified by this project demonstrates what happens when the coherence assumption is ignored. As we know, when we use multi-core on processor effectively, that might cause coherence problems. We need to learn how the Architecture reacts with the shared data when using multi-core processor.

Motivation(Important Questions) How would we use the sorting algorithms in such details to demonstrate our project? The sorting algorithms simply provide the necessary traces and overlapping reads and writes to cause coherence issues. Why are we choosing sorting Algorithms instead of applications? Many applications use sorting algorithms of various kinds. A sorting algorithm can be an entire application (utility). Sorting algorithms provide a clear area of coherence issues when executed in parallel on the same data.

Motivation(Important Questions) When presenting the sorting algorithms as sequential algorithms ? why are they related to cache and multicore architecture? Sorting algorithms are frequently executed on multicore architectures and make heavy use of caches. The algorithms, again, are simply to provide functional traces that result in coherence issues. Why are we choosing these sorting algorithms ? They are commonly known and applied and provide opportunities to examine incoherence. How will we count the miss and hit ? Misses were counted as compulsory (never had the data to begin with), conflict (fighting same slot in direct-mapped architecture), and, in a way, coherence (in the form of updates that invalidate blocks). A hit can either be a read hit or a write hit.

Our Implementation in details Simulating three sorting algorithms to study: Causing Incoherence in L1 cache by applying coherences (Invalidate policy) with write through policy or write back policy For measuring; read hit, write hit, coherence miss, conflict miss, compulsory miss Sorting Algorithms (bubble sort, Quick sort, insertion sort) Input long data array for example:

Our Implementation in details Simulating two different sorting algorithms For example (bobble sort vs. Quick sort) in parallel on two cores with same array using (write through policy) with (invalidation policy). L2 is sharing. Showing in figures the fighting on the same data between both.

Our Implementation in details Bus snooper Core 1 L1 Core 2 L1 L2 Bubble sort RAM Quick sort For example: causing incoherence (write through policy) in case of invalidation Running two different algorithms In the same time with same array Updating with Write through policy search swap

Our Implementation in details Bus snooper Core 1 L1 Core 2 L1 L2 Bubble sort RAM Quick sort Sending broadcast to update data in other cores Or request to invalid data For example: causing incoherence (write through policy) ) in case of invalidation Updating with Write through policy 73 37

Our Implementation in details Bus snooper Core 1 L1 Core 2 L1 L2 Bubble sort RAM Quick sort Sending broadcast to update data in other cores Or request to invalid data For example: causing incoherence (write through policy) ) in case of invalidation Updating with Write through policy 37 37

Our Implementation in details Bus snooper Core 1 L1 Core 2 L1 L2 Bubble sort RAM Quick sort Sending broadcast to update data in other cores Or request to invalid data For example: causing incoherence (write through policy) in case invalidation Running two different algorithms In the same time With the same data Updating with Write through policy

Analysis(Scenario of Invalidation with write through ) For example: we apply bubble sort algorithm on core1 and Quick sort algorithm on core 2. The array will be placed first in Main memory Then it sends all array that is inside two black from main memory to L2 cache and then each L1 cache has the same block. For example : In first (data access time), Quick sort on core 2 is searching while bubble sort algorithm on core1 is swapping that means it wants to write, so core1 updates the value of all array in L2 cache and then main memory by using write through policy. After that core1 sends request as broadcast on the bus snoopy to invalidation the same data on another core. Hence the core 2 read miss the data, so it needs to update its data from L2. That means cache coherence problem happens since each data access occurs. Two algorithms have fighting data on the same array that causes (duplicate data, losing data or wrong sort and flashing copies).

Contribution Bubble sort algorithm Quick sort Algorithm Insertion sort Algorithm Trace

Contribution Bubble Sort : compares the numbers in pairs from left to right exchanging when necessary. the first number is compared to the second and as it is larger they are exchanged.

Contribution Bubble Sort

Contribution Quick Sort : Given an array of n elements (e.g., integers): If array only contains one element, return Else  pick one element to use as pivot.  Partition elements into two sub-arrays:  Elements less than or equal to pivot  Elements greater than pivot  Quick sort two sub-arrays  Return results

Contribution Quick Sort :

Contribution Insertion Sort :

Our Experiments  Simulating Parallel different sorting algorithms on two cores with the same data array to study the behavior of cache. Case 0: Bubble sort vs. Insertion sort Case 1: Bubble sort vs. Quick sort Case 2: Quick sort vs. insertion sort

Our Millstones CasesImplementationCaches Polices Polices of Coherences 1 Bubble sort vs. Quick sortWrite throughInvalidation Done 2 Insertion sort vs. Bubble sort Write throughInvalidation Done 3 Quick vs. Insertion sortWrite throughInvalidation Done 4 Insertion sort vs. Bubble sort Write backInvalidation Still 5 Bubble sort vs. Quick sortWrite backInvalidation Still

Our Experiments In our experiment we studied the higher and lower levels cache behaviors We measured the Hits and Misses rate in the Cache. These measurements are appeared by incoherence that occurred as a result of applying invalidation policy.

Our Experiments (Parameters) We used the same parameters on all cases The input data: the same array Coherence policy: Invalidate Cache size: 64 byte Block size: 32 byte Numbers of cores: 2

Our Experiments The data type: one dimension array with size of 64 bytes. The Trace file is generated by the code. The Bubble Sort trace file size= 126 KB. The Insertion Sort trace file size= 62 KB. The Quick Sort trace file size= 23 KB 0x x

Our Experiments (result) 3- Measuring Coherence misses rate for all cases on 2 cores Coherence Miss = Coherence Write Miss + Coherence Read Miss Cases Coherence miss Bubble sort vs. Insertion sort1574 Bubble sort vs. Quick sort352 Quick sort vs. insertion sort675

Our Experiments(chart) Measuring Coherence misses rate for all cases on 2 cores

Our Experiments (analysis) This figure shows the incoherence in our simulator How ??? The incoherence happened because of the invalidation policy. That happens because of each both algorithms fighting on the same data As we see in the chart, the coherence misses rate is high in the first case. Why ? The array of the bubble sort can be helpful or wasteful for the insertion sort in the same case which increases the data accessed. The insertion sort behaviors can increase or decrease iteration numbers of algorlthm sorting for the bubble sort because of the wrong sorting that caused by the fighting on the same data.

Our Experiments(result) 3- Measuring Read Coherence misses rate and Write Coherence misses rate for all cases on 2 cores

Our Experiments (chart) Measuring Read Coherence misses rate and Write Coherence misses rate for all cases on 2 cores

Our Experiments (analysis) 1-This figure shows the write coherence miss and read coherence miss for all cases in details for the previous coherence miss's figure 2- As we can see, write coherence miss is higher than read coherence miss. Why ? 3- Because of the incoherence that was caused by invalidation each algorithm did a lot of swapping 4- That happens because of each both algorithms fighting on the same data

Our Experiment (result) 1- Measuring Miss and Hit Rate with write through for all cases using invalidation ALL CACES ( HIT/MISS)- WRITE Through HitMiss Bubble Vs Insertion Bubble Vs Quick Quick Vs Insertion

Our Experiments (Chart) Measuring Hits and Misses Rate with right through for all cases using invalidation

Our Experiments (analysis) Measuring the Cache performance. The Hit rates in all cases are greater than the Miss rates.  The reason is: the write hit occurs more often; because of swapping operation. The higher rate of Hit is showed in the Bubble sort.  This algorithm has more 'comparing and swapping' operations than the other sorting algorithms, and it is not efficient algorithm.

Our Experiments (Result) 2- Measuring Hits and Misses Rate with write through on each core for each case using invalidation (to show impact of the fighting on the same data on higher level). Bubble Sort Vs Insertion Sort AlgorithmsHIT MISS CORE 0 (Bubble Sort) CORE 1 (Insertion Sort) Bubble Sort Vs Quick Sort AlgorithmsHIT MISS CORE 0 (Bubble Sort) CORE 1 (Quick Sort) Quick Sort Vs Insertion Sort algorithms HIT MISS CORE 0 (Quick Sort) CORE 1 (Insertion Sort)

Our Experiments(charts) Measuring Hits and Misses Rate with write through on each core for each case using invalidation

Our Experiments (analysis) These figures show the high rate of hits and misses in cache for each case. That happens because of each both algorithms fighting on the same data on the higher level. The hits rate and misses rate occurred by incoherence that was caused by invalidation. The array of the first algorithm can be helpful or wasteful for the second algorithm in the same case. It can reduce or increase the data accesses. The first algorithm behaviors can increase or decrease iteration numbers of sorting for the second algorithm because of the wrong sorting that caused by the fighting on the same data.

Evaluation In our evaluation we studied the impact of varying parameters on cache optimization : Different block size with constant cache size  To measure the coherence misses.  To measure the hits and misses rate in both levels of cache. Different cache size with constant block size  To measure the conflict misses in both levels of cache.

Evaluation (block size) Coherence misses rate for all cases Block Size= 32 Cache Size= 64 Coherence misses rate for all cases Block Size= 64 Cache Size= 64

Evaluation(analysis) We increased the parameter value for block size and we constant the cache size with write through for all cases. Why ? to study the impact of causing incoherence These two figures show the increase of coherence misses since we increased the block size.Why ? Because we put the same data that means the invalidation applied many times which showed the fighting of data The block size vector parameter plays the role of impacting on cache.

Evaluation (Block size)

Evaluation(analysis) We increased the parameters values for block size with constant cache size with write through for all cases. Why ? to study the impact of this change on the cache optimization. These two figures show the same hits rate and increase the misses since we increased the block size. Because increasing of the invalidation message

Evaluation (Block size)

Evaluation (analysis) We increased the parameters values for block size and we constant the cache size with write through for specific cases. Why ? to study the impact of this change on the higher level optimization. These two figures show the different of hits and increased misses since we increased the block size. Here we studied each algorithms behavior as we see hits of the bubble is high because has much more cycles than another which clearly seen in its trace file The block size vector parameter plays the role of impacting on L1 cache.

Evaluation (Conflict VS. Cache Size)

Evaluation (analysis) These two figures distinguish that when the cache size increased, the conflict miss rate will be decreased. Why ? In the figure 1, the block size is 32 bytes and when the level 1 cache size is equal to block size, the conflict miss rate will be increased since the cache size fits to only one block In the figure 2, In this case, the block size is 32 bytes and when the level 1 cache size is twice the block size, the conflict miss rate will be decreased since the cache size fits with number of blocks.

Related Work the effect of false sharing on parallel algorithm performance occurs depending on many factors such as block size, access pattern and coherence polices[9] The impact of false sharing to be main vector in performance among the optimal policy that uses traditional coherence policies with the new merge facility[9]

Conclusion Coherence is needed to meet an architectural assumption held by the software designer. flashing data extra time prevents data losing and duplicate data and fixes performance of cache. Invalidation message increases when we changes block size with static cache size.

Future Work Using Update coherence policy with write through and write back. Executing algorithms parallel on more cores and counting the false sharing Using the large data size.

Challenges Clarifying the project idea to the class. First time to simulate the caches in software. We had a large effort for implementation with short time. The write bake and the Quick sort algorithm were too complicated. We read a lot of papers to find a related work to our project; because of our big area.

What did we learn ? Reacting of Architecture with software How to pick up small feature to make it big research. How to make a big project from a specific feature. The comprehensive questions from labs assignment, we have learned how to analyze our simulation performance.

References [1] Prabhu, Gurpur M. "COMPUTER ARCHITECTURETUTORIAL." Computer Architecture Tutorial. 2 Feb pubesher. 05 Apr [2] Gita Alaghband (2014) CSC 5593 Graduate Computer Architecture Lecture 2[2-12]. [3] Shaaban, Muhammed A. "EECC 550 Winter 2010 Home Page." EECC 550 Winter 2010 Home Page. 27 Nov RIT. 07 Apr [10-25]..[2-17] [4] Guanjun Jiang; Du Chen; Binbin Wu; Yi Zhao; Tianzhou Chen; Jingwei Liu, "CMP Thread Assignment Based on Group Sharing L2 Cache," Scalable Computing and Communications; Eighth International Conference on Embedded Computing, SCALCOM-EMBEDDEDCOM'09. International Conference on, vol., no., pp.298, 303, Sept. 2009[13-17] [5] Kruse and Ryba (2001). Mergesort and Quicksort [42-72]. Retrieved from ‎ [16-17] [6] Wei Zhang (2010). Multicore Architecture [ 73-81]. Retrieved from [7] J. Hennessy, D. Patterson. Computer Architecture: A Quantitative Approach (4th ed.). Morgan Kaufmann, 2011.[2-12] [8] D. Patterson, J. Hennessy. Computer Organization and Design (5th ed.). Morgan Kaufmann, 2011.[2-12} [9] W. Bolosky and M. Scott. False sharing and its effect on shared memory performance. In Proceedings of the USENIX Symposium on Experiences with Distributed and Multiprocessor Systems (SEDMS IV), San Diego, CA, September 1993.