Paul Cockshott Glasgow

Slides:



Advertisements
Similar presentations
Part IV: Memory Management
Advertisements

Lecture Materials for the John Wiley & Sons book: Cyber Security: Managing Networks, Conducting Tests, and Investigating Intrusions April 14, 2015 DRAFT1.
Slide 2-1 Copyright © 2004 Pearson Education, Inc. Operating Systems: A Modern Perspective, Chapter 2 Using the Operating System 2.
Chapter 4: Threads. Overview Multithreading Models Threading Issues Pthreads Windows XP Threads.
Based on Silberschatz, Galvin and Gagne  2009 Threads Definition and motivation Multithreading Models Threading Issues Examples.
1 Lecture 24: Parallel Algorithms I Topics: sort and matrix algorithms.
Guide To UNIX Using Linux Third Edition
Multi-core processors. History In the early 1970’s the first Microprocessor was developed by Intel. It was a 4 bit machine that was named the 4004 The.
Leveling the Field for Multicore Open Systems Architectures Markus Levy President, EEMBC President, Multicore Association.
Processes Part I Processes & Threads* *Referred to slides by Dr. Sanjeev Setia at George Mason University Chapter 3.
9/13/20151 Threads ICS 240: Operating Systems –William Albritton Information and Computer Sciences Department at Leeward Community College –Original slides.
Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.
Binary Image Compression via Monochromatic Pattern Substitution: A Sequential Speed-Up Luigi Cinque and Sergio De Agostino Computer Science Department.
Chapter Four UNIX File Processing. 2 Lesson A Extracting Information from Files.
Threads, Thread management & Resource Management.
Threads by Dr. Amin Danial Asham. References Operating System Concepts ABRAHAM SILBERSCHATZ, PETER BAER GALVIN, and GREG GAGNE.
(1) A Beginner’s Quick Start to SIMICS. (2) Disclaimer This is a quick start document to help users get set up quickly Does not replace the user guide.
April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.
CPS 4150 Computer Organization Fall 2006 Ching-Song Don Wei.
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
Chapter 15 A External Methods. © 2004 Pearson Addison-Wesley. All rights reserved 15 A-2 A Look At External Storage External storage –Exists beyond the.
TI Information – Selective Disclosure Implementation of Linear Algebra Libraries for Embedded Architectures Using BLIS September 28, 2015 Devangi Parikh.
Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.
Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 4: Threads.
Lab Activities 1, 2. Some of the Lab Server Specifications CPU: 2 Quad(4) Core Intel Xeon 5400 processors CPU Speed: 2.5 GHz Cache : Each 2 cores share.
Programming Multi-Core Processors based Embedded Systems A Hands-On Experience on Cavium Octeon based Platforms Lab Exercises: Lab 1 (Performance measurement)
© 2006 Pearson Addison-Wesley. All rights reserved15 A-1 Chapter 15 External Methods.
Hello world !!! ASCII representation of hello.c.
Operating Systems A Biswas, Dept. of Information Technology.
Barthélémy von Haller CERN PH/AID For the ALICE Collaboration The ALICE data quality monitoring system.
Lecture 5. Example for periority The average waiting time : = 41/5= 8.2.
Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.
Introduction to threads
Using the VTune Analyzer on Multithreaded Applications
Chapter 4: Multithreaded Programming
Chapter 4: Threads.
COP 3503 FALL 2012 Shayan Javed Lecture 15
Day 12 Threads.
Chapter 4: Multithreaded Programming
Chapter 4: Multithreaded Programming
Chapter 4 Threads.
Constructing a system with multiple computers or processors
Performance and Code Tuning
Chapter 4: Threads.
Lecture 16: Parallel Algorithms I
Chapter 4: Threads.
Chapter 4: Threads.
Chapter 4: Threads.
Lecture#12: External Sorting (R&G, Ch13)
Restructuring the multi-resolution approximation for spatial data to reduce the memory footprint and to facilitate scalability Vinay Ramakrishnaiah Mentors:
Modified by H. Schulzrinne 02/15/10 Chapter 4: Threads.
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
Constructing a system with multiple computers or processors
Constructing a system with multiple computers or processors
CSE8380 Parallel and Distributed Processing Presentation
Shared Memory Programming
Constructing a system with multiple computers or processors
Threads Chapter 4.
CHAPTER 4:THreads Bashair Al-harthi OPERATING SYSTEM
Multithreaded Programming
What is Concurrent Programming?
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
Chapter 4: Threads & Concurrency
Chapter 4: Threads.
Lecture 2 The Art of Concurrency
External Sorting.
Wellington Cabrera Advisor: Carlos Ordonez
Types of Parallel Computers
Concurrency: Threads, Address Spaces, and Processes
Presentation transcript:

Paul Cockshott Glasgow SICSA Benchmarks in C Paul Cockshott Glasgow

First Serial Experiments Prior to doing any parallelisation it is advisable to initially set up a good sequential version. I was initially doubtfull that this challenge would provide an effective basis for parallelisation because it seemed such a simple problem. Intutively it seems like a problem that is likely to be of either linear or at worst log linear complexity, and for such problems, especially ones involving text files, the time taken to read in the file and print out the results can easily come to dominate the total time taken. If a problem is disk bound, then there is little advantage in expending effort to run it on multiple cores.

Algorithm The algorithm is four pass. Read the file into a buffer Produce a tokenised version of the buffer Build the hash table and prefix trees. If the concordance is to be printed out, perform a traversal of the trees printing out the word sequences in the format suggested by the Haskell version. If we want it sorted, pipe through Linux sort

Encoded file left down right left down right hashtable left down right left down right

Serial results Machine was 2.6ghz Intel, gcc no optimisation language Filesize OS Time Haskell 3580 windows 0.82 C 0.03 4792092 timeout>2hrs 3.67 Linux 2.68

Conclusion from serial test As the summary above shows the C version is Significantly faster than the Haskell version. This is not surprising as one would expect C to be more efficient. Appears to have a lower complexity order than the Haskell version. This would indicate that the Haskell version is not a good starting point. Its run time is dominated by the time to ouput the results to file. The test files provided in the initial proposal were too short to get a realistic estimate of performance Linux implementations run substantially faster than Windows on the same processor + gcc.

Parallel Experiments As a first parallel experiment a dual core version of the C programme was produced using the pthreads library and it was tested on the same dual processor machine as the original serial version of the algorithm. A second parallel version used a simple shell script ./l1concordance WEB.txt 4 P 1 0 >WEB0.con& ./l1concordance WEB.txt 4 P 1 1 >WEB1.con wait cat WEB1.con >>WEB0.con

Parallel timings Mechanism OS Threads used Time sorted output opt level serial windows 1 3.67 no pthreads 2 5.63 linux 2.68 2.26 2.45 yes 3 Shell & 1.93

Conclusions for 2 core machine There was no gain using multithreading on windows. It looks as if the pthreads library under windows simply multi threads operations on a single core rather than using both cores. On Linux there was a small gain in performance due to multithreading - about 17% faster in elapsed time using 2 cores. Since a large part of the program execution is spent writing the results out, this proves a challenge to multicore. Parallel version adopted the strategy of allowing each thread to write its part of the results to a different file. The best performance used the oldest approach, classic Unix shell scripts along with C

SCC 48 core machine elapsed time in seconds 1 core doing full concordance 26.17 1 core doing half concordance 13.48 1 core doing 1/8 concordance 5.59 2 cores doing half each 49.0 8 cores doing 1/8 each 36.0 32 cores doing 1/32 each 34 .0 host processor doing it all 1.03 host processor using both cores 0.685 # shell script to run on host to run concordance on 32 scc cores rm /shared/stdout/* pssh -t 800 -h hosts32 -o /shared/stdout /shared/sccConcordance32 cat /shared/stdout/* |sort > WEB.con

Why did it not work on SCC Too much i/o Coms path to disk for the individual cores is poor Bandwidth of the PCI link to the chip is slow All io has to go through the host anyway

Proposed New Benchmarks Nbody problem 1024 bodies under gravitational attraction. Image Bluring 1024 by 1024 pixel 24bit colour image Mandelbrot set for image siz 2048 8 bit colour resolution.