SICSA Concordance Challenge: Using Groovy and the JCSP Library Jon Kerridge.

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

Parallel List Ranking Advanced Algorithms & Data Structures Lecture Theme 17 Prof. Dr. Th. Ottmann Summer Semester 2006.
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Concurrency The need for speed. Why concurrency? Moore’s law: 1. The number of components on a chip doubles about every 18 months 2. The speed of computation.
Development of Parallel Simulator for Wireless WCDMA Network Hong Zhang Communication lab of HUT.
Types of Parallel Computers
File Management Chapter 12. File Management File management system is considered part of the operating system Input to applications is by means of a file.
File Management Chapter 12. File Management A file is a named entity used to save results from a program or provide data to a program. Access control.
Slides 8d-1 Programming with Shared Memory Specifying parallelism Performance issues ITCS4145/5145, Parallel Programming B. Wilkinson Fall 2010.
1 Complexity of Network Synchronization Raeda Naamnieh.
Reference: Message Passing Fundamentals.
Data Parallel Algorithms Presented By: M.Mohsin Butt
FALL 2004CENG 351 Data Management and File Structures1 External Sorting Reference: Chapter 8.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
Chapter 2: Algorithm Discovery and Design
FALL 2006CENG 351 Data Management and File Structures1 External Sorting.
Software Performance Tuning Project – Final Presentation Prepared By: Eyal Segal Koren Shoval Advisors: Liat Atsmon Koby Gottlieb.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
Advanced Topics in Algorithms and Data Structures 1 Two parallel list ranking algorithms An O (log n ) time and O ( n log n ) work list ranking algorithm.
Chapter 2: Algorithm Discovery and Design
Chapter 2: Algorithm Discovery and Design
The hybird approach to programming clusters of multi-core architetures.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Parallelism Processing more than one instruction at a time. Pipelining
Implementation Yaodong Bi. Introduction to Implementation Purposes of Implementation – Plan the system integrations required in each iteration – Distribute.
1 Lecture 4: Threads Operating System Fall Contents Overview: Processes & Threads Benefits of Threads Thread State and Operations User Thread.
1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors.
Types of Computers Mainframe/Server Two Dual-Core Intel ® Xeon ® Processors 5140 Multi user access Large amount of RAM ( 48GB) and Backing Storage Desktop.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
© The McGraw-Hill Companies, 2006 Chapter 4 Implementing methods.
Adaptive Parallel Sorting Algorithms in STAPL Olga Tkachyshyn, Gabriel Tanase, Nancy M. Amato
Stephen P. Carl - CS 2421 Recursion Reading : Chapter 4.
Programming Multi-Core Processors based Embedded Systems A Hands-On Experience on Cavium Octeon based Platforms Lab Exercises.
Programming With C.
March 16 & 21, Csci 2111: Data and File Structures Week 9, Lectures 1 & 2 Indexed Sequential File Access and Prefix B+ Trees.
MapReduce How to painlessly process terabytes of data.
Towards a Billion Routing Lookups per Second in Software  Author: Marko Zec, Luigi, Rizzo Miljenko Mikuc  Publisher: SIGCOMM Computer Communication Review,
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
COMPUTER PROGRAMMING. A Typical C++ Environment Phases of C++ Programs: 1- Edit 2- Preprocess 3- Compile 4- Link 5- Load 6- Execute Loader Primary Memory.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:
DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.
Virtual Memory 1 1.
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce Leonidas Akritidis Panayiotis Bozanis Department of Computer & Communication.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.
Thinking in Parallel – Implementing In Code New Mexico Supercomputing Challenge in partnership with Intel Corp. and NM EPSCoR.
GEM: A Framework for Developing Shared- Memory Parallel GEnomic Applications on Memory Constrained Architectures Mucahid Kutlu Gagan Agrawal Department.
Architecture View Models A model is a complete, simplified description of a system from a particular perspective or viewpoint. There is no single view.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 1.
DMBS Internals I February 24 th, What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the.
DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.
Programming Multi-Core Processors based Embedded Systems A Hands-On Experience on Cavium Octeon based Platforms Lab Exercises: Lab 1 (Performance measurement)
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
TensorFlow– A system for large-scale machine learning
SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data - Aditi Thuse.
Distributed Shared Memory
Data Structures Recursion CIS265/506: Chapter 06 - Recursion.
MapReduce Simplied Data Processing on Large Clusters
Types of Computers Mainframe/Server
Shared Memory Programming
COMP60621 Fundamentals of Parallel and Distributed Systems
Multithreaded Programming
Lecture 2 The Art of Concurrency
Programming with Shared Memory Specifying parallelism
COMP60611 Fundamentals of Parallel and Distributed Systems
Map Reduce, Types, Formats and Features
Presentation transcript:

SICSA Concordance Challenge: Using Groovy and the JCSP Library Jon Kerridge

Software Environment Groovy –A Java based scripting language Direct support for Lists and Maps –Executes on a standard JVM JCSP –A CSP based library for Java –Process definitions independent of how the system will be executed –Enables multicore parallelism –Parallelism over a distributed system with TCP/IP interconnect –Executes on a standard JVM A set of Groovy Helper Classes have been created to permit easier access to the JCSP library

Hardware Environment A network of multicore PCs Network is 100Mbits / sec Each processor has –Intel Core 2 Duo E8400 operating at 3.0 Ghz –2 cores –2 threads ( no hyper-threading ) –FSB 1333MHz –L2 cache 6MB –2 Gbytes memory

Why Use a Distributed System? Regardless of the application may need to use more than one processing node Better to start with an inherently parallel distributed design Bolting on distributed parallelism afterwards is always very difficult Scalability Enables easier overlapping of operations, particularly file I/O

Architecture Read File Process Worker There can be any number of workers; in these experiments 4, 8 and 12 Bi-directional CSP channel communication in Client-Server Design

Primary Design Criteria Ensure all data structures are separable in some parameter –N in this case –Reduces contention for memory access; –Hence easier to parallelise Keep loops simple –Easier to parallelise

Read File process Reads parameters – input file name, N value, Minimum number of repetitions to be output –Number of workers and Block size Operation –Reads input file, tokenises into space delimited words –Forms a block of such words ensuring an overlap of N-1 words between blocks –Sends a block to each worker in turn –Merges the final partial concordance of each worker and writes final concordance to an output file Will be removed in the final version

Initial Experiments The relationship between Block Size and the Number of Workers governs how much processing can be overlapped with the initial file input It was discovered that for Block Size = 6144 gave the best performance for 4 or 8 workers Provided the only work undertaken was –removal of punctuation and –the initial calculation of the equivalent integer value for each word

Worker – Initial Phase Reads input blocks from Read File process –Removes punctuation – saving as bare words –Calculates integer equivalent value for each word by summing its ASCII characters This is also the N = 1 sequence value –These operations are overlapped with input and the same process in each worker For each block –Calculate the integer value for each sequence of length 2 up to N by adding word values and store it in a Sequence list The integer values generated by this processing will generate duplicate values for different words and different sequences

Worker – Local Map Generation For each Sequence in each Block –Produce a Map of the Sequence value with the corresponding entry of a Map comprising the corresponding word strings with an entry of the places where that word string is found in the input file –Save this in a structure that is indexed by N and each contains a list of the Maps produced above For each worker produce a composite Map combining the individual Maps –Save this in a structure indexed by N –This is the Concordance for this worker

Worker – Merge Phase For each of the N partial Concordances –Sort the integer keys into descending order –For each Key in the Nth partial Concordance Send the corresponding Map Entry to the Reader The Map Entry contains a Map of the word sequences and locations within file –This will be modified in the final version that overlaps the merge / output phase

Worker - Parallelisation Each Worker can be parallelised by N Data structures indexed by N can be written to in parallel –Provided each element of the parallel only accesses a single value of N –Access to any shared structures is read only Thus depending on the number of available Threads (T) in the Worker’s Processor each of these operations can be carried out in parallel Thus the design is scalable in N and T

Parallelising the Worker’s Join def localNPrimaryMapList = [] // holds concordance for each N value for ( n in 1..N)localNPrimaryMapList[n] = new PrimaryKeyMap() for ( s in 0..<startIndexes.size()){ /* sequential version for ( n in 1..N){ defs.initPrimaryMap( localNPrimaryMapList[n], localEqualWordMapListN[n][s]) } */ def procNet = (1..N).collect { n -> new InitialJoiner( primaryMap: localNPrimaryMapList[n], otherMap: localEqualWordMapListN[n][s])} new PAR(procNet).run() }

InitialJoiner – Process Definition class InitialJoiner implements CSProcess { // this is a non-standard CSP process as it has no channels! // relies on the fact that the primaryMap can be written to by this // process exclusively def primaryMap def otherMap void run(){ defs.initPrimaryMap( primaryMap, otherMap) }

Creating Equal Block Maps - Sequential def localEqualWordMapListN = [] // contains an element for each N value for ( i in 1..N) localEqualWordMapListN[i] = [] def maxLength = BL - N for ( WordBlock wb in wordBlocks) { /* sequential version that iterates through the sequenceBlockList for ( SequenceBlock sb in wb.sequenceBlockList){ // one sb for each value of N def length = maxLength def sequenceLength = sb.sequenceList.size() if ( sequenceLength < maxLength) length = sequenceLength def equalMap = defs.extractEqualValues ( length, wb.startIndex, sb.sequenceList) def equalWordMap = defs.extractUniqueSequences ( equalMap, sb.Nvalue, wb.startIndex, wb.bareWords) localEqualWordMapListN[sb.Nvalue] << equalWordMap }

Creating Equal Block Maps - Parallel def localEqualWordMapListN = [] // contains an element for each N value for ( i in 1..N) localEqualWordMapListN[i] = [] def maxLength = BL - N for ( WordBlock wb in wordBlocks) { def procNet = (1..N).collect { n -> new ExtractEqualMaps( n: n, maxLength: maxLength, startIndex: wb.startIndex, sequenceList: wb.sequenceBlockList[n-1].sequenceList, words: wb.bareWords, localMap: localEqualWordMapListN[n]) } new PAR(procNet).run() }

ExtractEqualMaps – Process Definition class ExtractEqualMaps implements CSProcess { def n def maxLength def startIndex def sequenceList def words def localMap void run(){ def length = maxLength def sequenceLength = sequenceList.size() if ( sequenceLength < maxLength) length = sequenceLength def equalMap = defs.extractEqualValues ( length, startIndex, sequenceList) def equalWordMap = defs.extractUniqueSequences ( equalMap, n, startIndex, words) localMap << equalWordMap }

Parallelisation Effect The presented results have the parallel version of the InitialJoiner deployed in both versions The effect of the previous Parallelisation is immediately observable in the results –Worker Style 1 has the sequential version to create the Equal Maps –Worker Style 2 has the parallel version to create the Equal Maps The system does allow the user to choose whether to output sequences that occur only once –All results presented do NOT output a sequence if it occurs only once

Results times in msecs – Bible Worker StyleWorkersN Worker Distribute Worker Equal Worker Join Worker Merge Worker Total Reader Distribute Reader Merge Reader Total Output File Size KB 1433,749138,2631,14719,263162,421 3,045159,205162,250 17, ,38969, ,70295,383 2,99792,26995,266 17, ,04653,6001,03018,64776,323 2,59273,70176,293 17, ,62927, ,75854,543 3,76150,15953,920 17, ,24565,4811,29153,736124,753 3,308121,186124,494 25, ,20911, ,75629,957 4,79023,80728,597 12, ,75017, ,00844,034 4,02639,42343,449 17, ,87025, ,94562,044 4,04257,39361,435 21, ,03034, ,03082,003 4,08977,92882,017 23, ,05743,5441,04853,247102,896 4,04198,287102,328 25,810

Commentary – Worker Equal Speedup Speedup is T slower / T faster For N = 3 Workers = 4 and 8 –Speedup of Worker Style 2 (parallel) over Worker Style 1 (sequential) –W = 4: 2.58 and W= 8: 2.52 –Solely due to the parallelisation of Extract Equal Maps using available threads (2) For N = 3 and Workers = 4, 8 and 12 –Speedup due to additional workers W = 8W = 12 W = W =

Commentary - Overall Merge Effects For N = 3 –The Merge time is very similar –Demonstrates that the Reader is the bottleneck Merge Parallelisation There is an option here to parallelise more by undertaking merges in parallel Worker Total Time Speedup W = 8W = 12 W = W =

Overlapped Merge / Output Architecture Reader Worker Merge N = 1 Merge N = 2 Merge N = 3

Commentary on Revised Architecture The workers output each of the N Primary maps in parallel to the respective Merge process –Each worker has N processes that output the entries in each primary key map in descending sorted order –One merge process per N value –Each Merge process writes its own file When the worker has finished –Sends a message to Reader informing it of termination –This enables calculation of overall time The architecture implements the CSP Client-Server design pattern thereby guaranteeing freedom from deadlock

Results – Overlapped Merge (msecs) Worker StyleWorkersN Worker Distribute Worker Equal Worker Join Worker Merge Worker Total Reader Distribute Reader Merge Reader Total Output File Size KB 21234,75017, ,00844,034 4,02639,42343,449 17, ,05743,5441,04853, ,89 6 4,04198, , ,810 Bible 31232,96918, ,90232,319 2,73129,50132,232 6,297N = ,20244,3421,08215,23963,866 2,71561,04963,809 6,297N = 1 WaD 31261,33817, ,62527,361 1,14026,16227,302 2,044N = 1

Speedup Calculations Worker 2 to Worker 3 Merge Speedup Bible Speedup on Input File Compare Bible to WaD –Overall times BibleSpeedup N=63.49 N = W = 12 N = 6WordsTime Bible802,30063,809 WaD268,50027,302 Ratio

Conclusion Utilisation of access to shared memory needs to be considered when designing the algorithm –This was done from the outset with the choice of data structures The parallelisation of sequential sections is relatively straightforward –Provided there are no memory access violations between parallel processes –The JCSP Library made this particularly easy The resulting system is scalable in –The number of Workers –The value of N and the number of available Threads –31 threads used in this implementation The creation of Equal maps needs to be further parallelised

Further Work The School has recently installed a new multi-node system –18 nodes each with Dual Quad Hyper-threading processor 16 threads in each Node Hence N = 16 can be undertaken in one pass –16GB memory –250 Gb local disk –Gigabit Ethernet communications infrastructure Its obvious what I shall be doing!