GEM: A Framework for Developing Shared- Memory Parallel GEnomic Applications on Memory Constrained Architectures Mucahid Kutlu Gagan Agrawal Department.

Slides:

Advertisements

Similar presentations

LIBRA: Lightweight Data Skew Mitigation in MapReduce

Advertisements

OpenFOAM on a GPU-based Heterogeneous Cluster

Using Metacomputing Tools to Facilitate Large Scale Analyses of Biological Databases Vinay D. Shet CMSC 838 Presentation Authors: Allison Waugh, Glenn.

Operating Systems CS208. What is Operating System? It is a program. It is the first piece of software to run after the system boots. It coordinates the.

Parallel Computation in Biological Sequence Analysis Xue Wu CMSC 838 Presentation.

MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering.

Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.

Introduction to HP LoadRunner Getting Familiar with LoadRunner >>>>>>>>>>>>>>>>>>>>>>

Computer System Architectures Computer System Software

Performance Evaluation of Hybrid MPI/OpenMP Implementation of a Lattice Boltzmann Application on Multicore Systems Department of Computer Science and Engineering,

Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 8: Main Memory.

HOCT: A Highly Scalable Algorithm for Training Linear CRF on Modern Hardware presented by Tianyuan Chen.

PAGE: A Framework for Easy Parallelization of Genomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering The Ohio.

Cluster-based SNP Calling on Large Scale Genome Sequencing Data Mucahid KutluGagan Agrawal Department of Computer Science and Engineering The Ohio State.

EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES.

Massively Parallel Mapping of Next Generation Sequence Reads Using GPUs Azita Nouri, Reha Oğuz Selvitopi, Özcan Öztürk, Onur Mutlu, Can Alkan Bilkent University,

Threads by Dr. Amin Danial Asham. References Operating System Concepts ABRAHAM SILBERSCHATZ, PETER BAER GALVIN, and GREG GAGNE.

PARALLEL PROCESSING OF LARGE SCALE GENOMIC DATA Candidacy Examination 08/26/2014 Mucahid Kutlu.

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

InCoB August 30, HKUST “Speedup Bioinformatics Applications on Multicore- based Processor using Vectorizing & Multithreading Strategies” King.

CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.

Cosc 2150: Computer Organization Chapter 6, Part 2 Virtual Memory.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.

ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University.

Integrating and Optimizing Transactional Memory in a Data Mining Middleware Vignesh Ravi and Gagan Agrawal Department of ComputerScience and Engg. The.

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

FREERIDE: System Support for High Performance Data Mining Ruoming Jin Leo Glimcher Xuan Zhang Ge Yang Gagan Agrawal Department of Computer and Information.

CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.

Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

U N I V E R S I T Y O F S O U T H F L O R I D A Hadoop Alternative The Hadoop Alternative Larry Moore 1, Zach Fadika 2, Dr. Madhusudhan Govindaraju 2 1.

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

Yang Yu, Tianyang Lei, Haibo Chen, Binyu Zang Fudan University, China Shanghai Jiao Tong University, China Institute of Parallel and Distributed Systems.

LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung Wong Chung Hoi Supervised by Prof. Michael R. Lyu Department of Computer.

Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 3: Process-Concept.

DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.

RE-PAGE: Domain-Specific REplication and PArallel Processing of GEnomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering.

Full and Para Virtualization

PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.

Martin Kruliš by Martin Kruliš (v1.1)1.

University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.

Operating Systems (CS 340 D) Dr. Abeer Mahmoud Princess Nora University Faculty of Computer & Information Systems Computer science Department.

Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,

Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.

Concurrency and Performance Based on slides by Henri Casanova.

Nawanol Theera-Ampornpunt, Seong Gon Kim, Asish Ghoshal, Saurabh Bagchi, Ananth Grama, and Somali Chaterji Fast Training on Large Genomics Data using Distributed.

Chapter 7 Memory Management Eighth Edition William Stallings Operating Systems: Internals and Design Principles.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo Vignesh T. Ravi Gagan Agrawal Department of Computer Science and Engineering,

Page 1 2P13 Week 1. Page 2 Page 3 Page 4 Page 5.

Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi

Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.

Fast Data Analysis with Integrated Statistical Metadata in Scientific Datasets By Yong Chen (with Jialin Liu) Data-Intensive Scalable Computing Laboratory.

| presented by Vasileios Zois CS at USC 09/20/2013 Introducing Scalability into Smart Grid 1.

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

Ioannis E. Venetis Department of Computer Engineering and Informatics

Operating Systems (CS 340 D)

Genomic Data Clustering on FPGAs for Compression

Operating Systems (CS 340 D)

Liang Chen Advisor: Gagan Agrawal Computer Science & Engineering

MapReduce Simplied Data Processing on Large Clusters

Yu Su, Yi Wang, Gagan Agrawal The Ohio State University

Linchuan Chen, Peng Jiang and Gagan Agrawal

Data-Intensive Computing: From Clouds to GPU Clusters

Chapter 1 Introduction.

Bin Ren, Gagan Agrawal, Brad Chamberlain, Steve Deitz

Peng Jiang, Linchuan Chen, and Gagan Agrawal

Virtual Memory: Working Sets

COMP755 Advanced Operating Systems

L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher

Presentation transcript:

GEM: A Framework for Developing Shared- Memory Parallel GEnomic Applications on Memory Constrained Architectures Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering The Ohio State University ICPP 2015, Beijing China

Motivation The sequencing costs are decreasing Available data is increasing! ICPP'152 *Adapted from *Adapted from Parallel processing is inevitable!

Besides sequencing technologies, computational technologies are also developing fast New technology trend: Many cores but limited memory per core A prominent example: Intel Xeon Phi (MIC) architectures – 61 cores and 16 GB Memory in 7200 series – Many advantages SIMD vector operation Compatibility with CPUs Challenges: – Load balancing – Memory over consumption or disk trashing – I/O contention New Trends in Computational Technologies ICPP'153

Proposed Middleware System We propose GEM for developing shared-memory parallel genomic applications with memory- constrained many-core architectures Runs with MIC but not designed specifically for MIC – Doesn’t utilize MIC’s specific features (512-bit SIMD instruction set) Supports two execution models similar to other middleware systems for genomic data processing (GATK and PAGE) ICPP'154

Inter-dependent Processing of Genomes ICPP'155

Independent Processing of Genomes ICPP'156

Load Tasks Reads data chunks from the disk Generates Genome Matrix data structure – Locus-based Genome Matrix – Sequence-based Genome Matrix ICPP'157

Enhancements on Load Tasks In order to decrease memory requirement – Selective Loading There are 11 data fields in SAM format and we don’t need all of them for many types of analyses. We ask user to define the data fields needed for processing. We keep only those needed data fields. – Compact Storage We modify the Samtools libraries and decrease the bits needed to define certain fields For example, practically, number of alignments for a particular locus doesn’t exceed 2^16 −1 in a single genomic data, thus we use 16 bits instead of an integer for that data field In order to decrease overhead of load tasks – Time consuming to find a specific region in genomic data Too many load tasks will increase the overhead Few number of load tasks can damage the load balance – Subchunking: Each load task fills multiple GMs ICPP'158

Map & Reduce Tasks Map Tasks – Defined by the user – Takes a genome matrix as input Intermediate Results – User can define combine function to reduce memory consumption (optional) – User can choose where to keep them. In the memory or in the disk Reduce Tasks – Defined by the user – Takes a list of intermediate results – Intermediate results should be removed from the memory by the user ICPP'159

Scheduling Scheme Load tasks increase memory consumption by loading data into memory Map tasks decrease memory consumption by removing genome matrices from the memory If we assign load tasks to all cores, I/O contention increases and memory can be over-consumed. Our goal is to schedule the tasks such that – Memory is not over-consumed – Concurrently running map and load tasks are balanced – Load balance is maintained We use the following thresholds – Maximum number of concurrently running load tasks – Maximum number of genome matrices in the memory ICPP'1510

ICPP'1511 Genome Matrix Container Master W1 W2 W3 IDLE Load Map Max. Number of Load Tasks: 2 Max. Number of GMs: 6 Max. Number of Elements in GMs: 100 R1, 1 R1 R2, 2 R2 Temp. Map Empty Loading Available Being Processed If there is no empty GM, a load task can temporarily process a GM When a load task loads all data in a region, a map task is assigned if there is no available GM

Sample Implementation with GEM ICPP'1512 Parameters defined by the user Execution model: Independent Selective Loading: Base sequences Where to keep intermediate results: In memory

Experiments ICPP'1513 Architecture: Xeon Phi SE10P architecture Number of cores: 61 Processor Speed: 1.1 GHz Memory: 8GB Applications SNP Calling: A very similar version of VarScan’s algorithm. Locus-based Statistical Analysis (LSA): A simplified version of DepthOfCoverage tool of GATK. Statistical Analysis Per Genome (SAG): Performs various statistical analysis (such as finding the number of sequences in the given list of genomic regions, the number of each nucleotide base ) for each genome separately. Parameter Configuration Based on executions performed on a training set Maximum Number of Load Tasks: 40 (but decreased to 20 when input size is 20 files due to I/O contention) Inter-dependent processing (SNP Calling and LSA) Region Length: 12M Maximum size of genome matrices: 2400 Independent Processing (SAG) Region Length: 64M Maximum size of genome matrices: 800

Parallel Scalability ICPP'1514 GEM’s Scalability: 14.4x GEM’s Scalability: 15.4x Speedup Over Basic Method GEM’s Scalability: 12.4x

Comparison with Other Middleware Systems ICPP'1515 Architecture: CPU with 8 cores and 12 GB memory Applications: two tools of GATK, which are Countbase and CountLoci Execution Time of GATK, PAGE and GEM with Varying Data Sizes

Summary We developed a middleware system for developing parallel genomic applications with memory-constrained many-core architectures. – Decreases memory requirements of tasks – Prevents over-consumption of the memory – Decreases I/O contention Good scalability results GEM outperforms GATK and PAGE. ICPP'1516

Thank you! ICPP'1517