Introduction to Research 2007 Introduction to Research 2007 Ashok Srinivasan Florida State University www.cs.fsu.edu/~asriniva Recent collaborators V.

Slides:

Advertisements

Similar presentations

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.

Advertisements

Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)

Monte-Carlo method and Parallel computing  An introduction to GPU programming Mr. Fang-An Kuo, Dr. Matthew R. Smith NCHC Applied Scientific Computing.

Parallelizing GIS applications for IBM Cell Broadband engine and x86 Multicore platforms Bharghava R, Jyothish Soman, K S Rajan International.

Ido Tov & Matan Raveh Parallel Processing ( ) January 2014 Electrical and Computer Engineering DPT. Ben-Gurion University.

ICS 556 Parallel Algorithms Ebrahim Malalla Office: Bldg 22, Room

Claude TADONKI Mines ParisTech – LAL / CNRS / INP 2 P 3 University of Oujda (Morocco) – October 7, 2011 High Performance Computing Challenges and Trends.

Heterogeneous Computing: New Directions for Efficient and Scalable High-Performance Computing Dr. Jason D. Bakos.

Parallel Programming Henri Bal Rob van Nieuwpoort Vrije Universiteit Amsterdam Faculty of Sciences.

Parallel Programming Henri Bal Vrije Universiteit Faculty of Sciences Amsterdam.

Reconfigurable Application Specific Computers RASCs Advanced Architectures with Multiple Processors and Field Programmable Gate Arrays FPGAs Computational.

ELEC 6200, Fall 07, Oct 29 McPherson: Vector Processors1 Vector Processors Ryan McPherson ELEC 6200 Fall 2007.

CS 732: Advance Machine Learning Usman Roshan Department of Computer Science NJIT.

Lecture 1: Introduction to High Performance Computing.

Parallel Programming Henri Bal Vrije Universiteit Amsterdam Faculty of Sciences.

FLANN Fast Library for Approximate Nearest Neighbors

Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.

© 2005, it - instituto de telecomunicações. Todos os direitos reservados. System Level Resource Discovery and Management for Multi Core Environment Javad.

Real Parallel Computers. Modular data centers Background Information Recent trends in the marketplace of high performance computing Strohmaier, Dongarra,

A Framework for Collective Personalized Communication Laxmikant V. Kale, Sameer Kumar, Krishnan Varadarajan.

GPU Programming with CUDA – Accelerated Architectures Mike Griffiths

CC02 – Parallel Programming Using OpenMP 1 of 25 PhUSE 2011 Aniruddha Deshmukh Cytel Inc.

Performance Evaluation of Hybrid MPI/OpenMP Implementation of a Lattice Boltzmann Application on Multicore Systems Department of Computer Science and Engineering,

Chapter 2 Computer Clusters Lecture 2.3 GPU Clusters for Massive Paralelism.

Scaling to New Heights Retrospective IEEE/ACM SC2002 Conference Baltimore, MD.

Atlanta, Georgia TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional OS Handong Ye, Robert Pavel, Aaron Landwehr, Guang.

Performance Tuning on Multicore Systems for Feature Matching within Image Collections Xiaoxin Tang*, Steven Mills, David Eyers, Zhiyi Huang, Kai-Cheung.

Computational issues in Carbon nanotube simulation Ashok Srinivasan Department of Computer Science Florida State University.

National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Cell processor implementation of a MILC lattice QCD application.

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

Rensselaer Why not change the world? Rensselaer Why not change the world? 1.

High-Performance Computing An Applications Perspective REACH-IIT Kanpur 10 th Oct

Programming Examples that Expose Efficiency Issues for the Cell Broadband Engine Architecture William Lundgren Gedae), Rick Pancoast.

High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.

An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech.

SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.

Problem is to compute: f(latitude, longitude, elevation, time)  temperature, pressure, humidity, wind velocity Approach: –Discretize the.

- Rohan Dhamnaskar. Overview  What is a Supercomputer  Some Concepts  Couple of examples.

Distributed WHT Algorithms Kang Chen Jeremy Johnson Computer Science Drexel University Franz Franchetti Electrical and Computer Engineering.

Long-Time Molecular Dynamics Simulations through Parallelization of the Time Domain Ashok Srinivasan Florida State University

Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.

2009/4/21 Third French-Japanese PAAP Workshop 1 A Volumetric 3-D FFT on Clusters of Multi-Core Processors Daisuke Takahashi University of Tsukuba, Japan.

Summary Background –Why do we need parallel processing? Moore’s law. Applications. Introduction in algorithms and applications –Methodology to develop.

Optimization of Collective Communication in Intra- Cell MPI Optimization of Collective Communication in Intra- Cell MPI Ashok Srinivasan Florida State.

Sep 08, 2009 SPEEDUP – Optimization and Porting of Path Integral MC Code to New Computing Architectures V. Slavnić, A. Balaž, D. Stojiljković, A. Belić,

Introduction to Research 2011 Introduction to Research 2011 Ashok Srinivasan Florida State University Images from ORNL, IBM, NVIDIA.

High Performance Computing Group Feasibility Study of MPI Implementation on the Heterogeneous Multi-Core Cell BE TM Architecture Feasibility Study of MPI.

MS 15: Data-Aware Parallel Computing Data-Driven Parallelization in Multi-Scale Applications – Ashok Srinivasan, Florida State University Dynamic Data.

Big data Usman Roshan CS 675. Big data Typically refers to datasets with very large number of instances (rows) as opposed to attributes (columns). Data.

TI Information – Selective Disclosure Implementation of Linear Algebra Libraries for Embedded Architectures Using BLIS September 28, 2015 Devangi Parikh.

CS 732: Advance Machine Learning

Scalable Time-Parallelization of Molecular Dynamics Simulations in Nano Mechanics Y. Yu, Ashok Srinivasan, and N. Chandra Florida State University

Today's Software For Tomorrow's Hardware: An Introduction to Parallel Computing Rahul.S. Sampath May 9 th 2007.

Data-Driven Time-Parallelization in the AFM Simulation of Proteins L. Ji, H. Nymeyer, A. Srinivasan, and Y. Yu Florida State University

Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.

FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.

Computational Chemistry Trygve Helgaker CTCC, Department of Chemistry, University of Oslo.

Fermi National Accelerator Laboratory & Thomas Jefferson National Accelerator Facility SciDAC LQCD Software The Department of Energy (DOE) Office of Science.

Multicore Applications in Physics and Biochemical Research Hristo Iliev Faculty of Physics Sofia University “St. Kliment Ohridski” 3 rd Balkan Conference.

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

Computational Techniques for Efficient Carbon Nanotube Simulation

Parallel Computers Today

Long-Time Molecular Dynamics Simulations in Nano-Mechanics through Parallelization of the Time Domain Ashok Srinivasan Florida State University

Hybrid Programming with OpenMP and MPI

Computational Techniques for Efficient Carbon Nanotube Simulation

Vrije Universiteit Amsterdam

Multicore and GPU Programming

Multicore and GPU Programming

Computational issues Issues Solutions Large time scale

Presentation transcript:

Introduction to Research 2007 Introduction to Research 2007 Ashok Srinivasan Florida State University Recent collaborators V. Aggarwal, J. Kolhe, L. Ji, M. Mascagni, H. Nymeyer, and Y. Yu  Florida State University S. Kapoor  IBM Austin S. Namilae  Oak Ridge National Lab M. Krishna, A. Kumar, N. Jayam, G. Senthilkumar, P. K. Baruah, and R. Sharma  Sri Sathya Sai University, India N. Chandra  University of Nebraska at Lincoln Research support Funding  DoD, FSU, NSF Computer time  IBM, NCSA, NERSC, ORNL

Outline Research Areas  Computational Nanotechnology  Computational Biology  High Performance Computing on Multicore Processors Potential Research Topics Graduate Courses

Research Areas High Performance Computing, Applications in Computational Sciences, Scalable Algorithms, Mathematical Software  Current topics: Computational Nanotechnology, Computational Biology, HPC on Multicore Processors  New Topics: Dynamic Data Driven Applications  Old Topics: Computational Finance, Parallel Random Number Generation, Monte Carlo Linear Algebra, Computational Fluid Dynamics, Image Compression

Importance of Parallel Computing Makes feasible products based on more fundamental understanding of science  Example: Nanotechnology, Medicine Increasing relevance to industry  In 1993, fewer than 30% of top 500 supercomputers were commercial  Now, over 50% are commercial Finance and insurance Medicine Aerospace and Automobiles Telecom Oil exploration Shoes! (Nike) Potato chips! Toys!

Architectural Trends Massive parallelism  10K processor systems will be commonplace  Large end already has over 100K processors Single chip multiprocessing  All processors will be multicore  Heterogeneous multicore processors Cell used in the PS3 80-core processor from Intel Processors with hundreds of cores are already commercially available Distributed environments, such as the Grid But it is hard to get good performance on these systems

Computational Nanotechnology Example application  Carbon Nanotube Can span 23,000 miles without failing due to own weight 100 times stronger than steel Lighter than feather Conducts heat better than diamond  Computations are used to understand materials at the atomic scale, so that better materials can be designed Easier than experimentation at the nano-meter scale

CNT Tensile Test Pull the CNT at constant speed  Determine material properties from force-displacement response Computational difficulties  Time steps size ~ 10 –15 seconds Desired time range is much larger A million time steps are required to reach s ~ 500 hours of computing for ~ 40K atoms using GROMACS MD uses unrealistically large pulling speed  1 to 10 m/s instead of to10 -5 m/s Results at unrealistic speeds are unrealistic!

Difficulty with Parallelization Results on scalable code  Does not scale efficiently beyond 10 ms/iteration If we want to simulate to a ms  Time step 1 fs  iterations  s ≈ 300 years If we scaled to 10  s per iteration  4 months computing time NAMD, 327K atom ATPase PME, Blue Gene, IPDPS 2006 NAMD, 92K atom ApoA1 PME, Blue Gene, IPDPS 2006 IBM Blue Matter, 43K Rhodopsin, Blue Gene, Tech Report 2005 Desmond, 92K atom ApoA1, SC 2006

Data Driven Time Parallelization Each processor simulates a different time interval  Initial state is obtained by prediction, using prior data (except for processor 0)  Verify if prediction for end state is close to that computed by MD  Prediction is based on dynamically determining a relationship between the current simulation and those in a database of prior results If time interval is sufficiently large, then communication overhead is small

Results Speedup result  Red line: Ideal speedup  Blue: v = 0.1m/s  Green: A different predictor  Experimental parameters v = 1m/s, using v = 10m/s CNT with 1000 atoms Xeon/ Myrinet cluster Validation  Compare stress strain response  Blue: Exact results  Red: Time parallel results  Green: Direct prediction

Computational Biology Data driven time parallelization in the AFM simulation of proteins  An order of magnitude improvement in performance by combining conventional and data driven time parallelization with the protein Titin

A PowerPC core, with 8 co-processors (SPE) with 256 K local store each Shared 512 MB - 2 GB main memory - SPEs can DMA Peak speeds of Gflops in single precision and Gflops in double precision for SPEs GB/s EIB bandwidth, 25.6 GB/s for memory Two Cell processors can be combined to form a Cell blade with global shared memory High Performance Computing on Multicore Processors DMA put times Memory to Memory Copy using: SPE local store memcpy by PPE Cell Architecture

Cell MPI Results PE: Consider SPUs to be a logical hypercube – in each step, each SPU exchanges messages with neighbor along one dimension DIS: In step i, SPU j sends to SPU j + 2 i and receives from j – 2 i Comparison of MPI_Barrier on different hardware PCell (PE)  s Xeon/Myrinet  s NEC SX-8  s SGI Altix BX2  s 80.4  10  13   14  5 MPI_Barrier timing Broadcast bandwidth

Potential Research Topics Computational Biology  Data Driven Time Parallelization  Markov State Modeling  Other topics Dynamic Data Driven Applications  Combining simulations and experiments in superplastic forming High Performance Computing on Multicore Processors  Algorithms and libraries on the Cell processor Example: Sorting, linear algebra, etc Good software cache/code overlaying implementations Other possible new directions  Applications in history, linguistics, medicine, etc

Graduate Courses Parallel Computing, Spring 2008  MPI and OpenMP programming on traditional parallel machines  Threaded programming on multicore processors  Parallel algorithms Advanced Algorithms, Fall 2008  Approximation algorithms for NP hard problems  Randomized algorithms  Cache aware algorithms