HOCT: A Highly Scalable Algorithm for Training Linear CRF on Modern Hardware presented by Tianyuan Chen.

Slides:



Advertisements
Similar presentations
Streaming SIMD Extension (SSE)
Advertisements

Fabián E. Bustamante, Spring 2007
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Our approach! 6.9% Perfect L2 cache (hit rate 100% ) 1MB L2 cache Cholesky 47% speedup BASE: All cores are used to execute the application-threads. PB-GS(PB-LS)
Is SC + ILP = RC? Presented by Vamshi Kadaru Chris Gniady, Babak Falsafi, and T. N. VijayKumar - Purdue University Spring 2005: CS 7968 Parallel Computer.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
Benchmarking Parallel Code. Benchmarking2 What are the performance characteristics of a parallel code? What should be measured?
Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.
Dr. Alexandra Fedorova August 2007 Introduction to Systems Research at SFU.
Memory/Storage Architecture Lab Computer Architecture Virtual Memory.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
1 Improving Hash Join Performance through Prefetching _________________________________________________By SHIMIN CHEN Intel Research Pittsburgh ANASTASSIA.
Abstract Shortest distance query is a fundamental operation in large-scale networks. Many existing methods in the literature take a landmark embedding.
Nat DucaJonathan Cohen Johns Hopkins University Peter Kirchner IBM Research Stream Caching: Mechanisms for General Purpose Stream Processing.
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
Data Cache Prefetching using a Global History Buffer Presented by: Chuck (Chengyan) Zhao Mar 30, 2004 Written by: - Kyle Nesbit - James Smith Department.
Introduction to Database Systems 1 The Storage Hierarchy and Magnetic Disks Storage Technology: Topic 1.
1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.
Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters Christine MORIN PARIS project-team, IRISA/INRIA.
NCCS Brown Bag Series. Vectorization Efficient SIMD parallelism on NCCS systems Craig Pelissier* and Kareem Sorathia
Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition Chapter 1: Introduction.
Modularizing B+-trees: Three-Level B+-trees Work Fine Shigero Sasaki* and Takuya Araki NEC Corporation * currently with 1st Nexpire Inc.
TRACK-ALIGNED EXTENTS: MATCHING ACCESS PATTERNS TO DISK DRIVE CHARACTERISTICS J. Schindler J.-L.Griffin C. R. Lumb G. R. Ganger Carnegie Mellon University.
Atlanta, Georgia TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional OS Handong Ye, Robert Pavel, Aaron Landwehr, Guang.
Performance Tuning on Multicore Systems for Feature Matching within Image Collections Xiaoxin Tang*, Steven Mills, David Eyers, Zhiyi Huang, Kai-Cheung.
Microprocessor-based systems Curse 7 Memory hierarchies.
Lecture 19: Virtual Memory
1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
July 30, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 8: Exploiting Memory Hierarchy: Virtual Memory * Jeremy R. Johnson Monday.
Computer Architecture Memory organization. Types of Memory Cache Memory Serves as a buffer for frequently accessed data Small  High Cost RAM (Main Memory)
Bottlenecks of SIMD Haibin Wang Wei tong. Paper Bottlenecks in Multimedia Processing with SIMD Style Extensions and Architectural Enhancements One IEEE.
Deconstructing Storage Arrays Timothy E. Denehy, John Bent, Florentina I. Popovici, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau University of Wisconsin,
Future of parallel computing: issues and directions Laxmikant Kale CS433 Spring 2000.
Tell Me What You See and I will Show You Where It Is Jia Xu 1 Alexander G. Schwing 2 Raquel Urtasun 2,3 1 University of Wisconsin-Madison 2 University.
GEM: A Framework for Developing Shared- Memory Parallel GEnomic Applications on Memory Constrained Architectures Mucahid Kutlu Gagan Agrawal Department.
A BRIEF INTRODUCTION TO CACHE LOCALITY YIN WEI DONG 14 SS.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Virtual Memory Review Goal: give illusion of a large memory Allow many processes to share single memory Strategy Break physical memory up into blocks (pages)
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
Sunpyo Hong, Hyesoon Kim
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
What is it and why do we need it? Chris Ward CS147 10/16/2008.
On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 20/11/2015 Compilers for Embedded Systems.
Taeho Kgil, Trevor Mudge Advanced Computer Architecture Laboratory The University of Michigan Ann Arbor, USA CASES’06.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
A Level Computing – a2 Component 2 1A, 1B, 1C, 1D, 1E.
Memory COMPUTER ARCHITECTURE
Chapter1 Fundamental of Computer Design
Embedded Systems Design
CS 147 – Parallel Processing
Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy
Many-core Software Development Platforms
Array Processor.
Performance Optimization for Embedded Software
(Architectural Support for) Semantically-Smart Disk Systems
STUDY AND IMPLEMENTATION
Coe818 Advanced Computer Architecture
Lecture 24: Memory, VM, Multiproc
Final Project presentation
13.3 Accelerating Access to Secondary Storage
Computer Architecture
Presentation transcript:

HOCT: A Highly Scalable Algorithm for Training Linear CRF on Modern Hardware presented by Tianyuan Chen

Introduction Conditional random fields (CRFs) are probabilistic models for segmenting and labeling sequential data. They offer many advantages over other models. Despite the advantages, CRFs suffer from a main drawback – the training process needs very high computational resources. The training process may last for days or even weeks. To address the inefficiency problem, we present our training algorithm – HOCT (Highly Optimized CRF Trainer).

Main Idea We found that the training algorithms failed to make effective utilization of modern hardware. Our main idea is to leverage features of modern hardware to accelerate CRF training. To the best of our knowledge, this is the first study to explore computer architectures to accelerate CRF training.

Related Work There have been some methods proposed to address the inefficiency problem of CRF training. The motivation behind their methods is to reduce the computation time by approximate the result of exact inference. Unlike these methods, we improved CRF training performance while not affecting the final results. There are also several studies which exploited modern computer architectures to improve algorithm performance.

Our methods We improved the performance of CRF training through the following approaches: – We improved the cache performance of CRF training by leveraging software prefetching of modern processors. – We utilized SIMD technology to improve the parallelism of CRF training. – We improved the performance of training by letting our algorithm manage the disk operations.

Prefetching Training a CRF model needs frequent access to large matrices, size of which can vary from tens to hundreds MB. The access pattern to these matrices appears to be random. Thus, large amounts of cache misses will occur. Modern processors provide software prefetch instructions to allow data to be moved into the cache before it is actually used. If used properly, prefetching can hide much of the cache miss latency by overlapping them with other computations. In our algorithm, when performing computations on some data from the matrices, we prefetch into cache the data that will be accessed in the near future. Thus hiding the cache miss latency.

Prefetching

SIMD SIMD stands for “Single Instruction Multiple Data”. CPUs that support SIMD feature can perform basic operations on several data elements in parallel. Most modern processors are equipped with this feature. In CRF training, there are many operations on large vectors, such as addition, subtraction and dot-product. In our algorithm, we leveraged SIMD to accelerate these computations on large vectors.

Memory-Disk Management For large tasks, the memory cost of training is quite large. This is due to the huge number of features. When the memory requirement exceeds the size of the physical memory, the OS will use some disk space as an extension to the physical memory. This will cause a drastic degradation of the performance. In our algorithm, we let HOCT manage the memory- disk operation. When performing computations, we write to disk the data that will not be used in the near future, and read it into a buffer when it is needed. This strategy has improved the efficiency by a huge amount.

Memory-Disk Management

Final Experimental Result

Thank you!