Accelerating the Dynamic Time Warping Distance Measure Using Logarithmic Arithmetic Joseph Tarango, Eamonn Keogh, Philip Brisk

Slides:

Advertisements

Similar presentations

Acceleration of Cooley-Tukey algorithm using Maxeler machine

Advertisements

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

Word Spotting DTW.

Chapter 1 An Overview of Computers and Programming Languages.

Doruk Sart, Abdullah Mueen, Walid Najjar, Eamonn Keogh, Vit Niennatrakul 1.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

Floating-Point FPGA (FPFPGA) Architecture and Modeling (A paper review) Jason Luu ECE University of Toronto Oct 27, 2009.

Application-Specific Customization of FPGA Soft- core Processors Journal Paper Presentation Presented by: Ahmad Sghaier Course Instructor: Dr. Shawki Areibi.

Energy Characterization and Optimization of Embedded Data Mining Algorithms: A Case Study of the DTW-kNN Framework Huazhong University of Science & Technology,

Instruction Set Extension for Dynamic Time Warping Joseph Tarango, Eammon Keogh, Philip Brisk

Bryan Lahartinger. “The Apriori algorithm is a fundamental correlation-based data mining [technique]” “Software implementations of the Aprioiri algorithm.

Zheming CSCE715.  A wireless sensor network (WSN) ◦ Spatially distributed sensors to monitor physical or environmental conditions, and to cooperatively.

Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.

CSCE 212 Chapter 4: Assessing and Understanding Performance Instructor: Jason D. Bakos.

Themis Palpanas1 VLDB - Aug 2004 Fair Use Agreement This agreement covers the use of all slides on this CD-Rom, please read carefully. You may freely use.

A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department.

CBF Dataset Two-Pat Dataset Euclidean DTW Increasingly Large Training.

Efficient Query Filtering for Streaming Time Series

Energy Efficient Instruction Cache for Wide-issue Processors Alex Veidenbaum Information and Computer Science University of California, Irvine.

Using Relevance Feedback in Multimedia Databases

The Effect of Data-Reuse Transformations on Multimedia Applications for Different Processing Platforms N. Vassiliadis, A. Chormoviti, N. Kavvadias, S.

Adaptive Video Coding to Reduce Energy on General Purpose Processors Daniel Grobe Sachs, Sarita Adve, Douglas L. Jones University of Illinois at Urbana-Champaign.

Exact Indexing of Dynamic Time Warping

Yehdhih Ould Mohammed Moctar1 Nithin George2 Hadi Parandeh-Afshar2

FastDTW: Toward Accurate Dynamic Time Warping in Linear Time and Space

RICE UNIVERSITY Implementing the Viterbi algorithm on programmable processors Sridhar Rajagopal Elec 696

A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,

1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Archs, VHDL 3 Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.

A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.

Brandon Westover, Qiang Zhu, Jesin Zakaria, Eamonn Keogh

Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.

Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.

200/MAPLD 2004 Craven1 Super-Sized Multiplies: How Do FPGAs Fare in Extended Digit Multipliers? Stephen Craven Cameron Patterson Peter Athanas Configurable.

Decimal Multiplier on FPGA using Embedded Binary Multipliers Authors: H. Neto and M. Vestias Conference: Field Programmable Logic and Applications (FPL),

HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.

Mahesh Sukumar Subramanian Srinivasan. Introduction Embedded system products keep arriving in the market. There is a continuous growing demand for more.

K. Selçuk Candan, Maria Luisa Sapino Xiaolan Wang, Rosaria Rossini

1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

Energy-Effective Issue Logic Hasan Hüseyin Yılmaz.

A Faster Approximation Scheme for Timing Driven Minimum Cost Layer Assignment Shiyan Hu*, Zhuo Li**, and Charles J. Alpert** *Dept of ECE, Michigan Technological.

1 CS 260 Winter 2014 Eamonn Keogh’s Presentation of Thanawin Rakthanmanon, Bilson Campana, Abdullah Mueen, Gustavo Batista, Brandon Westover, Qiang Zhu,

1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.

k-Shape: Efficient and Accurate Clustering of Time Series

Exact indexing of Dynamic Time Warping

Stream Monitoring under the Time Warping Distance Yasushi Sakurai (NTT Cyber Space Labs) Christos Faloutsos (Carnegie Mellon Univ.) Masashi Yamamuro (NTT.

RICE UNIVERSITY DSPs for future wireless systems Sridhar Rajagopal.

Minimum Effort Design Space Subsetting for Configurable Caches + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing This work.

Accelerating Dynamic Time Warping Clustering with a Novel Admissible Pruning Strategy Nurjahan BegumLiudmila Ulanova Jun Wang 1 Eamonn Keogh University.

WARP PROCESSORS ROMAN LYSECKY GREG STITT FRANK VAHID Presented by: Xin Guan Mar. 17, 2010.

Recursive Architectures for 2DLNS Multiplication RESEARCH CENTRE FOR INTEGRATED MICROSYSTEMS - UNIVERSITY OF WINDSOR 11 Recursive Architectures for 2DLNS.

1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.

1 Hardware-Software Co-Synthesis of Low Power Real-Time Distributed Embedded Systems with Dynamically Reconfigurable FPGAs Li Shang and Niraj K.Jha Proceedings.

Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,

Floating-Point FPGA (FPFPGA)

Dynamo: A Runtime Codesign Environment

A Closer Look at Instruction Set Architectures

Performance of Single-cycle Design

Decoupled Access-Execute Pioneering Compilation for Energy Efficiency

Genomic Data Clustering on FPGAs for Compression

Real-Time Ray Tracing Stefan Popov.

CSCE 212 Chapter 4: Assessing and Understanding Performance

CDA 3101 Spring 2016 Introduction to Computer Organization

Time Series Filtering Time Series

Short Circuiting Memory Traffic in Handheld Platforms

CS294-1 Reading Aug 28, 2003 Jaein Jeong

STUDY AND IMPLEMENTATION

Time Series Filtering Time Series

Donghui Zhang, Tian Xia Northeastern University

Chapter 4 The Von Neumann Model

Presentation transcript:

Accelerating the Dynamic Time Warping Distance Measure Using Logarithmic Arithmetic Joseph Tarango, Eamonn Keogh, Philip Brisk

Motivation 2 om/tumblr_loeis9vfDe1qi4jh5o1_400.jpg 100% fatality rate if left untreated Influx of fluid raises the heart muscle’s perfusion threshold Heart starves for oxygen and stops pumping blood Easy to treat Puncture pericardium and drain fluid Hard to detect People are not (yet?) born with integrated sensors Stringent real-time constraints between onset and death

Pulsus Paradoxus 3 NormalPulsus Paradoxus Respiration PPG (Photoplethysmographic) Pulse shows interference from respiration Under pericardial tamponade, inhalation reduces the heart’s ability to pump blood Real-time detection is computationally tractable on a bedside device at the hospital We need more efficient solutions for real-time monitoring

Time Series (Formal Definition) Ordered sequence of data points – T = (t 1, t 2, …, t m ) In the online context, consider a subsequence – T i,k = (t i, t i+1, …, t i+k ) 4 Candidate C = T i,k T Q Query

Time Series Similarity 5 Euclidean Distance (ED) Dynamic Time Warping (DTW)

DTW 6 Conceptual Idea: Enumerate all possible warping paths Choose the one of minimum cost Implementation: Dynamic programming computes an optimal solution in quadratic time C Q

The Case for DTW “… similarity search is the bottleneck for virtually all time series data mining algorithms.” [SIGKDD 2012] “After an exhaustive literature search of more than 800 papers [PVLDB 2008], we are not aware of any distance measure that has been shown to outperform DTW by a statistically significant amount on reproducible experiments.” [SIGKDD 2012] “We can exactly search under DTW much faster than the current state-of-the-art Euclidean distance search algorithms.” [SIGKDD 2012] 7

Objective and Contribution Design application-specific DTW processor with HW acceleration – Performance – Energy consumption Start with highly optimized DTW software [SIGKDD 2012] – Double-precision floating-point arithmetic written in C Prior work [CODES-ISSS 2013] – DTW processor derived from SIGKDD software This talk: DTW processor using logarithmic number systems (LNS) – Higher performance – Reduced energy consumption – Reduced area 8

Logarithmic Number System (LNS) Represent X as logX The good news – log(XY) = logX + logY(fixed-point +) – log(X/Y) = logX – logY(fixed-point -) – log(X n ) = nlogX(fixed-point *) – log(X 1/n ) = (1/n)*logX(fixed-point /) The bad news – log(X ± Y) = logX + log(1 ± 2 logB – logA )(ROM) – Conversion to/from LNS(log/exp) 9

LNS Operators Based on work by F. de Dinechin and J. Detrey [Asilomar 2003, 2005; ASAP 2005; DSD 2005; JMM 2006] 10

Z-Normalization 11 Arithmetic Mean [SIGKDD 2012, CODES-ISSS 2013] Geometric Mean (Good for LNS) Q C Q C Q C C Q

Bounding Warp Paths and LB_Keogh 12 L U Q C Q Sakoe-Chiba Band U i = max(q i-r : q i+r ) L i = min(q i-r : q i+r ) C U L Q DTW Match If LB_Keogh > threshold, then DTW > threshold No match ==> no need to compute DTW

Early Abandoning, Reordering and Reversing the Query/Candidate 13 Stop as soon as you exceed the threshold

Early Abandoning DTW 14

Cascading Lower Bounds 15 LB_KimFL A and DO(1) Time LB_Kim A, B, C, DO(n) Time Tightness of lower bound A B C D

Experimental Platform Xilinx EK-V6-ML605-G 16 Microblaze Processor – 1 core, 100 MHz – Integer divider – 64-bit multiplier – 2048-bit branch target cache Cache Configuration

ISE I/O Interface MicroBlaze operates on 32-bit data – Double-precision FP / LNS use 64-bit data – 2 cycles to transfer each operand to/from the ISE 17

Software Profile 18 Four instruction set extensions ISE-Norm (Normalization) ISE-DTW (DTW) ISE-ACCUM(Accumulation) ISE-ED (Euclidean Distance) [CODES-ISSS 2013]

FP vs. LNS Operators and ISEs Latency 19 FP LNS LNS operator latency is dominated by data transfer overhead FP operator latency is dominated by the operator ADD/SUBMULDIV ALU Ops ISE-NormISE-DTWISE-AccumISE-ED ISEs

FP vs. LNS Operators and ISEs Area (FPGA Resources) 20 LUT FFsSlice LUTsSlice Regs LNS operators are significantly smaller ADD/SUBMULDIV ALU Ops ISE-NormISE-DTWISE-AccumISE-ED ISEs

Speedup (Normalized to Baseline MicroBlaze) 21 gcc at optimization level –O3 used for all experiments FP ISE operators are pipelined LNS-based ISEs offer higher performance than FP ISEs

Energy Consumption (Joules) 22 BaselineBaseline + FPUBaseline + FP ISEs Baseline + LNS ISEs gcc –O3 used in all experiments reported here

Conclusion and Future Work LNS vs. Floating-point Instruction Set Extensions for DTW Processor – Faster (8.7x vs. 4.9x) – More energy efficient (8.5x vs. 4.7x) – Cheaper (FP ISEs are 3.6x larger than LNS) Future Work – Vary the precision of arithmetic operators – Scale up the system More candidates More queries More cores (more ISEs? shared ISEs? Etc.) 23