Efficient Document Analytics on Compressed Data:

Slides:

Advertisements

Similar presentations

Optimizing Compilers for Modern Architectures Syllabus Allen and Kennedy, Preface Optimizing Compilers for Modern Architectures.

Advertisements

OLAP Tuning. Outline OLAP 101 – Data warehouse architecture – ROLAP, MOLAP and HOLAP Data Cube – Star Schema and operations – The CUBE operator – Tuning.

Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.

1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei.

Spark: Cluster Computing with Working Sets

Bitmap Index Buddhika Madduma 22/03/2010 Web and Document Databases - ACS-7102.

Distributed Computations

1 Abstract This paper presents a novel modification to the classical Competitive Learning (CL) by adding a dynamic branching mechanism to neural networks.

1 On Constructing Efficient Shared Decision Trees for Multiple Packet Filters Author: Bo Zhang T. S. Eugene Ng Publisher: IEEE INFOCOM 2010 Presenter:

1 Semantic Processing. 2 Contents Introduction Introduction A Simple Compiler A Simple Compiler Scanning – Theory and Practice Scanning – Theory and Practice.

Software Metrics II Speaker: Jerry Gao Ph.D. San Jose State University URL: Sept., 2001.

IS 4420 Database Fundamentals Chapter 6: Physical Database Design and Performance Leon Chen.

Distributed Computations MapReduce

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

1 Software Testing Techniques CIS 375 Bruce R. Maxim UM-Dearborn.

Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.

Appraisal and Data Mining of Large Size Complex Documents Rob Kooper, William McFadden and Peter Bajcsy National Center for Supercomputing Applications.

Chapter 6 1 © Prentice Hall, 2002 The Physical Design Stage of SDLC (figures 2.4, 2.5 revisited) Project Identification and Selection Project Initiation.

Survey on Improving Dynamic Web Performance Guide:- Dr. G. ShanmungaSundaram (M.Tech, Ph.D), Assistant Professor, Dept of IT, SMVEC. Aswini. S M.Tech CSE.

A Hierarchical MapReduce Framework Yuan Luo and Beth Plale School of Informatics and Computing, Indiana University Data To Insight Center, Indiana University.

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.

Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.

Machine Learning Queens College Lecture 7: Clustering.

Bahareh Sarrafzadeh 6111 Fall 2009

Chapter 5 Index and Clustering

Massive Semantic Web data compression with MapReduce Jacopo Urbani, Jason Maassen, Henri Bal Vrije Universiteit, Amsterdam HPDC ( High Performance Distributed.

1 Data Structures CSCI 132, Spring 2014 Lecture 1 Big Ideas in Data Structures Course website:

Paper_topic: Parallel Matrix Multiplication using Vertical Data.

09/13/04 CDA 6506 Network Architecture and Client/Server Computing Peer-to-Peer Computing and Content Distribution Networks by Zornitza Genova Prodanoff.

COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dynamic Mapping Dr. Xiao Qin Auburn University

Packet Classification Using Multi- Iteration RFC Author: Chun-Hui Tsai, Hung-Mao Chu, Pi-Chung Wang Publisher: 2013 IEEE 37th Annual Computer Software.

Chapter 11 Indexing And Hashing (1) Yonsei University 1 st Semester, 2016 Sanghyun Park.

BAHIR DAR UNIVERSITY Institute of technology Faculty of Computing Department of information technology Msc program Distributed Database Article Review.

Web Server Load Balancing/Scheduling

Computer Architecture: Parallel Task Assignment

Data Driven Resource Allocation for Distributed Learning

Web Server Load Balancing/Scheduling

Indexing Structures for Files and Physical Database Design

3.3 Fundamentals of data representation

International Conference on Data Engineering (ICDE 2016)

Curator: Self-Managing Storage for Enterprise Clusters

CS240A Final Project 2.

Conception of parallel algorithms

Fast Kernel-Density-Based Classification and Clustering Using P-Trees

Ge Yang Ruoming Jin Gagan Agrawal The Ohio State University

Harry Xu University of California, Irvine & Microsoft Research

Developing Information Systems

Collaboration Spotting: Visualisation of LHCb process data

RE-Tree: An Efficient Index Structure for Regular Expressions

Operating Systems (CS 340 D)

Database Performance Tuning and Query Optimization

Waikato Environment for Knowledge Analysis

Sameh Shohdy, Yu Su, and Gagan Agrawal

Ministry of Higher Education

CHAPTER 5: PHYSICAL DATABASE DESIGN AND PERFORMANCE

Database Applications (15-415) Hadoop Lecture 26, April 19, 2016

ICS-2018 June 12-15, Beijing Zwift : A Programming Framework for High Performance Text Analytics on Compressed Data Feng Zhang †⋄, Jidong Zhai ⋄, Xipeng.

Author: Ahmed Eldawy, Mohamed F. Mokbel, Christopher Jonathan

Indexing and Hashing Basic Concepts Ordered Indices

Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms, Insights Feng Zhang †⋄, Jidong Zhai ⋄, Xipeng Shen #, Onur Mutlu ⋆, Wenguang.

The Physical Design Stage of SDLC (figures 2.4, 2.5 revisited)

Advanced Implementation of Tables

Chapter 11 Indexing And Hashing (1)

Chapter 11 Database Performance Tuning and Query Optimization

View Inter-Prediction GAN: Unsupervised Representation Learning for 3D Shapes by Learning Global Shape Memories to Support Local View Predictions 1,2 1.

Wednesday, 5/8/2002 Hash table indexes, physical operators

CS639: Data Management for Data Science

A task of induction to find patterns

BIRCH: Balanced Iterative Reducing and Clustering using Hierarchies

Presentation transcript:

Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms, Insights Feng Zhang, Jidong Zhai, Xipeng Shen, Onur Mutlu, Wenguang Chen Renmin University of China, Tsinghua University, North Carolina State University, ETH Zürich ABSTRACT CHALLENGES Today’s rapidly growing document volumes pose pressing challenges to modern document analytics, in both space usage and processing time. In this work, we propose the concept of compression-based direct processing to alleviate issues in both dimensions. The main idea is to enable direct document analytics on compressed data. We present how the concept can be materialized on Sequitur, a compression algorithm that produces hierarchical grammar-like representations. We discuss the major challenges in applying the idea to various document analytics tasks, and reveal a set of guidelines and also assistant software modules for developers to effectively apply compression-based direct processing. Experiments show that our proposed techniques save 90.8% storage space and 77.5% memory usage, while speeding up data processing significantly, i.e., by 1.6X on sequential systems, and 2.2X on distributed clusters, on average. Effectively materializing the concept of compression-based direct processing on document analytics faces a number of challenges. These challenges center around the tension between reuse of results across nodes and the overheads in saving and propagating results. Reuse saves repeated processing of repeated content, but at the same time, requires the computation results to be saved in memory and propagated throughout the graph. The key to effective compression-based direct processing is to maximize the reuse while minimizing the overhead. SOLUTION TECHNIQUES Adaptive traversal order and information for propagation Compression-time indexing Double compression Load-time coarsening Two-level table with depth-first traversal Coarse-grained parallel algorithm and automatic data partitioning Double-layered bit vector for footprint minimization CHALLENGES Unit sensitivity Parallelism barriers Order Data attributes Reuse of results across nodes Overhead in saving and propagating KEY IDEA A compression example with Sequitur. R2: R1: R0: R0 → R1 R1 R2 a R1 → R2 c R2 d R2 → a b a b c a b d a b c a b d a b a Input: Rules: (a) Original data (b) Sequitur compressed data (c) DAG Representation R1 R2 c d a b a: 0 b: 1 c: 2 d: 3 R0: 4 R1: 5 R2: 6 (d) Numerical representation 4 → 5 5 6 0 5 → 6 2 6 3 6 → 0 1 (e) Compressed data in numerical form GUIDELINES Guideline I: Try to minimize the footprint size of the data propagated across the graph during processing. Guideline II: Traversal order is essential for efficiency. It should be selected to suit both the analytics task and the input datasets. Guideline III: Coarse-grained distributed implementation is preferred, especially when the input dataset exceeds the memory capacity of one machine; data partitioning for load balance should be considered, but with caution if it requires the split of a file, especially for unit-sensitive or order-sensitive tasks. Guideline IV: For order-sensitive tasks, consider the use of depth-first traversal and a two-level table design. The former helps the system conform to the word appearance order, while the latter helps with result reuse. Guideline V: When dealing with analytics problems with unit sensitivity, consider the use of double-layered bitmap if unit information needs to be passed across the CFG. Guideline VI: Double compression and coarsening help reduce space and time cost, especially when the dataset consists of many files. They also enable that the thresholds be determined empirically (e.g., through decision trees). A DAG from Sequitur for “a b c a b d a b c a b d a b a”, and a postorder traversal of the DAG for counting word frequencies. <a, 2×2 + 1 +1> = <a, 6> <b, 2×2 + 1> = <b, 5> <c, 1×2 > = <c, 2> <d, 1×2 > = <d, 2> <a,2>, <b,2> <c,1>, <d,1> <a,1>, <b,1> 3 2 1 R0: R1 R2 a R1: c d R2: b <a, 1×2> = <a, 2> <b, 1×2> = <b, 2> <c, 1> <d, 1> <a,6>, <b,5> <c,2>, <d,2> CFG Relation Information Propagation RESULTS