Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms, Insights Feng Zhang †⋄, Jidong Zhai ⋄, Xipeng Shen #, Onur Mutlu ⋆, Wenguang.

Slides:



Advertisements
Similar presentations
CoMPI: Enhancing MPI based applications performance and scalability using run-time compression. Rosa Filgueira, David E.Singh, Alejandro Calderón and Jesús.
Advertisements

Introduction to Computer Science 2 Lecture 7: Extended binary trees
LIBRA: Lightweight Data Skew Mitigation in MapReduce
1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei.
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
Distributed Approximate Spectral Clustering for Large- Scale Datasets FEI GAO, WAEL ABD-ALMAGEED, MOHAMED HEFEEDA PRESENTED BY : BITA KAZEMI ZAHRANI 1.
Spark: Cluster Computing with Working Sets
Efficient Sparse Voxel Octrees
Bitmap Index Buddhika Madduma 22/03/2010 Web and Document Databases - ACS-7102.
Implementation of Graph Decomposition and Recursive Closures Graph Decomposition and Recursive Closures was published in 2003 by Professor Chen. The project.
Deterministic Memory- Efficient String Matching Algorithms for Intrusion Detection Nathan Tuck, Timothy Sherwood, Brad Calder, George Varghese Department.
Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
CS 206 Introduction to Computer Science II 04 / 29 / 2009 Instructor: Michael Eckmann.
Introducing Hashing Chapter 21 Copyright ©2012 by Pearson Education, Inc. All rights reserved.
Multi-level Selective Deduplication for VM Snapshots in Cloud Storage Wei Zhang*, Hong Tang †, Hao Jiang †, Tao Yang*, Xiaogang Li †, Yue Zeng † * University.
Fast Set Intersection in Memory Bolin Ding Arnd Christian König UIUC Microsoft Research.
CS492: Special Topics on Distributed Algorithms and Systems Fall 2008 Lab 3: Final Term Project.
Presented By HaeJoon Lee Yanyan Shen, Beng Chin Ooi, Bogdan Marius Tudor National University of Singapore Wei Lu Renmin University Cang Chen Zhejiang University.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
File System Implementation Chapter 12. File system Organization Application programs Application programs Logical file system Logical file system manages.
ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
P-Tree Implementation Anne Denton. So far: Logical Definition C.f. Dr. Perrizo’s slides Logical definition Defines node information Representation of.
IPv6-Oriented 4 OC768 Packet Classification with Deriving-Merging Partition and Field- Variable Encoding Scheme Mr. Xin Zhang Undergrad. in Tsinghua University,
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
Massive Semantic Web data compression with MapReduce Jacopo Urbani, Jason Maassen, Henri Bal Vrije Universiteit, Amsterdam HPDC ( High Performance Distributed.
Page 1 A Platform for Scalable One-pass Analytics using MapReduce Boduo Li, E. Mazur, Y. Diao, A. McGregor, P. Shenoy SIGMOD 2011 IDS Fall Seminar 2011.
Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,
Succinct: Enabling Queries on Compressed Data Presented by: Bhargav Mangipudi Rachit Agarwal, Anurag Khandelwal, and Ion Stoica, University of California,
Computer Sciences Department1. 2 Data Compression and techniques.
University of Maryland Baltimore County
An Efficient Algorithm for Incremental Update of Concept space
Compression & Huffman Codes
CPS216: Data-intensive Computing Systems
Indexing Structures for Files and Physical Database Design
CHP - 9 File Structures.
Pagerank and Betweenness centrality on Big Taxi Trajectory Graph
Computing and Compressive Sensing in Wireless Sensor Networks
Database Applications (15-415) DBMS Internals- Part VII Lecture 16, October 25, 2016 Mohammad Hammoud.
External Methods Chapter 15 (continued)
Introduction to Spark.
TT-Join: Efficient Set Containment Join
Li Weng, Umit Catalyurek, Tahsin Kurc, Gagan Agrawal, Joel Saltz
Lecture 16: Parallel Algorithms I
OrientX: an Integrated, Schema-Based Native XML Database System
ICS-2018 June 12-15, Beijing Zwift : A Programming Framework for High Performance Text Analytics on Compressed Data Feng Zhang †⋄, Jidong Zhai ⋄, Xipeng.
Author: Ahmed Eldawy, Mohamed F. Mokbel, Christopher Jonathan
Yu Su, Yi Wang, Gagan Agrawal The Ohio State University
Linchuan Chen, Peng Jiang and Gagan Agrawal
Department of Computer Science University of California, Santa Barbara
Pyramid Sketch: a Sketch Framework
StreamApprox Approximate Stream Analytics in Apache Spark
StreamApprox Approximate Computing for Stream Analytics
External Sorting The slides for this text are organized into chapters. This lecture covers Chapter 11. Chapter 1: Introduction to Database Systems Chapter.
Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform
Selected Topics: External Sorting, Join Algorithms, …
Managing batch processing Transient Azure SQL Warehouse Resource
MEgo2Vec: Embedding Matched Ego Networks for User Alignment Across Social Networks Jing Zhang+, Bo Chen+, Xianming Wang+, Fengmei Jin+, Hong Chen+, Cuiping.
Efficient Document Analytics on Compressed Data:
Overview of Query Evaluation
Wednesday, 5/8/2002 Hash table indexes, physical operators
Hash Functions for Network Applications (II)
Department of Computer Science University of California, Santa Barbara
CSE 326: Data Structures Lecture #14
Map Reduce, Types, Formats and Features
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms, Insights Feng Zhang †⋄, Jidong Zhai ⋄, Xipeng Shen #, Onur Mutlu ⋆, Wenguang Chen ⋄ †Renmin University of China ⋄Tsinghua University #North Carolina State University ⋆ETH Zürich

Motivation How to perform efficient document analytics Every day, 2.5 quintillion bytes of data created – 90% data in the world today has been created in the last two years alone[1]. How to perform efficient document analytics when data are extremely large? [1] What is Big Data? https://www-01.ibm.com/software/data/bigdata/what-is-big-data.html

Outline Introduction Guidelines and Techniques Evaluation Conclusion Motivation & Example Compression-Based Direct Processing Challenges Guidelines and Techniques Solution Overview Guidelines Evaluation Benchmarks Results Conclusion

Motivation How to perform efficient document analytics when data are extremely large? Challenge 1: SPACE: Large Space Requirement Challenge 2: TIME: Long Processing Time

analytics directly on compressed data? Motivation Observation Using Hash Table to check redundant content for Wikipedia dataset REUSE Whether we can perform analytics directly on compressed data?

Our Idea Compression-based direct processing Sequitur algorithm meets our requirement R0: R1 R1 R2 a Input: Rules: R0 → R1 R1 R2 a R1 → R2 c R2 d R2 → a b a b c a b d a b c a b d a b a R1: R2 c R2 d edge R2: a b (a) Original data (b) Sequitur compressed data (c) DAG Representation a: 0 b: 1 c: 2 d: 3 R0: 4 R1: 5 R2: 6 4 → 5 5 6 0 5 → 6 2 6 3 6 → 0 1 (d) Numerical representation (e) Compressed data in numerical ID

Double benefits Challenge 1: Space Challenge 2: Time R0: R1 R1 R2 a Appear more than once, but only store once! Challenge 1: Space R1: R2 c R2 d R2: a b Appear more than once, but only compute once! Challenge 2: Time

Optimization We can make it more compact. R0: R1 R1 R2 a R0: R1 R1 R2 x2 a x1 R2 x1 R0: R1 R1 R2 a R0: R1 R1 R2 a 2 2 R1: R2 x2 c x1 1 R1: R2 c R2 d R1: R2 c R2 d 1 d x1 2 R2: a b 2 R2: a x1 R2: a b b x1 Some applications do not need to keep the sequence. i weight Further saves storage space and computation time. In each rule, we may remove sequence info too.

Example Word Count i Step # <w,i> Word table CFG Relation Information Propagation <a,6>, <b,5> <c,2>, <d,2> 3 <a, 2×2 + 1 +1> = <a, 6> <b, 2×2 + 1> = <b, 5> <c, 1×2 > = <c, 2> <d, 1×2 > = <d, 2> R0: R1 R2 a 2 <a,2>, <b,2> <c,1>, <d,1> R1: R2 c d <a, 1×2> = <a, 2> <b, 1×2> = <b, 2> <c, 1> <d, 1> 1 R2: a b <a,1>, <b,1>

Challenges How to organize data. How to perform parallelism on large datasets. CHALLENGES Unit sensitivity Parallelism barriers Reuse of results across nodes Overhead in saving and propagating Order sensitivity Data attributes How to utilize the attributes of datasets. How to accommodate the order for applications that are sensitive to the order.

Outline Introduction Guidelines and Techniques Evaluation Conclusion Motivation & Example Compression-Based Direct Processing Challenges Guidelines and Techniques Solution Overview Guidelines Evaluation Benchmarks Results Conclusion

Solution Overview SOLUTION TECHNIQUES Adaptive traversal order and information to propagation Double-layered bit vector for footprint minimization CHALLENGES Unit sensitivity Parallelism barriers Reuse of results across nodes Overhead in saving and propagating Coarse-grained parallel algorithm and automatic data partition Order sensitivity Data attributes Compression-time indexing Double compression Two-level table with depth-first traversal Load-time coarsening

Data Attributes Problem: How to utilize the attributes of datasets Traversal Order Average Size Files # Postorder ≤2860 >2860 Preorder using 2levBitMap ≤800 Preorder using regular BitMap >800 or or ? The best traversal order may depend on the data attributes of input. Guideline I: minimize the footprint size Guideline II: Traversal order is essential for the efficiency

Parallelism Barriers Problem: How to perform parallelism on large datasets pipe() f1 f2 f3 f4 f5 f6 … Input files Partition 1 Partition 2 f5:part1 f5:part2 f5:part3 RDD 1 RDD 2 RDD 3 C++ Program SparkContext Final Results Guideline III: Coarse-grained distributed implementation is preferred

Order Sensitivity (detailed in paper) Problem: Some applications are sensitive to the order w1 spt1 w2 spt2 R3 root node … w4 file0 file1 file2 w5 R4 R1: R2: w6 w7 R4: w8 R5 R3: w9 R5: Global Sequence Table Local Sequence Table Local Sequence Table Guideline IV: depth-first traversal and a two-level table design

Unit Sensitivity (detailed in paper) Problem: How to organize data root node Level 1: file0 file1 file2 Bit array bit0 bit1 bit2 bit3 … R1 R2 w1 spt1 R2 w2 … w4 spt2 R3 … Pointer array P0 P1 NULL P3 … Level 2: R1: w5 R4 R2: R4 w5 R3: R4 R5 … bit0 bit1 … bit N-1 bit0 bit1 … bit N-1 bit0 bit1 … bit N-1 N bits … … R4: R5: w6 R6 R7 w9 R8 w8 … Guideline V: use of double-layered bitmap if unit information needs to be passed across the CFG

Short Summary for Six Guidelines Data Attributes Challenge Guideline II Parallelism Barriers Guideline III Order Sensitivity Guideline IV Unit Sensitivity Guideline V General insights and common techniques Guideline I and Guideline VI

Outline Introduction Guidelines and Techniques Evaluation Conclusion Motivation & Example Compression-Based Direct Processing Challenges Guidelines and Techniques Solution Overview Guidelines Evaluation Benchmarks Results Conclusion

Benchmarks Six benchmarks Five datasets Two platforms Word Count, Inverted Index, Sequence Count, Ranked Inverted Index, Sort, Term Vector Five datasets 580 MB ~ 300 GB Two platforms Single node Spark cluster (10 nodes on Amazon EC2)

Time Savings CompressDirect yields 2X speedup, on average, over direct processing.

Space Savings CompressDirect achieves 11.8X compression ratio, even more than gzip does. compression ratio = original size / compressed data size

Conclusion Our method, compression-based direct processing. How the concept can be materialized on Sequitur. Major challenges. Guidelines. Our library, CompressDirect, to help further ease the required development efforts.

Thanks! Any questions? Feng Zhang †⋄, Jidong Zhai ⋄, Xipeng Shen #, Onur Mutlu ⋆, Wenguang Chen ⋄ †Renmin University of China ⋄Tsinghua University #North Carolina State University ⋆ETH Zürich