Succinct: Enabling Queries on Compressed Data

Succinct: Enabling Queries on Compressed Data
Shreya

Current Query Methods Data Scans: Loads file into memory and scans. Column oriented stores Low memory, high latency Index based scan: the file is preprocessed and stored in memory with indices. High memory, low latency Map reduce no work

Succinct Stores compressed representation of the file into memory only. Can store more data before resorting to slow storage No additional indexes, data scans, or decompression (unless data accesses)

Succinct Interface simple interface for storing, retrieving and querying flat (unstructured) files Append new data Extract – allows Random access, returns an uncompressed buffer, start at 0 and collect every 10th element for example Count: occurrences of string Search (f, “ab”) return offsets of that substring Range search: search in range Wild card search prefix and suffix Flat files

Extensions for semi-structured data
Given a collection of records (key, avplist) Avplist = ((attribute1, value1), … (attributeN, valueN)) Avplist is encoded into Succinct’s data representation (flat files) Flat files allow analyzing semi structured data Each attribute is separated by a delimiter unique to that attribute Mapping from attribute to delimiter mapping from key to offset into the flat file where corresponding avpList is encoded Applications can also query individual attributes; for instance, search for string val along attribute A2 is executed as search(val•) using the Succinct API, and returns every key whose associated attribute A2 value matches val. This allows succinct to support dynamo and big table

Existing Techniques Classical search techniques are usually based on tries or suffix trees. May result in 10-20x memory Burrows-Wheeler Transform (BWT) and Suffix arrays FM-indexes and Compressed Suffix Arrays Succinct adapts compressed suffix arrays two memory efficient alternatives arrays are, but still require 5× more memory than the input size

Compressed Suffix Arrays
Array of Suffixes (AoS): array containing all suffixes in the input file in lexicographically sorted order O(n2) bits AoS2Input: O(nlogn) bits Input2Aos: O(nlogn) bits N characters

NextCharIdx Points to next value, can also have less samples and calculate 6-2=4, aos2input value

Succinct Data Representation
Uses a more space-efficient AoS2Input and Input2AoS using sampling by value. Samples every multiple of α and scales down by α to save space More space efficient representation for NextCharIdx Skewed wavelet trees

Queries on Compressed Data
Binary search is inefficient Query algorithm takes advantage of 2d NextCharIdx representation 2.3× speed-up on an average and 19× speed- up in the best case Binary search is inefficient because executes searches on the entire AoS2input array and each step requires computing suffix for corresponding AoS index Columns are unique characters, rows are t length strings, col is letter, row is following t letters rangesearch(f, str1, str2), we find the smallest AoS index whose suffix starts with string str1 and the largest AoS index whose suffix starts with string str2 wildcardsearch(f, prefix, suffix, dist), we first find the offsets of all prefix and suffix occurrences, and return all possible combinations such that the difference between the suffix and prefix offsets is positive and no larger than dist

Succinct Multistore Design
Write friendly multi-store design that chains multiple individual stores Log Store: New data is appended into this. Executes queries via in-memory data scans sped up using an inverted index that supports fast fine-grained update Suffix Store: supports bulk appends(aggregates larger amounts of data before compression) Succinct Store: supports queries on compressed data Suffix is an intermediate store so scans are inefficient. Uses current methods that exist Log store is not efficient since it contains a small fraction of entire dataset. let cores concurrently execute read and write requests on a single shared partition and exploit parallelism by assigning each query to one of the cores. However, concurrent writes scale poorly and require complex techniques for data structure integrity overlapping partitions, each annotated with the starting and the ending offset corresponding to the data “owned” by the partition. Data appended to the most recent partition Uses inverted index to map short length strings to their locations in the partition (default length is 3) Suffix Store: support of inverted index does not scale to data sizes in SuffixStore due to high memory footprint uncompressed AoS2Input array (§3) and executes search queries via binary search As seen in figure 3 y storing the original data that allows random access for comparison during binary search, as well as, for extract queries; these queries are fast since AoS2Input is uncompressed Bulk appends from LogStore are executed at partition granularity, with the entire LogStore data constituting a single partition of SuffixStore AoS2Input is constructed per partition that way you don’t have to update when you append new data

SuccinctStore Can choose sampling rate. Smaller rate means less memory and higher latency. Can specify string lengths for the 2d NextCharIdx rows Tuning of these parameters allows SuccinctStore to avoid disk and handle overloaded partitions for skewed workloads

Succinct Architecture
Central coordinator: membership management: maintaining a list of active servers in the system by having each server send periodic heartbeats data management, which includes maintaining an up-to-date collection of pointers to quickly locate the desired data during query execution pointers to file offsets to partitions pointers for partitions to machines with those partitions set of storage servers. One for first two and rest for succinct store. Query handler interface to connect to servers and coordinator. QH takes in query and sends it to a single server (for extract and append queries) or to all the other servers (for count and search queries). Coordinator can find location of server, but QH also has a cache to avoid coordinator becoming the bottleneck When sending to all the qh sends to all other qh, run in parallel then aggregates Qh as redirector is good for avoiding overloading the coordinator but raises concerns about scalability but with this already saving servers used it helps

Data Transformation between Stores
Logstore partition < 250 MB AoS2Input can be constructed on the server itself using an efficient linear-time Aggregates data across multiple partitions before transforming it to SuccinctStore Transforming SuffixStore data into a SuccinctStore partition requires a merge sort of AoS2Input for each of the SuffixStore partitions Currently done in the background, but could dedicate a server to just this process 1) coordinator failure; (2) data failure and recovery; and (3) adding new servers to an existing cluster.

Evaluation Compared to MongoDB and Cassandra for queries using secondary indexes; HyperDex using hyperspace hashing; and an industrial columnar- store DB-X, using in-memory data scans Used two multiattribute record datasets: smallVal and largeVal from Conviva Amazon EC2 with15GB RAM and 4 cores each of the system for no-failure scenario. Used YCSB to generate workloads and frequencies

Results: Memory MongoDB and Cassandra fit roughly 10–11× less data than Succinct due to storing secondary indexes and input data HyperDex not only stores large metadata and stores 126x more

Results: Throughput Mongo red Cassandra purple
Workload a: small record sizes. Succinct has higher throughput. Mongo dB's routing server is the bottleneck and Cassandra executes off disk. When record sizes are large succinct achieves slightly lower throughput due to increase in Succinct’s extract latency. When mongo dB and Cassandra don’t fit succinct obviously performs better Workload B: append queries. Others suffer. Mongo dB reduces because indexes update on each write. For succinct logstore is the bottleneck (everything written there), but Cassandra sees minimal reduction.

Results: Throughput Workload C: search workloads. Cassandra requires scanning and low throughput. Mongodb with fewer attributes in smallval it has high throughput because caching is more effective. Doesn’t help for largeval and throughput decreases. Succinct achieves x higher throughput since in memory. Throughput reduces minimally. Workload D: 5% appends, mongoDB and Cassandra become even worse because they are updating indies.

Results: Latency Compared to queries to systems that use indexes like MongoDB and Cassandra And systems that perform data scans along with metadata like hyperdex and db-x using a single machine Only in cases where entire data fits in memory comparable or better latency than MongoDB and Cassandra latency will get worse if record sizes are larger. For writes, need to update indexes upon each write, leading to higher latency. For search, MongoDB achieves good latency since MongoDB performs a binary search over an in-memory index, which is similar in complexity to Succinct’s search algorithm. Cassandra requires high latencies for search queries due to much less efficient utilization of available memory For data scans, have get and search. Hyperdex is comparable, but searh is higher because high memory footprint of meta data and off disk querieis. Db-x is columnar and not optimized for get

Succinct: Enabling Queries on Compressed Data

Similar presentations

Presentation on theme: "Succinct: Enabling Queries on Compressed Data"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Succinct: Enabling Queries on Compressed Data

Similar presentations

Presentation on theme: "Succinct: Enabling Queries on Compressed Data"— Presentation transcript:

Similar presentations

About project

Feedback