Efficient Common Items Extraction from Multiple Sorted Lists Wei Lu, Cuitian Rong, Jinchuan Chen, Xiaoyong Du, Gabriel Fung, Xiaofang Zhou Renmin University of China The University of Queensland
Efficient Common Items Extraction from Multiple Sorted Lists Outline Problem Statement & Motivation MergeSkip & MergeESkip Experiments
Efficient Common Items Extraction from Multiple Sorted Lists Problem Statement Given a set of sorted lists, supposing there are no duplicates in each list, our objective is to efficiently identify items that appear in each list.
Efficient Common Items Extraction from Multiple Sorted Lists Motivation Index Join –R1(X, Y1) ∞ R2(X, Y2) ∞ … ∞ Rn(X, Yn) –Where an index is created on X of each relation Information Retrieval –Identify documents that contain a given set of words –Where: documents are pre-tokenized as words an inverted list is exploited to map each word into a list of document identifiers. Existing Approach –ScanAll
Efficient Common Items Extraction from Multiple Sorted Lists
Efficient Common Items Extraction from Multiple Sorted Lists
Efficient Common Items Extraction from Multiple Sorted Lists Limitation –Each item of lists needs to be accessed before any of the lists is exhaused
Efficient Common Items Extraction from Multiple Sorted Lists MergeSkip Observation –Let minValue be the minimum value of each list –Let maxMinValue be the maximum value among minValues of lists –Items with values less than maxMinValue in each list cannot be the common items
Efficient Common Items Extraction from Multiple Sorted Lists maxMinValue: 80
Efficient Common Items Extraction from Multiple Sorted Lists How can we jump to the right position of each list? Using the binary search maxMinValue: 80
Efficient Common Items Extraction from Multiple Sorted Lists What will happen if lists are similar Can binary search bring any benefit? –No
Efficient Common Items Extraction from Multiple Sorted Lists Modified Binary Search The time complexity –log (k), k is the number of searched items in the list Motivation of Modified Binary Search –decrease the number of searched items, rather than the length from the current position to the end of the list –Iteratively check the item at the position current position + 2 i.
Efficient Common Items Extraction from Multiple Sorted Lists Current Position Check the item at the position, current position Else If value of the item is less than maxMinValue then item, with value 3, is accessed;
Efficient Common Items Extraction from Multiple Sorted Lists Limitation of MergeSkip –At each iteration, maxMinValue is not refined.
Efficient Common Items Extraction from Multiple Sorted Lists MergeESkip Motivation –maxMinValue will be refined at each step
Efficient Common Items Extraction from Multiple Sorted Lists
Efficient Common Items Extraction from Multiple Sorted Lists End
Efficient Common Items Extraction from Multiple Sorted Lists Further Discussion of MergeESkip Which list should be the next selected list? –The performance can be different Several strategies –Selection in a Token Ring Method –Random selection –Selection by size of each list –Selection by statistical information
Efficient Common Items Extraction from Multiple Sorted Lists Experimental Evaluation Synthetic datasets –Normal distribution Different mean, same variance Same mean, different variance DBLP dataset –10 lists –Length of each list is from 81,000 to 150,000 Algorithms –MergeAll algorithm –MergeSkip algorithm –MergeESkip algorithm
Efficient Common Items Extraction from Multiple Sorted Lists synthetic dataset
Efficient Common Items Extraction from Multiple Sorted Lists DBLP dataset len
Efficient Common Items Extraction from Multiple Sorted Lists Effect of Different Data Distribution Parameters: the number of lists = 4; the length of each list = 1M
Efficient Common Items Extraction from Multiple Sorted Lists
Effect of the Number of lists Parameters: mean = 0; variance = 100; the length of each list = 1M
Efficient Common Items Extraction from Multiple Sorted Lists
Effect of Size of Lists Parameters: mean = 0; variance = 100; the number of lists = 4
Efficient Common Items Extraction from Multiple Sorted Lists Thanks!