Efficient Common Items Extraction from Multiple Sorted Lists Wei Lu, Cuitian Rong, Jinchuan Chen, Xiaoyong Du, Gabriel Fung, Xiaofang Zhou Renmin University.

Slides:



Advertisements
Similar presentations
College of Information Technology & Design
Advertisements

Efficient Top-k Search across Heterogeneous XML Data Sources Jianxin Li 1 Chengfei Liu 1 Jeffrey Xu Yu 2 Rui Zhou 1 1 Swinburne University of Technology.
Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.
Indexing. Efficient Retrieval Documents x terms matrix t 1 t 2... t j... t m nf d 1 w 11 w w 1j... w 1m 1/|d 1 | d 2 w 21 w w 2j... w 2m 1/|d.
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
LIBRA: Lightweight Data Skew Mitigation in MapReduce
1 Parametric Empirical Bayes Methods for Microarrays 3/7/2011 Copyright © 2011 Dan Nettleton.
Fast Algorithms For Hierarchical Range Histogram Constructions
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
Likelihood Ratio, Wald, and Lagrange Multiplier (Score) Tests
Top-k Set Similarity Joins Chuan Xiao, Wei Wang, Xuemin Lin and Haichuan Shang University of New South Wales and NICTA.
1 Efficient Subgraph Search over Large Uncertain Graphs Ye Yuan 1, Guoren Wang 1, Haixun Wang 2, Lei Chen 3 1. Northeastern University, China 2. Microsoft.
Active Learning and Collaborative Filtering
ViST: a dynamic index method for querying XML data by tree structures Authors: Haixun Wang, Sanghyun Park, Wei Fan, Philip Yu Presenter: Elena Zheleva,
Quantile-Based KNN over Multi- Valued Objects Wenjie Zhang Xuemin Lin, Muhammad Aamir Cheema, Ying Zhang, Wei Wang The University of New South Wales, Australia.
Efficient Processing of Top-k Spatial Keyword Queries João B. Rocha-Junior, Orestis Gkorgkas, Simon Jonassen, and Kjetil Nørvåg 1 SSTD 2011.
Sorting and Searching. Searching List of numbers (5, 9, 2, 6, 3, 4, 8) Find 3 and tell me where it was.
Expectation Maximization Method Effective Image Retrieval Based on Hidden Concept Discovery in Image Database By Sanket Korgaonkar Masters Computer Science.
1 Exploratory Tools for Follow-up Studies to Microarray Experiments Kaushik Sinha Ruoming Jin Gagan Agrawal Helen Piontkivska Ohio State and Kent State.
Estimation of parameters. Maximum likelihood What has happened was most likely.
Reza Sherkat ICDE061 Reza Sherkat and Davood Rafiei Department of Computing Science University of Alberta Canada Efficiently Evaluating Order Preserving.
CS107 Introduction to Computer Science Lecture 5, 6 An Introduction to Algorithms: List variables.
Switch to Top-down Top-down or move-to-nearest Partition documents into ‘k’ clusters Two variants “Hard” (0/1) assignment of documents to clusters “soft”
Probabilistic Skyline Operator over sliding Windows Wan Qian HKUST DB Group.
Overview of Search Engines
Efficient Exact Similarity Searches using Multiple Token Orderings Jongik Kim 1 and Hongrae Lee 2 1 Chonbuk National University, South Korea 2 Google Inc.
VLDB 2012 Mining Frequent Itemsets over Uncertain Databases Yongxin Tong 1, Lei Chen 1, Yurong Cheng 2, Philip S. Yu 3 1 The Hong Kong University of Science.
Presented By HaeJoon Lee Yanyan Shen, Beng Chin Ooi, Bogdan Marius Tudor National University of Singapore Wei Lu Renmin University Cang Chen Zhejiang University.
1 Fast Failure Recovery in Distributed Graph Processing Systems Yanyan Shen, Gang Chen, H.V. Jagadish, Wei Lu, Beng Chin Ooi, Bogdan Marius Tudor.
Analysis of Algorithms
1 Searching. 2 Searching Searching refers to the operation of finding an item from a list of items based on some key value. Two Searching Methods (1)
Learning Theory Reza Shadmehr logistic regression, iterative re-weighted least squares.
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
Searching CS 105 See Section 14.6 of Horstmann text.
A compression-boosting transform for 2D data Qiaofeng Yang Stefano Lonardi University of California, Riverside.
Analysis of Algorithms CSCI Previous Evaluations of Programs Correctness – does the algorithm do what it is supposed to do? Generality – does it.
Computer Science and Engineering Efficiently Monitoring Top-k Pairs over Sliding Windows Presented By: Zhitao Shen 1 Joint work with Muhammad Aamir Cheema.
CSC 211 Data Structures Lecture 13
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,
Union-find Algorithm Presented by Michael Cassarino.
Clustering C.Watters CS6403.
1 Standard error Estimated standard error,s,. 2 Example 1 While measuring the thermal conductivity of Armco iron, using a temperature of 100F and a power.
UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.
CS4432: Database Systems II Query Processing- Part 2.
Data Structures and Algorithms Searching Algorithms M. B. Fayek CUFE 2006.
Fast Indexes and Algorithms For Set Similarity Selection Queries M. Hadjieleftheriou A.Chandel N. Koudas D. Srivastava.
Week 21 Order Statistics The order statistics of a set of random variables X 1, X 2,…, X n are the same random variables arranged in increasing order.
 2006 Pearson Education, Inc. All rights reserved. 1 Searching and Sorting.
Searching CS 110: Data Structures and Algorithms First Semester,
Week 21 Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
Complexity Analysis (Part I)
Design & Analysis of Algorithm Hashing
Indexing & querying text
Matrix Sketching over Sliding Windows
COSC160: Data Structures Linked Lists
Mining Frequent Itemsets over Uncertain Databases
Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms, Insights Feng Zhang †⋄, Jidong Zhai ⋄, Xipeng Shen #, Onur Mutlu ⋆, Wenguang.
Mathematical Foundations of BME Reza Shadmehr
Linear Search Binary Search Tree
Given value and sorted array, find index.
Query Execution Index Based Algorithms (15.6)
Probabilistic Latent Preference Analysis
Continuous Density Queries for Moving Objects
Binary Search Counting
Complexity Analysis (Part I)
PRSim: Sublinear Time SimRank Computation on Large Power-Law Graphs.
Complexity Analysis (Part I)
Towards Maximum Independent Sets on Massive Graphs
Presentation transcript:

Efficient Common Items Extraction from Multiple Sorted Lists Wei Lu, Cuitian Rong, Jinchuan Chen, Xiaoyong Du, Gabriel Fung, Xiaofang Zhou Renmin University of China The University of Queensland

Efficient Common Items Extraction from Multiple Sorted Lists Outline Problem Statement & Motivation MergeSkip & MergeESkip Experiments

Efficient Common Items Extraction from Multiple Sorted Lists Problem Statement Given a set of sorted lists, supposing there are no duplicates in each list, our objective is to efficiently identify items that appear in each list.

Efficient Common Items Extraction from Multiple Sorted Lists Motivation Index Join –R1(X, Y1) ∞ R2(X, Y2) ∞ … ∞ Rn(X, Yn) –Where an index is created on X of each relation Information Retrieval –Identify documents that contain a given set of words –Where: documents are pre-tokenized as words an inverted list is exploited to map each word into a list of document identifiers. Existing Approach –ScanAll

Efficient Common Items Extraction from Multiple Sorted Lists

Efficient Common Items Extraction from Multiple Sorted Lists

Efficient Common Items Extraction from Multiple Sorted Lists Limitation –Each item of lists needs to be accessed before any of the lists is exhaused

Efficient Common Items Extraction from Multiple Sorted Lists MergeSkip Observation –Let minValue be the minimum value of each list –Let maxMinValue be the maximum value among minValues of lists –Items with values less than maxMinValue in each list cannot be the common items

Efficient Common Items Extraction from Multiple Sorted Lists maxMinValue: 80

Efficient Common Items Extraction from Multiple Sorted Lists How can we jump to the right position of each list? Using the binary search maxMinValue: 80

Efficient Common Items Extraction from Multiple Sorted Lists What will happen if lists are similar Can binary search bring any benefit? –No

Efficient Common Items Extraction from Multiple Sorted Lists Modified Binary Search The time complexity –log (k), k is the number of searched items in the list Motivation of Modified Binary Search –decrease the number of searched items, rather than the length from the current position to the end of the list –Iteratively check the item at the position current position + 2 i.

Efficient Common Items Extraction from Multiple Sorted Lists Current Position Check the item at the position, current position Else If value of the item is less than maxMinValue then item, with value 3, is accessed;

Efficient Common Items Extraction from Multiple Sorted Lists Limitation of MergeSkip –At each iteration, maxMinValue is not refined.

Efficient Common Items Extraction from Multiple Sorted Lists MergeESkip Motivation –maxMinValue will be refined at each step

Efficient Common Items Extraction from Multiple Sorted Lists

Efficient Common Items Extraction from Multiple Sorted Lists End

Efficient Common Items Extraction from Multiple Sorted Lists Further Discussion of MergeESkip Which list should be the next selected list? –The performance can be different Several strategies –Selection in a Token Ring Method –Random selection –Selection by size of each list –Selection by statistical information

Efficient Common Items Extraction from Multiple Sorted Lists Experimental Evaluation Synthetic datasets –Normal distribution Different mean, same variance Same mean, different variance DBLP dataset –10 lists –Length of each list is from 81,000 to 150,000 Algorithms –MergeAll algorithm –MergeSkip algorithm –MergeESkip algorithm

Efficient Common Items Extraction from Multiple Sorted Lists synthetic dataset

Efficient Common Items Extraction from Multiple Sorted Lists DBLP dataset len

Efficient Common Items Extraction from Multiple Sorted Lists Effect of Different Data Distribution Parameters: the number of lists = 4; the length of each list = 1M

Efficient Common Items Extraction from Multiple Sorted Lists

Effect of the Number of lists Parameters: mean = 0; variance = 100; the length of each list = 1M

Efficient Common Items Extraction from Multiple Sorted Lists

Effect of Size of Lists Parameters: mean = 0; variance = 100; the number of lists = 4

Efficient Common Items Extraction from Multiple Sorted Lists Thanks!