Efficient Ranking of Keyword Queries Using P-trees

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Scoring and results assembly.
Advertisements

Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Chapter 5: Introduction to Information Retrieval
Efficient Keyword Search for Smallest LCAs in XML Database Yu Xu Department of Computer Science & Engineering University of California, San Diego Yannis.
INSTRUCTOR: DR.NICK EVANGELOPOULOS PRESENTED BY: QIUXIA WU CHAPTER 2 Information retrieval DSCI 5240.
Presented by Xinyu Chang
Fast Algorithms For Hierarchical Range Histogram Constructions
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
Clustered alignments of gene- expression time series data Adam A. Smith, Aaron Vollrath, Cristopher A. Bradfield and Mark Craven Department of Biosatatistics.
DYNAMIC ELEMENT RETRIEVAL IN A STRUCTURED ENVIRONMENT MAYURI UMRANIKAR.
Ch 4: Information Retrieval and Text Mining
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Vector Space Model Any text object can be represented by a term vector Examples: Documents, queries, sentences, …. A query is viewed as a short document.
Computer comunication B Information retrieval. Information retrieval: introduction 1 This topic addresses the question on how it is possible to find relevant.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department.
Advanced Multimedia Text Classification Tamara Berg.
Utilising software to enhance your research Eamonn Hynes 5 th November, 2012.
Performance Improvement for Bayesian Classification on Spatial Data with P-Trees Amal S. Perera Masum H. Serazi William Perrizo Dept. of Computer Science.
AN OPTIMIZED APPROACH FOR kNN TEXT CATEGORIZATION USING P-TREES Imad Rahal and William Perrizo Computer Science Department North Dakota State University.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Query Routing in Peer-to-Peer Web Search Engine Speaker: Pavel Serdyukov Supervisors: Gerhard Weikum Christian Zimmer Matthias Bender International Max.
25/03/2003CSCI 6405 Zheyuan Yu1 Finding Unexpected Information Taken from the paper : “Discovering Unexpected Information from your Competitor’s Web Sites”
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Comparison of SOM Based Document Categorization Systems.
Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.
Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Extracting meaningful labels for WEBSOM text archives Advisor.
CSE3201/CSE4500 Term Weighting.
Efficient OLAP Operations for Spatial Data Using P-Trees Baoying Wang, Fei Pan, Dongmei Ren, Yue Cui, Qiang Ding William Perrizo North Dakota State University.
Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry University of Tennessee.
TEMPLATE DESIGN © Predicate-Tree based Pretty Good Protection of Data William Perrizo, Arjun G. Roy Department of Computer.
Fast Kernel-Density-Based Classification and Clustering Using P-Trees Anne Denton Major Advisor: William Perrizo.
Web- and Multimedia-based Information Systems Lecture 2.
Efficient Equal Interval Neighborhood Ring (P-trees technology is patented by NDSU)
Event retrieval in large video collections with circulant temporal encoding CVPR 2013 Oral.
2005/12/021 Fast Image Retrieval Using Low Frequency DCT Coefficients Dept. of Computer Engineering Tatung University Presenter: Yo-Ping Huang ( 黃有評 )
Exact indexing of Dynamic Time Warping
A Fast and Scalable Nearest Neighbor Based Classification Taufik Abidin and William Perrizo Department of Computer Science North Dakota State University.
Aggregate Function Computation and Iceberg Querying in Vertical Databases Yue (Jenny) Cui Advisor: Dr. William Perrizo Master Thesis Oral Defense Department.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Sudhanshu Khemka.  Treats each document as a vector with one component corresponding to each term in the dictionary  Weight of a component is calculated.
1 CS 430: Information Discovery Lecture 5 Ranking.
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
On the Intersection of Inverted Lists Yangjun Chen and Weixin Shen Dept. Applied Computer Science, University of Winnipeg 515 Portage Ave. Winnipeg, Manitoba,
Efficient Quantitative Frequent Pattern Mining Using Predicate Trees Baoying Wang, Fei Pan, Yue Cui William Perrizo North Dakota State University.
Aggregate Function Computation and Iceberg Querying in Vertical Databases Yue (Jenny) Cui Advisor: Dr. William Perrizo Master Thesis Oral Defense Department.
Automated Information Retrieval
Fast Subsequence Matching in Time-Series Databases.
Item-Based P-Tree Collaborative Filtering applied to the Netflix Data
An Image Database Retrieval Scheme Based Upon Multivariate Analysis and Data Mining Presented by C.C. Chang Dept. of Computer Science and Information.
Fast Kernel-Density-Based Classification and Clustering Using P-Trees
Efficient Image Classification on Vertically Decomposed Data
Efficient Ranking of Keyword Queries Using P-trees
Yue (Jenny) Cui and William Perrizo North Dakota State University
Yue (Jenny) Cui and William Perrizo North Dakota State University
Efficient Image Classification on Vertically Decomposed Data
A Fast and Scalable Nearest Neighbor Based Classification
North Dakota State University Fargo, ND USA
6. Implementation of Vector-Space Retrieval
Chapter 5: Information Retrieval and Web Search
North Dakota State University Fargo, ND USA
Term Frequency–Inverse Document Frequency
Presentation transcript:

Efficient Ranking of Keyword Queries Using P-trees Fei Pan, Imad Rahal, Yue Cui, William Perrizo Computer Science Department North Dakota State University Fargo, ND

Outline The Keyword Ranking problem The P-tree technology EIN-ring Approach Efficient Ranking Algorithm using P-trees Summary

Introduction keyword Ranking is the process of ordering documents that best match a given query defined by a finite number of keywords For our purpose, the query is also viewed as a mini-document Similarity between documents is based entirely on their contents

Introduction PRANK proceeds by finding the kNN of a given query (viewed as a document) using the EIN-ring (later) After that, the systems returns a weighted list of matching documents (weighting scheme is discussed later)

Introduction Motivation Increase in the number of text documents Medical articles Research Publications E-mails News reports (e.g. Reuters) Others access to these information has become in great demand

Introduction (cont.) text has no explicit structure like other data (e.g. relational database) Vector Space Model have been proposed by Salton (1975) for text Each document is represented as a vector whose dimensions are the terms in the initial document collection The query is also represented as document vector in the given space

The P-tree technology Tree-like data structure that store numeric (and categorical) relational data in vertical bit format by splitting each attribute into bits representing each bit position by a P-tree The next example shows the conversion of columns into P-trees (next slide) All columns are converted to binary first

The P-tree technology (cont.) Construction of Basic Peano Count trees

The P-tree technology (cont.) Basic P-tree Operations

Predicates using P-trees There are five basic type of predicates, i.e., x > c, x  c, x  c, x<c, x=c, where c, is bound value. We come up with five very useful propositions for each of them using P-trees.

The Formula Propositions Proposition for predicate Px>c Briefly, for low bound c=(bm..bi..b1)2 Proposition for predicate Pxc Similarly, for upper bound c=(bm..bi..b1)2

The Examples Calculation of Px>c Px > (4)10 > (100)2 Crude Method Px > (4)10 > (100)2 = (P3P2) (P3  P2’  P1) Our formula: = P3(P2P1)

The Examples (cont.) Calculation of Px  c Crude Method Pxj  (70)10 = Pxj  (01000110)2 = (P7’P6’)  ( P7’P6P5’ P4’P3’P2’)  ( P7’P6P5’ P4’P3’P2P1’) Our formula Pxj  (70)10 = Pxj  (01000110)2 =P7’  (P6’  (P5’  (P4’ ( P3’ ( P2’  P1’ P0’))))

The EIN-ring Approach Definition of Equal Interval Neighborhood Ring (EIN-ring)

The EIN-ring Approach (cont.) x X r+ r  Px-r-<Xx+r+  P’x-r<Xx+r Px-r-<Xx+r+ ^ P’x- r<Xx+r

EIN-ring Formulation

The Examples Calculation of Pc1<x  c2 Crude Method Our formula

Efficient Ranking Algorithm Simplified Prototype of the Data Model

Efficient Ranking Algorithm Dimensionality values are the measurements, called Term Frequency by Inverse Document Frequency (TFxIDF) TF (t): how many times term t exists in a document Local weight measuring the importance of a term to document IDF: log(N/Nt) where Nt is the number of documents containing at least one occurrence of t and N is the total number of documents Global weight measuring the uniqueness of the term Normalization is then applied to solve the problem document sizes term t might exist 10 times in a 100-pages document but only 9 times in 1-page document Clearly the t is more associated with the 2nd document All representations are converted to P-trees

Efficient Ranking Algorithm Step1. Consider each weight in W1, W2 and W3 sections in Table 1 as a dimension of the search space X. We have a space of (k+2)-dimensions. Step2. Let Xstart be the point in X having the largest weight values, i.e., xstart-i = max(Wi) Step3. Calculate Pmin  Pmax, where Pmin = PY<Xstart-i-ri, Pmax = PX<Xstart-i-ri-. Ranking Algorithm Using EIN-ring

Experiment Results We compared P-rank with scan based ranking approach on MeDLINE data from KDDCup 2002 task2 with three size groups, denoted as DB1, DB2, DB3, which contain 1,000, 8,000, and 15,000, respectively.

Architecture of P-RANK

Conclusion We describe the architecture, implementation, and evaluation of the P-RANK system for extracting evidences of products of genes from text document, e.g., biomedical papers. Our contributions in this paper include a new efficient keyword query system using data structure P-tree and a fast weighted ranking method using the EIN-ring Formulations.

Future direction Accuracy Gene Alignment and Matching

Thanks!