Efficient Ranking of Keyword Queries Using P-trees

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Multimedia Database Systems
Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.
Ch 4: Information Retrieval and Text Mining
Multimedia Databases Text II. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases Text databases Image and video.
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Chapter 5: Information Retrieval and Web Search
Advanced Multimedia Text Classification Tamara Berg.
Data Mining on Streams  We should use runlists for stream data mining (unless there is some spatial structure to the data, of course, then we need to.
Performance Improvement for Bayesian Classification on Spatial Data with P-Trees Amal S. Perera Masum H. Serazi William Perrizo Dept. of Computer Science.
AN OPTIMIZED APPROACH FOR kNN TEXT CATEGORIZATION USING P-TREES Imad Rahal and William Perrizo Computer Science Department North Dakota State University.
MULTI-LAYERED SOFTWARE SYSTEM FRAMEWORK FOR DISTRIBUTED DATA MINING
Clustering Analysis of Spatial Data Using Peano Count Trees Qiang Ding William Perrizo Department of Computer Science North Dakota State University, USA.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Bit Sequential (bSQ) Data Model and Peano Count Trees (P-trees) Department of Computer Science North Dakota State University, USA (the bSQ and P-tree technology.
Partitioning – A Uniform Model for Data Mining Anne Denton, Qin Ding, William Jockheck, Qiang Ding and William Perrizo.
Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.
Ptree * -based Approach to Mining Gene Expression Data Fei Pan 1, Xin Hu 2, William Perrizo 1 1. Dept. Computer Science, 2. Dept. Pharmaceutical Science,
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Extracting meaningful labels for WEBSOM text archives Advisor.
Association Rule Mining on Remotely Sensed Imagery Using Peano-trees (P-trees) Qin Ding, Qiang Ding, and William Perrizo Computer Science Department North.
Efficient OLAP Operations for Spatial Data Using P-Trees Baoying Wang, Fei Pan, Dongmei Ren, Yue Cui, Qiang Ding William Perrizo North Dakota State University.
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
TEMPLATE DESIGN © Predicate-Tree based Pretty Good Protection of Data William Perrizo, Arjun G. Roy Department of Computer.
Fast Kernel-Density-Based Classification and Clustering Using P-Trees Anne Denton Major Advisor: William Perrizo.
Efficient Equal Interval Neighborhood Ring (P-trees technology is patented by NDSU)
The Universality of Nearest Neighbor Sets in Classification and Prediction Dr. William Perrizo, Dr. Gregory Wettstein, Dr. Amal Shehan Perera and Tingda.
The Universality of Nearest Neighbor Sets in Classification and Prediction Dr. William Perrizo, Dr. Gregory Wettstein, Dr. Amal Shehan Perera and Tingda.
Our Approach  Vertical, horizontally horizontal data vertically)  Vertical, compressed data structures, variously called either Predicate-trees or Peano-trees.
1 Centroid Based multi-document summarization: Efficient sentence extraction method Presenter: Chen Yi-Ting.
Aggregate Function Computation and Iceberg Querying in Vertical Databases Yue (Jenny) Cui Advisor: Dr. William Perrizo Master Thesis Oral Defense Department.
Fast and Scalable Nearest Neighbor Based Classification Taufik Abidin and William Perrizo Department of Computer Science North Dakota State University.
Accelerating Multilevel Secure Database Queries using P-Tree Technology Imad Rahal and Dr. William Perrizo Computer Science Department North Dakota State.
Knowledge Discovery in Protected Vertical Information Dr. William Perrizo University Distinguished Professor of Computer Science North Dakota State University,
1 CS 430: Information Discovery Lecture 5 Ranking.
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
Vertical Data 2 In this example database (which is used throughout these notes), there are two entities, Students (a student has a number, S#, a name,
Clustering Microarray Data based on Density and Shared Nearest Neighbor Measure CATA’06, March 23-25, 2006 Seattle, WA, USA Ranapratap Syamala, Taufik.
Chapter 5 Ranking with Indexes. Indexes and Ranking n Indexes are designed to support search  Faster response time, supports updates n Text search engines.
Efficient Quantitative Frequent Pattern Mining Using Predicate Trees Baoying Wang, Fei Pan, Yue Cui William Perrizo North Dakota State University.
Vertical Set Square Distance Based Clustering without Prior Knowledge of K Amal Perera,Taufik Abidin, Masum Serazi, Dept. of CS, North Dakota State University.
Aggregate Function Computation and Iceberg Querying in Vertical Databases Yue (Jenny) Cui Advisor: Dr. William Perrizo Master Thesis Oral Defense Department.
P Left half of rt half ? false  Left half pure1? false  Whole is pure1? false  0 5. Rt half of right half? true  1.
Automated Information Retrieval
Best pTree organization? level-1 gives te, tf (term level)
Item-Based P-Tree Collaborative Filtering applied to the Netflix Data
Decision Tree Classification of Spatial Data Streams Using Peano Count Trees Qiang Ding Qin Ding * William Perrizo Department of Computer Science.
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
CSC317 Greedy algorithms; Two main properties:
Clustering of Web pages
Fast Kernel-Density-Based Classification and Clustering Using P-Trees
Decision Tree Induction for High-Dimensional Data Using P-Trees
Efficient Ranking of Keyword Queries Using P-trees
Yue (Jenny) Cui and William Perrizo North Dakota State University
Proximal Support Vector Machine for Spatial Data Using P-trees1
North Dakota State University Fargo, ND USA
Yue (Jenny) Cui and William Perrizo North Dakota State University
PTrees (predicate Trees) fast, accurate , DM-ready horizontal processing of compressed, vertical data structures Project onto each attribute (4 files)
3. Vertical Data LECTURE 2 Section 3.
Vertical K Median Clustering
Incremental Interactive Mining of Constrained Association Rules from Biological Annotation Data Imad Rahal, Dongmei Ren, Amal Perera, Hassan Najadat and.
Representation of documents and queries
Vertical K Median Clustering
3. Vertical Data LECTURE 2 Section 3.
North Dakota State University Fargo, ND USA
The Multi-hop closure theorem for the Rolodex Model using pTrees
Vertical K Median Clustering
North Dakota State University Fargo, ND USA
The P-tree Structure and its Algebra Qin Ding Maleq Khan Amalendu Roy
Presentation transcript:

Efficient Ranking of Keyword Queries Using P-trees Fei Pan, Imad Rahal, Yue Cui, William Perrizo Computer Science Department North Dakota State University Fargo, ND

Outline The Document Keyword Ranking problem The Predicate-tree (P-tree) technology EIN-ring Approach Efficient Ranking Algorithm using P-trees Summary

Introduction keyword Ranking is process of ordering documents that best match a given query defined by a finite number of keywords For our purpose, the query is also viewed as a mini-document Similarity between documents is based entirely on their contents

PRANK proceeds by finding the kNN of the query (viewed as a document) using EIN-rings. Then the systems returns a weighted list of matching documents (weighting scheme is discussed later). Motivation Massive increases in the number of text documents Medical articles Research Publications E-mails News reports (e.g. Reuters) Others access to info in these libraries is in great demand

Text has very little of the explicit structure typically found in other data (e.g. a relational database) Vector Space Model has been proposed by Salton (1975) for text Each document is represented as a vector whose dimensions are the terms in the initial document collection Exclude stop list words with very high support, e.g., the, a, … expected terms e.g., blood in a corpus of articles addressing blood infections. Typically “term” means “stems” plus “important phrases”. Some use n-grams (raw sequences of characters) “case folding” is typically done (converting all characters to a common case. The query is also represented as document vector in the space.

The P-tree1 technology A P-tree is a tree data structure that store numeric (and categorical) relational data in a vertical bit format by splitting each attribute into bits representing each bit position by a P-tree The next example shows the conversion of columns into P-trees (next slide) All columns are converted to binary first Numeric attributes: by bit positions Categorical attributes: by a bit map for each category 1P-tree technology is patent pending at North Dakota State University

But it is pure (pure0) so this branch ends Predicate tree technology: vertically project each attribute, Current practice in data mining Data is structured into horizontal records. then vertically project each bit position of each attribute, then compress each bit slice, using a predicate, into a basic P-tree. e.g., compress R11 into P11 (using the universal predicate, pure1) Then Process vertically (vertical scans) 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 R( A1 A2 A3 A4) R[A1] R[A2] R[A3] R[A4] 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 Horizontally structured records are scanned vertically R11 1 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43 pure1? false=0 pure1? true=1 The 1-Dimension P-tree version of R11, P11, is built by recording the truth of the predicate “pure 1” in a tree recursively on halves, until purity is achieved. pure1? false=0 pure1? false=0 pure1? false=0 Horizontally AND basic Ptrees 1. Whole is pure1? false  0 0 0 0 1 10 ^ P11 P11 P12 P13 P21 P22 P23 P31 P32 P33 P41 P42 P43 0 0 0 1 10 1 0 01 0 0 0 1 01 10 1 0 0 1 0 1 0 0 10 01 ^ 2. Left half pure1? false  0 3. Right half pure1? false  0 0 0 P11 And it’s pure so branch ends 7. Rt half of lf of rt? false0 0 0 0 1 10 4. Left half of rt half ? false0 0 0 5. Rt half of right half? true1 0 0 0 1 6. Lf half of lf of rt? true1 0 0 0 1 1 To count occurrences of 7,0,1,4 use pure111000001100: 0 23-level P11^P12^P13^P’21^P’22^P’23^P’31^P’32^P33^P41^P’42^P’43 = 0 0 22-level =2 01 21-level ^ 7 0 1 4 But it is pure (pure0) so this branch ends

2-Dimensional Pure1 P-trees (AKA Peano-trees (use the Peano space filling curve concept) Node is 1 iff that quadrant is purely 1-bits, e.g., A bit-file (from, e.g., high-order bit of a band from a 2-D image) 1111110011111000111111001111111011110000111100001111000001110000 Which, in spatial raster order, looks like: Run-length compress it into a quadrant tree using Peano order. 1 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 0 1 1 1 1 0 0 0 0 0 1 1 1 0 0 0 0 1 1 1 1 1 1 1 1 1 1

Logical Operations on Ptrees (are used to get counts of any pattern) Ptree 1 Ptree 2 AND result OR result AND operation is faster than the bit-by-bit AND since, there are shortcuts (any pure0 operand node means result node is pure0.) e.g., only load quadrant 2 to AND Ptree1, Ptree2, etc. The more operands there are in the AND, the greater the benefit due to this shortcut (more pure0 nodes).

3-Dimensional Pure1 P-trees (are a naturaly choice (produce better compession?) if the data is naturally 3-D (e.g., solids)

Other useful Predicates-trees There are five basic type of predicates, i.e., x>c, xc, xc, x<c, x=c, where c, is bound value. Five very useful propositions for each of these P-trees. Proposition for predicate Px>c c=(bm..bi..b1)2 There are five basic type of predicates, i.e., x>c, xc, xc, x<c, x=c, where c, is bound value. Five very useful propositions for each of these P-trees. Where Pi is the ith basic P-tree and i =AND if bi = 1 else 0 For predicate tree, Pxc where i =AND if bi=0 else OR

e.g., Calculation of Px>c Crude Method: Px>(4)10(100)2 = ( P3P2) (P3P2’P1 ) Our formula: Px>(4)10(100)2 = P3(P2P1) e.g., Calculation of Px  c Crude: Px(70)10=Px(01000110)2 = (P7’P6’)(P7’P6P5’P4’P3’P2’)(P7’P6P5’P4’P3’P2P1’) Ours: Px(70)10=Px(01000110)2 = P7’(P6’(P5’(P4’(P3’(P2’P1’P0’))))

Equal Interval Neighborhood (EIN-ring) Approach x X P’x-r<Xx+r X r  Px-r-<Xx+r+ ^ P’x- r<Xx+r x x X r+ Px-r-<Xx+r+ 

Efficient Ranking Algorithm Dimensionality values are the measurements, called Term Frequency by Inverse Document Frequency (TFxIDF) TF (t): how many times term t exists in a document Local weight measuring the importance of a term to document Normalized to document size: term t might exist 10 times in a 100-pages document but only 9 times in 1-page document Clearly the t is more associated with the 2nd document IDF: log(N/Nt) where Nt is the number of documents containing at least one occurrence of t and N is the total number of documents Global weight measuring the uniqueness of the term (reciprocal of the support) All representations are converted to P-trees

Efficient Ranking Algorithm (e.g., experiment_ section = high; educ_section = low) (Not used here) =document weight (e.g., CATA=hi, Science=low) (not used in this analysis) Relative Term Frequencies Term Wt (IDF*TF) TF1 TFk .025 .09 .034 .012 .003 .03 .22 .02 .01 The data is converted to Ptrees and then ranked using P-kNN. P-kNN ranks according to similarity rings eminating from the query (as a document), until at least k are found. Each ring is a simple EIN-ring calculation.

Experiment Results We compared P-rank with scan-based ranking on MeDLINE data (ACM KDDCup 2002 task2) Three size groups, denoted DB1, DB2, DB3, containing 1,000, 8,000, 15,000, resp.

Architecture of P-RANK

Conclusion and future directions We describe the P-RANK method for extracting evidences, for example, of products of genes from text document, e.g., biomedical papers. Our contributions in this paper include a new efficient keyword query system using data structure P-tree and a fast weighted ranking method using the EIN-ring formulations.