Download presentation
Presentation is loading. Please wait.
Published byJulia Joan Lawson Modified over 9 years ago
1
An Efficient Inverted Index Technique for XML Documents using RDBMS Prepared by Devrim Yıldırım Original paper by Chiyoung Seo
2
Contents Introduction Containment query Extending the inverted index Containment query processing Experiments Conclusion & Future work
3
Introduction XML IR(Information Retrieval) systems need a new inverted index technique for processing containment queries efficiently There are two methods for storing the inverted index and processing containment queries for XML documents Using an RDBMS or using IR engine There are two serious problems in the previous approach Significant performance difference between RDBMS implementation and IR implementation The number of join operation required increases in proportion to the path length of a query
4
Introduction Advantages of using RDBMS We can easily build an integrated XML IR system, once storing XML documents as well as their inverted index in RDBMS We can utilize the database storage and processing power to process queries to XML documents It is very easy to build XML query processors supporting containment queries on top of RDBMS No additional costs are incurred
5
Containment query Data on the Web Abiteboul Serge Buneman Peter Suciu Dan This is book mainly mentions Semistructured data and XML A sample XML document Basic containment queries Indirect containment query Ex1: /books//title Ex2: /books//author//’Abiteboul’ Direct containment query Ex1: //book/author/family Ex2: /books/book/summary/keyword/ ‘XML’ Tight containment query Ex: //given=‘Peter’ k-Proximity containment query Ex(k=3): Distance(“Data”,”Web”) 3
6
Extending inverted index(1) This is a text. A text has many words. Words are made from letters. Positions 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Text words Occurrences text 4,6... many 8... words 9,10... made 12... letters 14........ Traditional Inverted Index
7
Extending inverted index(2) Previous approach Dealer.xml Billie Brown Chevy 1999 Chevrolet Caprice white $7500 … E-INDEX(term,document,begin,end,level) (Dealer,1,1,24,0) (Name,1,2,6,1) (Car,1,7,23,1) (Desc,1,8,22,2) (Year,1,9,11,3) (Model,1,12,15,3) (Color,1,16,18,3) (Price,1,19,21,3) … T-INDEX(term,document, position,level) (Billie,1,3,2) (Brown,1,4,2) (Chevy,1,5,2) (1999,1,10,4) (Chevrolet,1,13,4) … Element EL1 (EL1,D1,S1,E1,L1) Text T1 (W2,D2,P2,L2), T2 (W3,D3,P3,L3) Indirect containment relationship I) D1= D2, II) S1 P2 Direct containment relationship I) D1= D2, II) S1 P2, IV) L1= L2-1 Tight containment relationship I) D1= D2, II) S1= P2-1, III) E1= P3+1 k-Proximity containment relationship I) D2= D3, II) P3-P2 0, III) P3-P2 k
8
Extending inverted index(3) Problems - The number of join operations increases in proportion to the path length of a containment query - There is always a join operation between large relations - Non-equality join predicates(only nested loop join) Ex: “ /dealer/car/desc/model/'Chevrolet' “ Requires four join operations, since there are four containment relationships(path length=4)
9
Extending inverted index(4) Our approach Path(path,pathID) (/Dealer,0) (/Dealer/Name,1) (/Dealer/Car,2) (/Dealer/Car/Desc,3) (/Dealer/Car/Desc/Year,4) (/Dealer/Car/Desc/Model,5) (/Dealer/Car/Desc/Color,6) (/Dealer/Car/Desc/Price,7) … PathIndex(pathID,docID,start,end) (0,1,1,24) (1,1,2,6) (2,1,7,23) (3,1,8,22) … Term(term, termID) (Billie,0) (Brown,1) (Chevy,2) (1999,3) (Chevrolet,4) (Caprice,5) (white,6) ($7500,7) … TermIndex(termID,docID,pathID,position) (0,1,1,3) (1,1,1,4) (2,1,1,5) (3,1,4,10) (4,1,5,13) … Dealer.xml Billie Brown Chevy 1999 Chevrolet Caprice white $7500 …
10
Extending inverted index(5) Advantages - Indirect and direct containment query: 2 joins Tight and k-proximity containment query: 3 joins - Join operations happen between small relations - Equality join predicates Ex: “ /dealer/car/desc/model/'Chevrolet' “ (I) After extracting the pathID, P1 of “ /dealer/car/desc/model ” from Path and the termID, T1 of “ Chevrolet ” from Term (II) then, from TermIndex, extract tuples which have the same pathID and termID as P1 and T1, respectively
11
Containment query processing(1) Previous approach /dealer/car/desc/model/'Chevrolet' select chevrolet.docno from Elements dealer, Elements car, Elements desc, Elements model, Texts chevrolet where dealer.term= ‘ dealer ’ and car.term= ‘ car ’ and desc.term= ‘ desc ’ and model.term= ‘ model ’ and chevrolet.term= ‘ chevrolet ’ -- “ dealer ” contains “ car ” directly(parent-child relatiohship) and dealer.docno = car.docno and dealer.begin car.end and dealer.level = car.level-1 … -- 2 more self-joins on “ Elements ” table -- “ model ” contains “ Chevrolet ” directly and model.docno = chevrolet.docno and model.begin chevrolet.wordno and model.level = chevrolet.level-1
12
Containment query processing(2) Our approach /dealer/car/desc/model/ ’ Chevrolet ’ select TI.docID from Term T, TermIndex TI, Path P where T.term= ‘ Chevrolet ’ and P.path= ‘ /dealer/car/desc/model ’ and TI.termID = T.termID and TI.pathID = P.pathID
13
Experiments(1) Experimental environment 4Table method: The method that maps our four inverted indexes into four relations and processes containment queries in RDBMS 4BTree method: The method that stores our four inverted indexes in IR engine by using B+trees and processes containment queries in IR engine 2Table method: The method that maps two inverted indexes of previous work into two relations and processes containment queries in RDBMS 2BTree method: The method that stores two inverted indexes of previous work in IR engine by using B+trees and processes containment queries in IR engine - 4Table, 2Table methods: Oracle 8.1.7, JDBC2.0, 1.4GHZ PIV, main memory 768MB, Window 2000 Professional - 4BTree, 2BTree methods: IR engine written in Java, BerkeleyDB library B+tree, 1.4GHZ, main memory 768MB, Window 2000 Professional
14
Experiments(2) XML datasets & size of relational tables CompaniesCarsDBLPShakespeare size5 MB19 MB8 MB81 MB Size 4-Relations Term TermIndex Path PathIndex 110 MB 26 KB 55 MB 25 MB 2-Relations Texts Elements 138 MB 77 MB
15
Experiments(3) Experimental results 2Table/4Table performance ratios Variations of the execution time on the path length of a query
16
Experiments(4) Experimental results 4BTree/4Table performance ratios 2BTree/4Table performance ratios
17
Conclusion & Future work We suggested a novel inverted index technique Our RDBMS implementation is very comparable to our IR implementation with respect to performance Our RDBMS implementation significantly outperforms the RDBMS and IR implementations of the previous approach Future work How to integrate XML documents and their inverted indexes into one schema structure in RDBMS Comparing our approach with indexing techniques internally offered by commercial RDBMSs
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.