1 Extending PRIX for Similarity-based XML Query Group Members: Yan Qi, Jicheng Zhao, Dan Situ, Ning Liao.

Slides:



Advertisements
Similar presentations
Flexible and efficient retrieval of haemodialysis time series S. Montani, G. Leonardi, A. Bottrighi, L. Portinale, P. Terenziani DISIT, Sezione di Informatica,
Advertisements

Indexing DNA Sequences Using q-Grams
Processing XML Keyword Search by Constructing Effective Structured Queries Jianxin Li, Chengfei Liu, Rui Zhou and Bo Ning Swinburne University of Technology,
Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.
Basic IR: Modeling Basic IR Task: Slightly more complex:
Efficient Keyword Search for Smallest LCAs in XML Database Yu Xu Department of Computer Science & Engineering University of California, San Diego Yannis.
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
Query Languages. Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
1 Abdeslame ALILAOUAR, Florence SEDES Fuzzy Querying of XML Documents The minimum spanning tree IRIT - CNRS IRIT : IRIT : Research Institute for Computer.
Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.
Information Retrieval in Practice
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
Xyleme A Dynamic Warehouse for XML Data of the Web.
Aki Hecht Seminar in Databases (236826) January 2009
Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Ontology Based Content Management for Digital TV Services Benjamin Lui Chinese University of Hong Kong Dickson K. W. CHIU Senior.
1 Draft of a Matchmaking Service Chuang liu. 2 Matchmaking Service Matchmaking Service is a service to help service providers to advertising their service.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
INEX 2009 XML Mining Track James Reed Jonathan McElroy Brian Clevenger.
Web Mining Research: A Survey
Queensland University of Technology An Ontology-based Mining Approach for User Search Intent Discovery Yan Shen, Yuefeng Li, Yue Xu, Renato Iannella, Abdulmohsen.
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
Information Modeling: The process and the required competencies of its participants Paul Frederiks Theo van der Weide.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Page 1 Multidatabase Querying by Context Ramon Lawrence, Ken Barker Multidatabase Querying by Context.
XML –Query Languages, Extracting from Relational Databases ADVANCED DATABASES Khawaja Mohiuddin Assistant Professor Department of Computer Sciences Bahria.
University of Kansas Data Discovery on the Information Highway Susan Gauch University of Kansas.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Improving Data Discovery in Metadata Repositories through Semantic Search Chad Berkley 1, Shawn Bowers 2, Matt Jones 1, Mark Schildhauer 1, Josh Madin.
«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,
Aurora: A Conceptual Model for Web-content Adaptation to Support the Universal Accessibility of Web-based Services Anita W. Huang, Neel Sundaresan Presented.
Ontology Alignment/Matching Prafulla Palwe. Agenda ► Introduction  Being serious about the semantic web  Living with heterogeneity  Heterogeneity problem.
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
A TREE BASED ALGEBRA FRAMEWORK FOR XML DATA SYSTEMS
DBease: Making Databases User-Friendly and Easily Accessible Guoliang Li, Ju Fan, Hao Wu, Jiannan Wang, Jianhua Feng Database Group, Department of Computer.
FIGIS’ML Hands-on training - © FAO/FIGIS An introduction to XML Objectives : –what is XML? –XML and HTML –XML documents structure well-formedness.
Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No. 1, January/February 2002 Presented by Jitimon Keinduangjun.
EXist Indexing Using the right index for you data Date: 9/29/2008 Dan McCreary President Dan McCreary & Associates (952) M.
Chapter 6: Information Retrieval and Web Search
Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Q2Semantic: A Lightweight Keyword Interface to Semantic Search Haofen Wang 1, Kang Zhang 1, Qiaoling Liu 1, Thanh Tran 2, and Yong Yu 1 1 Apex Lab, Shanghai.
XML Access Control Koukis Dimitris Padeleris Pashalis.
Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.
1 Automatic indexing Salton: When the assignment of content identifiers is carried out with the aid of modern computing equipment the operation becomes.
1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.
Session 1 Module 1: Introduction to Data Integrity
Generating Query Substitutions Alicia Wood. What is the problem to be solved?
Augmenting (personal) IR Readings Review Evaluation Papers returned & discussed Papers and Projects checkin time.
Concepts and Realization of a Diagram Editor Generator Based on Hypergraph Transformation Author: Mark Minas Presenter: Song Gu.
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
Relevance Feedback in Image Retrieval System: A Survey Tao Huang Lin Luo Chengcui Zhang.
Question Answering Passage Retrieval Using Dependency Relations (SIGIR 2005) (National University of Singapore) Hang Cui, Renxu Sun, Keya Li, Min-Yen Kan,
Data Integrity & Indexes / Session 1/ 1 of 37 Session 1 Module 1: Introduction to Data Integrity Module 2: Introduction to Indexes.
Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.
Information Retrieval in Practice
Search Engine Architecture
CS 430: Information Discovery
Restrict Range of Data Collection for Topic Trend Detection
Introduction to Information Retrieval
Information Retrieval and Web Design
Introduction to XML IR XML Group.
CoXML: A Cooperative XML Query Answering System
Presentation transcript:

1 Extending PRIX for Similarity-based XML Query Group Members: Yan Qi, Jicheng Zhao, Dan Situ, Ning Liao

2 Agenda System Architecture Introduction Semantic-based Similarity Search  Query Expansion  Semantic Similarity Computation Structural-based Similarity Search  Adapting PRIX algorithm Indexing Query Processing  Structural Similarity Computation Similarity Computation and Ranking Discussion & Conclusion

3 System Architecture Introduction

4 Agenda System Architecture Introduction Semantic-based Similarity Search  Query Expansion  Semantic Similarity Computation Structural-based Similarity Search  Adapting PRIX algorithm Indexing Query Processing  Structural Similarity Computation Similarity Computation and Ranking Discussion & Conclusion

5 Query Expansion (I) An Example: Tags in a sample query {title, Praveen Rao, information retrieval} Keywords {title, Praveen, Rao, information, retrieval} Keyword Extensions {{title, status title,deed, claim, entity, style}, {Praveen}, {Rao}, {data, entropy, information}, {retrieval, recovery}} Valid Keyword Extensions {{title, claim, entity}, {Praveen}, {Rao}, {data, entropy, information}, {retrieval, recovery}} (Continue in next page)

6 Query Expansion (II) Tag Extensions {{title}, {claim}, {entity}, {Praveen}, {Rao}, {data, retrieval}, {data recovery}, {information, retrieval}, {information, recovery}, {entropy, retrieval}, {entropy, recovery}} Valid Tag Extensions {{title}, {A claim on theory of computation}, {entity}, {Praveen Rao}, {modern information retrieval}, {A survey on information retrieval}, {information recovery}} Query Expansions 1.{ {title}, {Praveen Rao}, {modern information retrieval} } 2.{A claim on theory of computation} , {Praveen Rao}, {modern information retrieval} } …… Valid Queries { {title}, {Praveen Rao}, {modern information retrieval} }

7 Semantic Similarity Computation Similarity between query q and one of its extensions q’ t: tag in query q t’: tag in query q’ n: number of tags in q = 1, if ki= ki’ α (0 = ki’ m: number of keywords in tag t

8 Agenda System Architecture Introduction Semantic-based Similarity Search  Query Expansion  Semantic Similarity Computation Structural-based Similarity Search  Adapting PRIX algorithm Indexing Query Processing  Structural Similarity Computation Similarity Computation and Ranking Discussion & Conclusion

9 Indexing: Prix (PRüfer sequences for Indexing Xml)

10 Indexing: Prix (PRüfer sequences for Indexing Xml) AD-Label (Ancestor- Descendant) Indexing structure in DB

11 Query Processing Procedure  Filtering Based on Subsequence matching O (n*n*m) : n is the number of nodes in the document; m is the number of nodes in the query.  Refinement Connectivity Gap Consistency Frequency Consistency

12 Subsequence Matching Definition - Example: * Good results: media, mult, mm, ted, tia, etc… Why it works? Is not enough, need more refinements…

13 Refinement #1 Concept of Dummy Nodes - PRIX offers only partial match - Solution: extend prix to leaves level - Example:

14 Refinement #2 Connection vs Connectionless - Definition - How to check it? - If not connected, then what? - Solution: apply penalty - Example (Disconnected By Gap): - Example (Disconnected By Unknown):

15 Refinement #3 Checking for Gap Consistency - Gap Consistency depends on gaps of prüfer sequence - How to check it? - Determines if query tree is subset of searching domain

16 Refinement #4 Checking for Frequency Consistency - Frequency consistency depends on Gap Consistency and occurrences of NPS - How to check it? - Determines if query tree is exact match in searching domain - If not frequency consistent, then what? - Solution: apply penalty

17 Structure Similarity Calculations are based on edit distances which transforms to penalty values Each mismatch node in structure has penalty equal to size of subtree + 1 Overall penalty is dot product of all mismatches All results are normalized with respect to worst case penalty Overall penalty is dot product of all mismatches All results are normalized with respect to worst case penalty

18 Structural Similarity #1: Connectivity

19 Structural Similarity #2: Gap Similarity

20 Structural Similarity #3: Frequency Similarity

21 Agenda System Architecture Introduction Semantic-based Similarity Search  Query Expansion  Semantic Similarity Computation Structural-based Similarity Search  Adapting PRIX algorithm Indexing Query Processing  Structural Similarity Computation Similarity Computation and Ranking Discussion & Conclusion

22 Rank returned XML patterns Similarity (q, q’’)= Semantic_sim(q, q’) * Structure_sim (q’, q’’)

23 Advantages of the approach Prix Indexing  Faster  Captures all structural information Similarity based  Structure similarity  Semantic similarity

24 Limitations and Extensions Limitation of Prix:  Ordering of nodes  We need to handle it in query extension a baca caba cb a bc

25 Limitations and Extensions More Limitations of Prix:  It is difficult to map intuitive structure similarities in tree to sequences similarities in Prix sequences  thus difficult to have accurate definitions of the similarity However:  Translate tree structures to equivalent sequences and further do data mining or similarity matching on sequences is a promising direction

26 Limitations and Extensions Limitations of Semantic similarity  Too many similar results However:  We consider semantic similarity together with structure information In broad sense:  Structure similarity  Semantic similarity  Syntax similarity  Similarity information from co-occurrences of keywords  Similarity information from user feedback  Similarity information from metadata (DTD, data source, region, language, link structure of XML files, etc.)