XML Information Retreival Hui Fang Department of Computer Science University of Illinois at Urbana-Champaign Some slides are borrowed from Nobert Fuhr’s.

Slides:



Advertisements
Similar presentations
XIRQL: Eine Anfragesprache für Information Retrieval in XML-Dokumenten
Advertisements

XML Retrieval: from modelling to evaluation Mounia Lalmas Queen Mary University of London qmir.dcs.qmul.ac.uk.
Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.
Modern Information Retrieval Chapter 1: Introduction
SPECIAL TOPIC XML. Introducing XML XML (eXtensible Markup Language) ◦A language used to create structured documents XML vs HTML ◦XML is designed to transport.
Relational Databases for Querying XML Documents: Limitations & Opportunities VLDB`99 Shanmugasundaram, J., Tufte, K., He, G., Zhang, C., DeWitt, D., Naughton,
XML R ETRIEVAL Tarık Teksen Tutal I NFORMATION R ETRIEVAL XML (Extensible Markup Language) XQuery Text Centric vs Data Centric.
CS276 Information Retrieval and Web Search Lecture 10: XML Retrieval.
Information Retrieval in Practice
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
IR Models: Structural Models
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
DYNAMIC ELEMENT RETRIEVAL IN A STRUCTURED ENVIRONMENT MAYURI UMRANIKAR.
Relevance Feedback based on Parameter Estimation of Target Distribution K. C. Sia and Irwin King Department of Computer Science & Engineering The Chinese.
1 Configurable Indexing and Ranking for XML Information Retrieval Shaorong Liu, Qinghua Zou and Wesley W. Chu UCLA Computer Science Department {sliu, zou,
1 COS 425: Database and Information Management Systems XML and information exchange.
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,
XML –Query Languages, Extracting from Relational Databases ADVANCED DATABASES Khawaja Mohiuddin Assistant Professor Department of Computer Sciences Bahria.
1 CS 502: Computing Methods for Digital Libraries Lecture 11 Information Retrieval I.
1 - Fuhr: Information Retrieval Methods for XML Documents XIRQL: Eine Anfragesprache für Information Retrieval in XML- Dokumenten Norbert Fuhr Universität.
Chapter 5: Information Retrieval and Web Search
Introduction to XML This material is based heavily on the tutorial by the same name at
Overview of Search Engines
Information Retrieval in Practice
CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.
LOGO XML Keyword Search Refinement 郭青松. Outline  Introduction  Query Refinement in Traditional IR  XML Keyword Query Refinement  My work.
TDDD43 XML and RDF Slides based on slides by Lena Strömbäck and Fang Wei-Kleiner 1.
1/17 ITApplications XML Module Session 7: Introduction to XPath.
XP New Perspectives on XML Tutorial 6 1 TUTORIAL 6 XSLT Tutorial – Carey ISBN
WORKING WITH XSLT AND XPATH
HTML Tags Basic Tags Doctype or HTML Head Title Body Use the website to find the definitions
XML과 Database 홍기형 성신여자대학교 성신여자대학교 홍기형.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
1 Searching XML Documents via XML Fragments D. Camel, Y. S. Maarek, M. Mandelbrod, Y. Mass and A. Soffer Presented by Hui Fang.
Querying Structured Text in an XML Database By Xuemei Luo.
XQL, OQL and SQL Xia Tang Sixin Qian Shijun Shen Feb 18, 2000.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 1 Hierarchical Indexing and Flexible Element Retrieval for Structured Document Hang Cui School of.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Prepared by: Mahmoud Rafeek Al-Farra College of Science & Technology Dep. Of Computer Science & IT BCs of Information Technology Data Mining
ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity.
Controlling Overlap in Content-Oriented XML Retrieval Charles L. A. Clarke School of Computer Science University of Waterloo Waterloo, Canada.
Chapter 6: Information Retrieval and Web Search
Book: Bayesian Networks : A practical guide to applications Paper-authors: Luis M. de Campos, Juan M. Fernandez-Luna, Juan F. Huete, Carlos Martine, Alfonso.
MIND: An architecture for multimedia information retrieval in federated digital libraries Henrik Nottelmann University of Dortmund, Germany.
[ Part III of The XML seminar ] Presenter: Xiaogeng Zhao A Introduction of XQL.
COMP9321 Web Application Engineering Semester 2, 2015 Dr. Amin Beheshti Service Oriented Computing Group, CSE, UNSW Australia Week 4 1COMP9321, 15s2, Week.
XML A Language Presentation. Outline 1. Introduction 2. XML 2.1 Background 2.2 Structure 2.3 Advantages 3. Related Technologies 3.1 DTD 3.2 Schemas and.
V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.
Introduction to DTD A Document Type Definition (DTD) defines the legal building blocks of an XML document. It defines the document structure with a list.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Information Retrieval Search Engine Technology (8) Prof. Dragomir R. Radev.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
XML – Basic Concepts (modified version from Dr. Praveen Madiraju) 2015, Fall Pusan National University Ki-Joune Li.
XML Extensible Markup Language
Information Retrieval in Practice
Information Storage and Retrieval Fall Lecture 1: Introduction and History.
An Introduction to IR Chapter 10: XML Retrieval 9th Course,
XML QUESTIONS AND ANSWERS
Efficient Ranking of Keyword Queries Using P-trees
XML Data Introduction, Well-formed XML.
Introduction to Information Retrieval
Text Categorization Berlin Chen 2003 Reference:
Introduction to XML IR XML Group.
CoXML: A Cooperative XML Query Answering System
Presentation transcript:

XML Information Retreival Hui Fang Department of Computer Science University of Illinois at Urbana-Champaign Some slides are borrowed from Nobert Fuhr’s XML Tutorial.

Outline XML basics Research Topics XML IR –Tasks –Retrieval methods –Clustering XML documents

XML standards

Basic XML Hierarchical document format for information exchange in WWW Self describing data (tags) Nested element structure having a root Element data can have –Attributes –Sub-elements (Slides from Jayavel Shanmugasundaram )

Attribute Element Example XML document - Finding Out About: A Cognitive Perspective on Search Engine Technology and the WWW Richard Belew San Diego 92093

Tree structure of XML documents book id=“rbelew” authortitle nameaddress First nameLast name cityZip code Finding…. RichardBelewSan Diego 92093

Basic XML standard does not deal with … Standardization of element names  XML namespaces Structure of element content  XML DTDs Data types of element content  XML schema

XML namespace Apples Bananas GPA Table Provide a method to avoid element name conflicts

XML namespace(Cont.) Apples Bananas GPA Table Provide a method to avoid element name conflicts

XML Document Type Definition Define the document structure with a list of legal elements Tove Jani Reminder Have a rest!

Research Topics related to XML

Research Topics IR areas –Retrieval Models –Query Languages –… DB areas –Query Languages –System architecture –Apply relational DB technology to XML data –Streaming XML –XML Query Processing –XML indexing and compression –……

XML IR

INEX: Initiative for the Evaluation for XML Retrieval Documents: 12,107 articles in XML format Queries: 30 Content-only; 30 Content and structure Relevance Assessments: by participating groups Participants: 36 active groups in 2003

CO search task Document as hierarchical structure of nested elements Type of elements is not considered Query refers to content only Query syntax as in standard text retrieval Task: Find smallest subtree(element) satisfying the query

Example of CO Topic augmented reality and medicine How virtual (or augmented )reality can contribute to improve the medical and surgical practice. In order to be considered relevant, a document/component must include considerations about applications of computer graphics and especially augmented (or virtual) reality to medice(including surgery). Augmented virtual reality medicine surgery improve computer assisted aided image

CAS search Task Queries contain explicit references to the XML structure, by restricing –The context of interest :target element –The context of certain search concepts (, ) pairs

Example of CAS topic article non-monotonic reasoning bdy/sec hdr//yr -calendar belief revision Retrieve all articles from the years that deal with works on non-monotonic reaonsing. Do not retrieve CfPs/calendar entries non-monotonic reasoning belief revision

XML Retrieval Methods XIRQL –XML query languages with IR-related features Language models JuruXML

XIRQL(I) CO Approaches : –Split document text into disjoint nodes –Index nodes separately –Aggregate indexing weights for higher- level elements (subtrees)

Index nodes as units for term weighting Application of known indexing functions (e.g. tf*idf)

Index nodes for relevance-oriented search document class="H.3.3" author John Smith title XML Retrieval Introduction chapter headingThis... heading SyntaxExamples heading sectionheading XML Query Lang. XQL section We describe syntax of XQL chapter Q1: syntax  example Q2: XQL

Combining weights …by disjunction Q1: syntax  example Q2: XQL 0.5 example0.8 XQL 0.7 syntax section1section2 0.3 XQL chapter 0.5 example 0.7 syntax *0.3=0.86 Need to return most specific element satisfying the query! 0.7*0.5=0.35

Combining weights … with augmentation weight Q2: XQL 0.5 example0.8 XQL 0.7 syntax section1section2 0.3 XQL chapter 0.30 example 0.42 syntax *0.3=

XIRQL(II) CAS approaches –Extension of XQL by Weighting and ranking Data types with vague predicates Structural relativism

XQL Expressions Path condition –search for single elements heading –parent-child: chapter/heading –ancestor-descendant: chapter//section –document root: /book/* Filter wrt. structure: //chapter[heading] Filter wrt. content: $and$ author=“John Smith”]

Data types with vague predicates Compares two values of a specific data-type –E.g. Near, broader, narrower Returns (probabilistic) matching value –E.g. “Search for an artist named Ulbrich, living in Frankfurt, Germany about 100 years ago”  Ernst Olbrich, Darmstadt, 1899 P(OlbrichUlbrich)=0.8 (phonetic similarity) P( )=0.9 (numeric similarity) P(Darmstadt Frankfurt)=0.7 (geographic distance)

Semantic Relativism Drop distinction attribute/element: ~author searches for attribute or element Generalize to data types: #personname searches for attribute/elements of specific data type

Language models Generate language models for each node in the tree Combine the children language models using linear interpolation Use EM approach to train the linear interpolation parameters

Element-specific language models ---CO Approaches

Higher level nodes: mixture of language models Query: dog and cat 0.5

Type-specific language models --- CAS approaches

0.5 “Return components of type x where it has component y that contains the query term w” e.g. return documents where the title is contains the word “bird” e.g. return documents where the body’s first section is contains the word “dog”

Juru-XML Element-specific indexing+vector space model: –Transform query into set of (term,path)- conditions –Vague matching of path conditions –Modified cosine similarity as retrieval function

JuruXML(1) ---Transform Query

JuruXML(2) ---Vague matching of path conditions

JuruXML(3) ---Retrieval function Standard cosine similarity –w Q (t i ): query term weight of term t i –w D (t i ): indexing weight of term t i in the document Modified cosine similarity –w Q (t i,c i Q ): query term weight of pair (t i,c i Q ) –w D (t i,c i D ): indexing weight of pair (t i,c i D ) in the document

For each query term (t i,c i Q ) treat all matched document terms (t i,c j D ) equally from the user perspective. Define a weight function w(c i Q ) –E.g. JuruXML(4) ---Alternative approach (Merging contexts)

Clustering XML documents

Document similarity Document representation: document  N-dimensional vector –N= # document features –Feature sets Text only Tags only Text + Tags Feature weighting in the document vector Similarity measure--- vector similarity –E.g. cosine measure

Clustering methods Hierarchical clustering: –Main weakness: quadratic complexity Partitional clustering: –K-means Linear time complexity Simplicity of its algorithm

K-Means clustering algorithm

Measuring clustering quality External quality: comparison of clusters with external classification –Entropy distribution of classes within clusters –Purity largest class in a cluster/cluster size Internal quality: calculate average inter- and intra- cluster similarities. –cohesiveness ( overall similarity)

Discussion Text alone give best results Text+tags: problem with weighting of tags vs. terms

Conclusion XML basics XML Retrieval Tasks and methods Clustering XML documents

Bayesian Networks

Context-dependent Retrieval The score of one element is given by RSV(Retrieval Status Value). RSV of node depends on RSVs of nodes in the context(parent nodes) Elements with highest values are then presented to the user.

Bayesian Networks

Bayesian Networks(Cont.)