Presentation is loading. Please wait.

Presentation is loading. Please wait.

XML Information Retreival Hui Fang Department of Computer Science University of Illinois at Urbana-Champaign Some slides are borrowed from Nobert Fuhr’s.

Similar presentations


Presentation on theme: "XML Information Retreival Hui Fang Department of Computer Science University of Illinois at Urbana-Champaign Some slides are borrowed from Nobert Fuhr’s."— Presentation transcript:

1 XML Information Retreival Hui Fang Department of Computer Science University of Illinois at Urbana-Champaign Some slides are borrowed from Nobert Fuhr’s XML Tutorial.

2 Outline XML basics Research Topics XML IR –Tasks –Retrieval methods –Clustering XML documents

3 XML standards

4 Basic XML Hierarchical document format for information exchange in WWW Self describing data (tags) Nested element structure having a root Element data can have –Attributes –Sub-elements (Slides from Jayavel Shanmugasundaram )

5 Attribute Element Example XML document - Finding Out About: A Cognitive Perspective on Search Engine Technology and the WWW Richard Belew San Diego 92093

6 Tree structure of XML documents book id=“rbelew” authortitle nameaddress First nameLast name cityZip code Finding…. RichardBelewSan Diego 92093

7 Basic XML standard does not deal with … Standardization of element names  XML namespaces Structure of element content  XML DTDs Data types of element content  XML schema

8 XML namespace Apples Bananas GPA Table 80 120 Provide a method to avoid element name conflicts

9 XML namespace(Cont.) Apples Bananas GPA Table 80 120 Provide a method to avoid element name conflicts

10 XML Document Type Definition Define the document structure with a list of legal elements Tove Jani Reminder Have a rest!

11 Research Topics related to XML

12 Research Topics IR areas –Retrieval Models –Query Languages –… DB areas –Query Languages –System architecture –Apply relational DB technology to XML data –Streaming XML –XML Query Processing –XML indexing and compression –……

13 XML IR

14 INEX: Initiative for the Evaluation for XML Retrieval Documents: 12,107 articles in XML format Queries: 30 Content-only; 30 Content and structure Relevance Assessments: by participating groups Participants: 36 active groups in 2003

15 CO search task Document as hierarchical structure of nested elements Type of elements is not considered Query refers to content only Query syntax as in standard text retrieval Task: Find smallest subtree(element) satisfying the query

16 Example of CO Topic augmented reality and medicine How virtual (or augmented )reality can contribute to improve the medical and surgical practice. In order to be considered relevant, a document/component must include considerations about applications of computer graphics and especially augmented (or virtual) reality to medice(including surgery). Augmented virtual reality medicine surgery improve computer assisted aided image

17 CAS search Task Queries contain explicit references to the XML structure, by restricing –The context of interest :target element –The context of certain search concepts (, ) pairs

18 Example of CAS topic article non-monotonic reasoning bdy/sec 1999 2000 hdr//yr -calendar belief revision Retrieve all articles from the years 1999-2000 that deal with works on non-monotonic reaonsing. Do not retrieve CfPs/calendar entries non-monotonic reasoning belief revision

19 XML Retrieval Methods XIRQL –XML query languages with IR-related features Language models JuruXML

20 XIRQL(I) CO Approaches : –Split document text into disjoint nodes –Index nodes separately –Aggregate indexing weights for higher- level elements (subtrees)

21 Index nodes as units for term weighting Application of known indexing functions (e.g. tf*idf)

22 Index nodes for relevance-oriented search 1 2 3 45 document class="H.3.3" author John Smith title XML Retrieval Introduction chapter headingThis... heading SyntaxExamples heading sectionheading XML Query Lang. XQL section We describe syntax of XQL chapter Q1: syntax  example Q2: XQL

23 Combining weights …by disjunction Q1: syntax  example Q2: XQL 0.5 example0.8 XQL 0.7 syntax section1section2 0.3 XQL chapter 0.5 example 0.7 syntax 0.86 0.8+0.3-0.8*0.3=0.86 Need to return most specific element satisfying the query! 0.7*0.5=0.35

24 Combining weights … with augmentation weight Q2: XQL 0.5 example0.8 XQL 0.7 syntax section1section2 0.3 XQL chapter 0.30 example 0.42 syntax 0.64 0.48+0.3-0.48*0.3=0.64 0.6

25 XIRQL(II) CAS approaches –Extension of XQL by Weighting and ranking Data types with vague predicates Structural relativism

26 XQL Expressions Path condition –search for single elements heading –parent-child: chapter/heading –ancestor-descendant: chapter//section –document root: /book/* Filter wrt. structure: //chapter[heading] Filter wrt. content: /document[@class=“H.3.3” $and$ author=“John Smith”]

27 Data types with vague predicates Compares two values of a specific data-type –E.g. Near, broader, narrower Returns (probabilistic) matching value –E.g. “Search for an artist named Ulbrich, living in Frankfurt, Germany about 100 years ago”  Ernst Olbrich, Darmstadt, 1899 P(OlbrichUlbrich)=0.8 (phonetic similarity) P(1899 1903)=0.9 (numeric similarity) P(Darmstadt Frankfurt)=0.7 (geographic distance)

28 Semantic Relativism Drop distinction attribute/element: ~author searches for attribute or element Generalize to data types: #personname searches for attribute/elements of specific data type

29 Language models Generate language models for each node in the tree Combine the children language models using linear interpolation Use EM approach to train the linear interpolation parameters

30 Element-specific language models ---CO Approaches

31 Higher level nodes: mixture of language models Query: dog and cat 0.5

32 Type-specific language models --- CAS approaches

33 0.5 “Return components of type x where it has component y that contains the query term w” e.g. return documents where the title is contains the word “bird” e.g. return documents where the body’s first section is contains the word “dog”

34 Juru-XML Element-specific indexing+vector space model: –Transform query into set of (term,path)- conditions –Vague matching of path conditions –Modified cosine similarity as retrieval function

35 JuruXML(1) ---Transform Query

36 JuruXML(2) ---Vague matching of path conditions

37 JuruXML(3) ---Retrieval function Standard cosine similarity –w Q (t i ): query term weight of term t i –w D (t i ): indexing weight of term t i in the document Modified cosine similarity –w Q (t i,c i Q ): query term weight of pair (t i,c i Q ) –w D (t i,c i D ): indexing weight of pair (t i,c i D ) in the document

38 For each query term (t i,c i Q ) treat all matched document terms (t i,c j D ) equally from the user perspective. Define a weight function w(c i Q ) –E.g. JuruXML(4) ---Alternative approach (Merging contexts)

39 Clustering XML documents

40 Document similarity Document representation: document  N-dimensional vector –N= # document features –Feature sets Text only Tags only Text + Tags Feature weighting in the document vector Similarity measure--- vector similarity –E.g. cosine measure

41 Clustering methods Hierarchical clustering: –Main weakness: quadratic complexity Partitional clustering: –K-means Linear time complexity Simplicity of its algorithm

42 K-Means clustering algorithm

43 Measuring clustering quality External quality: comparison of clusters with external classification –Entropy distribution of classes within clusters –Purity largest class in a cluster/cluster size Internal quality: calculate average inter- and intra- cluster similarities. –cohesiveness ( overall similarity)

44 Discussion Text alone give best results Text+tags: problem with weighting of tags vs. terms

45 Conclusion XML basics XML Retrieval Tasks and methods Clustering XML documents

46 Bayesian Networks

47 Context-dependent Retrieval The score of one element is given by RSV(Retrieval Status Value). RSV of node depends on RSVs of nodes in the context(parent nodes) Elements with highest values are then presented to the user.

48 Bayesian Networks

49 Bayesian Networks(Cont.)


Download ppt "XML Information Retreival Hui Fang Department of Computer Science University of Illinois at Urbana-Champaign Some slides are borrowed from Nobert Fuhr’s."

Similar presentations


Ads by Google