XML Retrieval: A content-oriented perspective Mounia Lalmas Department of Computer Science Queen Mary, University of London.

XML Retrieval: A content-oriented perspective Mounia Lalmas Department of Computer Science Queen Mary, University of London

Outline  Part I - Content-oriented XML retrieval  Part II - Evaluating content-oriented XML retrieval

XML Retrieval: Motivation  XML is able to represent a mixture of “structured” and text (“unstructured”) information.  XML applications: digital libraries, content management.  XML repositories: IEEE INEX collection, LexisNexis, the Library of Congress collection.

XML Retrieval: DB and IR views  Data-centric view (DB) — XML as exchange format for structured data  Document-centric view (IR) — XML as format for representing the logical structure of documents  Now increasingly both views (DB+IR)

Data Centric XML Documents: Example Thomas Tassos Christof

Document Centric XML Documents: Example Thomas James Bond is the best student in the class. He scored 95 points out of 100. His presentation of Using Materialized Views in Data Warehouse was brilliant. Donald Duck is not a very good student. He scored 20 points…

Content-oriented XML retrieval  Traditional IR is about finding relevant documents to a user’s information need, e.g. entire book.  XML retrieval allows users to retrieve document components (elements) that are more focussed to their information needs, e.g a chapter, a page, several paragraphs of a book instead of an entire book.  The structure of documents is exploited to identify which document components to retrieve. Structure improves precision Exploit visual memory

Book Chapters Sections Subsections World Wide Web This is only only another to look one le to show the need an la a out structure of and more a document and so ass to it doe not necessary text a structured document have retrieval on the web is an it important topic of today’s research it issues to make se last sentence.. XML retrieval allows users to retrieve document components that are more focussed, e.g. a subsection of a book instead of an entire book. SEARCHING = QUERYING + BROWSING Content-oriented XML retrieval

Focussed retrieval: Scientific Collection  Query model checking aviation systems  Answer one section in a workshop report

Focussed Retrieval: Encyclopedia  Information need volcanic eruption prediction  Answer relatively small portion of the volcano topic

Focussed retrieval: Technical Manual  Query segmentation fault windows services for unix  Answer only a single paragraph in a long manual

XML: eXtensible Mark-up Language  Meta-language (user-defined tags) currently being adopted as the document format language by W3C  Used to describe content and structure (and not layout)  Grammar described in DTD (  used for validation) Structured Document Retrieval Smith John Introduction into SDR …. … …

XML: eXtensible Mark-up Language  Use of XPath notation to refer to the XML structure chapter/title: title is a direct sub-component of chapter //title: any title chapter//title: title is a direct or indirect sub-component of chapter chapter/paragraph[2]: any direct second paragraph of any chapter chapter/*: all direct sub-components of a chapter Structured Document Retrieval Smith John Introduction into SDR …. …

XML Queries  Content-only (CO) queries  Standard IR queries but here we are retrieving document components  “Wine tasting in Granada”  Structure-only queries  Usually not that useful from an IR perspective  “Paragraph containing a diagram next to a table”  Content-and-structure (CAS) queries  Put constraints on which types of components are to be retrieved E.g. “Articles that contain sections about hotels in Granada, and that contain a picture of Alhambra, and return titles of these articles”  Where to look (support elements),what to return (target elements)

Content-oriented XML retrieval Return document components at the right level of granularity (e.g. a book, a chapter, a section, a paragraph, a table, a figure, etc), relevant to the user’s information need with regards to content and structure. SEARCHING = QUERYING + BROWSING

Right level of granularity: The challenge Query: wordnet information retrieval

(Simplified) Conceptual model Structured documentsContent + structure Inverted file + structure index tf, idf, … Matching content + structure Presentation of related components DocumentsQuery Document representation Retrieval results Query representation IndexingFormulation Retrieval function

Article ?XML,?retrieval ?authoring 0.9 XML 0.5 XML 0.2 XML 0.4 retrieval 0.7 authoring Challenge 1: term weights Title Section 1Section 2 how to obtain document and collection statistics (e.g. tf, idf)

Article ?XML,?retrieval ?authoring 0.9 XML 0.5 XML 0.2 XML 0.4 retrieval 0.7 authoring Challenge 2: augmentation weights Title Section 1Section 2 0.5 0.8 0.2 which components contribute best to content of “article”?

Article ?XML,?retrieval ?authoring 0.9 XML 0.5 XML 0.2 XML 0.4 retrieval 0.7 authoring Challenge 3: component weights Title Section 1Section 2 0.6 0.4 0.5 which component type (tag) is a good retrieval unit?

Article XML,retrieval authoring XML XML XML retrieval authoring Challenge 4: overlapping elements Title Section 1Section 2 “section 1” and “article” are both relevant to “XML retrieval”, so which one to return?

Approaches … vector space model probabilistic model Bayesian network language model extending DB model Boolean model natural language processing cognitive model ontology parameter estimation tuning smoothing fusion phrase term statistics collection statistics component statistics proximity search logistic regression belief model relevance feedback divergence from randomness machine learning

Content-oriented XML retrieval: Conclusion  Efficiency — Not just documents, but all its elements  Models — Statistics to be adapted or redefined — Combination of evidence  Users — What is focussed retrieval? — Do users really want elements?  Interface and presentation issues

Outline  Part I - Content-oriented XML retrieval  Part II - Evaluating content-oriented XML retrieval

Evaluation of XML retrieval: INEX Promote research and stimulate development of XML information access and retrieval, through  Creation of evaluation infrastructure and organisation of regular evaluation campaigns for system testing  Building of an XML information access and retrieval research community  Construction of test-suites Collaborative effort  participants contribute to the development of the collection End with a yearly workshop, in December, in Dagstuhl, Germany INEX has allowed a new community in XML information access to emerge, as shown by the number of publications (64 - not final- in 2005, 37 in 2004 and 13 in 2003).

INEX: Background  University of Amsterdam, NL  University of Otago, NZ  University of Chile, CL  CWI, NL  Carnegie Mellon University, USA  IBM Research Lab, IL  University of Minnesota Duluth, USA  University of Paris 6, FR  Queensland University of Technology, AUS  University of California, Berkeley, USA  Royal School of LIS, DK  Queen Mary, University of London, UK  University of Duisburg-Essen, DE  INRIA-Rocquencourt, FR  Utrecht University, NL  Sponsored by DELOS Network of Excellence for Digital Libraries under FP6 – IST programme  Mainly dependent on voluntary efforts  Coordination is distributed for tasks and tracks Main Institutions involved in Coordination for 2005

INEX 2005 Participants 64 participants: 32 Europe; 12 N.America; 10 Asia; 5 Oceania; 5 other +3000 e-mails in 2005! Max-Planck-Institut fuer Informatik, Germany Information Studies, Royal School of LIS, Denmark University of California, Berkeley, USA Peking University, China University of Granada, Spain University of Amsterdam, The Netherlands University of Otago, New Zealand Queen Mary University of London, UK University of Toronto, Canada Utrecht University, The Netherlands City University London, UK University of Kaiserslautern, Germany INRIA-Rocquencourt, France University of Wollongong in Dubai IRIT - Toulouse, France RMIT University, Australia Ecoles des Mines de Saint-Etienne, France Queensland University of Technology, Australia Universtity of Klagenfurt, Austria Fondazione Ugo Bordoni, Italy University of Tampere, Finland Carnegie Mellon University, USA Cornell University, USA University of Illinois at Urbana-Champaign, USA IBM Haifa Research Lab, Israel Ochanomizu University, Japan The Hebrew University of Jerusalem, Israel Laboratoire d’Informatique de Paris 6, France University of Minnesota Duluth, USA University of Rostock, Germany University of California, Los Angeles, USA University of Udine, Italy University of Rostock, Germany University of California, Los Angeles, USA University of Udine, Italy University of South-Brittany, France Nagoya University, Japan University of Waterloo, Canada Rutgers University, USA Kyungpook National University, Korea University of Chile, Chile Hiroshima City University, Japan University of Helsinki, Finland AT&T Labs-Research, USA Microsoft Research Lab Cambridge, UK University of Twente, The Netherlands Centre for Mathematics & Computer Science (CWI), NL University of Utah, USA University Duisburg-Essen, Germany University of Ostrava, Czech Republic Hong Kong Baptist University, Hong Kong University of Sheffield, UK Oslo University College, Norway L3S Research Center, Germany University of Michigan, USA CLIPS-IMAG Grenoble, France Wuhan University, China Nara Institute of Science and Technology, Japan Ritsumeikan University, Japan University of Tsukuba, Japan State University of Montes Claros, Montes Claros(MG), Brazil INRIA Sophia Antipolis Charles de Gaulle University - Lille 3 University of Siena, Italy Australian Research Council, Canberra, Australia University of Wollongong, Wollongong, Australia University of Padova, Italy

Test suite for evaluating retrieval performance Is your XML engine retrieving the relevant information, while at the same time avoiding returning irrelevant information?  Document collection  Topics reflecting realistic information needs  Retrieval tasks, stating what the XML search engine should return as answers  Relevance assessments, stating which elements are relevant to which topics  Metrics to measure effectiveness performance

INEX test suites  Documents ~500MB (+ 241 MB): 12,107 (16, 819) articles in XML format from IEEE Computer Society journals and magazines; 8 millions elements!  INEX 2002 60 topics, inex_eval metric  INEX 2003 66 topics, use subset of XPath, inex_eval and inex_eval_ng metrics  INEX 2004 75 topics, subset of 2003 XPath subset (NEXI) Official metric: inex_eval Others: inex_eval_ng, XCG, t2i, ERR, PRUM, …  INEX 2005 87 topics, NEXI Official metric: XCG

INEX Topics  CO topic: open standards for digital video in distance learning  CAS topic: //article[about(.,'formal methods verify correctness aviation systems')]//sec//[about(.,'case study application model checking theorem proving')] — Candidate topics submitted by participants, must have some relevant elements, not to few and not too many — Selection process performed by INEX organisers

Retrieval tasks I  CO retrieval task, same as standard IR, but return elements open standards for digital video in distance learning  +S retrieval task, where user add structural hints to query to narrow down number of returned elements //article//sec[about(.,open standards for digital video in distance learning)]  Three strategies: — Focussed strategy: assume that user prefers a single element that is the most relevant. — Thorough strategy: assume that user prefers all highly relevant elements. — Fetch and browse strategy: assume that user interested in highly relevant elements that are contained only within highly relevant articles.

Retrieval tasks II CAS retrieval task  where to look for the relevant elements (i.e. support elements)  what type of elements to return (i.e. target elements).  strict and vague interpretations applied to both support and target elements //article[about(.,'formal methods verify correctness aviation systems')]//sec//[about(.,'case study application model checking theorem proving')]

Relevance in XML retrieval smallest component (specificity) that is highly relevant (exhaustivity) specificity specificity: extent to which a document component is focused on the information need, while being an informative unit. exhaustivity exhaustivity: extent to which the information contained in a document component satisfies the information need. XML retrieval evaluation XML retrieval article ss1 ss2 s1 s2 s3 XML evaluation

Relevance assessment task  Topics are assessed by the INEX participants  Use of an on-line interface  Completeness — Rules that force assessors to assess related elements — E.g. element assessed relevant  its parent element and children elements must also be assessed — …  Consistency — Rules to enforce consistent assessments — E.g. Parent of a relevant element must also be relevant, although to a different extent — E.g. Exhaustivity increases going up; specificity increases going down — …  Assessing a topics takes a week!  Average 2 topics per participants  Duplicate assessments (12 topics) in INEX 2004

% Agreement

Measuring effectiveness: Metrics  A research problem in itself!  Metrics  inex_eval - official INEX metric until 2004  inex_eval_ng  ERR (expected ratio of relevant units)  XCG (XML cumulative gain) - official INEX metric 2005  t2i (tolerance to irrelevance)  PRUM (Precision Recall with User Modelling)  HiXEval  …..

What is the problem? Relevance propagates up!  ~26,000 relevant elements on ~14,000 relevant paths  Propagated assessments: ~45%  Increase in size of relevant elements: ~182%

Precision-Recall-based metric and Overlap Simulated runs

Overlap in results Rank Systems (runs) Avg Prec % Overlap 1. IBM Haifa Research Lab(CO-0.5-LAREFIENMENT) 0.1437 80.89 2. IBM Haifa Research Lab(CO-0.5)0.1340 81.46 3. University of Waterloo(Waterloo-Baseline)0.1267 76.32 4. University of Amsterdam(UAms-CO-T-FBack)0.1174 81.85 5. University of Waterloo(Waterloo-Expanded)0.1173 75.62 6. Queensland University of Technology(CO_PS_Stop50K)0.1073 75.89 7. Queensland University of Technology(CO_PS_099_049)0.1072 76.81 8. IBM Haifa Research Lab(CO-0.5-Clustering)0.1043 81.10 9. University of Amsterdam(UAms-CO-T)0.1030 71.96 10. LIP6(simple)0.0921 64.29 Official INEX 2004 Results for CO topics (1500 retrieved elements)

Final words  Challenging research issues in XML retrieval are not ‘just’ about the effective retrieval of XML documents, but also about what and how to evaluate!  INEX 2006 document collection — Wikipedia (XML English) document collection full-texts, marked up in XML, of about 1,900,000 articles — 228,546 categories, totaling +100 Gigabytes (10 Gigabytes without pictures) — 3000 different tags, article has in average 500 XML nodes, average depth 5.  Additional tracks in 2006 — interactive, heterogeneous collection, document mining, relevance feedback, natural language query processing, multimedia, XML entity search

Gracias

XML Retrieval: A content-oriented perspective Mounia Lalmas Department of Computer Science Queen Mary, University of London.

Similar presentations

Presentation on theme: "XML Retrieval: A content-oriented perspective Mounia Lalmas Department of Computer Science Queen Mary, University of London."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

XML Retrieval: A content-oriented perspective Mounia Lalmas Department of Computer Science Queen Mary, University of London.

Similar presentations

Presentation on theme: "XML Retrieval: A content-oriented perspective Mounia Lalmas Department of Computer Science Queen Mary, University of London."— Presentation transcript:

Similar presentations

About project

Feedback