1 Intelligente und effiziente Suche auf semistrukturierten Daten Gerhard Weikum

Slides:



Advertisements
Similar presentations
Improved TF-IDF Ranker
Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
IRDM WS Chapter 3: Top-k Query Processing and Indexing 3.1 Top-k Algorithms 3.2 Approximate Top-k Query Processing 3.3 Index Access Scheduling.
Top-k Query Evaluation with Probabilistic Guarantees By Martin Theobald, Gerald Weikum, Ralf Schenkel.
DELIS Highlights: Efficient and Intelligent Top-k Search in Peer-to-Peer Systems presented by Gerhard Weikum (Max-Planck Institute of Computer Science)
Information Retrieval in Practice
Search Engines and Information Retrieval
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
Xyleme A Dynamic Warehouse for XML Data of the Web.
CSE 574 – Artificial Intelligence II Statistical Relational Learning Instructor: Pedro Domingos.
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
The Informative Role of WordNet in Open-Domain Question Answering Marius Paşca and Sanda M. Harabagiu (NAACL 2001) Presented by Shauna Eggers CS 620 February.
1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Slide 1 EE3J2 Data Mining EE3J2 Data Mining - revision Martin Russell.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
1 The Index-based XXL Search Engine for Querying XML Data with Relevance Ranking Anja Theobald and Gerhard Weikum University of the Saarland Saarbrücken,
Top- K Query Evaluation with Probabilistic Guarantees Martin Theobald, Gerhard Weikum, Ralf Schenkel Presenter: Avinandan Sengupta.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Information Retrieval in Practice
CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.
Teaching Metadata and Networked Information Organization & Retrieval The UNT SLIS Experience William E. Moen School of Library and Information Sciences.
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
Ontology Alignment/Matching Prafulla Palwe. Agenda ► Introduction  Being serious about the semantic web  Living with heterogeneity  Heterogeneity problem.
Search Engines and Information Retrieval Chapter 1.
1 The BT Digital Library A case study in intelligent content management Paul Warren
DELIS Kickoff, March 18-19, Outline 1 Overriding Goals & Structure 2 Background on P2P 3 Background on Search Engines 4 P2P SE Architecture 5 Research.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
VLDB ´04 Top-k Query Evaluation with Probabilistic Guarantees Martin Theobald Gerhard Weikum Ralf Schenkel Max-Planck Institute for Computer Science SaarbrückenGermany.
MPI Informatik 1/17 Oberseminar AG5 Result merging in a Peer-to-Peer Web Search Engine Supervisors: Speaker : Sergey Chernov Prof. Gerhard Weikum Christian.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
1 BINGO! and Daffodil: Personalized Exploration of Digital Libraries and Web Sources Martin Theobald Max-Planck-Institut für Informatik Claus-Peter Klas.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Query Routing in Peer-to-Peer Web Search Engine Speaker: Pavel Serdyukov Supervisors: Gerhard Weikum Christian Zimmer Matthias Bender International Max.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
Querying Structured Text in an XML Database By Xuemei Luo.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
1/28 Efficient Top-k Queries for XML Information Retrieval Gerhard Weikum Joint work with Ralf Schenkel.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Algorithmic Detection of Semantic Similarity WWW 2005.
Introduction to Information Retrieval Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Scalable Hybrid Keyword Search on Distributed Database Jungkee Kim Florida State University Community Grids Laboratory, Indiana University Workshop on.
1 Information Retrieval LECTURE 1 : Introduction.
NRA Top k query processing using Non Random Access Only sequential access Only sequential accessAlgorithm 1) 1) scan index lists in parallel; 2) 2) consider.
Information Retrieval
Achieving Semantic Interoperability at the World Bank Designing the Information Architecture and Programmatically Processing Information Denise Bedford.
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
BINGO!: Bookmark-Induced Gathering of Information Sergej Sizov, Martin Theobald, Stefan Siersdorfer, Gerhard Weikum University of the Saarland Germany.
SEMI-STRUCTURED DATA (XML) 1. SEMI-STRUCTURED DATA ER, Relational, ODL data models are all based on schema Structure of data is rigid and known is advance.
Meta-Path-Based Ranking with Pseudo Relevance Feedback on Heterogeneous Graph for Citation Recommendation By: Xiaozhong Liu, Yingying Yu, Chun Guo, Yizhou.
Big Data: Every Word Managing Data Data Mining TerminologyData Collection CrowdsourcingSecurity & Validation Universal Translation Monolingual Dictionaries.
Efficient Top-k Querying over Social-Tagging Networks Ralf Schenkel, Tom Crecelius, Mouna Kacimi, Sebastian Michel, Thomas Neumann, Josiane Xavier Parreira,
Data mining in web applications
Information Retrieval in Practice
Neighborhood - based Tag Prediction
Search Engine Architecture
Key Observation Theorem:
Information Retrieval
Introduction to Information Retrieval
Observations on DB & IR and Conclusions
Presentation transcript:

1 Intelligente und effiziente Suche auf semistrukturierten Daten Gerhard Weikum

2 Outline Motivation and Challenges XXL & XXL-light: IR on XML Data Role of Ontologies Ongoing and Future Work Efficient Evaluation of Top-k Queries

3 A Few Challenging Queries (on Web / Deep Web / Intranet / Personal Info) Which drama has a scene in which a woman makes a prophecy to a Scottish nobleman that he will become king? Which professors from Saarbruecken (SB) are teaching IR and have research projects on XML? SB IR XML Who was the French woman that I met at the PC meeting where Paolo Atzeni was PC Chair? What are the most important results on large deviation theory? Which gene expression data from Barrett tissue in the esophagus exhibit high levels of gene A01g? Are there any published theorems that are equivalent to or subsume my latest mathematical conjecture?

4 Homepage Title: Professor Name: Gerhard Weikum Address... City: SB Country: Germany Teaching:Research: What if the Semantic Web Existed and All Information Were in XML? Course Title: IR Description: Information retrieval... Syllabus... BookArticle... Project Title: Intelligent Search of XML Data Sponsor: German Science Foundation... Homepage Title: Lecturer Name: Ralf Schenkel Address: Max-Planck Institute for CS, Germany Teaching: Interests: Semistructured Data, IR Seminar Contents: Ranked Search... Literature Book Title: Statistical Language Models... Challenge: Diversity !

5 What if the Semantic Web Existed and All Information Were in XML? Play Title: Macbeth Personae... Persona: Duncan, King of Scotland PGroup Persona: Macbeth Persona: Banquo GrpDescr: Generals... Act... Title: Act I Scene... Title: Scene III Speech... Speaker: Macbeth Speech Line: So foul... Speaker: Third Witch Line: All hail, Macbeth, thou shalt be king hereafter! Challenge: Uncertainty !

6 What if the Semantic Web Existed and All Information Were in XML? Homepage Firstname: Sophie Lastname: Cluet Address: INRIA Rocquencourt, Le Chesnay, France Interests: XML,... Homepage Name: Antoinette Lagrange Address Hobbies: Painting,... Street: Rue de la Chimie 138 City: Paris Country: France Homepage Name: M.-C. Richard Address: Rue de Voltaire, Paris, France Biography:... mother of two children... International Conference... Challenge: Ambiguity ! Homepage Firstname: Maria Lastname: Sanchez Address: Main Street, Paris, Texas Gender: female

7 Our Research Agenda Web Applications Intranet Applications Deep Web Semantic Web Web Service Generator (UDDI, WSDL) BINGO! DB / Cache / Queue Mgr Craw- ler Focus Mgr Classi- fier Filter (XML...) Onto- logies Search Engine XXL Index Mgr Onto- logy Mgr Query Processor Ontology Service Portal Explorer HTML2XML Auto Annotation Materialography Portal Chamber of Crafts Portal... XXL-light (COMPASS Engine): Concept-oriented Unified Search on HTML / PDF / XML Data from Web / Portals / Files

8 Outline Motivation and Challenges XXL & XXL-light: IR on XML Data Role of Ontologies Ongoing and Future Work Efficient Evaluation of Top-k Queries

9 XML-IR Example (1) ETH Zürich Nat.-Techn. Fak. I Fachrichtung Informatik... Leistungsanalyse Warteschlangen... Sprachverarbeitung... Markovketten Uni Stuttgart Nat.-Techn. Fak. I Fachrichtung Informatik... Leistungsanalyse Warteschlangen... Sprachverarbeitung... Markovketten Uni Saarland Math & Engineering CS... Performance analysis... Queueing models.. Speech processing... Markov chains Uni: Uni Saarland Book Title: Stochastic... Author: R. Nelson Review:... Chapter on Markov chains School:... Dept:... CS... Teaching GradStudies Course: Speech processing School: Course: Performance analysis... Content:... Queueing models Lit:... Content:... Markov chains... Uni: Uni Stuttgart Inst: CS Course: Mobile Comm. Prerequisites:... Markov processes... Uni: Uni Augsburg Curriculum: E Commerce... Weekend: Data Mining... Dozent URL=... Inhalt... Semistructured data: elements, attributes, links organized as labeled graph

10 XML-IR Example (2) Uni: Uni Saarland Book Title: Stochastic... Author: R. Nelson Review:... Chapter on Markov chains School:... Dept:... CS Teaching GradStudies Course: Speech processing School: Course: Performance analysis... Content:... Queueing models Lit:... Content:... Markov chains... Uni: Uni Stuttgart Inst: CS Course: Mobile comm. Prerequisites:... Markov processes... Uni: Uni Augsburg Curriculum: E Commerce... Weekend: Data Mining... Outline:... statistical methods for classification Select U, C From Where Uni As U And U.#.School?.#.(Inst | Dept)+ As D And D Like „%CS%“ And D.#.Course As C And C.# Like „%Markov chain%“ Regular expressions over path labels Logical conditions over element contents +

11 Select U, C From Where Uni As U And U.#.School?.#.(Inst | Dept)+ As D And D Like „%CS%“ And D.#.Course As C And C.# Like „%Markov chain%“ XML-IR Example (3) Uni: Uni Saarland Book Title: Stochastic... Author: R. Nelson Review:... Chapter on Markov chains School:... Dept:... CS Teaching GradStudies Course: Speech processing School: Course: Performance analysis... Content:... Queueing models Lit:... Content:... Markov chains... Uni: Uni Stuttgart Inst: CS Course: Mobile comm. Prerequisites:... Markov processes... Uni: Uni Augsburg Curriculum: E Commerce... Weekend: Data Mining... Outline:... statistical methods for classification Uni As U Uni: U.#.School?.#.(Inst | Dept)+ As D Inst: School: Dept: D Like „%CS%“ CS D.#.Course As C Course: C.# Like „%Markov chain%“ Markov chains U, C

12 XML-IR Example (4) Uni: Uni Saarland Book Title: Stochastic... Author: R. Nelson Review:... Chapter on Markov chains School:... Dept:... CS Teaching GradStudies Course: Speech processing School: Course: Performance analysis... Content:... Queueing models Lit:... Content:... Markov chains... Uni: Uni Stuttgart Inst: CS Course: Mobile comm. Prerequisites:... Markov processes... Uni: Uni Augsburg Curriculum: E Commerce... Weekend: Data Mining... Outline:... statistical methods for classification Select U, C From Where Uni As U And U.# As D And D ~ „CS“ And D.#. ~ Course As C And C.# ~ „Markov chain“

13 XML-IR Example (5) Uni: Uni Saarland Book Title: Stochastic... Author: R. Nelson Review:... Chapter on Markov chains School:... Dept:... CS Teaching GradStudies Course: Speech processing School: Course: Performance analysis... Content:... Queueing models Lit:... Content:... Markov chains... Uni: Uni Stuttgart Inst: CS Course: Mobile comm. Prerequisites:... Markov processes... Uni: Uni Augsburg Curriculum: E Commerce... Weekend: Data Mining... Outline:... statistical methods for classification Select U, C From Where Uni As U And U.# As D And D ~ „CS“ And D.#. ~ Course As C And C.# ~ „Markov chain“ Dozent URL=... Inhalt... Result ranking of XML data based on semantic similarity

14 XML-IR Concepts Where clause: conjunction of restricted path expressions with binding of variables Example COMPASS (Concept-Oriented Multi-Format Portal-Aware Search System): simple, extensible core language – applicable to XML and HTML Select P, C, R From index Where ~professor As P And P = „Saarbruecken“ And P//~course = „Information Retrieval“ As C And P//~research = „~XML“ As R Elementary conditions on names and contents Semantic similarity conditions on names and contents Relevance scoring based on tf*idf similarity of contents, ontological similarity of names, probabilistic combination of conditions ~research = „~XML“ Query Semantics: query is a path/tree/graph pattern results are isomorphic paths/subtrees/subgraphs of the data graph Query Semantics: query is a pattern with relaxable conditions results are approximate matches to query with similarity scores

15 XML-IR Scoring Model Homepage Title: Professor Name: Gerhard Weikum Address City: SB Teaching: Research: Course Title: IR Description:... IR... XML IR... DB... IR Syllabus... Book Course Title: Ranked Search Contents:... IR... Book Title: Statistical Language Models Project Description:... XML... XSLT local score for elementary condition: based on tf*idf-style statistics for node or node context with score propagation global score for query:  local scores * compactness compactness of result: max{  node & edge weights | graph connecting matching nodes}  generalized MST (related to Steiner trees)

16 XML-IR Scoring Model Homepage Title: Professor Name: Gerhard Weikum Address City: SB Teaching: Research: Course Title: IR Description:... IR... XML IR... DB... IR Syllabus... Book Course Title: Ranked Search Contents:... IR... Book Title: Statistical Language Models Project Description:... XML... XSLT local score for elementary condition: based on tf*idf-style statistics for node or node context with score propagation global score for query:  local scores * compactness compactness of result: max{  node & edge weights | graph connecting matching nodes}  generalized MST (related to Steiner trees)

17 XML-IR Scoring Model Homepage Title: Professor Name: Gerhard Weikum Address City: SB Teaching: Research: Course Title: IR Description:... IR... XML IR... DB... IR Syllabus... Book Course Title: Ranked Search Contents:... IR... Book Title: Statistical Language Models Project Description:... XML... XSLT local score for elementary condition: based on tf*idf-style statistics for node or node context with score propagation global score for query:  local scores * compactness compactness of result: max{  node & edge weights | graph connecting matching nodes}  generalized MST (related to Steiner trees) Efficient score computation: heuristics work; advanced algorithms is open issue

18 Outline Motivation and Challenges XXL & XXL-light: IR on XML Data Role of Ontologies Ongoing and Future Work Efficient Evaluation of Top-k Queries

19 On Thesauri and Ontologies Thesaurus: repository („treasure“) of synonyms (and other relationships between words and concepts) Taxonomy: classification of concepts into groups (and trees of groups) Ontology: metaphysical study of the nature of being & existence Gazetteer: (geographical) dictionary of names XML schemas, DTDs, namespaces: syntactic conventions and standardized naming (plus typing info) Ontology (new definition): structured repository of knowledge with a description of concepts and relationships, possibly in the form of description logics formula Reasoning on Ontologies and Thesauri: Professor  Lecturer   hasStaff.Secretary Teaching  Course  Seminar  LabWork Professor  Academician Academician  Human Human  Carnivore...  logical inferences with sub-FOL calculus  transitive closures, shortest paths, etc. along generalizations poor man‘s ontology: pragmatic, rich, efficient  transitive closures, shortest paths, etc. along generalizations

20 Example WordNet woman, adult female – (an adult female person) => amazon, virago – (a large strong and aggressive woman) => donna -- (an Italian woman of rank) => geisha, geisha girl -- (...) => lady (a polite name for any woman)... => wife – (a married woman, a man‘s partner in marriage) => witch – (a being, usually female, imagined to have special powers derived from the devil)

21 Ontology Visualization

22 Ontology Graph An ontology graph is a directed graph with concepts (and their descriptions) as nodes and semantic relationships as edges (e.g., hypernyms). woman human body personality character lady witch nanny Mary Poppins fairy Lady Di heart... syn (1.0) hyper (0.9) part (0.3) mero (0.5) part (0.8) hypo (0.77) hypo (0.3) hypo (0.35) hypo (0.42) instance (0.2) instance (0.61) instance (0.1) Weighted edges capture strength of relationships  key for identifying closely related concepts

23 Statistics for Weighted Ontological Relations Dice coefficient: sim* (c1, cn) = Various correlation measures for sim(c1, c2): Gather statistics from large corpus or by (focused) Web crawl compute by (adaptation of) Dijkstra‘s shortest-path algorithm Transitive similarity: Jaccard coefficient: Conditional probabilites:

24 Benefits from Ontology Service Threshold-based query expansion Query keyword disambiguation Mapping of concept-value query conditions onto Deep-Web portals Ontology service accessible via SOAP or RMI Ontology filled with WordNet, geo gazetteer, focused crawl results, extracted tables & forms Support for automatic tagging of HTML and enhanced XML tags usefor for:

25 Query Expansion Threshold-based query expansion: substitute ~w by (c 1 |... | c k ) with all c i for which sim(w, c i )   „Old hat“ in IR; highly disputed for danger of topic dilution Approach to careful expansion: determine phrases from query or best initial query results (e.g., forming 3-grams and looking up ontology/thesaurus entries) if uniquely mapped to one concept then expand with synonyms and weighted hyponyms

26 Query Expansion Example Title: International Organized Crime Description: Identify organizations that participate in international criminal activity, the activity, and, if possible, collaborating organizations and the countries involved. Query = {international[0.145|1.00], ~META[1.00|1.00][{gangdom[1.00|1.00], gangland[0.742|1.00], "organ[0.213|1.00] & crime[0.312|1.00]", camorra[0.254|1.00], maffia[0.318|1.00], mafia[0.154|1.00], "sicilian[0.201|1.00] & mafia[0.154|1.00]", "black[0.066|1.00] & hand[0.053|1.00]", mob[0.123|1.00], syndicate[0.093|1.00]}], organ[0.213|1.00], crime[0.312|1.00], collabor[0.415|0.20], columbian[0.686|0.20], cartel[0.466|0.20],...}} From TREC 2004 Robust Track: sorted accesses in s. Results: 1.Interpol Chief on Fight Against Narcotics 2.Economic Counterintelligence Tasks Viewed 3.Supreme Procuratorate Work Report to NPC 4.Crime and Punishment in the PRC 5.SWITZERLAND CALLED SOFT ON CRIME... A parliamentary commission accused Swiss prosecutors today of doing little to stop drug and money-laundering international networks from pumping billions of dollars through Swiss companies.... Let us take, for example, the case of Medellin cartel's boss Pablo Escobar. Will the fact that he was eliminated change anything at all? No, it may perhaps have a psychological effect on other drug dealers but, for organizing the illicit export of metals and import of arms. It is extremely difficult for the law-enforcement organs to investigate and stamp out corruption among leading officials....

27 C++: I have implemented this method in Java, as a class that inherits methods from the socket class.... Keyword-to-Concept Mapping and Word Sense Disambiguation Approach for query keyword disambiguation: form contexts con(w) and con(c i ) for keyword w and potential target concepts c i  {c 1,..., c k } bag-of-words similarity sim(con(w), con(c)) based on cos or KL diff choose concept argmax c {sim(con(w), con(c))} Example: „Java class socket“ vs. „Java beach snorkeling“ Which concept should „Java“ be mapped to for query expansion? Note: unlike in LSI or pLSI, concepts are explicit, not latent! object-oriented programming language Java (3): a simple platform-independent class, category, family (4) class (9):... program... Java (1): an island q: Java class socket

28 What About Deep Web and Web Services? Mapping of concept-value query conditions onto Deep-Web portals: ~sheetmusic = „~flute“  instrument = (flute | piccolo | recorder)  category = reeds  style = ( classical | jazz | folk )... Observations: Deep Web has > hidden databases with > 500 billion (5*10 11 ) dynamic pages High „redundancy“ among query forms  enables exploitation of statistics

29 What About Deep Web and Web Services? Mapping of concept-value query conditions onto Deep-Web portals: ~sheetmusic = „~flute“  instrument = (flute | piccolo | recorder)  category = reeds  style = ( classical | jazz | folk )... Approach: crawl Web forms, tables, WSDL mine for multivariate correlations among N-tuples of (concept, value) pairs, examples: ((instrument, flute), (category, reeds)) ((make, Audi), (model, A4), (color, red)) ((title, *), (author, )) map query to target form/WSDL for max likelihood (taking into account ontological similarities)

30 Outline Motivation and Challenges XXL & XXL-light: IR on XML Data Role of Ontologies Ongoing and Future Work Efficient Evaluation of Top-k Queries

31 Top-k Query Processing with Scoring Naive QP algorithm: candidate-docs :=  ; for i=1 to z do { candidate-docs := candidate-docs  index-lookup(ti) }; for each dj  candidate-docs do {compute score(q,dj)}; sort candidate-docs by score(q,dj) descending; algorithm B+ tree on terms 17: : performance... z-transform... 52: : : : : : : : : : : : : 0.144: : : 0.6 index lists with (DocId, tf*idf) sorted by DocId Given: query q = t1 t2... tz with z (conjunctive) keywords similarity scoring function score(q,d) for docs d  D, e.g.: Find: top k results with regard to score(q,d) (e.g.:  i  q s i (d)) Google: > 10 mio. terms > 4 bio. docs > 2 TB index

32 “Fagin’s TA” (Fagin’01; Güntzer/Kießling/Balke; Nepal et al.) scan all lists L i (i=1..m) in parallel: consider dj at position pos i in Li; high i := s i (dj); if dj  top-k then { look up s (dj) in all lists L with  i; // random access compute s(dj) := aggr {s (dj) | =1..m}; if s(dj) > min score among top-k then add dj to top-k and remove min-score d from top-k; }; threshold := aggr {high | =1..m}; if min score among top-k  threshold then exit; m=3 aggr: sum k=2 f: 0.5 b: 0.4 c: 0.35 a: 0.3 h: 0.1 d: 0.1 a: 0.55 b: 0.2 f: 0.2 g: 0.2 c: 0.1 h: 0.35 d: 0.35 b: 0.2 a: 0.1 c: 0.05 f: 0.05 f: 0.75 a: 0.95 top-k: b: 0.8 but random accesses are expensive ! applicable to XML data: course ~ „Internet“ and ~topic = „performance“

33 TA-Sorted scan index lists in parallel: consider dj at position pos i in Li; E(dj) := E(dj)  {i}; high i := si(q,dj); bestscore(dj) := aggr{x1,..., xm) with xi := si(q,dj) for i  E(dj), high i for i  E(dj); worstscore(dj) := aggr{x1,..., xm) with xi := si(q,dj) for i  E(dj), 0 for i  E(dj); top-k := k docs with largest worstscore; threshold := bestscore{d | d not in top-k}; if min worstscore among top-k  threshold then exit; m=3 aggr: sum k=2 a: 0.55 b: 0.2 f: 0.2 g: 0.2 c: 0.1 h: 0.35 d: 0.35 b: 0.2 a: 0.1 c: 0.05 f: 0.05 top-k: candidates: f: 0.5 b: 0.4 c: 0.35 a: 0.3 h: 0.1 d: 0.1 f: ?  a: 0.95 h: ?  b: 0.8 d: ?  c: ?  g: ?  h: ?  d: ? 

34 Top-k Queries with Probabilistic Guarantees TA family of algorithms based on invariant (with sum as aggr) Relaxed into probabilistic invariant where the RV S i has some (postulated and/or estimated) distribution in the interval (0,high i ] f: 0.5 b: 0.4 c: 0.35 a: 0.3 h: 0.1 d: 0.1 a: 0.55 b: 0.2 f: 0.2 g: 0.2 c: 0.1 h: 0.35 d: 0.35 b: 0.2 a: 0.1 c: 0.05 f: 0.05 S1S1 S2S2 S3S3 Discard candidates with p(d) ≤  Exit index scan when candidate list empty high i

35 Probabilistic Threshold Test postulating uniform or Zipf score distribution in [0, high i ] compute convolution using LSTs use Chernoff-Hoeffding tail bounds or generalized bounds for correlated dimensions (Siegel 1995) fitting Poisson distribution (or Poisson mixture) over equidistant values: easy and exact convolution distribution approximated by histograms: precomputed for each dimension dynamic convolution at query-execution time with independent Si‘s or with correlated Si‘s engineering-wise histograms work best! 0 f 2 (x) 1 high 2 Convolution (f 2 (x), f 3 (x)) 2 0 δ(d) f 3 (x) high  cand doc d with 2  E(d), 3  E(d)

36 Prob-sorted Algorithm (Smart Variant) Prob-sorted (RebuildPeriod r, QueueBound b):... scan all lists Li (i=1..m) in parallel: …same code as TA-sorted… // queue management for all priority queues q for which d is relevant do insert d into q with priority bestscore(d); // periodic clean-up if step-number mod r = 0 then // rebuild; single bounded queue if strategy = Smart then for all queue elements e in q do update bestscore(e) with current high_i values; rebuild bounded queue with best b elements; if prob[top(q) can qualify for top-k] <  then exit; if all queues are empty then exit;

37 TA-sortedProb-sorted (smart) #sorted accesses2,263,652527,980 elapsed time [s] max queue size relative recall10.69 rank distance039.5 score error Performance Results for.Gov Queries on.GOV corpus from TREC-12 Web track: 1.25 Mio. docs (html, pdf, etc.) 50 keyword queries, e.g.: „Lewis Clark expedition“, „juvenile delinquency“, „legalization Marihuana“, „air bag safety reducing injuries death facts“ speedup by factor 10 at high precision/recall (relative to TA-sorted); aggressive queue mgt. even yields factor 100 at % prec./recall

38.Gov Expanded Queries on.GOV corpus with query expansion based on WordNet synonyms: 50 keyword queries, e.g.: „juvenile delinquency youth minor crime law jurisdiction offense prevention“, „legalization marijuana cannabis drug soft leaves plant smoked chewed euphoric abuse substance possession control pot grass dope weed smoke“ TA-sortedProb-sorted (smart) #sorted accesses22,403,49018,287,636 elapsed time [s] max queue size relative recall10.88 rank distance014.5 score error00.035

39 Performance Results for IMDB Queries on IMDB corpus (Web site: Internet Movie Database): movies, 1.2 Mio. persons (html/xml) 20 structured/text queries with Dice-coefficient-based similarities of categorical attributes Genre and Actor, e.g.: Genre  {Western}  Actor  {John Wayne, Katherine Hepburn}  Description  {sheriff, marshall}, Genre  {Thriller}  Actor  {Arnold Schwarzenegger}  Description  {robot} TA-sortedProb-sorted (smart) #sorted accesses1,003,650403,981 elapsed time [s] max queue size relative recall10.75 rank distance score error00.25

40 Outline Motivation and Challenges XXL & XXL-light: IR on XML Data Role of Ontologies Ongoing and Future Work Efficient Evaluation of Top-k Queries

41 Exploiting Collective Human Input for Collaborative Web Search - Beyond Relevance Feedback and Beyond Google - href links are human endorsements  PageRank, etc. Opportunity: online analysis of human input & behavior may compensate deficiencies of search engine Typical scenario for 3-keyword user query: a & b & c  top 10 results: user clicks on ranks 2, 5, 7 Challenge: How can we use knowledge about the collective input of all users in a large community?  top 10 results: user modifies query into a & b & c & d user modifies query into a & b & e user modifies query into a & b & NOT c  top 10 results: user selects URL from bookmarks user jumps to portal user asks friend for tips query logs, bookmarks, etc. provide human assessments & endorsements correlations among words & concepts and among documents

42 Concluding Remarks XML and Semantic Web are key assets, but by themselves not sufficient; we need to cope with diversity, incompleteness, and uncertainty  absolute need for ranked retrieval long-term goal: exploit the Web‘s potential for being the world‘s largest knowledge base view information organization and information search as dual views of the same problem combine techniques from DBS, IR, CL, AI, and ML need better theory about quality/efficiency tradeoffs as well as large-scale experiments