1 Intelligente und effiziente Suche auf semistrukturierten Daten Gerhard Weikum
2 Outline Motivation and Challenges XXL & XXL-light: IR on XML Data Role of Ontologies Ongoing and Future Work Efficient Evaluation of Top-k Queries
3 A Few Challenging Queries (on Web / Deep Web / Intranet / Personal Info) Which drama has a scene in which a woman makes a prophecy to a Scottish nobleman that he will become king? Which professors from Saarbruecken (SB) are teaching IR and have research projects on XML? SB IR XML Who was the French woman that I met at the PC meeting where Paolo Atzeni was PC Chair? What are the most important results on large deviation theory? Which gene expression data from Barrett tissue in the esophagus exhibit high levels of gene A01g? Are there any published theorems that are equivalent to or subsume my latest mathematical conjecture?
4 Homepage Title: Professor Name: Gerhard Weikum Address... City: SB Country: Germany Teaching:Research: What if the Semantic Web Existed and All Information Were in XML? Course Title: IR Description: Information retrieval... Syllabus... BookArticle... Project Title: Intelligent Search of XML Data Sponsor: German Science Foundation... Homepage Title: Lecturer Name: Ralf Schenkel Address: Max-Planck Institute for CS, Germany Teaching: Interests: Semistructured Data, IR Seminar Contents: Ranked Search... Literature Book Title: Statistical Language Models... Challenge: Diversity !
5 What if the Semantic Web Existed and All Information Were in XML? Play Title: Macbeth Personae... Persona: Duncan, King of Scotland PGroup Persona: Macbeth Persona: Banquo GrpDescr: Generals... Act... Title: Act I Scene... Title: Scene III Speech... Speaker: Macbeth Speech Line: So foul... Speaker: Third Witch Line: All hail, Macbeth, thou shalt be king hereafter! Challenge: Uncertainty !
6 What if the Semantic Web Existed and All Information Were in XML? Homepage Firstname: Sophie Lastname: Cluet Address: INRIA Rocquencourt, Le Chesnay, France Interests: XML,... Homepage Name: Antoinette Lagrange Address Hobbies: Painting,... Street: Rue de la Chimie 138 City: Paris Country: France Homepage Name: M.-C. Richard Address: Rue de Voltaire, Paris, France Biography:... mother of two children... International Conference... Challenge: Ambiguity ! Homepage Firstname: Maria Lastname: Sanchez Address: Main Street, Paris, Texas Gender: female
7 Our Research Agenda Web Applications Intranet Applications Deep Web Semantic Web Web Service Generator (UDDI, WSDL) BINGO! DB / Cache / Queue Mgr Craw- ler Focus Mgr Classi- fier Filter (XML...) Onto- logies Search Engine XXL Index Mgr Onto- logy Mgr Query Processor Ontology Service Portal Explorer HTML2XML Auto Annotation Materialography Portal Chamber of Crafts Portal... XXL-light (COMPASS Engine): Concept-oriented Unified Search on HTML / PDF / XML Data from Web / Portals / Files
8 Outline Motivation and Challenges XXL & XXL-light: IR on XML Data Role of Ontologies Ongoing and Future Work Efficient Evaluation of Top-k Queries
9 XML-IR Example (1) ETH Zürich Nat.-Techn. Fak. I Fachrichtung Informatik... Leistungsanalyse Warteschlangen... Sprachverarbeitung... Markovketten Uni Stuttgart Nat.-Techn. Fak. I Fachrichtung Informatik... Leistungsanalyse Warteschlangen... Sprachverarbeitung... Markovketten Uni Saarland Math & Engineering CS... Performance analysis... Queueing models.. Speech processing... Markov chains Uni: Uni Saarland Book Title: Stochastic... Author: R. Nelson Review:... Chapter on Markov chains School:... Dept:... CS... Teaching GradStudies Course: Speech processing School: Course: Performance analysis... Content:... Queueing models Lit:... Content:... Markov chains... Uni: Uni Stuttgart Inst: CS Course: Mobile Comm. Prerequisites:... Markov processes... Uni: Uni Augsburg Curriculum: E Commerce... Weekend: Data Mining... Dozent URL=... Inhalt... Semistructured data: elements, attributes, links organized as labeled graph
10 XML-IR Example (2) Uni: Uni Saarland Book Title: Stochastic... Author: R. Nelson Review:... Chapter on Markov chains School:... Dept:... CS Teaching GradStudies Course: Speech processing School: Course: Performance analysis... Content:... Queueing models Lit:... Content:... Markov chains... Uni: Uni Stuttgart Inst: CS Course: Mobile comm. Prerequisites:... Markov processes... Uni: Uni Augsburg Curriculum: E Commerce... Weekend: Data Mining... Outline:... statistical methods for classification Select U, C From Where Uni As U And U.#.School?.#.(Inst | Dept)+ As D And D Like „%CS%“ And D.#.Course As C And C.# Like „%Markov chain%“ Regular expressions over path labels Logical conditions over element contents +
11 Select U, C From Where Uni As U And U.#.School?.#.(Inst | Dept)+ As D And D Like „%CS%“ And D.#.Course As C And C.# Like „%Markov chain%“ XML-IR Example (3) Uni: Uni Saarland Book Title: Stochastic... Author: R. Nelson Review:... Chapter on Markov chains School:... Dept:... CS Teaching GradStudies Course: Speech processing School: Course: Performance analysis... Content:... Queueing models Lit:... Content:... Markov chains... Uni: Uni Stuttgart Inst: CS Course: Mobile comm. Prerequisites:... Markov processes... Uni: Uni Augsburg Curriculum: E Commerce... Weekend: Data Mining... Outline:... statistical methods for classification Uni As U Uni: U.#.School?.#.(Inst | Dept)+ As D Inst: School: Dept: D Like „%CS%“ CS D.#.Course As C Course: C.# Like „%Markov chain%“ Markov chains U, C
12 XML-IR Example (4) Uni: Uni Saarland Book Title: Stochastic... Author: R. Nelson Review:... Chapter on Markov chains School:... Dept:... CS Teaching GradStudies Course: Speech processing School: Course: Performance analysis... Content:... Queueing models Lit:... Content:... Markov chains... Uni: Uni Stuttgart Inst: CS Course: Mobile comm. Prerequisites:... Markov processes... Uni: Uni Augsburg Curriculum: E Commerce... Weekend: Data Mining... Outline:... statistical methods for classification Select U, C From Where Uni As U And U.# As D And D ~ „CS“ And D.#. ~ Course As C And C.# ~ „Markov chain“
13 XML-IR Example (5) Uni: Uni Saarland Book Title: Stochastic... Author: R. Nelson Review:... Chapter on Markov chains School:... Dept:... CS Teaching GradStudies Course: Speech processing School: Course: Performance analysis... Content:... Queueing models Lit:... Content:... Markov chains... Uni: Uni Stuttgart Inst: CS Course: Mobile comm. Prerequisites:... Markov processes... Uni: Uni Augsburg Curriculum: E Commerce... Weekend: Data Mining... Outline:... statistical methods for classification Select U, C From Where Uni As U And U.# As D And D ~ „CS“ And D.#. ~ Course As C And C.# ~ „Markov chain“ Dozent URL=... Inhalt... Result ranking of XML data based on semantic similarity
14 XML-IR Concepts Where clause: conjunction of restricted path expressions with binding of variables Example COMPASS (Concept-Oriented Multi-Format Portal-Aware Search System): simple, extensible core language – applicable to XML and HTML Select P, C, R From index Where ~professor As P And P = „Saarbruecken“ And P//~course = „Information Retrieval“ As C And P//~research = „~XML“ As R Elementary conditions on names and contents Semantic similarity conditions on names and contents Relevance scoring based on tf*idf similarity of contents, ontological similarity of names, probabilistic combination of conditions ~research = „~XML“ Query Semantics: query is a path/tree/graph pattern results are isomorphic paths/subtrees/subgraphs of the data graph Query Semantics: query is a pattern with relaxable conditions results are approximate matches to query with similarity scores
15 XML-IR Scoring Model Homepage Title: Professor Name: Gerhard Weikum Address City: SB Teaching: Research: Course Title: IR Description:... IR... XML IR... DB... IR Syllabus... Book Course Title: Ranked Search Contents:... IR... Book Title: Statistical Language Models Project Description:... XML... XSLT local score for elementary condition: based on tf*idf-style statistics for node or node context with score propagation global score for query: local scores * compactness compactness of result: max{ node & edge weights | graph connecting matching nodes} generalized MST (related to Steiner trees)
16 XML-IR Scoring Model Homepage Title: Professor Name: Gerhard Weikum Address City: SB Teaching: Research: Course Title: IR Description:... IR... XML IR... DB... IR Syllabus... Book Course Title: Ranked Search Contents:... IR... Book Title: Statistical Language Models Project Description:... XML... XSLT local score for elementary condition: based on tf*idf-style statistics for node or node context with score propagation global score for query: local scores * compactness compactness of result: max{ node & edge weights | graph connecting matching nodes} generalized MST (related to Steiner trees)
17 XML-IR Scoring Model Homepage Title: Professor Name: Gerhard Weikum Address City: SB Teaching: Research: Course Title: IR Description:... IR... XML IR... DB... IR Syllabus... Book Course Title: Ranked Search Contents:... IR... Book Title: Statistical Language Models Project Description:... XML... XSLT local score for elementary condition: based on tf*idf-style statistics for node or node context with score propagation global score for query: local scores * compactness compactness of result: max{ node & edge weights | graph connecting matching nodes} generalized MST (related to Steiner trees) Efficient score computation: heuristics work; advanced algorithms is open issue
18 Outline Motivation and Challenges XXL & XXL-light: IR on XML Data Role of Ontologies Ongoing and Future Work Efficient Evaluation of Top-k Queries
19 On Thesauri and Ontologies Thesaurus: repository („treasure“) of synonyms (and other relationships between words and concepts) Taxonomy: classification of concepts into groups (and trees of groups) Ontology: metaphysical study of the nature of being & existence Gazetteer: (geographical) dictionary of names XML schemas, DTDs, namespaces: syntactic conventions and standardized naming (plus typing info) Ontology (new definition): structured repository of knowledge with a description of concepts and relationships, possibly in the form of description logics formula Reasoning on Ontologies and Thesauri: Professor Lecturer hasStaff.Secretary Teaching Course Seminar LabWork Professor Academician Academician Human Human Carnivore... logical inferences with sub-FOL calculus transitive closures, shortest paths, etc. along generalizations poor man‘s ontology: pragmatic, rich, efficient transitive closures, shortest paths, etc. along generalizations
20 Example WordNet woman, adult female – (an adult female person) => amazon, virago – (a large strong and aggressive woman) => donna -- (an Italian woman of rank) => geisha, geisha girl -- (...) => lady (a polite name for any woman)... => wife – (a married woman, a man‘s partner in marriage) => witch – (a being, usually female, imagined to have special powers derived from the devil)
21 Ontology Visualization
22 Ontology Graph An ontology graph is a directed graph with concepts (and their descriptions) as nodes and semantic relationships as edges (e.g., hypernyms). woman human body personality character lady witch nanny Mary Poppins fairy Lady Di heart... syn (1.0) hyper (0.9) part (0.3) mero (0.5) part (0.8) hypo (0.77) hypo (0.3) hypo (0.35) hypo (0.42) instance (0.2) instance (0.61) instance (0.1) Weighted edges capture strength of relationships key for identifying closely related concepts
23 Statistics for Weighted Ontological Relations Dice coefficient: sim* (c1, cn) = Various correlation measures for sim(c1, c2): Gather statistics from large corpus or by (focused) Web crawl compute by (adaptation of) Dijkstra‘s shortest-path algorithm Transitive similarity: Jaccard coefficient: Conditional probabilites:
24 Benefits from Ontology Service Threshold-based query expansion Query keyword disambiguation Mapping of concept-value query conditions onto Deep-Web portals Ontology service accessible via SOAP or RMI Ontology filled with WordNet, geo gazetteer, focused crawl results, extracted tables & forms Support for automatic tagging of HTML and enhanced XML tags usefor for:
25 Query Expansion Threshold-based query expansion: substitute ~w by (c 1 |... | c k ) with all c i for which sim(w, c i ) „Old hat“ in IR; highly disputed for danger of topic dilution Approach to careful expansion: determine phrases from query or best initial query results (e.g., forming 3-grams and looking up ontology/thesaurus entries) if uniquely mapped to one concept then expand with synonyms and weighted hyponyms
26 Query Expansion Example Title: International Organized Crime Description: Identify organizations that participate in international criminal activity, the activity, and, if possible, collaborating organizations and the countries involved. Query = {international[0.145|1.00], ~META[1.00|1.00][{gangdom[1.00|1.00], gangland[0.742|1.00], "organ[0.213|1.00] & crime[0.312|1.00]", camorra[0.254|1.00], maffia[0.318|1.00], mafia[0.154|1.00], "sicilian[0.201|1.00] & mafia[0.154|1.00]", "black[0.066|1.00] & hand[0.053|1.00]", mob[0.123|1.00], syndicate[0.093|1.00]}], organ[0.213|1.00], crime[0.312|1.00], collabor[0.415|0.20], columbian[0.686|0.20], cartel[0.466|0.20],...}} From TREC 2004 Robust Track: sorted accesses in s. Results: 1.Interpol Chief on Fight Against Narcotics 2.Economic Counterintelligence Tasks Viewed 3.Supreme Procuratorate Work Report to NPC 4.Crime and Punishment in the PRC 5.SWITZERLAND CALLED SOFT ON CRIME... A parliamentary commission accused Swiss prosecutors today of doing little to stop drug and money-laundering international networks from pumping billions of dollars through Swiss companies.... Let us take, for example, the case of Medellin cartel's boss Pablo Escobar. Will the fact that he was eliminated change anything at all? No, it may perhaps have a psychological effect on other drug dealers but, for organizing the illicit export of metals and import of arms. It is extremely difficult for the law-enforcement organs to investigate and stamp out corruption among leading officials....
27 C++: I have implemented this method in Java, as a class that inherits methods from the socket class.... Keyword-to-Concept Mapping and Word Sense Disambiguation Approach for query keyword disambiguation: form contexts con(w) and con(c i ) for keyword w and potential target concepts c i {c 1,..., c k } bag-of-words similarity sim(con(w), con(c)) based on cos or KL diff choose concept argmax c {sim(con(w), con(c))} Example: „Java class socket“ vs. „Java beach snorkeling“ Which concept should „Java“ be mapped to for query expansion? Note: unlike in LSI or pLSI, concepts are explicit, not latent! object-oriented programming language Java (3): a simple platform-independent class, category, family (4) class (9):... program... Java (1): an island q: Java class socket
28 What About Deep Web and Web Services? Mapping of concept-value query conditions onto Deep-Web portals: ~sheetmusic = „~flute“ instrument = (flute | piccolo | recorder) category = reeds style = ( classical | jazz | folk )... Observations: Deep Web has > hidden databases with > 500 billion (5*10 11 ) dynamic pages High „redundancy“ among query forms enables exploitation of statistics
29 What About Deep Web and Web Services? Mapping of concept-value query conditions onto Deep-Web portals: ~sheetmusic = „~flute“ instrument = (flute | piccolo | recorder) category = reeds style = ( classical | jazz | folk )... Approach: crawl Web forms, tables, WSDL mine for multivariate correlations among N-tuples of (concept, value) pairs, examples: ((instrument, flute), (category, reeds)) ((make, Audi), (model, A4), (color, red)) ((title, *), (author, )) map query to target form/WSDL for max likelihood (taking into account ontological similarities)
30 Outline Motivation and Challenges XXL & XXL-light: IR on XML Data Role of Ontologies Ongoing and Future Work Efficient Evaluation of Top-k Queries
31 Top-k Query Processing with Scoring Naive QP algorithm: candidate-docs := ; for i=1 to z do { candidate-docs := candidate-docs index-lookup(ti) }; for each dj candidate-docs do {compute score(q,dj)}; sort candidate-docs by score(q,dj) descending; algorithm B+ tree on terms 17: : performance... z-transform... 52: : : : : : : : : : : : : 0.144: : : 0.6 index lists with (DocId, tf*idf) sorted by DocId Given: query q = t1 t2... tz with z (conjunctive) keywords similarity scoring function score(q,d) for docs d D, e.g.: Find: top k results with regard to score(q,d) (e.g.: i q s i (d)) Google: > 10 mio. terms > 4 bio. docs > 2 TB index
32 “Fagin’s TA” (Fagin’01; Güntzer/Kießling/Balke; Nepal et al.) scan all lists L i (i=1..m) in parallel: consider dj at position pos i in Li; high i := s i (dj); if dj top-k then { look up s (dj) in all lists L with i; // random access compute s(dj) := aggr {s (dj) | =1..m}; if s(dj) > min score among top-k then add dj to top-k and remove min-score d from top-k; }; threshold := aggr {high | =1..m}; if min score among top-k threshold then exit; m=3 aggr: sum k=2 f: 0.5 b: 0.4 c: 0.35 a: 0.3 h: 0.1 d: 0.1 a: 0.55 b: 0.2 f: 0.2 g: 0.2 c: 0.1 h: 0.35 d: 0.35 b: 0.2 a: 0.1 c: 0.05 f: 0.05 f: 0.75 a: 0.95 top-k: b: 0.8 but random accesses are expensive ! applicable to XML data: course ~ „Internet“ and ~topic = „performance“
33 TA-Sorted scan index lists in parallel: consider dj at position pos i in Li; E(dj) := E(dj) {i}; high i := si(q,dj); bestscore(dj) := aggr{x1,..., xm) with xi := si(q,dj) for i E(dj), high i for i E(dj); worstscore(dj) := aggr{x1,..., xm) with xi := si(q,dj) for i E(dj), 0 for i E(dj); top-k := k docs with largest worstscore; threshold := bestscore{d | d not in top-k}; if min worstscore among top-k threshold then exit; m=3 aggr: sum k=2 a: 0.55 b: 0.2 f: 0.2 g: 0.2 c: 0.1 h: 0.35 d: 0.35 b: 0.2 a: 0.1 c: 0.05 f: 0.05 top-k: candidates: f: 0.5 b: 0.4 c: 0.35 a: 0.3 h: 0.1 d: 0.1 f: ? a: 0.95 h: ? b: 0.8 d: ? c: ? g: ? h: ? d: ?
34 Top-k Queries with Probabilistic Guarantees TA family of algorithms based on invariant (with sum as aggr) Relaxed into probabilistic invariant where the RV S i has some (postulated and/or estimated) distribution in the interval (0,high i ] f: 0.5 b: 0.4 c: 0.35 a: 0.3 h: 0.1 d: 0.1 a: 0.55 b: 0.2 f: 0.2 g: 0.2 c: 0.1 h: 0.35 d: 0.35 b: 0.2 a: 0.1 c: 0.05 f: 0.05 S1S1 S2S2 S3S3 Discard candidates with p(d) ≤ Exit index scan when candidate list empty high i
35 Probabilistic Threshold Test postulating uniform or Zipf score distribution in [0, high i ] compute convolution using LSTs use Chernoff-Hoeffding tail bounds or generalized bounds for correlated dimensions (Siegel 1995) fitting Poisson distribution (or Poisson mixture) over equidistant values: easy and exact convolution distribution approximated by histograms: precomputed for each dimension dynamic convolution at query-execution time with independent Si‘s or with correlated Si‘s engineering-wise histograms work best! 0 f 2 (x) 1 high 2 Convolution (f 2 (x), f 3 (x)) 2 0 δ(d) f 3 (x) high cand doc d with 2 E(d), 3 E(d)
36 Prob-sorted Algorithm (Smart Variant) Prob-sorted (RebuildPeriod r, QueueBound b):... scan all lists Li (i=1..m) in parallel: …same code as TA-sorted… // queue management for all priority queues q for which d is relevant do insert d into q with priority bestscore(d); // periodic clean-up if step-number mod r = 0 then // rebuild; single bounded queue if strategy = Smart then for all queue elements e in q do update bestscore(e) with current high_i values; rebuild bounded queue with best b elements; if prob[top(q) can qualify for top-k] < then exit; if all queues are empty then exit;
37 TA-sortedProb-sorted (smart) #sorted accesses2,263,652527,980 elapsed time [s] max queue size relative recall10.69 rank distance039.5 score error Performance Results for.Gov Queries on.GOV corpus from TREC-12 Web track: 1.25 Mio. docs (html, pdf, etc.) 50 keyword queries, e.g.: „Lewis Clark expedition“, „juvenile delinquency“, „legalization Marihuana“, „air bag safety reducing injuries death facts“ speedup by factor 10 at high precision/recall (relative to TA-sorted); aggressive queue mgt. even yields factor 100 at % prec./recall
38.Gov Expanded Queries on.GOV corpus with query expansion based on WordNet synonyms: 50 keyword queries, e.g.: „juvenile delinquency youth minor crime law jurisdiction offense prevention“, „legalization marijuana cannabis drug soft leaves plant smoked chewed euphoric abuse substance possession control pot grass dope weed smoke“ TA-sortedProb-sorted (smart) #sorted accesses22,403,49018,287,636 elapsed time [s] max queue size relative recall10.88 rank distance014.5 score error00.035
39 Performance Results for IMDB Queries on IMDB corpus (Web site: Internet Movie Database): movies, 1.2 Mio. persons (html/xml) 20 structured/text queries with Dice-coefficient-based similarities of categorical attributes Genre and Actor, e.g.: Genre {Western} Actor {John Wayne, Katherine Hepburn} Description {sheriff, marshall}, Genre {Thriller} Actor {Arnold Schwarzenegger} Description {robot} TA-sortedProb-sorted (smart) #sorted accesses1,003,650403,981 elapsed time [s] max queue size relative recall10.75 rank distance score error00.25
40 Outline Motivation and Challenges XXL & XXL-light: IR on XML Data Role of Ontologies Ongoing and Future Work Efficient Evaluation of Top-k Queries
41 Exploiting Collective Human Input for Collaborative Web Search - Beyond Relevance Feedback and Beyond Google - href links are human endorsements PageRank, etc. Opportunity: online analysis of human input & behavior may compensate deficiencies of search engine Typical scenario for 3-keyword user query: a & b & c top 10 results: user clicks on ranks 2, 5, 7 Challenge: How can we use knowledge about the collective input of all users in a large community? top 10 results: user modifies query into a & b & c & d user modifies query into a & b & e user modifies query into a & b & NOT c top 10 results: user selects URL from bookmarks user jumps to portal user asks friend for tips query logs, bookmarks, etc. provide human assessments & endorsements correlations among words & concepts and among documents
42 Concluding Remarks XML and Semantic Web are key assets, but by themselves not sufficient; we need to cope with diversity, incompleteness, and uncertainty absolute need for ranked retrieval long-term goal: exploit the Web‘s potential for being the world‘s largest knowledge base view information organization and information search as dual views of the same problem combine techniques from DBS, IR, CL, AI, and ML need better theory about quality/efficiency tradeoffs as well as large-scale experiments