Presentation is loading. Please wait.

Presentation is loading. Please wait.

Ralf Schenkel joint work with Jens Graupmann and Gerhard Weikum The SphereSearch Engine for Unified Ranked Retrieval of Heterogeneous XML and Web Documents.

Similar presentations


Presentation on theme: "Ralf Schenkel joint work with Jens Graupmann and Gerhard Weikum The SphereSearch Engine for Unified Ranked Retrieval of Heterogeneous XML and Web Documents."— Presentation transcript:

1 Ralf Schenkel joint work with Jens Graupmann and Gerhard Weikum The SphereSearch Engine for Unified Ranked Retrieval of Heterogeneous XML and Web Documents

2 VLDB 2005, Trondheim, Norway 2 Outline Where existing search engines fail SphereSearch Concepts Transformation and Annotation Query Language and Scoring Experimental Evaluation Summary

3 VLDB 2005, Trondheim, Norway 3 Example query #1 Which professors from Saarbrücken do research on XML Different terminology in query and Web pages Director of Department 5 DBS & IS Professor at Saarland University Abstraction Awareness

4 VLDB 2005, Trondheim, Norway 4 Example query #2 Conferences about XML in Norway 2005 ? Context Awareness Information is not present on a single page, but distributed across linked pages VLDB Conference 2005, Trondheim, Norway Call for Papers …XML…

5 VLDB 2005, Trondheim, Norway 5 What are the publications of Max Planck? Example query #3 Max Planck should be instance of concept person, not of concept institute Concept Awareness

6 VLDB 2005, Trondheim, Norway 6 SphereSearch Concepts Unified search for unstructured, semistructured, structured data from heterogeneous sources Graph-based model, including links Annotation engines from NLP to recognize classes of named entities (persons, locations, dates, …) for concept-aware queries Flexible yet simple abstraction-aware query language with context-aware scoring Compactness-based scores Goal: Increase recall & precision for hard queries on linked and heterogeneous data

7 VLDB 2005, Trondheim, Norway 7 Some Related Work Web Query Languages e.g., W3QS [VLDB95], WebOQL [ICDE95],… Web IR with thesauri e.g., Qiu et al.[SIGIR93], Liu et al.[SIGIR04],… XML IR e.g., XXL [WebDB00], XIRQL [SIGIR01], XSearch [VLDB93], XRank [SIGMOD03], … Information extraction e.g., Lixto, KnowItAll, … Advanced Web graph IR e.g., BANKS [ICDE02], Hristidis et al.[VLDB03], …

8 VLDB 2005, Trondheim, Norway 8 Outline Where existing search engines fail SphereSearch Concepts Transformation and Annotation Query Language and Scoring Experimental Evaluation Current and Future Work

9 VLDB 2005, Trondheim, Norway 9 Unifying Search on Heterogeneous Data Web Intranet Databases Enterprise Information Systems … XML Heuristics, type-spec transformations

10 VLDB 2005, Trondheim, Norway 10 Heuristic Transformation of HTML Headlines Experiments Settings We evaluated... Results Our system... Goal: Transform layout tags to semantic annotations...... Patterns Topic: XML XML Rules for tables, lists, …

11 VLDB 2005, Trondheim, Norway 11 (Almost) Generic XML Data Model Gerhard Weikum IR Saarbrücken XML 1 docid=1 tag=“Professor“ content=“Gerhard Weikum Saarbrücken“ 32 docid=1 tag=“Research“ content=“XML“ docid=1 tag=“Course“ content=“IR“ Automatic annotation of important concepts (persons, locations, dates, money amounts) with tools from Information Extraction Tags annotate content with corresponding concept person location

12 VLDB 2005, Trondheim, Norway 12 Information Extraction (IE) The Pelican Hotel in Salvador, operated by Roberto Cardoso, offers comfortable rooms starting at $100 a night, including breakfast. Please check in before 7pm. The Pelican Hotel in Salvador, operated by Roberto Cardoso, offers comfortable rooms starting at $100 a night, including breakfast. Please check in before 7pm. Named Entity Recognition (NER) Named Entity ~ abstract datatype, concept (location, person,…, IP-address) Mature (out-of-the-box products, e.g. GATE/ANNIE) Extensible

13 VLDB 2005, Trondheim, Norway 13 Unifying Search on Heterogeneous Data Web Intranet Databases Enterprise Information Systems … XML Heuristics, type-spec transformations Annotated XML Annotation of named entities with IE tools (e.g., GATE)

14 VLDB 2005, Trondheim, Norway 14 Annotation-Aware Data Model Gerhard Weikum IR Saarbrücken XML 1 docid=1 tag=“Professor“ content=“Gerhard Weikum Saarbrücken“ 3 2 docid=1 tag=“Research“ content=“XML“ docid=1 tag=“Course“ content=“IR“ 2 1 docid=1 tag=„Professor“ content=“Gerhard Weikum“ 3 docid=1 tag=“Research“ content=“XML“ docid=1 tag=“Course“ content=“IR“ 4 docid=1 tag=“location“ content=“Saarbrücken“ Annotation with GATE: „Saarbrücken“ of type „location“ Annotation introduces new tags

15 VLDB 2005, Trondheim, Norway 15 Data Model for Links

16 VLDB 2005, Trondheim, Norway 16 Architecture Tourist Guide (XML) Hotel Website Flight Schedule INDEX Web Portal Adapter Search Engine XML Adapter Location= Salvador Price =89 $ Date = 15-18 August Event=SIGIR Location=Salvad or Location= Frankfurt Location=Salvador Time = 13:15 SIGIR Website EMail Adapter IE Processor Annotation Module DATE Annotation Module PRICE … … Annotation Module LOCATION … Person=Schenke l FROM=SIGIR SUBJECT=Notificati on Web Adapter Homepage Graupmann Sources Adapters Annotators Search Engine

17 VLDB 2005, Trondheim, Norway 17 Outline Where existing search engines fail SphereSearch Concepts Transformation and Annotation Query Language and Scoring Experimental Evaluation Current and Future Work

18 VLDB 2005, Trondheim, Norway 18 SphereSearch Queries Extended keyword queries: similarity conditions ~professor, ~Saarbrücken concept-based conditions person=Max Planck, location=Trondheim grouping join conditions Ranked results with context-aware scoring A

19 VLDB 2005, Trondheim, Norway 19 Score Aggregation: SphereScore Weighted aggregation of local scores in environment of element (sphere score): 2 1 1 2 Rewards proximity of terms and compactness of term distribution s(1): research XML Local score s L (e) for each element e (tf/idf, BM25,…) Context awareness

20 VLDB 2005, Trondheim, Norway 20 Similarity Conditions wizard intellectual artist alchemist director primadonna lecturer professor teacher educator scholar academic, academician, faculty member scientist researcher HYPONYM (0.7) Thesaurus/Ontology: concepts, relationships, glosses from WordNet, Gazetteers, Web forms & tables, Wikipedia relationships quantified by statistical co-occurence measures investigator mentor Similarity conditions like ~professor, ~Saarbrücken Query expansion Local score: weighted max over all expansion terms disambiguation δ-exp(x)={w|sim(x,w)>δ} s L (e,~professor) = max t  δ-exp(professor) {sim(professor,t)*s L (e,t)} Abstraction awareness

21 VLDB 2005, Trondheim, Norway 21 Concept-based conditions Concept awareness Goal: Exploit explicit (tags) and automatic annotations in documents location=Trondheim conceptvalue e docid=1 tag=„location“ content=“Trondheim“ Allows similarity and range queries (for annotated concepts) like location~Trondheim 1970<date<1980 with concept-specific distance measures s L (e,c=v)= score for concept-tag match + score for value-content-match concept- specific

22 VLDB 2005, Trondheim, Norway 22 Query Groups Group conditions that relate to the same „entity“ professor teaching IR research XML professor T(teaching IR) R(research XML) SphereScore computed for each group Find compact sets with one result for each group Goal: Related terms should occur in the same context

23 VLDB 2005, Trondheim, Norway 23 Scores for Query Results query result R: one result per query group A X B 2 1 compactness ~ 1/size of a minimal spanning tree A 1 X 3 1 2 A 2 X 3 4 B 1 X 5 3 B 2 X 5 6 1 1 2 Context awareness

24 VLDB 2005, Trondheim, Norway 24 Join conditions Goal: Connect results of different query groups A(research, XML) B(VLDB 2005 paper) A.person=B.person Dependent on database size, application Precomputed Computed during query execution researchXML Ralf Schenkel 20042005R.Schenkel VLDB2005 1.0 0.9 Join conditions do not change the score for a node Join conditions create a new link with a specific weight A B

25 VLDB 2005, Trondheim, Norway 25 Score for Join Conditions Join condition A.T=B.S: For all nodes n1 with type T, n2 with type S, add edge (n1,n2) with weight sim(n1,n2)) -1 sim(n1,n2): content-based similarity A X B 2 1 B 2 X 2 31

26 VLDB 2005, Trondheim, Norway 26 Outline Where existing search engines fail SphereSearch Concepts Transformation and Annotation Query Language and Scoring Experimental Evaluation Current and Future Work

27 VLDB 2005, Trondheim, Norway 27 Setup for Experiments Three corpora: Wikipedia extended Wikipedia with links to IMDB extended DBLP corpus with links to homepages 50 Queries like A(actor birthday 1970<date<1980) western G(California,governor) M(movie) A(Madonna,husband) B(director) A.person=B.director Opponent: keyword queries with standard TF/IDF-based score  „simplified Google“ No existing benchmark (INEX, TREC, …) fits

28 VLDB 2005, Trondheim, Norway 28 SSE-Join (join conditions) SSE-QG (query groups) SSE-CV (concept-based conditions) Incremental Language Levels SSE-basic (keywords, SphereScores)

29 VLDB 2005, Trondheim, Norway 29 Experimental Results on Wikipdia

30 VLDB 2005, Trondheim, Norway 30 Experimental Results on Wiki++ and DBLP++ SphereScores better than local scores New SSE features nearly double precision

31 VLDB 2005, Trondheim, Norway 31 Current and Future Work Improve graphical user interface Refined type-specific similarity measures (like geographic distances) [SIGIR-WS 2005] Deep Web search through automatic portal queries Parameter tuning with relevance feedback Efficiency of query evaluation through precomputation and integrated top-k (TopX talk this afternoon)

32 VLDB 2005, Trondheim, Norway 32 Thank you!


Download ppt "Ralf Schenkel joint work with Jens Graupmann and Gerhard Weikum The SphereSearch Engine for Unified Ranked Retrieval of Heterogeneous XML and Web Documents."

Similar presentations


Ads by Google