Indexing. The goal of indexing is to speed up online querying – Retrieval needs to be performed in milliseconds – Without an index, retrieval would require.

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

Chapter 5: Introduction to Information Retrieval
1 gStore: Answering SPARQL Queries Via Subgraph Matching Presented by Guan Wang Kent State University October 24, 2011.
© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
Store RDF Triples In A Scalable Way Liu Long & Liu Chunqiu.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Web Search – Summer Term 2006 VI. Web Search - Indexing (c) Wolfgang Hürst, Albert-Ludwigs-University.
Information Retrieval in Practice
Search Engines and Information Retrieval
 Copyright 2005 Digital Enterprise Research Institute. All rights reserved. 1 The Architecture of a Large-Scale Web Search and Query Engine.
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
Xyleme A Dynamic Warehouse for XML Data of the Web.
Physical Database Monitoring and Tuning the Operational System.
Research Problems in Semantic Web Search Varish Mulwad ____________________________ 1.
Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Presented by Cathrin Weiss, Panagiotis Karras, Abraham Bernstein Department of Informatics, University of Zurich Summarized by: Arpit Gagneja.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Digital Object: A Virtual Online Storage Solution 598C Course Project Huajing Li.
Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.
Rajashree Deka Tetherless World Constellation Rensselaer Polytechnic Institute.
Search Engines and Information Retrieval Chapter 1.
1 © Prentice Hall, 2002 Physical Database Design Dr. Bijoy Bordoloi.
Hexastore: Sextuple Indexing for Semantic Web Data Management
MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.
Database Support for Semantic Web Masoud Taghinezhad Omran Sharif University of Technology Computer Engineering Department Fall.
Physical Database Design Chapter 6. Physical Design and implementation 1.Translate global logical data model for target DBMS  1.1Design base relations.
DANIEL J. ABADI, ADAM MARCUS, SAMUEL R. MADDEN, AND KATE HOLLENBACH THE VLDB JOURNAL. SW-Store: a vertically partitioned DBMS for Semantic Web data.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Internet Information Retrieval Sun Wu. Course Goal To learn the basic concepts and techniques of internet search engines –How to use and evaluate search.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Daniel J. Abadi · Adam Marcus · Samuel R. Madden ·Kate Hollenbach Presenter: Vishnu Prathish Date: Oct 1 st 2013 CS 848 – Information Integration on the.
Object Persistence Design Chapter 13. Key Definitions Object persistence involves the selection of a storage format and optimization for performance.
C-Store: How Different are Column-Stores and Row-Stores? Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 8, 2009.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Problems in Semantic Search Krishnamurthy Viswanathan and Varish Mulwad {krishna3, varish1} AT umbc DOT edu 1.
Efficient RDF Storage and Retrieval in Jena2 Written by: Kevin Wilkinson, Craig Sayers, Harumi Kuno, Dave Reynolds Presented by: Umer Fareed 파리드.
C-Store: RDF Data Management Using Column Stores Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Apr. 24, 2009.
RDF languages and storages part 1 - expressivness Maciej Janik Conrad Ibanez CSCI 8350, Fall 2004.
User Profiling using Semantic Web Group members: Ashwin Somaiah Asha Stephen Charlie Sudharshan Reddy.
1 Information Retrieval LECTURE 1 : Introduction.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Spring 2004 ECE569 Lecture 05.1 ECE 569 Database System Engineering Spring 2004 Yanyong Zhang
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
MapReduce Joins Shalish.V.J. A Refresher on Joins A join is an operation that combines records from two or more data sets based on a field or set of fields,
RDF storages and indexes Maciej Janik September 1, 2005 Enterprise Integration – Semantic Web.
Introduction to Information Retrieval Introduction to Information Retrieval Introducing Information Retrieval and Web Search.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
Apache Solr Dima Ionut Daniel. Contents What is Apache Solr? Architecture Features Core Solr Concepts Configuration Conclusions Bibliography.
Chapter 04 Semantic Web Application Architecture 23 November 2015 A Team 오혜성, 조형헌, 권윤, 신동준, 이인용.
CS520 Web Programming Full Text Search Chengyu Sun California State University, Los Angeles.
Information Retrieval in Practice
Why indexing? For efficient searching of a document
Information Retrieval in Practice
Indexing Structures for Files and Physical Database Design
Text Based Information Retrieval
Physical Database Design and Performance
Map Reduce.
Implementation Issues & IR Systems
MR Application with optimizations for performance and scalability
Chapter 11: Indexing and Hashing
Chapter 5: Information Retrieval and Web Search
Information Retrieval and Web Design
Chapter 11: Indexing and Hashing
Presentation transcript:

Indexing

The goal of indexing is to speed up online querying – Retrieval needs to be performed in milliseconds – Without an index, retrieval would require streaming through the collection Computing static (query independent) ranking features – Example: PageRank Building specialized data structures (indexes)

DB vs. IR-style indexing The type of indexes you need depend on the query language to support – SPARQL is a highly-expressive SQL-like query language for experts – End-users are accustomed to keyword queries with very limited structure (see Pound et al. WWW2010) DB-style indexing – Support for SPARQL queries – Optimized for Read/Write workload IR-style indexing – Keyword-queries – Read workload

DB-style indexing Index RDF as data Triple stores: repositories for RDF data – Sesame, Mulgara, Virtuoso, Redland etc. – Most triple stores can rely on a traditional database backend e.g. MySQL Enjoy all the features of a mature database (query optimizations, compression, security, logging, replication etc.) But worse performance than native implementations Difference to databases – Schema may not be known in advance and subject to change – Heterogeneous data – Sparsity

Table layouts Schema-aware layouts – aka property tables Triple stores Vertical partitioning

Schema aware layout (a) – Classical database layout – One table per class and relation – Good choice if the schema is known in advance and stable – Lot’s of NULLs if the data is sparse Person idnameage Pid1“Alice”“21” Pid2“Joe”“63” Publication idtitleyear Pub1“The life of chimps” 2005 Pub2“My life”2002 Person idPublication id Pid1Pub1 Pid2Pub1 Pid2Pub2

Schema aware layout (b) Same as I(a) but one generic triples table for relations – Implemented in Jena, Oracle Person idnameage Pid1“Alice”“21” Pid2“Joe”“63” Publication idtitleyear Pub1“The life of chimps” 2005 Pub2“My life”2002 SubjectPredicateObject Pid1publishedPub1 Pid2publishedPub1 Pid2publishedPub2

Triple stores One table for all triples – Expensive self-joins Performance can be improved by – Indexing all or some combinations of columns – Dictionary encoding of URIs and strings SubjectPredicateObject Pid1name“Alice” Pid1age“21 Pid1typePerson Pid2name“Joe” Pid2age“63” Pid1publishedPub1 Pid2typePerson Pid2publishedPub1 Pid2publishedPub2 Pub1title“The life of Chimps Pub1year2005 Pid1typePublication Pub2title“My life” Pub2year2005

Resolving queries in a triple store SELECT ?name WHERE { ?x name Person. ?x name ?name } SELECT ?name WHERE { ?x name Person. ?x name ?name } SELECT B.object FROM triples AS A, triples as B WHERE A.subject= B.subject AND A.property= “type” AND A.object= “Person” AND B.predicate= “name”; SELECT B.object FROM triples AS A, triples as B WHERE A.subject= B.subject AND A.property= “type” AND A.object= “Person” AND B.predicate= “name”; SPARQL query SQL query

Vertical partitioning One table per property Nulls are not stored Easy to handle multi-valued properties Only need to read relevant properties Joins are mostly merge-joins Materializing paths can speed up frequent path patterns see Abadi, VLDB 2007 SubjectObject Pid1“Alice” Pid2“Joe” SubjectObject Pid1“21 Pid2“63” nameage SubjectObject Pid1Pub1 Pid2Pub1 Pid2Pub2 published

IR-style indexing Index data as text – Create virtual documents from data – One virtual document per subgraph, resource or triple typically: resource Key differences to Text Retrieval – RDF data is structured – Minimally, queries on property values are required

Virtual documents Single index – All words – Post-fixing Multiple indexes – Horizontal indexing – Vertical indexing

All words Treat the RDF document simply as a sequence of tokens Works quite well but no support for querying values for a particular property “Peter Mika”. “32”. “Barcelona”. “Peter Mika”. “32”. “Barcelona”. peter xmlns foaf name Peter Mika peter xmlns foaf age 32 peter w vcard ns Barcelona

Post-fixing This can be achieved without any special index structure by ”post-fixing” – Instead of the term Peter index the term Peter#foaf_name – Prefix queries are needed to search only for Peter ✗ Dictionary is number of unique terms per property – It works well when the number of properties are small Example: NER indexing with a small number of types – RDF has large number of properties: dictionary explodes the_name, the_title, the_address, the_org, ….

Horizontal index structure Two fields (indices): one for terms, one for properties For each term, store the property on the same position in the property index – Positions are required even without phrase queries Query engine needs to support the alignment operator ✓ Dictionary is number of unique terms + number of properties ✓ Occurrences is number of tokens * 2

Vertical index structure One field (index) per property Positions are not required – But useful for phrase queries Query engine needs to support fields ✓ Dictionary is number of unique terms ✓ Occurrences is number of tokens ✗ Number of fields is a problem for merging, query performance

Inverted index construction 1.Crawling 2.Data extraction, integration and reasoning 3.Collection parsing – language detection – tokenization – stemming, – stop word removal

Inverted index construction II. 4.For each partition of documents to be indexed – Build inverted-index in memory – Write out each partition 5.Merge blocks – Merging sorted lists Encoding and compression techniques reduce the size of the dictionary and posting lists – Dictionary needs to fit in memory – Minimize read time from disk

Indexing using MapReduce MapReduce is the perfect model for building inverted indices – Map creates (term, {doc1}) pairs – Reduce collects all docs for the same term: (term, {doc1, doc2…} – Sub-indices are merged separately Term-partitioned indices Build indices using any IR library – Katta project for Lucene

Implementation Mika. Distributed Indexing for Semantic Search. SemSearch Reality is more complex than the textbooks… – Hashed subject URIs using MG4J’s Minimal Perfect Hash hash function occupies only 307MB Fits in memory for now, only required for map – Implemented fields, positions: keys are (field, term) pairs, values are (docid, position) pairs – For each term, documents need to be indexed in increasing order of docid Secondary sort by making value part of the key Increased amount of data

Implementation Document frequency needs to be known when starting to write out the posting list – Introduced dummy occurrences, one for each document – Set docid to -1 to make sure dummy occurrences come first Memory problems with unbalanced executions (too many documents for a single term) Memory problems with large number of indices (index caching) Trade-offs between how much memory is left for our job vs. memory for the system itself

Example: Semplore Zhang et al. (IBM China and Shanghai JiaoTong Univ.): Semplore: An IR Approach to Scalable Hybrid Query of Semantic Web Data Semplore: An IR Approach to Scalable Hybrid Query of Semantic Web Data Hybrid query capability – Added: virtual “keyword” concepts that can be combined with other formal concepts in ontology – Limited to tree-shaped queries with a single query target Efficient indexing by extending an IR engine (Lucene) – Transform RDF triples to artificial documents with fields and terms – Optimized by the IR engine (e.g. index compression)

Hybrid Query Capability Example: Find directors who have directed films that are American films and are about war, and have directed at least one romantic comedy starring both a Best Actor and a Best Actress winner. x FilmDirector y1y1 directs y4y4 “war” “romantic” AmericanFilm AND y2y2 starring Best Actor Academy Award Winner starring y3y3 Best Actress Academy Award Winner © 2007 IBM Corporation

Query Evaluation Example x FilmDirector y1y1 directs y4y4 “war” “romantic” AmericanFilm u y2y2 starring Best Actor Academy Award Winner starring y3y3 Best Actress Academy Award Winner s 1 =(type, FilmDirector) ⋈ (s 1, directs, s 2 ) s 2 =(text, “romantic”) ⋈ (s 2, starring, s 3 ) s 3 =(type, BestActor) s 2 = ⋈ (s 3, starring¯, s 2 ) DFS the tree and conduct the basic operations on AIS. © 2007 IBM Corporation

Index Structure: The Virtual Documents DocumentFieldTerm Concept C subConOfSuper concepts of C superConOfSub concepts of C textTerms in textual properties of C Relation R subRelOfSuper relations of R superRelOfSub relations of R textTerms in textual properties of R Individual i type Concepts that i belongs to subjOf All relations R that ( i, R, ?) exists objOf All relations R that (?, R, i ) exists text Terms in textual properties of i © 2007 IBM Corporation

Indexing the Artificial Documents Feed the artificial documents to existing IR engine Reuse existing IR engine to index the data The result index structure (logical) For a given subject and a relation, all the objects are stored as the subject’s artificial document’s positions. © 2007 IBM Corporation

Example: Sindice Oren et al.: Sindice.com: A Document-oriented Lookup Index for Open Linked DataSindice.com: A Document-oriented Lookup Index for Open Linked Data Crawling Linked Data and extracting microformats, RDFa from well-known sources – Implements politeness policies – Uses Yahoo BOSS to discover additional sources – Accepts pings through PingTheSemanticWeb.comPingTheSemanticWeb.com – Supports Semantic SitemapsSemantic Sitemaps Limited reasoning during indexing – ‘Sandboxed’ reasoning: the ontologies of each document are loaded separately into the reasoner – Inverse-Functional Properties (IFPs) such as addresses are identified and indexed in a separate index

Sindice Siren indexing engine Siren – Open source extension to Lucene – New field type “tuples” Index (s, p, o) + (o, rdfs:label, l) as (p, o, l) Tokenize URIs – Expressivity: queries where the subject is the variable, e.g. – (*, name, "renaud delbru") AND (*, workplace, deri) – Ranking “comparable to TF-IDF” without global features (such as PageRank) Efficiency – Example: 1.5B triples, 4 nodes, ~22 hours of indexing

Demo

Example: Sindice and Sig.ma Sindice is a Semantic Web Search engine – Indexing Linked Data and some microformat/RDFa data Uses Yahoo’s BOSS API to find more at query time – Lucene-based engine (Siren) Sig.ma: interactive interface for Sindice – Clean up search results by removing irrelevant results and untrustworthy sources – Share cleaned-up search result pages