1 Unique identifiers for the Web Zoltan Miklos Joint work with Gleb Skobeltsyn, Saket Sathe, Nicolas Bonvin, Philippe Cudré-Mauroux, Ekaterini Ioannou,

Slides:



Advertisements
Similar presentations
…to Ontology Repositories Mathieu dAquin Knowledge Media Institute, The Open University From…
Advertisements

Understanding Tables on the Web Jingjing Wang. Problem to Solve A wealth of information in the World Wide Web Not easy to access or process by machine.
XML DOCUMENTS AND DATABASES
A Stepwise Modeling Approach for Individual Media Semantics Annett Mitschick, Klaus Meißner TU Dresden, Department of Computer Science, Multimedia Technology.
INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.
1 Autocompletion for Mashups Ohad Greenshpan, Tova Milo, Neoklis Polyzotis Tel-Aviv University UCSC.
WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
Search Engines and Information Retrieval
 Copyright 2005 Digital Enterprise Research Institute. All rights reserved. 1 The Architecture of a Large-Scale Web Search and Query Engine.
A Web of Concepts Dalvi, et al. Presented by Andrew Zitzelberger.
Xyleme A Dynamic Warehouse for XML Data of the Web.
Efficient Processing of Top-k Spatial Keyword Queries João B. Rocha-Junior, Orestis Gkorgkas, Simon Jonassen, and Kjetil Nørvåg 1 SSTD 2011.
©2003, Philippe Cudré-Mauroux, EPFL-I&C-IIF, Distributed Information Systems Lab The Chatty Web approach for global semantic agreements MMGPS Workshop,
1 Draft of a Matchmaking Service Chuang liu. 2 Matchmaking Service Matchmaking Service is a service to help service providers to advertising their service.
1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.
Web Data Management Dr. Daniel Deutch. Web Data The web has revolutionized our world Data is everywhere Constitutes a great potential But also a lot of.
XLink: Open Linking Standard XML / XSL separate  data semantics  presentation semantics Need to also separate out  navigation semantics Single unique.
Scalable Text Mining with Sparse Generative Models
Employing Two Question Answering Systems in TREC 2005 Harabagiu, Moldovan, et al 2005 Language Computer Corporation.
Cloud based linked data platform for Structural Engineering Experiment Xiaohui Zhang
A Social Help Engine for Online Social Network Mobile Users Tam Vu, Akash Baid WINLAB, Rutgers University May 21,
Web Information Retrieval Projects Ida Mele. Rules Students can work in teams (max 3 people) The project must be delivered by the deadline that will be.
Database Environment 1.  Purpose of three-level database architecture.  Contents of external, conceptual, and internal levels.  Purpose of external/conceptual.
Projects ( ) Ida Mele. Rules Students have to work in teams (max 2 people). The project has to be delivered by the deadline that will be published.
Aurora: A Conceptual Model for Web-content Adaptation to Support the Universal Accessibility of Web-based Services Anita W. Huang, Neel Sundaresan Presented.
Query-Driven Indexing for Peer-to-Peer Text Retrieval ** WWW 2007 Banff, Canada Contact: Gleb Skobeltsyn Contact: Gleb Skobeltsyn
Search Engines and Information Retrieval Chapter 1.
Approaches to Event Prediction in Complex Environments Terence Tan (PhD Candidate) Advisors: Prof Christian Darken,
1 iTrails: Pay-as-you-go Information Integration in Datasapces Authors: Salles, Dittrich et al. (ETH Zurich) Published in VLDB2007 Presenter: Jim 7 Dec.
DBXplorer: A System for Keyword- Based Search over Relational Databases Sanjay Agrawal Surajit Chaudhuri Gautam Das Presented by Bhushan Pachpande.
Web Data Management Dr. Daniel Deutch. Web Data The web has revolutionized our world Data is everywhere Constitutes a great potential But also a lot of.
Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.
Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.
Mini-Project on Web Data Analysis DANIEL DEUTCH. Data Management “Data management is the development, execution and supervision of plans, policies, programs.
« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.
Assigning Global Relevance Scores to DBpedia Facts Philipp Langer, Patrick Schulze, Stefan George, Tobias Metzke, Ziawasch Abedjan, Gjergji Kasneci DESWeb.
Gregor Gisler-Merz How to hit in google The anatomy of a modern web search engine.
Efficient Instant-Fuzzy Search with Proximity Ranking Authors: Inci Centidil, Jamshid Esmaelnezhad, Taewoo Kim, and Chen Li IDCE Conference 2014 Presented.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
Database Environment Chapter 2. Data Independence Sometimes the way data are physically organized depends on the requirements of the application. Result:
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
Ranking objects based on relationships Computing Top-K over Aggregation Sigmod 2006 Kaushik Chakrabarti et al.
EGEE User Forum Data Management session Development of gLite Web Service Based Security Components for the ATLAS Metadata Interface Thomas Doherty GridPP.
User Profiling using Semantic Web Group members: Ashwin Somaiah Asha Stephen Charlie Sudharshan Reddy.
Date: 2012/08/21 Source: Zhong Zeng, Zhifeng Bao, Tok Wang Ling, Mong Li Lee (KEYS’12) Speaker: Er-Gang Liu Advisor: Dr. Jia-ling Koh 1.
DeepDive Model Dongfang Xu Ph.D student, School of Information, University of Arizona Dec 13, 2015.
Digital Video Library Network Supervisor: Prof. Michael Lyu Student: Ma Chak Kei, Jacky.
Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,
TWC Illuminate Knowledge Elements in Geoscience Literature Xiaogang (Marshall) Ma, Jin Guang Zheng, Han Wang, Peter Fox Tetherless World Constellation.
Presented By Amarjit Datta
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Toward Semantic Search: RDFa based facet browser Jin Guang Zheng Tetherless World Constellation.
1 Database Environment. 2 Objectives of Three-Level Architecture u All users should be able to access same data. u A user’s view is immune to changes.
Semantic Data Extraction for B2B Integration Syntactic-to-Semantic Middleware Bruno Silva 1, Jorge Cardoso 2 1 2
Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Semantic Interoperability in GIS N. L. Sarda Suman Somavarapu.
1 CS 8803 AIAD (Spring 2008) Project Group#22 Ajay Choudhari, Avik Sinharoy, Min Zhang, Mohit Jain Smart Seek.
GoRelations: an Intuitive Query System for DBPedia Lushan Han and Tim Finin 15 November 2011
September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.
Σύστ ημα ενοποίησης δεδομένων με βάση τα αντικείμενα A Matching Framework for Entity-Based Aggregation 9 th Hellenic Data Management Symposium Ekaterini.
Neighborhood - based Tag Prediction
Cloud based linked data platform for Structural Engineering Experiment
Extraction, aggregation and classification at Web Scale
Database Systems Instructor Name: Lecture-3.
CS246: Information Retrieval
Information Retrieval and Web Design
Presentation transcript:

1 Unique identifiers for the Web Zoltan Miklos Joint work with Gleb Skobeltsyn, Saket Sathe, Nicolas Bonvin, Philippe Cudré-Mauroux, Ekaterini Ioannou, Karl Aberer and others EPFL 1

2 Providing unique identifiers Webpages Documents Okkamization Entities store Query: “Barack Obama” Response: (Information extraction)

3 Web search vs. entity search Web searchEntity search DocumentsWeb documentsEntity profiles/unique ID RankingPage rankOKKAM ranking QueryKeywords, e.g. Barack Obama Keywords + attribute names, e.g. Barack Obama, Name:”Barack Obama”, geoLocation:Paris, firstName:Paris Query “semantic” Find all relevant documents and have the most relevant first Find the only relevant document, if this is not possible, an ordered list of candidates(with confidence values)

4 Entity profiles are collection of attribute-value pairs with an okkam-id Examples of entity requests ◦ Q 1 -- name= “Einstein” (AND) physicist ◦ Q 2 -- Einstein (AND) physicist ◦ Q 3 -- name= “Einstein” (AND) profession= “physicist” Entities and Entity Requests 4 name : Albert Einstein affiliation : Institute of Advanced Study profession : physicist okkam-id :

5 OKKAM Match API OKKAM Match & Store Process Matching Modules Receive the entity request Name=“Einstein" AND physicist Group Linkage Group Linkage Generic Matching Generic Matching Product Matching Product Matching Convert request and select matching module Module Selection: Entity Type Inferred from attributes Identified from receiver Required response time …

6 OKKAM Match API OKKAM Match & Store Process Generation of the storage query Name=“Einstein" AND physicist Matching Module Matching Module Create query for OKKAM Store Possibility to overwrite default implementation Schema rewriting (internal object, or store query) Add attributes to values Complex query plan

7 OKKAM Store API OKKAM Match & Store Process 7 OKKAM Store Index Top-k matches (IDs + scores) Send the query to index Query the distributed index Each server processes the query from the index and returns top-k results Aggregate top-k results from each server name:einstein physicist

8 OKKAM Store API OKKAM Match & Store Process 8 OKKAM Store storage Top-k matches (IDs) Top-k entities (candidates) Requesting entity profiles by their IDs for top-k candidate matches Top-k matching candidates are obtained name:einstein physicist

9 OKKAM Match API OKKAM Match & Store Process Receive matching candidates Name=“Einstein" AND physicist Matching Module Matching Module Advanced matching and final entities Background knowledge Domain specific information Analyze inner-relationships Make another query … …

10 Name=“Einstein" AND physicist OKKAM Match API OKKAM Match & Store Process Matching Module Matching Module Ranked list with matching entities Background knowledge Domain specific information Analyze inner-relationships Make another query … … X XXX X XX X X XX X X

11 einstein physicist D3D3 D5D5 D9D9 D namefirmname D1D1 D3D3 D9D9 D school affiliation … D3D3 OKKAM Store Index OKKAM Match & Store Process 11

12 Scoring at the index level OKKAMStore returns top-k candidate entities Scoring for keyword queries: ◦Example: query Paolo – entity with “name=Paolo” will be scored higher than the entity with “comment=Paolo leads OKKAM…” Scoring for structured queries: ◦Example: query name=Paolo – high score to the entity with “name=paolo” and low score to the entity with “location=paolo alto” 12 bu Boosting for unstructured queries bu ~ popularity of the attribute used with the term t from the query q in the entity e bs(a) Boosting for unstructured queries bs(a) = 1 if the entity e contains the term t exactly with the attribute a from the query q

13 Identified Challenges  Achievements Challenges : Huge number of entities that OKKAM needs to store and process A single algorithm for matching an entity description to the OKKAM entities does not exist 13 [ENA+]

14 OKKAMstore distributed architecture Conceptual principles: ◦Document- (entity-) partitioned distributed index + ◦distributed storage: 14 EEEE AB Storage Collection of entities Server maintains: Storage: Entity read(OkkamID) write(Entity) Index: Collection match(query) Inverted index Inverted index Inverted index Inverted index EEEE CD Storage Inverted index Inverted index Inverted index Inverted index Servers (replicas), each maintains:

15 Future work: manage mappings The user in general does not know the set of available attributes D1 fName Paris … User query: firstName=Paris Need a mapping firstName -> fName Challenge: on-the-fly mappings are needed but only mappings with very low computation costs (constant time) are realistic Strategy: create mapping candidates, from the dataset, adapt the mappings based on statistics (we don’t have good test data …) Posting list

16 idMesh: Cudré-Mauroux et al., WWW’09 Source1 e1= c1 e2 e1= c2 e3 e1≠ c3 e4 e2 ≠ c4 e4 e2= c5 e4 e3= c6 e4 Source2 e1 e2e3 e4 l12l13 l24 l34 l14 Entity graph l12 l13 l14 l24 l34 S1 S2 c1 c2 c3c4 c5 c6 Source graph

17 idMesh: Inferring the most-probable relations We formulate a set of integrity constraints: P(l=equal)+P(l=non-equal)=1, for link variables No cycle can contain exactly one non-equivalent link We also define a trust framework and attach a trust variable to each source (which has the value 1 if all the relations declared by this source are correct). With a graphical model-based (factor graph) probabilistic inference machinery we compute the most probable values for the entity relations.

18 idMesh: further challenges Entity graph can be very large ◦ε - graph : represent only edges with confidence larger than (1-ε) or smaller than ε (even this is difficult to compute) How to construct the graph if the entity profiles have a different set of attributes Large connected components in the graph or large circles ◦Apply standard graph algorithms for finding max connected components Problems with the dataset, eg. the number of sources is low ◦More advanced models

19 Thank you for your attention! New version (V2) will be online in July/August 2009