Presentation is loading. Please wait.

Presentation is loading. Please wait.

Supporting the Automatic Construction of Entity Aware Search Engines Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Dipartimento di Informatica.

Similar presentations


Presentation on theme: "Supporting the Automatic Construction of Entity Aware Search Engines Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Dipartimento di Informatica."— Presentation transcript:

1 Supporting the Automatic Construction of Entity Aware Search Engines Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Dipartimento di Informatica e Automazione Università degli Studi Roma Tre

2 Introduction A huge number of web sites publish pages based on data stored in databases Each of these pages often contains information about a single instance of a conceptual entity namebirthdate college BasketballPlayer weightheight

3 Introduction http://www.nba.com/ http://sports.espn.go.com/

4 We developed a system that: taking as input a small set of sample pages from distinct web sites automatically discovers pages containing data about other instances of the conceptual entity exemplified by the input samples Introduction

5 Overall Approach Given a bunch of sample pages crawl the web sites of the sample pages to gather other pages offering the same type of information extract a set of keywords that describe the underlying entity do -launch web searches to find other sources with pages that contain instances of the target entity -analyze the results to filter out irrelevant pages -crawl the new sources to gather new pages while new pages are found

6 Overall Approach Given a bunch of sample pages crawl the web sites of the sample pages to gather other pages offering the same type of information extract a set of keywords that describe the underlying entity do -launch web searches to find other sources with pages that contain instances of the target entity -analyze the results to filter out irrelevant pages -crawl the new sources to gather new pages while new pages are found

7 Instance Identifiers Alan Anderson Mike Doucet Ricky Dixon Quentin Leday Jarrett Lee … site Crawler Goal: given one sample page, crawl its site to discover as many pages as possible that offer the same information A crawling algorithm scans the web site toward pages sharing the same structure of the input sample page The crawler also computes a set of strings representing meaningful identifiers for the entity instances (e.g. the athletes' names) Crawling the seed sites

8 …………………… Given a sample page, the system explores the site structure looking for pages that work as indexes to "similar" pages The similarity between pages is measured analyzing their structure Crawler: intuition

9 Overall Approach Given a bunch of sample pages crawl the web sites of the sample pages to gather other pages offering the same type of information extract a set of keywords that describe the underlying entity do -launch web searches to find other sources with pages that contain instances of the target entity -analyze the results to filter out irrelevant pages -crawl the new sources to gather new pages while new pages are found

10 Extraction of the entity description On a web site, different instances of the same conceptual entity are likely to share a characterizing set of keywords It is usual that these keywords appear in the page template

11 Extraction of the entity descriptionPPG RPG APG EFF Born Height Weight College Years Pro photosBuyphotoE-mail

12 Extraction of the entity description For each known website we extract from its template a set of keywords The entity description is a set of keywords built combining these sets We favour the more frequent terms

13 Template Extraction: intuition To extract the terms of the template of a set of pages (from the same web site) the system analyzes the frequencies of the tokens (inspired by Arasu&Garcia-Molina, Sigmod 2003)

14 Template Extraction: intuition Home Sport! Weight 97 Height 180 Profile The career... Home Sport! Weight 136 Height 212 Profile Giant... Height... page 1 page 2 /html/body/div[3]/b /html/body/div[4]/span

15 Overall Approach Given a bunch of sample pages crawl the web sites of the sample pages to gather other pages offering the same type of information extract a set of keywords that describe the underlying entity do -launch web searches to find other sources with pages that contain instances of the target entity -analyze the results to filter out irrelevant pages -crawl the new sources to gather new pages while new pages are found

16 For each entity identifier, the system launches one search on the web to discover new target pages To focus the searches, the query includes the entity description Launches searches on the Web (identifier)Michael Jordan + pts height weight min ast (entity description)

17 We compute and check template of each result The pages whose template contains terms that match with the set of keywords of the entity description are considered as instances of the entity - only a percentage of the terms is taken into account

18 Overall Approach Given a bunch of sample pages crawl the web sites of the sample pages to gather other pages offering the same type of information extract a set of keywords that describe the underlying entity do -launch web searches to find other sources with pages that contain instances of the target entity -analyze the results to filter out irrelevant pages -crawl the new sources to gather new pages while new pages are found

19 Experiments We run some experiments to analyze the approach. We focused on the sport domain, looking for pages containing data about the following entities: -Basketball player -Soccer player -Hockey player -Golf player The sport domain as it is easy to: -interpret published data -evaluate precision of results

20 Experiments: extracted entity descriptions All the terms can reasonably represent attribute names for the corresponding player entity

21 Experiments: using entity descriptions % of terms (used in the filtering of Google results) vs recall & precision 500 pages from 10 soccer web sites Google returned about 15.000 pages distribuited over 4.000 distinct web sites

22 Experiments: pages found “Hockey player” entity 2 iterations of the cycle > 12,000 pages found > 5,000 distinct instances

23 Related work Our method is inspired by DIPRE (S.Brin, WebDB, 1998) Focus crawlers (S.Chakrabarti et al., Computer Networks, 1999) -Typically rely on text classifiers to determine the relevance of the visited pages to the target topic -Analogies, but we look for pages containing instances of an entity CIMPLE (A.Doan et al., SIGIR, 2006) -Building a platform to support the information needs of a virtual community -An expert is needed to provide relevant sources and design the E-R model of the domain of interest

24 Conclusions and future work We populated an entity aware search engine for sport fans. We used the facilities of Google Co-op: http://flint.dia.uniroma3.it/ (Demo section) To improve the entity description we are working on a probabilistic model to dynamically compute a weight for the terms of the page templates We are investigating the usage of automatic wrapping techniques to extract, mine and integrate data from the web pages collected by the proposed approach

25 Thank you!

26 Probabilistic studies on entity keywords 3 sources 5 sources 10 sources 20 sources

27 Forums

28 Experiments: pages found

29

30


Download ppt "Supporting the Automatic Construction of Entity Aware Search Engines Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Dipartimento di Informatica."

Similar presentations


Ads by Google