Supporting the Automatic Construction of Entity Aware Search Engines Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Dipartimento di Informatica.

Slides:



Advertisements
Similar presentations
Paolo Atzeni - Pierluigi Del Nostro Università Roma Tre Dipartimento di Informatica e Automazione T-Araneus: Management of temporal data-intensive Web.
Advertisements

Chapter 5: Introduction to Information Retrieval
Flint: exploiting redundant information to wring out value from Web data Lorenzo Blanco, Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, Paolo Papotti.
Date : 2013/05/27 Author : Anish Das Sarma, Lujun Fang, Nitin Gupta, Alon Halevy, Hongrae Lee, Fei Wu, Reynold Xin, Gong Yu Source : SIGMOD’12 Speaker.
Large-Scale Entity-Based Online Social Network Profile Linkage.
Topical Crawling for Business Intelligence Gautam Pant * and Filippo Menczer ** * Department of Management Sciences The University of Iowa, Iowa City,
A Quality Focused Crawler for Health Information Tim Tang.
WebMiningResearch ASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007.
Aki Hecht Seminar in Databases (236826) January 2009
LinkSelector: A Web Mining Approach to Hyperlink Selection for Web Portals Xiao Fang University of Arizona 10/18/2002.
FACT: A Learning Based Web Query Processing System Hongjun Lu, Yanlei Diao Hong Kong U. of Science & Technology Songting Chen, Zengping Tian Fudan University.
Searching The Web Search Engines are computer programs (variously called robots, crawlers, spiders, worms) that automatically visit Web sites and, starting.
WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Information Retrieval
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
1 Introduction to Web Development. Web Basics The Web consists of computers on the Internet connected to each other in a specific way Used in all levels.
Page 1 WEB MINING by NINI P SURESH PROJECT CO-ORDINATOR Kavitha Murugeshan.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
WHAT IS A SEARCH ENGINE A search engine is not a physical engine, instead its an electronic code or a software programme that searches and indexes millions.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
25/03/2003CSCI 6405 Zheyuan Yu1 Finding Unexpected Information Taken from the paper : “Discovering Unexpected Information from your Competitor’s Web Sites”
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
Focused Crawling for both Topical Relevance and Quality of Medical Information By Tim Tang, David Hawking, Nick Craswell, Kathy Griffiths CIKM ’05 November,
11 A Hybrid Phish Detection Approach by Identity Discovery and Keywords Retrieval Reporter: 林佳宜 /10/17.
It is impossible to guarantee that all relevant pages are returned (even inspected) (Figure 1): Millions of pages available, many of them not indexed in.
Search engines are the key to finding specific information on the vast expanse of the World Wide Web. Without sophisticated search engines, it would be.
McLean HIGHER COMPUTER NETWORKING Lesson 7 Search engines Description of search engine methods.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Publication Spider Wang Xuan 07/14/2006. What is publication spider Gathering publication pages Using focused crawling With the help of Search Engine.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
1 Web-Page Summarization Using Clickthrough Data* JianTao Sun, Yuchang Lu Dept. of Computer Science TsingHua University Beijing , China Dou Shen,
Intelligent Web Topics Search Using Early Detection and Data Analysis by Yixin Yang Presented by Yixin Yang (Advisor Dr. C.C. Lee) Presented by Yixin Yang.
Google’s Deep-Web Crawl By Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy August 30, 2008 Speaker : Sahana Chiwane.
Characterizing the Uncertainty of Web Data: Models and Experiences Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Università degli Studi.
Personalization with user’s local data Personalizing Search via Automated Analysis of Interests and Activities 1 Sungjick Lee Department of Electrical.
Augmenting Focused Crawling using Search Engine Queries Wang Xuan 10th Nov 2006.
- University of North Texas - DSCI 5240 Fall Graduate Presentation - Option A Slides Modified From 2008 Jones and Bartlett Publishers, Inc. Version.
Data Extraction and Integration from Imprecise Web Sources Lorenzo Blanco, Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Università degli.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Chapter 8: Web Analytics, Web Mining, and Social Analytics
SEMINAR ON INTERNET SEARCHING PRESENTED BY:- AVIPSA PUROHIT REGD NO GUIDED BY:- Lect. ANANYA MISHRA.
WEB STRUCTURE MINING SUBMITTED BY: BLESSY JOHN R7A ROLL NO:18.
Harnessing the Deep Web : Present and Future -Tushar Mhaskar Jayant Madhavan, Loredana Afanasiev, Lyublena Antova, Alon Halevy January 7,
Data mining in web applications
Search Engine Optimization
DATA MINING Introductory and Advanced Topics Part III – Web Mining
IST 516 Fall 2011 Dongwon Lee, Ph.D.
Prepared by Rao Umar Anwar For Detail information Visit my blog:
Applying Key Phrase Extraction to aid Invalidity Search
Information Retrieval
What is a Search Engine EIT, Author Gay Robertson, 2017.
Data Integration for Relational Web
Data Mining Chapter 6 Search Engines
Web Mining Department of Computer Science and Engg.
International Marketing and Output Database Conference 2005
Chapter 5: Information Retrieval and Web Search
Web Mining Research: A Survey
Tantan Liu, Fan Wang, Gagan Agrawal The Ohio State University
Presentation transcript:

Supporting the Automatic Construction of Entity Aware Search Engines Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Dipartimento di Informatica e Automazione Università degli Studi Roma Tre

Introduction A huge number of web sites publish pages based on data stored in databases Each of these pages often contains information about a single instance of a conceptual entity namebirthdate college BasketballPlayer weightheight

Introduction

We developed a system that: taking as input a small set of sample pages from distinct web sites automatically discovers pages containing data about other instances of the conceptual entity exemplified by the input samples Introduction

Overall Approach Given a bunch of sample pages crawl the web sites of the sample pages to gather other pages offering the same type of information extract a set of keywords that describe the underlying entity do -launch web searches to find other sources with pages that contain instances of the target entity -analyze the results to filter out irrelevant pages -crawl the new sources to gather new pages while new pages are found

Overall Approach Given a bunch of sample pages crawl the web sites of the sample pages to gather other pages offering the same type of information extract a set of keywords that describe the underlying entity do -launch web searches to find other sources with pages that contain instances of the target entity -analyze the results to filter out irrelevant pages -crawl the new sources to gather new pages while new pages are found

Instance Identifiers Alan Anderson Mike Doucet Ricky Dixon Quentin Leday Jarrett Lee … site Crawler Goal: given one sample page, crawl its site to discover as many pages as possible that offer the same information A crawling algorithm scans the web site toward pages sharing the same structure of the input sample page The crawler also computes a set of strings representing meaningful identifiers for the entity instances (e.g. the athletes' names) Crawling the seed sites

…………………… Given a sample page, the system explores the site structure looking for pages that work as indexes to "similar" pages The similarity between pages is measured analyzing their structure Crawler: intuition

Overall Approach Given a bunch of sample pages crawl the web sites of the sample pages to gather other pages offering the same type of information extract a set of keywords that describe the underlying entity do -launch web searches to find other sources with pages that contain instances of the target entity -analyze the results to filter out irrelevant pages -crawl the new sources to gather new pages while new pages are found

Extraction of the entity description On a web site, different instances of the same conceptual entity are likely to share a characterizing set of keywords It is usual that these keywords appear in the page template

Extraction of the entity descriptionPPG RPG APG EFF Born Height Weight College Years Pro photosBuyphoto

Extraction of the entity description For each known website we extract from its template a set of keywords The entity description is a set of keywords built combining these sets We favour the more frequent terms

Template Extraction: intuition To extract the terms of the template of a set of pages (from the same web site) the system analyzes the frequencies of the tokens (inspired by Arasu&Garcia-Molina, Sigmod 2003)

Template Extraction: intuition Home Sport! Weight 97 Height 180 Profile The career... Home Sport! Weight 136 Height 212 Profile Giant... Height... page 1 page 2 /html/body/div[3]/b /html/body/div[4]/span

Overall Approach Given a bunch of sample pages crawl the web sites of the sample pages to gather other pages offering the same type of information extract a set of keywords that describe the underlying entity do -launch web searches to find other sources with pages that contain instances of the target entity -analyze the results to filter out irrelevant pages -crawl the new sources to gather new pages while new pages are found

For each entity identifier, the system launches one search on the web to discover new target pages To focus the searches, the query includes the entity description Launches searches on the Web (identifier)Michael Jordan + pts height weight min ast (entity description)

We compute and check template of each result The pages whose template contains terms that match with the set of keywords of the entity description are considered as instances of the entity - only a percentage of the terms is taken into account

Overall Approach Given a bunch of sample pages crawl the web sites of the sample pages to gather other pages offering the same type of information extract a set of keywords that describe the underlying entity do -launch web searches to find other sources with pages that contain instances of the target entity -analyze the results to filter out irrelevant pages -crawl the new sources to gather new pages while new pages are found

Experiments We run some experiments to analyze the approach. We focused on the sport domain, looking for pages containing data about the following entities: -Basketball player -Soccer player -Hockey player -Golf player The sport domain as it is easy to: -interpret published data -evaluate precision of results

Experiments: extracted entity descriptions All the terms can reasonably represent attribute names for the corresponding player entity

Experiments: using entity descriptions % of terms (used in the filtering of Google results) vs recall & precision 500 pages from 10 soccer web sites Google returned about pages distribuited over distinct web sites

Experiments: pages found “Hockey player” entity 2 iterations of the cycle > 12,000 pages found > 5,000 distinct instances

Related work Our method is inspired by DIPRE (S.Brin, WebDB, 1998) Focus crawlers (S.Chakrabarti et al., Computer Networks, 1999) -Typically rely on text classifiers to determine the relevance of the visited pages to the target topic -Analogies, but we look for pages containing instances of an entity CIMPLE (A.Doan et al., SIGIR, 2006) -Building a platform to support the information needs of a virtual community -An expert is needed to provide relevant sources and design the E-R model of the domain of interest

Conclusions and future work We populated an entity aware search engine for sport fans. We used the facilities of Google Co-op: (Demo section) To improve the entity description we are working on a probabilistic model to dynamically compute a weight for the terms of the page templates We are investigating the usage of automatic wrapping techniques to extract, mine and integrate data from the web pages collected by the proposed approach

Thank you!

Probabilistic studies on entity keywords 3 sources 5 sources 10 sources 20 sources

Forums

Experiments: pages found