Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management.

Slides:



Advertisements
Similar presentations
Haystack: Per-User Information Environment 1999 Conference on Information and Knowledge Management Eytan Adar et al Presented by Xiao Hu CS491CXZ.
Advertisements

Help communities share knowledge more effectively across the language barrier Automated Community Content Editing PorTal.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
System Implementation: Does it help or hinder research? Anthony K. H. Tung National University of Singapore
Information Retrieval in Practice
Search Engines and Information Retrieval
 Copyright 2005 Digital Enterprise Research Institute. All rights reserved. 1 The Architecture of a Large-Scale Web Search and Query Engine.
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.
Information Retrieval in Practice
FACT: A Learning Based Web Query Processing System Hongjun Lu, Yanlei Diao Hong Kong U. of Science & Technology Songting Chen, Zengping Tian Fudan University.
Enterprise Search With SharePoint Portal Server V2 Steve Tullis, Program Manager, Business Portal Group 3/5/2003.
Chapter 19: Information Retrieval
1 Discovering Unexpected Information from Your Competitor’s Web Sites Bing Liu, Yiming Ma, Philip S. Yu Héctor A. Villa Martínez.
CSC230 Software Design (Engineering)
Overview of Search Engines
Best Practices Using Enterprise Search Technology Aurelien Dubot Consultant – Media and Entertainment, Fast Search & Transfer (FAST) British Computer Society.
A Platform for Personal Information Management and Integration Xin (Luna) Dong and Alon Halevy University of Washington.
Combining Keyword Search and Forms for Ad Hoc Querying of Databases Eric Chu, Akanksha Baid, Xiaoyong Chai, AnHai Doan, Jeffrey Naughton University of.
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
Supporting the Automatic Construction of Entity Aware Search Engines Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Dipartimento di Informatica.
AnHai Doan University of Wisconsin-Madison The Cimple Project on Community Information Management.
«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Enterprise & Intranet Search How Enterprise is different from Web search What to think about when evaluating Enterprise Search How Intranet use is different.
Search Engines and Information Retrieval Chapter 1.
Copyright © Allyn & Bacon 2008 POWER PRACTICE Chapter 7 The Internet and the World Wide Web START This multimedia product and its contents are protected.
CS523 INFORMATION RETRIEVAL COURSE INTRODUCTION YÜCEL SAYGIN SABANCI UNIVERSITY.
Redeeming Relevance for Subject Search in Citation Indexes Shannon Bradshaw The University of Iowa
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.
Markup and Validation Agents in Vijjana – A Pragmatic model for Self- Organizing, Collaborative, Domain- Centric Knowledge Networks S. Devalapalli, R.
©2003 Paula Matuszek CSC 9010: Text Mining Applications Document Summarization Dr. Paula Matuszek (610)
When Search is not Enough Case Study: The Advertising Research Foundation Gilbane Boston November 27, 2007 Gilbane Boston November 27, 2007.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Presenter: Shanshan Lu 03/04/2010
AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron.
Page 1 Alliver™ Page 2 Scenario Users Contents Properties Contexts Tags Users Context Listener Set of contents Service Reasoner GPS Navigator.
Optimizing Complex Extraction Programs over Evolving Text Data Fei Chen 1, Byron Gao 2, AnHai Doan 1, Jun Yang 3, Raghu Ramakrishnan 4 1 University of.
Search Engine Architecture
ITGS Databases.
Google’s Deep-Web Crawl By Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy August 30, 2008 Speaker : Sahana Chiwane.
Ranking CSCI 572: Information Retrieval and Search Engines Summer 2010.
Copyright © 2006 Pilothouse Consulting Inc. All rights reserved. Search Overview Search Features: WSS and Office Search Architecture Content Sources and.
WEB 2.0 PATTERNS Carolina Marin. Content  Introduction  The Participation-Collaboration Pattern  The Collaborative Tagging Pattern.
Data Integration Hanna Zhong Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009.
Search Result Interface Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
Date: 2012/08/21 Source: Zhong Zeng, Zhifeng Bao, Tok Wang Ling, Mong Li Lee (KEYS’12) Speaker: Er-Gang Liu Advisor: Dr. Jia-ling Koh 1.
Proposal for a Global Network for Beam Instrumentation [BIGNET] BI Group Meeting – 08/06/2012 J-J Gras CERN-BE-BI.
Chapter 9: Web Services and Databases Title: NiagaraCQ: A Scalable Continuous Query System for Internet Databases Authors: Jianjun Chen, David J. DeWitt,
Presented By – Yogesh A. Vaidya. Introduction What are Structured Web Community Portals? Advantages of SWCP powerful capabilities for searching, querying.
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
Pedro DeRose University of Wisconsin-Madison Cimple 1.0: A Community Information Management Workbench Preliminary Examination.
GoRelations: an Intuitive Query System for DBPedia Lushan Han and Tim Finin 15 November 2011
Chapter 8: Web Analytics, Web Mining, and Social Analytics
Xiaoyong Chai, Ba-Quy Vuong, AnHai Doan, Jeffrey F. Naughton University of Wisconsin-Madison Efficiently Incorporating User Feedback into Information Extraction.
CS276B Text Information Retrieval, Mining, and Exploitation Practical 1 Jan 14, 2003.
CS 440 Database Management Systems Web Data Management 1.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
Warren Shen, Xin Li, AnHai Doan Database & AI Groups University of Illinois, Urbana Constraint-Based Entity Matching.
Pedro DeRose University of Wisconsin-Madison The DBLife Prototype System in The Cimple Project on Community Information Management.
Information Retrieval in Practice
Information Retrieval in Practice
Search Engine Architecture
Information Retrieval
A Platform for Personal Information Management and Integration
CSE 635 Multimedia Information Retrieval
Code search & recommendation engines
Search Engine Architecture
Presentation transcript:

Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems November 4, 2008 Slides based on content by AnHai Doan, used with permission

Administrivia  By next Tuesday: a rough schedule and division of duties for your project  Please read the Halevy et al. paper on Piazza 2

The Web Is Full of Special-Interest Portal Sites for Communities  Academia  Certain bioinformatics topics; citations; etc.  Medicine  WebMD  Infotainment  Rotten Tomatoes, IMDB, fantasy football  Business  enterprise intranets, tech support groups, lawyers  CIA / homeland security  Intellipedia  Some of these gather information from the Web 3

Cimple Wisconsin (+ Yahoo) Researcher Homepages Conference Pages Group Pages DBworld mailing list DBLP Web pages Text documents * * * * * * * * * SIGMOD-04 * * * * give-talk Jim Gray Keyword search SQL querying Question answering Browse Mining Alert/Monitor News summary Jim Gray SIGMOD-04 * * Maintain and add more sources Develops a general solution to community Web portals using extraction + integration + mass collaboration Mass collaboration

The Basic Ideas  Architecture mainly consists of extractors and ER- graphs  The key challenge is to extract “good” information for the ER-graphs, and to allow all of the information to be extended and repaired 5

Prototype System: DBLife  Integrate data of the DB research community  1164 data sources Crawled daily, pages = 160+ MB / day

Data Integration Raghu Ramakrishnan co-authors = A. Doan, Divesh Srivastava,...

Resulting ER Graph “Proactive Re-optimization Jennifer Widom Shivnath Babu SIGMOD 2005 David DeWitt Pedro Bizarro coauthor advise write PC-Chair PC-member

Provide Services  DBLife system DBLife system

Mass Collaboration via Wiki

Issues Addressed by Cimple  Cimple addresses challenges in 1. Source selection 2. Extraction and integration 3. Detecting problems and providing feedback 4. Mass collaboration

1. Source Selection Researcher Homepages Conference Pages Group Pages DBworld mailing list DBLP Web pages Text documents * * * * * * * * * SIGMOD-04 * * * * give-talk Jim Gray Keyword search SQL querying Question answering Browse Mining Alert/Monitor News summary Jim Gray SIGMOD-04 * * Maintain and add more sources Mass collaboration

Current Solutions vs. Cimple  Current solutions: topic specific crawlers  find all relevant data sources (e.g., using focused crawling, search engines)  maximize coverage  results in many “noisy” sources  Cimple allows for incremental development, deployment  starts with a small set of high-quality “core” sources  incrementally adds more sources  only from “high-quality” places  or as suggested by users (mass collaboration)

Start with a Small Set of “Core” Sources  Key observation: communities often follow rule  20% of sources cover 80% of interesting activities  Initial portal over these 20% often is already quite useful  How do we select these 20%?  select as many sources as possible  then evaluate and select most relevant ones

Evaluate the Relevance of Sources  Use PageRank + virtual links across entities + TF/IDF... Gerhard Weikum G. Weikum See [VLDB-07a]

Add More Sources over Time  Key observation: most important sources will eventually be mentioned within the community  so monitor certain “community channels” to find them Message type: conf. ann. Subject: Call for Participation: VLDB Workshop on Management of Uncertain Data Call for Participation Workshop on "Management of Uncertain Data" in conjunction with VLDB Also allow users to suggest new sources –e.g., the Silicon Valley Database Society

Summary: Source Selection  Incremental approach:  start with highly relevant sources  expand carefully  minimize “garbage in, garbage out”  Need a notion of source relevance  Need a way to compute this

2. Extraction and Integration Researcher Homepages Conference Pages Group Pages DBworld mailing list DBLP Web pages Text documents * * * * * * * * * SIGMOD-04 * * * * give-talk Jim Gray Keyword search SQL querying Question answering Browse Mining Alert/Monitor News summary Jim Gray SIGMOD-04 * * Maintain and add more sources Mass collaboration

Extracting Entity Mentions  Key idea: reasonable plan, then “patch”  Reasonable basic plan:  collect person names, e.g., David Smith  generate variations, e.g., D. Smith, Dr. Smith, etc.  find occurrences of these variations ExtractMbyName Union s 1 … s n Works well, but can’t handle certain difficult spots

Handling Difficult Spots  Example  R. Miller, D. Smith, B. Jones  if “David Miller” is in the dictionary  will flag “Miller, D.” as a person name  Solution: patch such spots with stricter plans ExtractMbyName Union s 1 … s n FindPotentialNameLists ExtractMStrict

Matching Entity Mentions  Key idea: reasonable plan, then patch  Reasonable plan  mention names are the same (modulo some variation)  match  e.g., David Smith and D. Smith Union Extract Plan MatchMbyName s1s1 snsn … Works well, but can’t handle certain difficult spots

Handling Difficult Spots  Estimate the semantic ambiguity of data sources  use social networking techniques related to cohesion of graphs [see ICDE-07a]  Apply stricter matchers to more ambiguous sources MatchMStrict Extract Plan MatchMbyName Union {s 1 … s n }DBLP \ Extract Plan DBLP DBLP: Chen Li · · · 41. Chen Li, Bin Wang, Xiaochun Yang. VGRAM. VLDB · · · 38. Ping-Qi Pan, Jian-Feng Hu, Chen Li. Feasible region contraction. Applied Mathematics and Computation. · · ·

Summary: Extraction and Integration  Most current solutions  try to find a single good plan, applied to all of data  Cimple solution: reasonable plan, then patch  So the focus shifts to:  how to find a reasonable plan?  how to detect problematic data spots?  how to patch those?  Need a notion of semantic ambiguity  Different from the notion of source relevance

3. Detecting Problems and Making Corrections Researcher Homepages Conference Pages Group Pages DBworld mailing list DBLP Web pages Text documents * * * * * * * * * SIGMOD-04 * * * * give-talk Jim Gray Keyword search SQL querying Question answering Browse Mining Alert/Monitor News summary Jim Gray SIGMOD-04 * * Maintain and add more sources Mass collaboration

How to Detect Problems?  After extraction and matching, build services  e.g., superhomepages  Many such homepages contain minor problems  e.g., X graduated in X chairs SIGMOD-05 and VLDB-05 X published 5 SIGMOD-03 papers  Intuitively, something is semantically incorrect  To fix this, build a Semantic Debugger  learns what is a normal profile for researcher, paper, etc.  alerts the builder to potentially buggy superhomepages  so corrections / feedback can be provided

What Types of Feedback?  Say that a certain data item Y is wrong  Provide correct value for Y, e.g., Y = SIGMOD-06  Add domain knowledge  e.g., no researcher has ever published 5 SIGMOD papers in a year  Add more data  e.g., X was advised by Z  e.g., here is the URL of another data source  Modify the underlying algorithm  e.g., pull out all data involving X match using names and co-authors, not just names

How to Make Providing Feedback Very Easy?  Extremely crucial in DBLife context  If feedback can be provided easily  can get more feedback  can leverage the mass of users

Critical but unsolved Provide a Wiki interface How to Make Providing Feedback Very Easy?  Say that a certain data item Y is wrong  Provide correct value for Y, e.g., Y = SIGMOD-06  Add domain knowledge  Add more data  Modify the underlying algorithm Provide form interfaces Unsolved: some recent interest on how to mass customize software

Summary: Detection and Feedback  How to detect problems?  Semantic Debugger  What types of feedback & how to easily provide them?  critical, largely unsolved  What feedback would make most impact?  crucial in large-scale systems  need a notion of a Feedback Advisor  need a precise notion of system quality

4. Mass Collaboration Researcher Homepages Conference Pages Group Pages DBworld mailing list DBLP Web pages Text documents * * * * * * * * * SIGMOD-04 * * * * give-talk Jim Gray Keyword search SQL querying Question answering Browse Mining Alert/Monitor News summary Jim Gray SIGMOD-04 * * Maintenance and expansion Mass collaboration

Mass Collaboration: Voting Can be applied to numerous problems

Example: Matching  Hard for machine, but easy for human Mouse for Dell laptop 200 series... Dell X200; mouse at reduced price... Dell laptop X200 with mouse...

Mass Collaboration: Wiki  Community wikipedia  built by machine + human  backed up by a structured database Data Sources G T V1V1 V2V2 V3V3 W1W1 W2W2 W3W3 u1u1 V3’V3’W3’W3’ T3’T3’ M

Machine Human Mass Collaboration: Wiki Interests: David J. DeWitt Professor Interests: Parallel Database since 1976 Interests: since 1976 Interests: David J. DeWitt John P. Morgridge Professor UW-Madison since 1976 Interests: Parallel Database Privacy Machine Human

Summary: Mass Collaboration  What can users contribute?  How to evaluate user quality?  How to reconcile inconsistent data?

Summary: Cimple  A very interesting attempt to rethink Web crawling and information extraction  Based on a “best-effort” notion  One of many concurrent efforts in that vein  “Dataspaces”  Simple building blocks, progressive refinement 36

Open Questions and Issues  Incorporating uncertain data  Can we really scale, and manage the complexity, of many, many extractors that might produce data of different quality?  How do we provide feedback? Particularly as there are many learners, data and feedback are very sparse  Others? 37