Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems November 4, 2008 Slides based on content by AnHai Doan, used with permission
Administrivia By next Tuesday: a rough schedule and division of duties for your project Please read the Halevy et al. paper on Piazza 2
The Web Is Full of Special-Interest Portal Sites for Communities Academia Certain bioinformatics topics; citations; etc. Medicine WebMD Infotainment Rotten Tomatoes, IMDB, fantasy football Business enterprise intranets, tech support groups, lawyers CIA / homeland security Intellipedia Some of these gather information from the Web 3
Cimple Wisconsin (+ Yahoo) Researcher Homepages Conference Pages Group Pages DBworld mailing list DBLP Web pages Text documents * * * * * * * * * SIGMOD-04 * * * * give-talk Jim Gray Keyword search SQL querying Question answering Browse Mining Alert/Monitor News summary Jim Gray SIGMOD-04 * * Maintain and add more sources Develops a general solution to community Web portals using extraction + integration + mass collaboration Mass collaboration
The Basic Ideas Architecture mainly consists of extractors and ER- graphs The key challenge is to extract “good” information for the ER-graphs, and to allow all of the information to be extended and repaired 5
Prototype System: DBLife Integrate data of the DB research community 1164 data sources Crawled daily, pages = 160+ MB / day
Data Integration Raghu Ramakrishnan co-authors = A. Doan, Divesh Srivastava,...
Resulting ER Graph “Proactive Re-optimization Jennifer Widom Shivnath Babu SIGMOD 2005 David DeWitt Pedro Bizarro coauthor advise write PC-Chair PC-member
Provide Services DBLife system DBLife system
Mass Collaboration via Wiki
Issues Addressed by Cimple Cimple addresses challenges in 1. Source selection 2. Extraction and integration 3. Detecting problems and providing feedback 4. Mass collaboration
1. Source Selection Researcher Homepages Conference Pages Group Pages DBworld mailing list DBLP Web pages Text documents * * * * * * * * * SIGMOD-04 * * * * give-talk Jim Gray Keyword search SQL querying Question answering Browse Mining Alert/Monitor News summary Jim Gray SIGMOD-04 * * Maintain and add more sources Mass collaboration
Current Solutions vs. Cimple Current solutions: topic specific crawlers find all relevant data sources (e.g., using focused crawling, search engines) maximize coverage results in many “noisy” sources Cimple allows for incremental development, deployment starts with a small set of high-quality “core” sources incrementally adds more sources only from “high-quality” places or as suggested by users (mass collaboration)
Start with a Small Set of “Core” Sources Key observation: communities often follow rule 20% of sources cover 80% of interesting activities Initial portal over these 20% often is already quite useful How do we select these 20%? select as many sources as possible then evaluate and select most relevant ones
Evaluate the Relevance of Sources Use PageRank + virtual links across entities + TF/IDF... Gerhard Weikum G. Weikum See [VLDB-07a]
Add More Sources over Time Key observation: most important sources will eventually be mentioned within the community so monitor certain “community channels” to find them Message type: conf. ann. Subject: Call for Participation: VLDB Workshop on Management of Uncertain Data Call for Participation Workshop on "Management of Uncertain Data" in conjunction with VLDB Also allow users to suggest new sources –e.g., the Silicon Valley Database Society
Summary: Source Selection Incremental approach: start with highly relevant sources expand carefully minimize “garbage in, garbage out” Need a notion of source relevance Need a way to compute this
2. Extraction and Integration Researcher Homepages Conference Pages Group Pages DBworld mailing list DBLP Web pages Text documents * * * * * * * * * SIGMOD-04 * * * * give-talk Jim Gray Keyword search SQL querying Question answering Browse Mining Alert/Monitor News summary Jim Gray SIGMOD-04 * * Maintain and add more sources Mass collaboration
Extracting Entity Mentions Key idea: reasonable plan, then “patch” Reasonable basic plan: collect person names, e.g., David Smith generate variations, e.g., D. Smith, Dr. Smith, etc. find occurrences of these variations ExtractMbyName Union s 1 … s n Works well, but can’t handle certain difficult spots
Handling Difficult Spots Example R. Miller, D. Smith, B. Jones if “David Miller” is in the dictionary will flag “Miller, D.” as a person name Solution: patch such spots with stricter plans ExtractMbyName Union s 1 … s n FindPotentialNameLists ExtractMStrict
Matching Entity Mentions Key idea: reasonable plan, then patch Reasonable plan mention names are the same (modulo some variation) match e.g., David Smith and D. Smith Union Extract Plan MatchMbyName s1s1 snsn … Works well, but can’t handle certain difficult spots
Handling Difficult Spots Estimate the semantic ambiguity of data sources use social networking techniques related to cohesion of graphs [see ICDE-07a] Apply stricter matchers to more ambiguous sources MatchMStrict Extract Plan MatchMbyName Union {s 1 … s n }DBLP \ Extract Plan DBLP DBLP: Chen Li · · · 41. Chen Li, Bin Wang, Xiaochun Yang. VGRAM. VLDB · · · 38. Ping-Qi Pan, Jian-Feng Hu, Chen Li. Feasible region contraction. Applied Mathematics and Computation. · · ·
Summary: Extraction and Integration Most current solutions try to find a single good plan, applied to all of data Cimple solution: reasonable plan, then patch So the focus shifts to: how to find a reasonable plan? how to detect problematic data spots? how to patch those? Need a notion of semantic ambiguity Different from the notion of source relevance
3. Detecting Problems and Making Corrections Researcher Homepages Conference Pages Group Pages DBworld mailing list DBLP Web pages Text documents * * * * * * * * * SIGMOD-04 * * * * give-talk Jim Gray Keyword search SQL querying Question answering Browse Mining Alert/Monitor News summary Jim Gray SIGMOD-04 * * Maintain and add more sources Mass collaboration
How to Detect Problems? After extraction and matching, build services e.g., superhomepages Many such homepages contain minor problems e.g., X graduated in X chairs SIGMOD-05 and VLDB-05 X published 5 SIGMOD-03 papers Intuitively, something is semantically incorrect To fix this, build a Semantic Debugger learns what is a normal profile for researcher, paper, etc. alerts the builder to potentially buggy superhomepages so corrections / feedback can be provided
What Types of Feedback? Say that a certain data item Y is wrong Provide correct value for Y, e.g., Y = SIGMOD-06 Add domain knowledge e.g., no researcher has ever published 5 SIGMOD papers in a year Add more data e.g., X was advised by Z e.g., here is the URL of another data source Modify the underlying algorithm e.g., pull out all data involving X match using names and co-authors, not just names
How to Make Providing Feedback Very Easy? Extremely crucial in DBLife context If feedback can be provided easily can get more feedback can leverage the mass of users
Critical but unsolved Provide a Wiki interface How to Make Providing Feedback Very Easy? Say that a certain data item Y is wrong Provide correct value for Y, e.g., Y = SIGMOD-06 Add domain knowledge Add more data Modify the underlying algorithm Provide form interfaces Unsolved: some recent interest on how to mass customize software
Summary: Detection and Feedback How to detect problems? Semantic Debugger What types of feedback & how to easily provide them? critical, largely unsolved What feedback would make most impact? crucial in large-scale systems need a notion of a Feedback Advisor need a precise notion of system quality
4. Mass Collaboration Researcher Homepages Conference Pages Group Pages DBworld mailing list DBLP Web pages Text documents * * * * * * * * * SIGMOD-04 * * * * give-talk Jim Gray Keyword search SQL querying Question answering Browse Mining Alert/Monitor News summary Jim Gray SIGMOD-04 * * Maintenance and expansion Mass collaboration
Mass Collaboration: Voting Can be applied to numerous problems
Example: Matching Hard for machine, but easy for human Mouse for Dell laptop 200 series... Dell X200; mouse at reduced price... Dell laptop X200 with mouse...
Mass Collaboration: Wiki Community wikipedia built by machine + human backed up by a structured database Data Sources G T V1V1 V2V2 V3V3 W1W1 W2W2 W3W3 u1u1 V3’V3’W3’W3’ T3’T3’ M
Machine Human Mass Collaboration: Wiki Interests: David J. DeWitt Professor Interests: Parallel Database since 1976 Interests: since 1976 Interests: David J. DeWitt John P. Morgridge Professor UW-Madison since 1976 Interests: Parallel Database Privacy Machine Human
Summary: Mass Collaboration What can users contribute? How to evaluate user quality? How to reconcile inconsistent data?
Summary: Cimple A very interesting attempt to rethink Web crawling and information extraction Based on a “best-effort” notion One of many concurrent efforts in that vein “Dataspaces” Simple building blocks, progressive refinement 36
Open Questions and Issues Incorporating uncertain data Can we really scale, and manage the complexity, of many, many extractors that might produce data of different quality? How do we provide feedback? Particularly as there are many learners, data and feedback are very sparse Others? 37