Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management.

Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems November 4, 2008 Slides based on content by AnHai Doan, used with permission

Administrivia  By next Tuesday: a rough schedule and division of duties for your project  Please read the Halevy et al. paper on Piazza 2

The Web Is Full of Special-Interest Portal Sites for Communities  Academia  Certain bioinformatics topics; citations; etc.  Medicine  WebMD  Infotainment  Rotten Tomatoes, IMDB, fantasy football  Business  enterprise intranets, tech support groups, lawyers  CIA / homeland security  Intellipedia  Some of these gather information from the Web 3

Cimple Project @ Wisconsin (+ Yahoo) Researcher Homepages Conference Pages Group Pages DBworld mailing list DBLP Web pages Text documents * * * * * * * * * SIGMOD-04 * * * * give-talk Jim Gray Keyword search SQL querying Question answering Browse Mining Alert/Monitor News summary Jim Gray SIGMOD-04 * * Maintain and add more sources Develops a general solution to community Web portals using extraction + integration + mass collaboration Mass collaboration

The Basic Ideas  Architecture mainly consists of extractors and ER- graphs  The key challenge is to extract “good” information for the ER-graphs, and to allow all of the information to be extended and repaired 5

Prototype System: DBLife  Integrate data of the DB research community  1164 data sources Crawled daily, 11000+ pages = 160+ MB / day

Data Integration Raghu Ramakrishnan co-authors = A. Doan, Divesh Srivastava,...

Resulting ER Graph “Proactive Re-optimization Jennifer Widom Shivnath Babu SIGMOD 2005 David DeWitt Pedro Bizarro coauthor advise write PC-Chair PC-member

Provide Services  DBLife system DBLife system

Mass Collaboration via Wiki

Issues Addressed by Cimple  Cimple addresses challenges in 1. Source selection 2. Extraction and integration 3. Detecting problems and providing feedback 4. Mass collaboration

1. Source Selection Researcher Homepages Conference Pages Group Pages DBworld mailing list DBLP Web pages Text documents * * * * * * * * * SIGMOD-04 * * * * give-talk Jim Gray Keyword search SQL querying Question answering Browse Mining Alert/Monitor News summary Jim Gray SIGMOD-04 * * Maintain and add more sources Mass collaboration

Current Solutions vs. Cimple  Current solutions: topic specific crawlers  find all relevant data sources (e.g., using focused crawling, search engines)  maximize coverage  results in many “noisy” sources  Cimple allows for incremental development, deployment  starts with a small set of high-quality “core” sources  incrementally adds more sources  only from “high-quality” places  or as suggested by users (mass collaboration)

Start with a Small Set of “Core” Sources  Key observation: communities often follow 80-20 rule  20% of sources cover 80% of interesting activities  Initial portal over these 20% often is already quite useful  How do we select these 20%?  select as many sources as possible  then evaluate and select most relevant ones

Evaluate the Relevance of Sources  Use PageRank + virtual links across entities + TF/IDF... Gerhard Weikum G. Weikum See [VLDB-07a]

Add More Sources over Time  Key observation: most important sources will eventually be mentioned within the community  so monitor certain “community channels” to find them Message type: conf. ann. Subject: Call for Participation: VLDB Workshop on Management of Uncertain Data Call for Participation Workshop on "Management of Uncertain Data" in conjunction with VLDB 2007 http://mud.cs.utwente.nl... Also allow users to suggest new sources –e.g., the Silicon Valley Database Society

Summary: Source Selection  Incremental approach:  start with highly relevant sources  expand carefully  minimize “garbage in, garbage out”  Need a notion of source relevance  Need a way to compute this

2. Extraction and Integration Researcher Homepages Conference Pages Group Pages DBworld mailing list DBLP Web pages Text documents * * * * * * * * * SIGMOD-04 * * * * give-talk Jim Gray Keyword search SQL querying Question answering Browse Mining Alert/Monitor News summary Jim Gray SIGMOD-04 * * Maintain and add more sources Mass collaboration

Extracting Entity Mentions  Key idea: reasonable plan, then “patch”  Reasonable basic plan:  collect person names, e.g., David Smith  generate variations, e.g., D. Smith, Dr. Smith, etc.  find occurrences of these variations ExtractMbyName Union s 1 … s n Works well, but can’t handle certain difficult spots

Handling Difficult Spots  Example  R. Miller, D. Smith, B. Jones  if “David Miller” is in the dictionary  will flag “Miller, D.” as a person name  Solution: patch such spots with stricter plans ExtractMbyName Union s 1 … s n FindPotentialNameLists ExtractMStrict

Matching Entity Mentions  Key idea: reasonable plan, then patch  Reasonable plan  mention names are the same (modulo some variation)  match  e.g., David Smith and D. Smith Union Extract Plan MatchMbyName s1s1 snsn … Works well, but can’t handle certain difficult spots

Handling Difficult Spots  Estimate the semantic ambiguity of data sources  use social networking techniques related to cohesion of graphs [see ICDE-07a]  Apply stricter matchers to more ambiguous sources MatchMStrict Extract Plan MatchMbyName Union {s 1 … s n }DBLP \ Extract Plan DBLP DBLP: Chen Li · · · 41. Chen Li, Bin Wang, Xiaochun Yang. VGRAM. VLDB 2007. · · · 38. Ping-Qi Pan, Jian-Feng Hu, Chen Li. Feasible region contraction. Applied Mathematics and Computation. · · ·

Summary: Extraction and Integration  Most current solutions  try to find a single good plan, applied to all of data  Cimple solution: reasonable plan, then patch  So the focus shifts to:  how to find a reasonable plan?  how to detect problematic data spots?  how to patch those?  Need a notion of semantic ambiguity  Different from the notion of source relevance

3. Detecting Problems and Making Corrections Researcher Homepages Conference Pages Group Pages DBworld mailing list DBLP Web pages Text documents * * * * * * * * * SIGMOD-04 * * * * give-talk Jim Gray Keyword search SQL querying Question answering Browse Mining Alert/Monitor News summary Jim Gray SIGMOD-04 * * Maintain and add more sources Mass collaboration

How to Detect Problems?  After extraction and matching, build services  e.g., superhomepages  Many such homepages contain minor problems  e.g., X graduated in 19998 X chairs SIGMOD-05 and VLDB-05 X published 5 SIGMOD-03 papers  Intuitively, something is semantically incorrect  To fix this, build a Semantic Debugger  learns what is a normal profile for researcher, paper, etc.  alerts the builder to potentially buggy superhomepages  so corrections / feedback can be provided

What Types of Feedback?  Say that a certain data item Y is wrong  Provide correct value for Y, e.g., Y = SIGMOD-06  Add domain knowledge  e.g., no researcher has ever published 5 SIGMOD papers in a year  Add more data  e.g., X was advised by Z  e.g., here is the URL of another data source  Modify the underlying algorithm  e.g., pull out all data involving X match using names and co-authors, not just names

How to Make Providing Feedback Very Easy?  Extremely crucial in DBLife context  If feedback can be provided easily  can get more feedback  can leverage the mass of users

Critical but unsolved Provide a Wiki interface How to Make Providing Feedback Very Easy?  Say that a certain data item Y is wrong  Provide correct value for Y, e.g., Y = SIGMOD-06  Add domain knowledge  Add more data  Modify the underlying algorithm Provide form interfaces Unsolved: some recent interest on how to mass customize software

Summary: Detection and Feedback  How to detect problems?  Semantic Debugger  What types of feedback & how to easily provide them?  critical, largely unsolved  What feedback would make most impact?  crucial in large-scale systems  need a notion of a Feedback Advisor  need a precise notion of system quality

4. Mass Collaboration Researcher Homepages Conference Pages Group Pages DBworld mailing list DBLP Web pages Text documents * * * * * * * * * SIGMOD-04 * * * * give-talk Jim Gray Keyword search SQL querying Question answering Browse Mining Alert/Monitor News summary Jim Gray SIGMOD-04 * * Maintenance and expansion Mass collaboration

Mass Collaboration: Voting Can be applied to numerous problems

Example: Matching  Hard for machine, but easy for human Mouse for Dell laptop 200 series... Dell X200; mouse at reduced price... Dell laptop X200 with mouse...

Mass Collaboration: Wiki  Community wikipedia  built by machine + human  backed up by a structured database Data Sources G T V1V1 V2V2 V3V3 W1W1 W2W2 W3W3 u1u1 V3’V3’W3’W3’ T3’T3’ M

Machine Human Mass Collaboration: Wiki Interests: David J. DeWitt Professor Interests: Parallel Database since 1976 Interests: since 1976 Interests: David J. DeWitt John P. Morgridge Professor UW-Madison since 1976 Interests: Parallel Database Privacy Machine Human

Summary: Mass Collaboration  What can users contribute?  How to evaluate user quality?  How to reconcile inconsistent data?

Summary: Cimple  A very interesting attempt to rethink Web crawling and information extraction  Based on a “best-effort” notion  One of many concurrent efforts in that vein  “Dataspaces”  Simple building blocks, progressive refinement 36

Open Questions and Issues  Incorporating uncertain data  Can we really scale, and manage the complexity, of many, many extractors that might produce data of different quality?  How do we provide feedback? Particularly as there are many learners, data and feedback are very sparse  Others? 37

Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management.

Similar presentations

Presentation on theme: "Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management.

Similar presentations

Presentation on theme: "Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management."— Presentation transcript:

Similar presentations

About project

Feedback