Presentation is loading. Please wait.

Presentation is loading. Please wait.

Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management.

Similar presentations


Presentation on theme: "Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management."— Presentation transcript:

1 Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems November 4, 2008 Slides based on content by AnHai Doan, used with permission

2 Administrivia  By next Tuesday: a rough schedule and division of duties for your project  Please read the Halevy et al. paper on Piazza 2

3 The Web Is Full of Special-Interest Portal Sites for Communities  Academia  Certain bioinformatics topics; citations; etc.  Medicine  WebMD  Infotainment  Rotten Tomatoes, IMDB, fantasy football  Business  enterprise intranets, tech support groups, lawyers  CIA / homeland security  Intellipedia  Some of these gather information from the Web 3

4 Cimple Project @ Wisconsin (+ Yahoo) Researcher Homepages Conference Pages Group Pages DBworld mailing list DBLP Web pages Text documents * * * * * * * * * SIGMOD-04 * * * * give-talk Jim Gray Keyword search SQL querying Question answering Browse Mining Alert/Monitor News summary Jim Gray SIGMOD-04 * * Maintain and add more sources Develops a general solution to community Web portals using extraction + integration + mass collaboration Mass collaboration

5 The Basic Ideas  Architecture mainly consists of extractors and ER- graphs  The key challenge is to extract “good” information for the ER-graphs, and to allow all of the information to be extended and repaired 5

6 Prototype System: DBLife  Integrate data of the DB research community  1164 data sources Crawled daily, 11000+ pages = 160+ MB / day

7 Data Integration Raghu Ramakrishnan co-authors = A. Doan, Divesh Srivastava,...

8 Resulting ER Graph “Proactive Re-optimization Jennifer Widom Shivnath Babu SIGMOD 2005 David DeWitt Pedro Bizarro coauthor advise write PC-Chair PC-member

9 Provide Services  DBLife system DBLife system

10 Mass Collaboration via Wiki

11 Issues Addressed by Cimple  Cimple addresses challenges in 1. Source selection 2. Extraction and integration 3. Detecting problems and providing feedback 4. Mass collaboration

12 1. Source Selection Researcher Homepages Conference Pages Group Pages DBworld mailing list DBLP Web pages Text documents * * * * * * * * * SIGMOD-04 * * * * give-talk Jim Gray Keyword search SQL querying Question answering Browse Mining Alert/Monitor News summary Jim Gray SIGMOD-04 * * Maintain and add more sources Mass collaboration

13 Current Solutions vs. Cimple  Current solutions: topic specific crawlers  find all relevant data sources (e.g., using focused crawling, search engines)  maximize coverage  results in many “noisy” sources  Cimple allows for incremental development, deployment  starts with a small set of high-quality “core” sources  incrementally adds more sources  only from “high-quality” places  or as suggested by users (mass collaboration)

14 Start with a Small Set of “Core” Sources  Key observation: communities often follow 80-20 rule  20% of sources cover 80% of interesting activities  Initial portal over these 20% often is already quite useful  How do we select these 20%?  select as many sources as possible  then evaluate and select most relevant ones

15 Evaluate the Relevance of Sources  Use PageRank + virtual links across entities + TF/IDF... Gerhard Weikum G. Weikum See [VLDB-07a]

16 Add More Sources over Time  Key observation: most important sources will eventually be mentioned within the community  so monitor certain “community channels” to find them Message type: conf. ann. Subject: Call for Participation: VLDB Workshop on Management of Uncertain Data Call for Participation Workshop on "Management of Uncertain Data" in conjunction with VLDB 2007 http://mud.cs.utwente.nl... Also allow users to suggest new sources –e.g., the Silicon Valley Database Society

17 Summary: Source Selection  Incremental approach:  start with highly relevant sources  expand carefully  minimize “garbage in, garbage out”  Need a notion of source relevance  Need a way to compute this

18 2. Extraction and Integration Researcher Homepages Conference Pages Group Pages DBworld mailing list DBLP Web pages Text documents * * * * * * * * * SIGMOD-04 * * * * give-talk Jim Gray Keyword search SQL querying Question answering Browse Mining Alert/Monitor News summary Jim Gray SIGMOD-04 * * Maintain and add more sources Mass collaboration

19 Extracting Entity Mentions  Key idea: reasonable plan, then “patch”  Reasonable basic plan:  collect person names, e.g., David Smith  generate variations, e.g., D. Smith, Dr. Smith, etc.  find occurrences of these variations ExtractMbyName Union s 1 … s n Works well, but can’t handle certain difficult spots

20 Handling Difficult Spots  Example  R. Miller, D. Smith, B. Jones  if “David Miller” is in the dictionary  will flag “Miller, D.” as a person name  Solution: patch such spots with stricter plans ExtractMbyName Union s 1 … s n FindPotentialNameLists ExtractMStrict

21 Matching Entity Mentions  Key idea: reasonable plan, then patch  Reasonable plan  mention names are the same (modulo some variation)  match  e.g., David Smith and D. Smith Union Extract Plan MatchMbyName s1s1 snsn … Works well, but can’t handle certain difficult spots

22 Handling Difficult Spots  Estimate the semantic ambiguity of data sources  use social networking techniques related to cohesion of graphs [see ICDE-07a]  Apply stricter matchers to more ambiguous sources MatchMStrict Extract Plan MatchMbyName Union {s 1 … s n }DBLP \ Extract Plan DBLP DBLP: Chen Li · · · 41. Chen Li, Bin Wang, Xiaochun Yang. VGRAM. VLDB 2007. · · · 38. Ping-Qi Pan, Jian-Feng Hu, Chen Li. Feasible region contraction. Applied Mathematics and Computation. · · ·

23 Summary: Extraction and Integration  Most current solutions  try to find a single good plan, applied to all of data  Cimple solution: reasonable plan, then patch  So the focus shifts to:  how to find a reasonable plan?  how to detect problematic data spots?  how to patch those?  Need a notion of semantic ambiguity  Different from the notion of source relevance

24 3. Detecting Problems and Making Corrections Researcher Homepages Conference Pages Group Pages DBworld mailing list DBLP Web pages Text documents * * * * * * * * * SIGMOD-04 * * * * give-talk Jim Gray Keyword search SQL querying Question answering Browse Mining Alert/Monitor News summary Jim Gray SIGMOD-04 * * Maintain and add more sources Mass collaboration

25 How to Detect Problems?  After extraction and matching, build services  e.g., superhomepages  Many such homepages contain minor problems  e.g., X graduated in 19998 X chairs SIGMOD-05 and VLDB-05 X published 5 SIGMOD-03 papers  Intuitively, something is semantically incorrect  To fix this, build a Semantic Debugger  learns what is a normal profile for researcher, paper, etc.  alerts the builder to potentially buggy superhomepages  so corrections / feedback can be provided

26 What Types of Feedback?  Say that a certain data item Y is wrong  Provide correct value for Y, e.g., Y = SIGMOD-06  Add domain knowledge  e.g., no researcher has ever published 5 SIGMOD papers in a year  Add more data  e.g., X was advised by Z  e.g., here is the URL of another data source  Modify the underlying algorithm  e.g., pull out all data involving X match using names and co-authors, not just names

27 How to Make Providing Feedback Very Easy?  Extremely crucial in DBLife context  If feedback can be provided easily  can get more feedback  can leverage the mass of users

28 Critical but unsolved Provide a Wiki interface How to Make Providing Feedback Very Easy?  Say that a certain data item Y is wrong  Provide correct value for Y, e.g., Y = SIGMOD-06  Add domain knowledge  Add more data  Modify the underlying algorithm Provide form interfaces Unsolved: some recent interest on how to mass customize software

29 Summary: Detection and Feedback  How to detect problems?  Semantic Debugger  What types of feedback & how to easily provide them?  critical, largely unsolved  What feedback would make most impact?  crucial in large-scale systems  need a notion of a Feedback Advisor  need a precise notion of system quality

30 4. Mass Collaboration Researcher Homepages Conference Pages Group Pages DBworld mailing list DBLP Web pages Text documents * * * * * * * * * SIGMOD-04 * * * * give-talk Jim Gray Keyword search SQL querying Question answering Browse Mining Alert/Monitor News summary Jim Gray SIGMOD-04 * * Maintenance and expansion Mass collaboration

31 Mass Collaboration: Voting Can be applied to numerous problems

32 Example: Matching  Hard for machine, but easy for human Mouse for Dell laptop 200 series... Dell X200; mouse at reduced price... Dell laptop X200 with mouse...

33 Mass Collaboration: Wiki  Community wikipedia  built by machine + human  backed up by a structured database Data Sources G T V1V1 V2V2 V3V3 W1W1 W2W2 W3W3 u1u1 V3’V3’W3’W3’ T3’T3’ M

34 Machine Human Mass Collaboration: Wiki Interests: David J. DeWitt Professor Interests: Parallel Database since 1976 Interests: since 1976 Interests: David J. DeWitt John P. Morgridge Professor UW-Madison since 1976 Interests: Parallel Database Privacy Machine Human

35 Summary: Mass Collaboration  What can users contribute?  How to evaluate user quality?  How to reconcile inconsistent data?

36 Summary: Cimple  A very interesting attempt to rethink Web crawling and information extraction  Based on a “best-effort” notion  One of many concurrent efforts in that vein  “Dataspaces”  Simple building blocks, progressive refinement 36

37 Open Questions and Issues  Incorporating uncertain data  Can we really scale, and manage the complexity, of many, many extractors that might produce data of different quality?  How do we provide feedback? Particularly as there are many learners, data and feedback are very sparse  Others? 37


Download ppt "Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management."

Similar presentations


Ads by Google