Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data and Information Systems Laboratory University of Illinois Urbana-Champaign Advanced Data Mining May 4, 2010 Growing Parallel Paths for Entity-Page.

Similar presentations


Presentation on theme: "Data and Information Systems Laboratory University of Illinois Urbana-Champaign Advanced Data Mining May 4, 2010 Growing Parallel Paths for Entity-Page."— Presentation transcript:

1 Data and Information Systems Laboratory University of Illinois Urbana-Champaign Advanced Data Mining May 4, 2010 Growing Parallel Paths for Entity-Page Retrieval Tim Weninger, Cindy Xide Lin, and Jiawei Han Department of Computer Science University of Illinois Urbana-Champaign, Urbana, IL Work Submitted to VLDB'10

2 Data and Information Systems Laboratory University of Illinois Urbana-Champaign Advanced Data Mining May 4, 2010 Problem: Entity Page Retrieval Given: Reference page

3 Data and Information Systems Laboratory University of Illinois Urbana-Champaign Advanced Data Mining May 4, 2010 …Can We find Entity Pages of the same Type? Problem: Entity Page Retrieval

4 Data and Information Systems Laboratory University of Illinois Urbana-Champaign Advanced Data Mining May 4, 2010 …Can We find Entity Pages of the same Type? Problem: Entity Page Retrieval

5 Data and Information Systems Laboratory University of Illinois Urbana-Champaign Advanced Data Mining May 4, 2010 Definitions: Defn 1: Root to link path: ◊ - href X contains HTML-TABLE-TR 1— TD-href X Defn 2: Parallel Links: Share a root to link path. i.e., lists of links Defn 3: Intra-page parallel paths: ◊ - href C ǁ ◊ - href B ◊ - href C ǁ ◊ - href X

6 Data and Information Systems Laboratory University of Illinois Urbana-Champaign Advanced Data Mining May 4, 2010 Definitions: Defn 5: Parallel Web site paths Share intra or inter-page parallel paths across multiple pages Defn 4: Inter-page parallel ◊ - href C in Page A ǁ ◊ - href W in Page B

7 Data and Information Systems Laboratory University of Illinois Urbana-Champaign Advanced Data Mining May 4, 2010 Properties of Parallel Paths Prop. 1: Equal Path Length Property: Parallel paths must contain the same number of pages. Prop. 2: Parallel Page Property: The test of two paths being in parallel is equivalent to the result of tests of respective pages. Prop. 3: Equal Page Length Property: Parallel paths must have the same number of nodes across pages.

8 Data and Information Systems Laboratory University of Illinois Urbana-Champaign Advanced Data Mining May 4, 2010 Properties of Parallel Paths Prop. 4: Divergent Path Property: Parallel Paths can extend through separate pages Prop. 5: Early Termination Property: The test of two paths can be terminated at the first occurrence of a dissimilar node

9 Data and Information Systems Laboratory University of Illinois Urbana-Champaign Advanced Data Mining May 4, 2010 Finding Paths Naive Method Can be very costly Growing Parallel Paths First find example path Then grow paths which are in parallel to the example Repeat with alternate paths This makes magic happen

10 Data and Information Systems Laboratory University of Illinois Urbana-Champaign Advanced Data Mining May 4, 2010 Repeating with alternate paths k-shortest paths Do k-shortest path search. Explore all of these paths Removing links After exploring a path remove the edges from the graph

11 Data and Information Systems Laboratory University of Illinois Urbana-Champaign Advanced Data Mining May 4, 2010 Interpreting the Output Side Effect of Repeating with Alternate paths Given: Jiawei Han Result: Jiawei Han40 Cheng Zhai38 Kevin Chang38 Dan Roth32 Vikram Adve4 Roy Campbell3 …

12 Data and Information Systems Laboratory University of Illinois Urbana-Champaign Advanced Data Mining May 4, 2010 Interpreting the Output Side Effect of Path Finding What does the link labels on the path tell us about the entity First path People Faculty Jiawei Han Personal Site Second path Research Data Mining

13 Data and Information Systems Laboratory University of Illinois Urbana-Champaign Advanced Data Mining May 4, 2010 Experiments Top 25 CS Departments in US (according to US News) Find all professors United States Congress Find all senators, representatives, and committees UIUC only Find all courses Final all research groups Baseline Google’s find similar search (essentially TFIDF-type ranking)

14 Data and Information Systems Laboratory University of Illinois Urbana-Champaign Advanced Data Mining May 4, 2010 Results

15 Data and Information Systems Laboratory University of Illinois Urbana-Champaign Advanced Data Mining May 4, 2010 Results

16 Data and Information Systems Laboratory University of Illinois Urbana-Champaign Advanced Data Mining May 4, 2010 Results

17 Data and Information Systems Laboratory University of Illinois Urbana-Champaign Advanced Data Mining May 4, 2010 Conclusions and Future Work Given a reference page and an example entity type we can retrieve all entity pages of the same type Implications: We can use this for information integration Search, retrieval can be enhanced Shortcomings: Most errors due to incorrect list finding

18 Data and Information Systems Laboratory University of Illinois Urbana-Champaign Advanced Data Mining May 4, 2010 Questions?


Download ppt "Data and Information Systems Laboratory University of Illinois Urbana-Champaign Advanced Data Mining May 4, 2010 Growing Parallel Paths for Entity-Page."

Similar presentations


Ads by Google