Download presentation
Presentation is loading. Please wait.
Published byJade Sharp Modified over 8 years ago
1
PAIR project progress report Yi-Ting Chou Shui-Lung Chuang Xuanhui Wang
2
Motivation A lot of information exists distributed and unstructured on the Web Web IE: To extract and organize such information into a structured format E.g., Person (name, contact (email, phone, address), research interests, … ) E.g., Book (title, authors, price, ISBN, … )
3
Example Person (name, contact (email, phone, address), research interests, … ) Page 1 Page 2 Page 3 ……
5
Motivation (cont.) Direct Web IE is very hard. E.g., distributed and unstructured This project is to provide a instance- attribute retrieval engine towards this problem In this project, We focus on personal information. The attribute should be given (e.g. contact).
6
Flow Chart Name Attribute Page Collector Attribute Expansion Pages Attribute* Segment Tool Trees Retrieval Rank List
7
Why tree structure for page segmentation?? The parameter which controls the size of leaf block is difficult to tune Our Solution: score each node of the tree instead of the leaf blocks. Then select the appropriate node to rank.
8
Current Progress Name Attribute Page Collector Attribute Expansion Pages Attribute* Segment Tool Trees Retrieval Rank List
9
The main idea of the project 1. Given a person name, first identify the pages which contain the information of the person (multiple pages exist on the Web) 2. Each page will be segmented into semantic-coherent blocks 3. Given an attribute name, identify the most relevant blocks 4. NLP techniques can be applied to extracted the Noun Phrase from the relevant blocks.
10
The progress so far Currently, we are focus on the single page. 1. Page Segmentation, using VIPS, will generate a tree structure for the page. 2. Given an attribute, match it with the most relevant “ node ” of the “ tree ”. 3. Present the rank list of the relevant blocks.
11
Demo
12
The remaining task 1. Improve the accuracy for single page. 2. Extend to multiple pages: INPUT: a person name (instead of a URL) and attribute name. OUTPUT: a rank list of the blocks.
13
Issues for discussion The possible problem of our method E.g. how to effectively score and rank the “ node ” of the page “ tree ” ? The way to improve and extend our method E.g. how to combine with the NLP/Name-Entity- Extraction on the retrieved blocks E.g. How to deal with multiple page and duplicated information The evaluation suggestion of our method E.g. user study, anything more?? The relation with Entity Retrieval ??
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.