Presentation is loading. Please wait.

Presentation is loading. Please wait.

Ping LUO*, Fen LIN^, Yuhong XIONG*, Yong ZHAO*, Zhongzhi SHI^

Similar presentations


Presentation on theme: "Ping LUO*, Fen LIN^, Yuhong XIONG*, Yong ZHAO*, Zhongzhi SHI^"— Presentation transcript:

1 Towards Combining Web Classification and Web Information Extraction: a Case Study
Ping LUO*, Fen LIN^, Yuhong XIONG*, Yong ZHAO*, Zhongzhi SHI^ *Hewlett-Packard Labs China ^Institute of Computing Technology, CAS

2 Web Content Analysis for Vertical Search
Web Classification Identify the target pages Web Information Extraction Extract the metadata in the target pages product pages product name, model number, price … Web pages after crawling course homepages course title, ID, time, teacher …

3 OfCourse Search engine for online courses
More than 60, 000 courses from the top 50 universities in US

4 Web Classification and Web Information Extraction
Presentation Title Web Classification and Web Information Extraction WC vs. WIE Two sequential and separate phases Error accumulation Web Content Analysis for Vertical Search Web Classification Web Information extraction it is highly ine®ective to use this decoupled strategy - attempting to do Web classi¯cation and Web information extraction in two separate phases. Speci¯cally, since these two steps are separate, the errors in Web classi¯cation will be propagated to Web information extraction and eventually accumulate to a high level. Therefore, the overall performance is upper- bounded by that of Web classi¯cation.

5 Presentation Title Contributions Web Content Analysis for Vertical Search Web Classification Web Information extraction Web Classification and Web Information Extraction After studying this problem, we ¯nd that these two steps are closely related in that informa- tion obtained from one step can greatly help the other step. On one hand, if a Web page is a course homepage it usually contains a course title. On the other hand, the existence of some course metadata, in turn, indicates that the current Web page is a course homepage. This means that there is a forward dependency from Web classi¯cation to informa- tion extraction, and also a backward dependency from Web information extraction. In this paper, we propose a method to combine Web clas- si¯cation and information extraction and achieve mutual enhancement between these two operations. Rather than conducting these two steps separately and sequentially, our method utilizes the probabilistic graphical model to simulta- neously detect the target Web pages and extract the meta- data in them, through which we aim to improve both the recall and precision of Web classi¯cation and Web informa- tion extraction. Combine them by probabilistic model to achieve mutual enhancement

6 Motivating Examples (1)
No Course Title WIE Oracle Lots of course-related terms on this page WIE helps to improve the precision of WC

7 Motivating Examples (2)
No Course Title WIE Oracle With Course Title Few course-related terms on this page WIE helps to improve the recall of WC

8 Problem Formulation (1)
Denotations x, a given Web page y, the class label of this page (indicating the type of the Web page for WC) xi (i=1…k), a text DOM leaf node in the page x yi (i=1…k), the class label of xi (indicating the type of the text node for WIE) k, the number of text DOM leaf nodes in this page Label assignment problem for both x and x1 … xk

9 Problem Formulation (2)
Presentation Title Problem Formulation (2) Given a Web page x with k text DOM nodes x1 … xk Let y,y1…yk be one possible label assignment for x,x1…xk The principle of Maximum A Posteriori for the label assignment problem

10 The Graphical Model Undirected graphical model for combining WC and WIE

11 The Graphical Model Undirected graphical model for combining WC and WIE maximal clique on x and y

12 The Graphical Model Undirected graphical model for combining WC and WIE maximal clique on each xi and yi, k such kind of maximal cliques

13 The Graphical Model Undirected graphical model for combining WC and WIE maximal clique on all label variables y,y1…yk January, 2009

14 Expressing the Conditional Probability
Presentation Title Expressing the Conditional Probability Adopting the form of CRFs More explaination of the feature functions Number of feature functions Number of parameters January, 2009

15 Parameter Learning

16 Model Inference with Constrained Output (1)
The challenge: the normalization factor in the conditional probability Exact computation when the structure of the elements in the vector y is simple Approximate computation otherwise (fully connected y,y1…yk in our model)

17 Model Inference with Constrained Output (2)
Use the domain knowledge to constrain the output label space A course homepage contains one and only one course title A non course homepage do not contain a course title

18 Baseline Methods Local training and separate inference
Presentation Title Baseline Methods Local training and separate inference Train the two classifiers for WC and WIE respectively Use these two classifiers sequentially when predicting Local training and joint inference Use these two classifiers jointly when predicting Joint learning and joint inference in the proposed method

19 Experimental Results

20 Conclusions and Discussion
Presentation Title Conclusions and Discussion Tasks that are inherently joint should be addressed using only one model WC and WIE However, this definitely increase the complexity of the statistic model This work is to show the possibility of this joint model with tractable complexity, which is achieved by adopting the domain assumption We have shown that tasks that are inherently joint should be addressed using a model and a technique that is joint, rather than breaking them up into multiple, independent sub-tasks (parsing: syntax and semantics; information extraction and classification, image analysis, etc) January, 2009

21 OfCourse Open search engine
support interactively adding of the course data

22

23 Experimental Data Positive data Negative data 530 course homepages
Presentation Title Experimental Data Positive data 530 course homepages Negative data 1200 other web pages


Download ppt "Ping LUO*, Fen LIN^, Yuhong XIONG*, Yong ZHAO*, Zhongzhi SHI^"

Similar presentations


Ads by Google