Download presentation
Presentation is loading. Please wait.
Published byMorgan Chase Modified over 9 years ago
1
An Unsupervised Framework for Extracting and Normalizing Product Attributes from Multiple Web Sites Center for E-Business Technology Seoul National University Seoul, Korea Nam, Kwang-hyun Intelligent Database Systems Lab School of Computer Science & Engineering Seoul National University, Seoul, Korea Tak-Lam Wong, Wai Lam, Tik-Shun Wong The Chinese University of Hong Kong SIGIR 2008
2
Copyright 2009 by CEBT Contents Introduction Problem Definition Model Inference Method Experimental Results Conclusions Discussion IDS Lab Seminar - 2
3
Copyright 2009 by CEBT Introduction Motivation IDS Lab Seminar - 3 (Source: http://www.superwarehouse.com) (Source: http://www.crayeon3.com)
4
Copyright 2009 by CEBT Introduction Information Extraction Prior knowledge about content – Sensor resolution Previously unseen attributes – Layout format White balance, shutter speed – Mutual influence Light sensitivity IDS Lab Seminar - 4
5
Copyright 2009 by CEBT Introduction Attribute Normalization Samples of extracted text fragments from a page: – Cloudy, daylight, etc… – What do they refer to? A text fragment extracted from another page: – white balance auto, daylight, cloudy, etc… Attribute normalization – To cluster text fragments into the same group – Better indexing for product search – Easier understanding and interpretation IDS Lab Seminar - 5
6
Copyright 2009 by CEBT Introduction Existing Works Supervised wrapper induction – They need training examples. – The wrapper learned from a Web site cannot be applied to other sites. Template-independent extraction (Zhu et al., 2007) – They cannot handle previously unseen attributes. Unsupervised wrapper learning (Crescenzi et al, 2001) – Extracted data are not normalized. IDS Lab Seminar - 6
7
Copyright 2009 by CEBT Introduction Contributions Unsupervised learning framework for jointly extracting and normalizi ng product attributes from multiple Web sites. Can extract unlimited number of product attributes (Dirichlet process ) Can visualize the semantic meaning of each product attribute IDS Lab Seminar - 7
8
Copyright 2009 by CEBT Problem Definition (1) A product domain, E.g., Digital camera domain A set of reference attributes, E.g., “resolution”, “white balance”, etc. A special element,, representing “not-an-attribute” A collection of Web pages from any Web sites,, each of which contains a single product Let be any text fragment from a Web page IDS Lab Seminar - 8
9
Copyright 2009 by CEBT Problem Definition (2) IDS Lab Seminar - 9 White balance Auto, daylight, cloudy, tungstem, fluorescent, fluorescent H, custom White balance Auto, daylight, cloudy, tungstem, fluorescent, fluorescent H, custom Line separator
10
Copyright 2009 by CEBT Problem Definition (3) IDS Lab Seminar - 10 Information extraction: Attribute normalization: Joint attribute extraction and normalization: Attribute information Target information Layout information Content information e.g., x =(resolution 10,000,000 pixels, black and in small font size, 1, resolution)
11
Copyright 2009 by CEBT Problem Definition (4) White balance Auto, daylight, cloudy, tungstem, fluorescent, fluor escent H, custom T=1 A=“white balance” “Cloudy, daylight” T=1 A=“white balance” View larger image T=0 A=“not-an-attribute” IDS Lab Seminar - 11
12
Copyright 2009 by CEBT Model IDS Lab Seminar - 12 Dirichlet Process Prior (Infinite Mixture Model) N Text Fragments S Different Web Pages k-th component proportion Content info. generation Target info. generation A set of layout distribution
13
Copyright 2009 by CEBT Generation Process IDS Lab Seminar - 13
14
Copyright 2009 by CEBT Generation Process The joint probability for generating a particular text fragment gi ven the parameters,,,, and, Inference Intractable (means very difficult to deal with) IDS Lab Seminar - 14
15
Copyright 2009 by CEBT Variational Method Finding is intractable Goal Design a tractable distribution such that should be as close to as possible. Kullback-Leibler(KL) divergence Since D(Q||P) ≥ 0, IDS Lab Seminar - 15
16
Copyright 2009 by CEBT Experiments We have conducted experiments on four different domains: Digital camera:85 Web pages from 41 different sites MP3 player:96 Web pages from 62 different sites Camcorder:111 Web pages from 61 different sites Restaurant:29 Web pages from LA-Weekly Restaurant Guide In each domain, we conducted 10 runs of experiments. In each run, we randomly selected a Web page and use the attrib utes inside as prior knowledge. IDS Lab Seminar - 16
17
Copyright 2009 by CEBT Evaluation on Attribute Normalization Baseline approach Agglomerative clustering – Only consider the text content of text fragments Evaluation metrics Recall (R) Precision (P) F1-measure (F) IDS Lab Seminar - 17
18
Copyright 2009 by CEBT Results of Attribute Normalization IDS Lab Seminar - 18
19
Copyright 2009 by CEBT Visualize the Normalized Attributes The top five weighted terms in the ten largest normalized attribut es in the digital camera domain IDS Lab Seminar - 19
20
Copyright 2009 by CEBT Evaluation on Attribute Extraction Surprisingly, in the restaurant domain, our framework achieves A performance (0.95 F1-measure) which is comparable to the su pervised method (Muslea et al. 2001) IDS Lab Seminar - 20
21
Copyright 2009 by CEBT Conclusions Developed an unsupervised framework aiming at simultaneously extracting and normalizing product attributes from Web pages col lected from different sites. Developed a graphical model to model the generation of text frag ments in Web pages. Showed that content and layout information can collaborate and i mprove both extraction and normalization performance under our model. IDS Lab Seminar - 21
22
Copyright 2009 by CEBT Discussion Pros Good motivation and proposed solution Performance is good enough for real situation. Cons Lack explanation of equations Some words used wrongly IDS Lab Seminar - 22
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.