Semi-automatic Product Attribute Extraction from Store Website

Name: Semi-automatic Product Attribute Extraction from Store Website
Uploaded: 2017-10-14T21:04:25+00:00
Duration: PTM10S33
Channel: Kelley Norton
Description: Semi-automatic Product Attribute Extraction from Store Website

Semi-automatic Product Attribute Extraction from Store Website
Yan Liu Carnegie Mellon University Sep 2, 2005

Example from Dick’s Sporting Goods
webpage Free text Product name Description Features Structured data

Applications Direct application More general applications
Product recommendation systems for customers Price estimates for auction Sales amount prediction More general applications Document organization prioritization Question answering And many more text mining tasks

Relationship with Previous Work
Information extraction Extract from the documents salient facts about prespecified types of events, entities or relationships Different from information retrieval Previous work Finite state machines Sliding windows Sequential models, such as HMMs or CRFs Association and clustering Major challenges Few training data Unclear attribute definition Making better use of labeled and unlabeled data, Making better use of user feedback using active learning

Outline Introduction General framework Detailed algorithms
Experiment results Conclusion and discussion

General Framework Attribute Identification Name-value Assignment
Example: Input: free text 9.68-lb total weight (4.4-kg) Attribute Identification (Semi-supervised learning) 9.68-lb total weight (4.4-kg) Name-value Assignment (Statistical and grammatical association) 9.68-lb total weight (4.4-kg) weight: 9.68-lb, 4.4-kg weight: CD-lb weight: CD-kg …. Feedback (Active learning) Output: structured data

Attribute Identification
Initial label acquisition Template matching Knowledge database Semi-supervised learning Yarowsky’s algorithm Co-training Co-EM Co-boosting Graph-based methods Phrase identification Statistical associations between adjacent words Heuristic grammatical rules

Attribute Identification (1) Initial Label Acquisition
Positive labels Template matching Extracted templates from data with special format Noisy data Knowledge database Measure units: length, weight, volume and etc Material Country Color Negative labels Partial stop word list

Attribute Identification (1) Semi-supervised learning
Co-training [Blum & Mitchell, 1998; Collins & Singer, 1999] Separation of two views Contextual features Spelling features Two kinds of features Stemmer words (Porter Stemmer) POS tagging (Brill’s tagger) Algorithm Psedocode

Attribute Identification (1) Phrase Identification
Difference between chunking Label propagation Category dependent Statistical association Information gain Mutual information Yule’s statistic display team colors up to 12 inches display team colors up to 12 inches

Name-value Assignment
Combination of three information sources Semantic association Knowledge database Attribute name generation and pair assignment Grammatical association Parsing tree (Minipar) Attribute name/value generation Statistical association scores Yule’s statistic (category dependent) Pair assignment Other association sources Wordnet

User Feedback Clustering─based active learning Clustering algorithm
Novelty attribute identification Merge and splitting attributes Better use of labeled examples Clustering algorithm Sparse data problem Multiple clustering algorithms Cluster selection Within-cluster coherence Novelty based measurement

User Feedback Clustering algorithm
Latent semantic indexing (LSI) [Deerwester et al, 1990] Singular value decomposition on term─document matrix Mapping the words into hidden semantic concepts Similarity measure: cosine similarity Clustering algorithm using CLUTO K─means Bisected K─means Agglomerative algorithm Single linkage Complete linkage Average linkage

User Feedback Cluster Selection
Novelty concepts Major difference from previous task Supervised novelty detection is difficult Tradeoff between novelty and relevancy Recently studied by the IR community [Carbonell and Goldstein, 1995; Zhang et al, 2003; Zhai et al, 2004] Cluster selection criterion using maximal marginal relevance (MMR) Similarity measure Cosine similarity KL-divergence

Outline Introduction General framework Detailed algorithms
Experiment results Conclusion and discussion

Experiment Setup Dataset Evaluation measures
Free text extracted from product descriptions on Subsets from two categories Football (largest category) 52339 entries, words, 2926 predicted feature-value pairs Tennis (medium category) 3840 entries, words, 419 predicted feature-value pairs Evaluation measures Direct evaluation Precision on feature value pairs Indirect evaluation in other applications Recommender systems

Experiment Results Examples by steps Initial label acquisition
Semi-supervised learning Phrase identification Semantic association Grammatical association Statistical association scores

Experiment Results Human feedback Examples by active learning
Sample files (link to file) Total labeling time of 5 mins Identified concepts color, graphics, logo, design, fit, size, pocket, pad, set, adjustment, attachment, construction, strap Examples by active learning

Experiment Results Precision on most frequent feature-value pairs
Most frequent 600 pairs Assignment of 5 labels Fully correct, incorrect names, incorrect values, incorrect associations, nonsense: Human labeling of approximately 6 hours Thanks to Katharin and Marko Results

Conclusion Product attribute identification is a difficult task
Few training data Making use of labeled and unlabeled data by semi-supervised learning Unclear attribute definition Novelty attribute identification by active learning A framework of active learning combined with semi-supervised learning

Text Learning Techniques
Text processing Stemming (Porter stemmer) POS tagging (Brill’s parser) Text chunking and parsing (Minipar) Word semantics (Wordnet, dependency-based thesaurus) Latent semantic indexing (SVDPack) Machine learning Semi-supervised learning (Co-training) Active learning (MMR) Classification (C4.5 decision tree, FOIL) Clustering (K-means, CLUTO) Information theory and statistical associations (Information gain, Yule’s statistic)

Future Work Associations of product attributes across categories or websites More effective active learning algorithms Graphical models with application to information extraction

Questions?

Semi-automatic Product Attribute Extraction from Store Website

Similar presentations

Presentation on theme: "Semi-automatic Product Attribute Extraction from Store Website"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Semi-automatic Product Attribute Extraction from Store Website

Similar presentations

Presentation on theme: "Semi-automatic Product Attribute Extraction from Store Website"— Presentation transcript:

Similar presentations

About project

Feedback