Download presentation
Presentation is loading. Please wait.
Published byBrayan Brasseur Modified over 10 years ago
1
DATESO, April 14 th 2005 Multimedia Information extraction from HTML product catalogues Martin Labský 1, Vojtěch Svátek 1, Pavel Praks 2, Ondřej Šváb 1 {labsky, svatek, xsvao06}@vse.cz, pavel.praks@vsb.cz rainbow.vse.cz 1 Dept. of Information and Knowledge Engineering, Prague University of Economics 2 Dept. of Applied Mathematics, Technical University of Ostrava
2
DATESO, April 14 th 20052 Agenda Information Extraction from Internet Annotation using Hidden Markov Models Extracting images Instance composition guided by ontology Bicycle search application
3
DATESO, April 14 th 20053 IE from Internet Motivation –Semantic and structured search over large document collections Requirements –Identify relevant documents –Perform automatic IE documents are semi-structured, have heterogeneous layouts and formattings searching for objects of type Bicycle in price range €500 - €900 find structures (name, price, equipment) IE from Internet
4
DATESO, April 14 th 20054 Our approach to IE Preprocessing Acquire new document Annotation using HMMs w 1 w 2... w n w 1 w 2 w 3 w 4 w 5 w 6 w 7 w 8 w 9... w n w 3 w 4 w 6 w 7 HTML w9w9 namepricepicture Instance extraction name price picture Bicycle offer w3w4w3w4 w6w7w6w7 w9w9 IE from Internet
5
DATESO, April 14 th 20055 Relevant documents IE from Internet
6
DATESO, April 14 th 20056 Agenda Information Extraction from Internet Annotation using Hidden Markov Models Extracting images Instance composition guided by ontology Bicycle search application
7
DATESO, April 14 th 20057 Preprocessing HTML cleanup –conversion to valid XHTML Only potentially relevant blocks kept –blocks that do not directly contain text or images omitted Formatting tags –attributes removed –several rules matching common constructions (add-to- basket form, choose-amount button) Images –baseline: all images treated as a single token Annotation using HMMs
8
DATESO, April 14 th 20058 Preprocessing – example TREK Session 77 ( 2005 ) OUR PRICE £ 3000. 00 - - Select Size - - 15. 5 17. 5 19 TREK Session 77 (2005) OUR PRICE £3000.00 -- Select Size -- 15.5 17.5 19 0 1 2 3 4 5 Annotation using HMMs
9
DATESO, April 14 th 20059 Document modeling using HMMs Generative model Document = [w 1 c 1 ] [w 2 c 2 ] P([w 1 c 1 ] [w 2 c 2 ]) = P(c 1 )P(c 2 |c 1 )P(w 1 |c 1 )P(w 2 |c 2 ) c 1 c 2 = argmax i,j P([w 1 c i ] [w 2 c j ]) Annotation using HMMs c1c1 c2c2 P(c 2 |c 1 ) P(c 1 |c 2 ) P(w 1 |c 1 )P(w 1 |c 2 ) transition prob. lexical prob. estimated from training data (frequencies) word class
10
DATESO, April 14 th 200510 HMM Structure States –adopted from [Freitag, McCallum 99] –Target, Prefix, Suffix and Background –densely connected Class trigram model –P(name | name_prefix, name) Variations –word-ngram models for lexical probabilities of target states P(w 1 | w i-1, name) –state substructures instead of single target states, learned by EM Annotation using HMMs
11
DATESO, April 14 th 200511 Agenda Information Extraction from Internet Annotation using Hidden Markov Models Extracting images Instance composition guided by ontology Bicycle search application
12
DATESO, April 14 th 200512 Extracting Images Baseline –every image represented by the same token –HMM only extracts product images based on context, e.g. P(product_picture | name, product_picture_prefix) Use image classifier to preprocess images –classifies into 3 classes – Pos, Neg, Unk –before HMM annotation, each image occurrence in document is substituted by its class Extracting Images
13
DATESO, April 14 th 200513 Image Classification – Features Image size –estimated 2-dimensional normal distribution from a set of 1000 unique bicycle images N C (x, y) –estimated decision threshold (1-feature binary classifier) using held-out set of 150 images (60% positive) Image similarity –latent semantic similarity [Praks 2004] sim(I 1,I 2 ) – –estimated decision threshold for 1-feature bin classifier Does the image repeat in document? Extracting Images
14
DATESO, April 14 th 200514 Image Classification Combined binary classifier –Multi-layer perceptron (Weka) –Features: N C (x,y), sim C (I), repeats(I) Performance of binary classifiers –10-fold cross-validation, document-level folds Extracting Images
15
DATESO, April 14 th 200515 Annotation Results Combined ternary classifier –outputs Pos Unk Neg –decision list based on predictions of all 3 single feature ternary classifiers Extracting Images
16
DATESO, April 14 th 200516 Agenda Information Extraction from Internet Annotation using Hidden Markov Models Extracting images Instance composition guided by ontology Bicycle search application
17
DATESO, April 14 th 200517 Instance Composition Instance extraction algorithm Instances (xml) Sesame RDF repository Document annotated by HMM Presentation ontology
18
DATESO, April 14 th 200518 Domain ontology Instance Composition Presentation Ontology
19
DATESO, April 14 th 200519 Instance extraction algorithm Sequentially parses annotated document Adds annotated attributes to working instance WI If adding an attribute would cause an inconsitency, an empty working_instance is created. The old working_instance is saved only if it is consistent. 1.WI = empty_instance; 2.while (more_attributes) { 3. A = next_attribute; 4. if (cannot_add (WI, A)) { 5. if (consistent (WI)) { 6. store (WI); 7. } 8. WI = empty_instance; 9. } 10. add (WI, A); 11.} Instance Composition http://eso.vse.cz/~labsky/cgi-bin/client/
20
DATESO, April 14 th 200520 Agenda Information Extraction from Internet Annotation using Hidden Markov Models Extracting images Instance composition guided by ontology Bicycle search application
21
DATESO, April 14 th 200521 Bicycle search application, powered by Sesame RDF DB http://rainbow.vse.cz:8000/sesame/
22
DATESO, April 14 th 200522 Future work Learn to correct annotation errors –use document structure to detect unlabeled attributes –bootstrap from these new examples –use ontology constraints on values (types, lists, regexps) Population algorithm –utilize scores for each annotated attribute –augment presentation ontology with frequencies of attribute orderings –use approximate name matching to identify instances Improve search interface –approximate name matching (word and char edit distance)
23
DATESO, April 14 th 200523 Thank you! rainbow.vse.cz
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.