Download presentation
Presentation is loading. Please wait.
1
Recognizing Ontology-Applicable Multiple-Record Web Documents David W. Embley Dennis Ng Li Xu Brigham Young University
2
Problem: Recognizing Applicable Documents Document 1: Car Ads Document 2: Items for Sale or Rent
3
A Conceptual Modeling Solution
4
Car-Ads Ontology Car [->object]; Car [0:0.975:1] has Year [1:*]; Car [0:0.925:1] has Make [1:*]; Car [0:0.908:1] has Model [1:*]; Car [0:0.45:1] has Mileage [1:*]; Car [0:2.1:*] has Feature [1:*]; Car [0:0.8:1] has Price [1:*]; PhoneNr [1:*] is for Car [1:1.15:*]; Year matches [4] constant {extract “\d{2}”; context "([^\$\d]|^)[4-9]\d,[^\d]"; substitute "^" -> "19"; }, … End;
5
Recognition Heuristics H1: Density H2: Expected Values H3: Grouping
6
Document 1: Car Ads Document 2: Items for Sale or Rent H1: Density
7
Car Ads –Number of Matched Characters: 626 –Total Number of Characters: 2048 –Density: 0.306 Items for Rent or Sale –Number of Matched Characters: 196 –Total Number of Characters: 2671 –Density: 0.073
8
Document 1: Car Ads Year: 3 Make: 2 Model: 3 Mileage: 1 Price: 1 Feature: 15 PhoneNr: 3 H2: Expected Values Document 2: Items for Sale or Rent Year: 1 Make: 0 Model: 0 Mileage: 1 Price: 0 Feature: 0 PhoneNr: 4
9
H2: Expected Values OV D1D2 Year 0.98 16 6 Make 0.93 10 0 Model 0.91 12 0 Mileage 0.45 6 2 Price 0.80 11 8 Feature 2.10 29 0 PhoneNr 1.15 1511 D1: 0.996 D2: 0.567 ov D1 D2
10
H3: Grouping (of 1-Max Object Sets) Year Make Model Price Year Model Year Make Model Mileage … Document 1: Car Ads { { { Year Mileage … Mileage Year Price … Document 2: Items for Sale or Rent { {
11
H3: Grouping Car Ads ---------------- Year Make Model -------------- 3 Price Year Model Year ---------------3 Make Model Mileage Year ---------------4 Model Mileage Price Year ---------------4 … Grouping: 0.865 Sale Items ---------------- Year Mileage -------------- 2 Mileage Year Price ---------------3 Year Price Year ---------------2 Price ---------------1 … Grouping: 0.500 Expected Number in Group = Ave = 4 (for our example) Sum of Distinct 1-Max in each Group Number of Groups Expected Number in a Group 1-Max 3+3+4+4 4 4 = 0.875 2+3+2+1 4 4 = 0.500
12
Combining Heuristics Decision-Tree Learning Algorithm C4.5 –(H1, H2, H3, Positive) –(H1, H2, H3, Negative) Training Set –20 positive examples –30 negative examples (some purposely similar, e.g. classified ads) Test Set –10 positive examples –20 negative examples
13
Car Ads: Rule & Results Precision: 100% Recall: 91% Accuracy 97% –Harmonic Mean –2/(1/Precision + 1/Recall)
14
False Negative
15
Obituaries
16
Obituaries: Rule & Results Precision: 91% Recall: 100% Accuracy: 97%
17
False Positive: Missing Person Report
18
Universal Rule Precision: 84% Recall: 100% Accuracy: 93%
19
Additional and Future Work Other Approaches –Naïve Bayes [McCallum96] (accuracy near 90%) –Logistic Regression [Wang01] (accuracy near 95%) –Multivariate Analysis with Continuous Random Vectors [Tang01] (accuracy near 100%) More Extensive Testing –Similar documents (motorcycles, wedding announcements, …) –Accuracy drops to near 87% –Naïve Bayes drops to near 77% –Others … ? Other Types of Documents –XML Documents –Forms and the Hidden Web –Tables
20
Summary Objective : Automatically Recognize Document Applicability Approach: –Conceptual Modeling –Recognition Heuristics Density Expected Values Grouping Result : Accuracy Near 95% www.deg.byu.edu
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.