Recognizing Ontology-Applicable Multiple-Record Web Documents David W. Embley Dennis Ng Li Xu Brigham Young University
Problem: Recognizing Applicable Documents Document 1: Car Ads Document 2: Items for Sale or Rent
A Conceptual Modeling Solution
Car-Ads Ontology Car [->object]; Car [0:0.975:1] has Year [1:*]; Car [0:0.925:1] has Make [1:*]; Car [0:0.908:1] has Model [1:*]; Car [0:0.45:1] has Mileage [1:*]; Car [0:2.1:*] has Feature [1:*]; Car [0:0.8:1] has Price [1:*]; PhoneNr [1:*] is for Car [1:1.15:*]; Year matches [4] constant {extract “\d{2}”; context "([^\$\d]|^)[4-9]\d,[^\d]"; substitute "^" -> "19"; }, … End;
Recognition Heuristics H1: Density H2: Expected Values H3: Grouping
Document 1: Car Ads Document 2: Items for Sale or Rent H1: Density
Car Ads –Number of Matched Characters: 626 –Total Number of Characters: 2048 –Density: Items for Rent or Sale –Number of Matched Characters: 196 –Total Number of Characters: 2671 –Density: 0.073
Document 1: Car Ads Year: 3 Make: 2 Model: 3 Mileage: 1 Price: 1 Feature: 15 PhoneNr: 3 H2: Expected Values Document 2: Items for Sale or Rent Year: 1 Make: 0 Model: 0 Mileage: 1 Price: 0 Feature: 0 PhoneNr: 4
H2: Expected Values OV D1D2 Year Make Model Mileage Price Feature PhoneNr D1: D2: ov D1 D2
H3: Grouping (of 1-Max Object Sets) Year Make Model Price Year Model Year Make Model Mileage … Document 1: Car Ads { { { Year Mileage … Mileage Year Price … Document 2: Items for Sale or Rent { {
H3: Grouping Car Ads Year Make Model Price Year Model Year Make Model Mileage Year Model Mileage Price Year … Grouping: Sale Items Year Mileage Mileage Year Price Year Price Year Price … Grouping: Expected Number in Group = Ave = 4 (for our example) Sum of Distinct 1-Max in each Group Number of Groups Expected Number in a Group 1-Max 4 = 4 = 0.500
Combining Heuristics Decision-Tree Learning Algorithm C4.5 –(H1, H2, H3, Positive) –(H1, H2, H3, Negative) Training Set –20 positive examples –30 negative examples (some purposely similar, e.g. classified ads) Test Set –10 positive examples –20 negative examples
Car Ads: Rule & Results Precision: 100% Recall: 91% Accuracy 97% –Harmonic Mean –2/(1/Precision + 1/Recall)
False Negative
Obituaries
Obituaries: Rule & Results Precision: 91% Recall: 100% Accuracy: 97%
False Positive: Missing Person Report
Universal Rule Precision: 84% Recall: 100% Accuracy: 93%
Additional and Future Work Other Approaches –Naïve Bayes [McCallum96] (accuracy near 90%) –Logistic Regression [Wang01] (accuracy near 95%) –Multivariate Analysis with Continuous Random Vectors [Tang01] (accuracy near 100%) More Extensive Testing –Similar documents (motorcycles, wedding announcements, …) –Accuracy drops to near 87% –Naïve Bayes drops to near 77% –Others … ? Other Types of Documents –XML Documents –Forms and the Hidden Web –Tables
Summary Objective : Automatically Recognize Document Applicability Approach: –Conceptual Modeling –Recognition Heuristics Density Expected Values Grouping Result : Accuracy Near 95%