Filtering Multiple-Record Web Documents Based on Application Ontologies Presenter: L. Xu Advisor: D.W.Embley
Examples D1: CarD2: Item for Sale or Rent
Car Ontology Car[->object]; Car[ ] has Year; Car[ ] has Make; Car[ ] has Model; Car[ ] has Mileage; Car[ *] has Feature; Car[ ] has Price; PhoneNr is for Car[ *]; Year matches [4] constant {extract “\d{2}”; context "([^\$\d]|^)[4-9]\d,[^\d]"; substitute "^" -> "19"; },. End;
Filtering Heuristics H1: Density H2: Expected-values H3: Grouping
H1: Density Car Total Number of Characters: 2048 Number of Matched Characters: 626 Density: Item for Rent or Sale Total Number of Characters: 196 Number of Matched Characters: 2671 Density: 0.073
H2: Expected-values OV D1D2 Year Make Model Mileage Price Feature PhoneNr D1: D2: ov D1 D2
H3: Grouping Year: 2000 Year: 1989 Make: Subaru Model: SW Nr of Distinct "One Max" Object:3 Price: 1900 Year: 1998 Model: Elantra Year: Nr of Distinct "One Max" Object:3. Grouping Factor is: Year: 1999 Year: 1998 Year: 1960 Mileage: Nr of Distinct "One Max" Object:2 Mileage: Year: 1940 Price: Year: Nr of Distinct "One Max" Object: 3. Grouping Factor is: 0.5
Combining Heuristics Decision tree learning algorithm C4.5 –Learning task: suitability –Performance measure: accuracy –Training experience: human classified documents Training set –20 positive examples (from 10 geographical regions of US States) –30 negative examples Test set –10 positive examples –20 negative examples
Generated Rules Car application –H2 <= :NO –H2 > :YES Obituary application –H2 <= :NO –H2 > –| H1 <= :NO –| H1 > :YES Universal rule –H3 <= –| H1 <= 0.369: NO –| H1 > –| | H2 <= : NO –| | H2 > : YES –H3 > 0.625: YES
Experiment Results Car application –accuracy96.7% –precision100% –recall91% Obituary application –accuracy96.7% –precision91% –recall100% Universal rule –accuracy93.4% –precision84% –recall100%
False Drop Example
False Positive Example
Summary Objective : Automatically filter multiple-record web documents. Approach: Filtering heuristics –Density –Expected-values –Grouping Result : ~95% accuracy