Conceptual-Model-Based Web Data Extraction by Example Yuanqiu (Joe) Zhou Data Extraction Group Brigham Young University Sponsored by NSF
Motivation Data-rich Websites in abundance Conceptual-Model-Based Methodology is resilient “By Example” approach is user-friendly
“By Example” Approach Web users specify desired information by creating a form Users collect sample pages on the Web An ontology generator learns the task by analyzing the form and the sample pages Interactions may be needed to improve or complete the ontology
Architecture Data Frame Libraries User Created Form GUI Sample Pages Ontology Generator Extraction EngineTarget PagesPopulated Database Extraction Ontology
Digital Camera Brand Model CCD Resolution Image Resolution Optical Zoom Digital Zoom PowerShot G x Sample Web PageUser Created Form Canon
Extraction Ontology Relationship Set and Constraints Extraction Patterns Keywords Context Expressions
Primary Object Name Other Objects’ Names Participation Constraints DigitalCamera [-> object]; DigitalCamera [0:1] has Brand [1:*]; DigitalCamera [0:1] has Model [1:*]; DigitalCamera [0:1] has CCDResolution [1:*]; DigitalCamera [0:1] has ImageResolution [1:*]; DigitalCamera [0:1] has OpticalZoom [1:*]; DigitalCamera [0:1] has DigitalZoom [1:*]; Relationship Set and Constraints
Primary Object Name Other Objects’ Names Participation Constraints DigitalCamera [-> object]; DigitalCamera [0:1] has Brand [1:*]; DigitalCamera [0:1] has Model [1:*]; DigitalCamera [0:1] has CCDResolution [1:*]; DigitalCamera [0:1] has ImageResolution [1:*]; DigitalCamera [0:1] has OpticalZoom [1:*]; DigitalCamera [0:1] has DigitalZoom [1:*]; Relationship Set and Constraints
Primary Object Name Other Objects’ Names Participation Constraints DigitalCamera [-> object]; DigitalCamera [0:1] has Brand [1:*]; DigitalCamera [0:1] has Model [1:*]; DigitalCamera [0:1] has CCDResolution [1:*]; DigitalCamera [0:1] has ImageResolution [1:*]; DigitalCamera [0:1] has OpticalZoom [1:*]; DigitalCamera [0:1] has DigitalZoom [1:*];
Primary Object Name Other Objects’ Names Participation Constraints DigitalCamera [-> object]; DigitalCamera [0:1] has Brand [1:*]; DigitalCamera [0:1] has Model [1:*]; DigitalCamera [0:1] has CCDResolution [1:*]; DigitalCamera [0:1] has ImageResolution [1:*]; DigitalCamera [0:1] has OpticalZoom [1:*]; DigitalCamera [0:1] has DigitalZoom [1:*]; Relationship Set and Constraints
Extraction Patterns Data Frame Libraries Lexicons Synonym Dictionary Regular Expressions Extraction Pattern: Lexicons for Brand and Model Regular Expressions for numbers and Image resolution From Data Frame Libraries
CCDResolution matches [20] constant{ extract "\b\d(\.\d{1,2})?\b"; }; keyword "\bMegapixel\b", "\bCCD\b", "\bResolution\b"; Features a high-quality 4.0 Megapixel Resolution CCD The new Nikon Coolpix 995 offers a boasting 3.34 Megapixel CCD 3 effective megapixel Extraction Patterns Data Frame Libraries
Keywords Features a high-quality 4.0 Megapixel Resolution CCD The new Nikon Coolpix 995 offers a boasting 3.34 Megapixel CCD 3 effective megapixel
Keywords Features a high-quality 4.0 Megapixel Resolution CCD The new Nikon Coolpix 995 offers a boasting 3.34 Megapixel CCD 3 effective megapixel
Keywords Features a high-quality 4.0 Megapixel Resolution CCD The new Nikon Coolpix 995 offers a boasting 3.34 Megapixel CCD 3 effective megapixel CCDResolution matches [20] constant{ extract "\b\d(\.\d{1,2})?\b"; }; keyword "\bMegapixel\b", "\bCCD\b", "\bResolution\b";
Context Expressions 3.5x optical zoom (2.5x digital) a superior 4x Optical Zoom Nikkor lens, plus 4x stepless digital zoom optical 3X /digital 6X zoom OpticalZoom matches [10] constant{ extract "\b\d(\.\d)?"; context "\b\d(\.\d)?(x)\b"; }; keyword "\boptical\b";
DigitalCamera [-> object]; DigitalCamera [0:1] has Brand [1:*]; Brand matches [10] constant{ extract "\bNikon\b";}, { extract "\bCanon\b";}, { extract "\bOlympus\b";}, { extract "\bMinolta\b";}, { extract "\bSony\b";}; end; DigitalCamera [0:1] has CCDResolution [1:*]; CCDResolution matches [20] constant{ extract "\b\d(\.\d{1,2})?\b"; }; keyword "\bMegapixel\b“, "\bCCD\b", "\bResolution\b"; end; DigitalCamera [0:1] has ImageResolution [1:*]; ImageResolution matches [20] constant{ extract "\b\d{4}(\s)?(x)(\s)?\d{4}\b"; }, { extract "\b\d{4}(\s)?(x)(\s)?\d{4}\b"; }; keyword "\bResolution\b", "\bImage\b"; end; DigitalCamera [0:1] has OpticalZoom [1:*]; OpticalZoom matches [10] constant{ extract "\b\d"; context "\b\d(x)\b"; }; keyword "\boptical\b"; end; Extraction Ontology
DigitalCamera [-> object]; DigitalCamera [0:1] has Brand [1:*]; Brand matches [10] constant{ extract "\bNikon\b";}, { extract "\bCanon\b";}, { extract "\bOlympus\b";}, { extract "\bMinolta\b";}, { extract "\bSony\b";}; end; DigitalCamera [0:1] has CCDResolution [1:*]; CCDResolution matches [20] constant{ extract "\b\d(\.\d{1,2})?\b"; }; keyword "\bMegapixel\b“, "\bCCD\b", "\bResolution\b"; end; DigitalCamera [0:1] has ImageResolution [1:*]; ImageResolution matches [20] constant{ extract "\b\d{4}(\s)?(x)(\s)?\d{4}\b"; }, { extract "\b\d{4}(\s)?(x)(\s)?\d{4}\b"; }; keyword "\bResolution\b", "\bImage\b"; end; DigitalCamera [0:1] has OpticalZoom [1:*]; OpticalZoom matches [10] constant{ extract "\b\d"; context "\b\d(x)\b"; }; keyword "\boptical\b"; end; Extraction Ontology
DigitalCamera [-> object]; DigitalCamera [0:1] has Brand [1:*]; Brand matches [10] constant{ extract "\bNikon\b";}, { extract "\bCanon\b";}, { extract "\bOlympus\b";}, { extract "\bMinolta\b";}, { extract "\bSony\b";}; end; DigitalCamera [0:1] has CCDResolution [1:*]; CCDResolution matches [20] constant{ extract "\b\d(\.\d{1,2})?\b"; }; keyword "\bMegapixel\b“, "\bCCD\b", "\bResolution\b"; end; DigitalCamera [0:1] has ImageResolution [1:*]; ImageResolution matches [20] constant{ extract "\b\d{4}(\s)?(x)(\s)?\d{4}\b"; }, { extract "\b\d{4}(\s)?(x)(\s)?\d{4}\b"; }; keyword "\bResolution\b", "\bImage\b"; end; DigitalCamera [0:1] has OpticalZoom [1:*]; OpticalZoom matches [10] constant{ extract "\b\d(\.\d)"; context "\b\d(\.\d)?(x)\b"; }; keyword "\boptical\b"; end; Extraction Ontology
DigitalCamera [-> object]; DigitalCamera [0:1] has Brand [1:*]; Brand matches [10] constant{ extract "\bNikon\b";}, { extract "\bCanon\b";}, { extract "\bOlympus\b";}, { extract "\bMinolta\b";}, { extract "\bSony\b";}; end; DigitalCamera [0:1] has CCDResolution [1:*]; CCDResolution matches [20] constant{ extract "\b\d(\.\d{1,2})?\b"; }; keyword "\bMegapixel\b“, "\bCCD\b", "\bResolution\b"; end; DigitalCamera [0:1] has ImageResolution [1:*]; ImageResolution matches [20] constant{ extract "\b\d{4}(\s)?(x)(\s)?\d{4}\b"; }, { extract "\b\d{4}(\s)?(x)(\s)?\d{4}\b"; }; keyword "\bResolution\b", "\bImage\b"; end; DigitalCamera [0:1] has OpticalZoom [1:*]; OpticalZoom matches [10] constant{ extract "\b\d(\.\d)"; context "\b\d(\.\d)?(x)\b"; }; keyword "\boptical\b"; end; Extraction Ontology
Results (Same Site)
Results ( Different Site )
Summary and Future Work The example indicates that the approach is feasible Some open questions need to be explored