Presentation is loading. Please wait.

Presentation is loading. Please wait.

Conceptual-Model-Based Web Data Extraction by Example Yuanqiu (Joe) Zhou Data Extraction Group Brigham Young University Sponsored by NSF.

Similar presentations


Presentation on theme: "Conceptual-Model-Based Web Data Extraction by Example Yuanqiu (Joe) Zhou Data Extraction Group Brigham Young University Sponsored by NSF."— Presentation transcript:

1 Conceptual-Model-Based Web Data Extraction by Example Yuanqiu (Joe) Zhou Data Extraction Group Brigham Young University Sponsored by NSF

2 Motivation  Data-rich Websites in abundance  Conceptual-Model-Based Methodology is resilient  “By Example” approach is user-friendly

3 “By Example” Approach  Web users specify desired information by creating a form  Users collect sample pages on the Web  An ontology generator learns the task by analyzing the form and the sample pages  Interactions may be needed to improve or complete the ontology

4 Architecture Data Frame Libraries User Created Form GUI Sample Pages Ontology Generator Extraction EngineTarget PagesPopulated Database Extraction Ontology

5 Digital Camera Brand Model CCD Resolution Image Resolution Optical Zoom Digital Zoom PowerShot G2 4.0 2272 x 1074 3 2 Sample Web PageUser Created Form Canon

6 Extraction Ontology  Relationship Set and Constraints  Extraction Patterns  Keywords  Context Expressions

7  Primary Object Name  Other Objects’ Names  Participation Constraints DigitalCamera [-> object]; DigitalCamera [0:1] has Brand [1:*]; DigitalCamera [0:1] has Model [1:*]; DigitalCamera [0:1] has CCDResolution [1:*]; DigitalCamera [0:1] has ImageResolution [1:*]; DigitalCamera [0:1] has OpticalZoom [1:*]; DigitalCamera [0:1] has DigitalZoom [1:*]; Relationship Set and Constraints

8  Primary Object Name  Other Objects’ Names  Participation Constraints DigitalCamera [-> object]; DigitalCamera [0:1] has Brand [1:*]; DigitalCamera [0:1] has Model [1:*]; DigitalCamera [0:1] has CCDResolution [1:*]; DigitalCamera [0:1] has ImageResolution [1:*]; DigitalCamera [0:1] has OpticalZoom [1:*]; DigitalCamera [0:1] has DigitalZoom [1:*]; Relationship Set and Constraints

9  Primary Object Name  Other Objects’ Names  Participation Constraints DigitalCamera [-> object]; DigitalCamera [0:1] has Brand [1:*]; DigitalCamera [0:1] has Model [1:*]; DigitalCamera [0:1] has CCDResolution [1:*]; DigitalCamera [0:1] has ImageResolution [1:*]; DigitalCamera [0:1] has OpticalZoom [1:*]; DigitalCamera [0:1] has DigitalZoom [1:*];

10  Primary Object Name  Other Objects’ Names  Participation Constraints DigitalCamera [-> object]; DigitalCamera [0:1] has Brand [1:*]; DigitalCamera [0:1] has Model [1:*]; DigitalCamera [0:1] has CCDResolution [1:*]; DigitalCamera [0:1] has ImageResolution [1:*]; DigitalCamera [0:1] has OpticalZoom [1:*]; DigitalCamera [0:1] has DigitalZoom [1:*]; Relationship Set and Constraints

11 Extraction Patterns  Data Frame Libraries  Lexicons  Synonym Dictionary  Regular Expressions  Extraction Pattern:  Lexicons for Brand and Model  Regular Expressions for numbers and Image resolution From Data Frame Libraries

12 CCDResolution matches [20] constant{ extract "\b\d(\.\d{1,2})?\b"; }; keyword "\bMegapixel\b", "\bCCD\b", "\bResolution\b";  Features a high-quality 4.0 Megapixel Resolution CCD  The new Nikon Coolpix 995 offers a boasting 3.34 Megapixel CCD  3 effective megapixel Extraction Patterns Data Frame Libraries

13 Keywords  Features a high-quality 4.0 Megapixel Resolution CCD  The new Nikon Coolpix 995 offers a boasting 3.34 Megapixel CCD  3 effective megapixel

14 Keywords  Features a high-quality 4.0 Megapixel Resolution CCD  The new Nikon Coolpix 995 offers a boasting 3.34 Megapixel CCD  3 effective megapixel

15 Keywords  Features a high-quality 4.0 Megapixel Resolution CCD  The new Nikon Coolpix 995 offers a boasting 3.34 Megapixel CCD  3 effective megapixel CCDResolution matches [20] constant{ extract "\b\d(\.\d{1,2})?\b"; }; keyword "\bMegapixel\b", "\bCCD\b", "\bResolution\b";

16 Context Expressions  3.5x optical zoom (2.5x digital)  a superior 4x Optical Zoom Nikkor lens, plus 4x stepless digital zoom  optical 3X /digital 6X zoom OpticalZoom matches [10] constant{ extract "\b\d(\.\d)?"; context "\b\d(\.\d)?(x)\b"; }; keyword "\boptical\b";

17 DigitalCamera [-> object]; DigitalCamera [0:1] has Brand [1:*]; Brand matches [10] constant{ extract "\bNikon\b";}, { extract "\bCanon\b";}, { extract "\bOlympus\b";}, { extract "\bMinolta\b";}, { extract "\bSony\b";}; end; DigitalCamera [0:1] has CCDResolution [1:*]; CCDResolution matches [20] constant{ extract "\b\d(\.\d{1,2})?\b"; }; keyword "\bMegapixel\b“, "\bCCD\b", "\bResolution\b"; end; DigitalCamera [0:1] has ImageResolution [1:*]; ImageResolution matches [20] constant{ extract "\b\d{4}(\s)?(x)(\s)?\d{4}\b"; }, { extract "\b\d{4}(\s)?(x)(\s)?\d{4}\b"; }; keyword "\bResolution\b", "\bImage\b"; end; DigitalCamera [0:1] has OpticalZoom [1:*]; OpticalZoom matches [10] constant{ extract "\b\d"; context "\b\d(x)\b"; }; keyword "\boptical\b"; end; Extraction Ontology

18 DigitalCamera [-> object]; DigitalCamera [0:1] has Brand [1:*]; Brand matches [10] constant{ extract "\bNikon\b";}, { extract "\bCanon\b";}, { extract "\bOlympus\b";}, { extract "\bMinolta\b";}, { extract "\bSony\b";}; end; DigitalCamera [0:1] has CCDResolution [1:*]; CCDResolution matches [20] constant{ extract "\b\d(\.\d{1,2})?\b"; }; keyword "\bMegapixel\b“, "\bCCD\b", "\bResolution\b"; end; DigitalCamera [0:1] has ImageResolution [1:*]; ImageResolution matches [20] constant{ extract "\b\d{4}(\s)?(x)(\s)?\d{4}\b"; }, { extract "\b\d{4}(\s)?(x)(\s)?\d{4}\b"; }; keyword "\bResolution\b", "\bImage\b"; end; DigitalCamera [0:1] has OpticalZoom [1:*]; OpticalZoom matches [10] constant{ extract "\b\d"; context "\b\d(x)\b"; }; keyword "\boptical\b"; end; Extraction Ontology

19 DigitalCamera [-> object]; DigitalCamera [0:1] has Brand [1:*]; Brand matches [10] constant{ extract "\bNikon\b";}, { extract "\bCanon\b";}, { extract "\bOlympus\b";}, { extract "\bMinolta\b";}, { extract "\bSony\b";}; end; DigitalCamera [0:1] has CCDResolution [1:*]; CCDResolution matches [20] constant{ extract "\b\d(\.\d{1,2})?\b"; }; keyword "\bMegapixel\b“, "\bCCD\b", "\bResolution\b"; end; DigitalCamera [0:1] has ImageResolution [1:*]; ImageResolution matches [20] constant{ extract "\b\d{4}(\s)?(x)(\s)?\d{4}\b"; }, { extract "\b\d{4}(\s)?(x)(\s)?\d{4}\b"; }; keyword "\bResolution\b", "\bImage\b"; end; DigitalCamera [0:1] has OpticalZoom [1:*]; OpticalZoom matches [10] constant{ extract "\b\d(\.\d)"; context "\b\d(\.\d)?(x)\b"; }; keyword "\boptical\b"; end; Extraction Ontology

20 DigitalCamera [-> object]; DigitalCamera [0:1] has Brand [1:*]; Brand matches [10] constant{ extract "\bNikon\b";}, { extract "\bCanon\b";}, { extract "\bOlympus\b";}, { extract "\bMinolta\b";}, { extract "\bSony\b";}; end; DigitalCamera [0:1] has CCDResolution [1:*]; CCDResolution matches [20] constant{ extract "\b\d(\.\d{1,2})?\b"; }; keyword "\bMegapixel\b“, "\bCCD\b", "\bResolution\b"; end; DigitalCamera [0:1] has ImageResolution [1:*]; ImageResolution matches [20] constant{ extract "\b\d{4}(\s)?(x)(\s)?\d{4}\b"; }, { extract "\b\d{4}(\s)?(x)(\s)?\d{4}\b"; }; keyword "\bResolution\b", "\bImage\b"; end; DigitalCamera [0:1] has OpticalZoom [1:*]; OpticalZoom matches [10] constant{ extract "\b\d(\.\d)"; context "\b\d(\.\d)?(x)\b"; }; keyword "\boptical\b"; end; Extraction Ontology

21 Results (Same Site)

22 Results ( Different Site )

23 Summary and Future Work  The example indicates that the approach is feasible  Some open questions need to be explored


Download ppt "Conceptual-Model-Based Web Data Extraction by Example Yuanqiu (Joe) Zhou Data Extraction Group Brigham Young University Sponsored by NSF."

Similar presentations


Ads by Google