Download presentation
Presentation is loading. Please wait.
1
Generating Data-Extraction Ontologies By Example Joe Zhou Data Extraction Group Brigham Young University
2
2 Background World Wide Web contains a huge amount of useful information. Web data-extraction is necessary for querying the data of interest. Most of wrappers generate extraction patterns based on delimiters or HTML tags. So they are source-dependent. BYU ontology-based technique is resilient.
3
3 Problem and Solution BYU Onto approach requires that ontology experts generate data-extraction ontologies for the domains of interest to ordinary users A principal effort of our research is to automate ontology- generation process as much as possible We developed a system OntoByE (Ontology By Example) to generate data-extraction ontologies semi-automatically
4
4 Extraction Ontology Object sets, Relationship sets and Constraints Data frames for Lexical Object Sets
5
5 Extraction Ontology Object sets, Relationship sets and Constraints Data frame for Digital Zoom Value Phrase Value Expression: \d(\.\d)? Left Context Expression: Right Context Expression: (\s)?(x|X) Keyword Phrase Keyword Expression: Digital\sZoom |digital\szoom Digital Camera [0:1] Brand [1:*] Digital Camera [0:1] Model [1:*] Digital Camera [0:1] CCD Resolution [1:*] Digital Camera [0:3] Image Resolution [1:*] Digital Camera [0:1] Zooms [1:*] Digital Camera [0:1] Weight [1:*] Digital Camera [0:1] Dimensions [1:*] Digital Camera [0:1] Price [1:*] Digital Camera [0:1] LCD Size [1:*] Zooms [0:1] Optical Zoom [1:*] Zooms [0:1] Digital Zoom [1:*] Dimensions [0:1] Width [1:*] Dimensions [0:1] Depth [1:*] Dimensions [0:1] Height [1:*]
6
6 OntoByE System Overview and Architecture Data Frame Library Forms User Interface Sample Pages Ontology Generator Extraction EngineTarget Pages Populated Database Extraction Ontology Marked Pages
7
7 OntoByE – User Interface
8
8 Form Editor – Basic Form Elements
9
9 Form Editor – Nesting Forms
10
10 Form Editor – Creating Forms for Digital Camera Application
11
11 Training Web Document Preparation
12
12 Ontology Generator – Workflow Context Phrase Locator Data Frame Matcher Form Analyzer Marked HTML Pages User-defined Forms Data Frame Editor Users Extraction Ontology Ontology Generator Data Frame Library Keyword and Context Expression Recognizer Object Sets, Relationship Sets and Constraints Data Frames Extraction Ontology
13
13 Ontology Generator – Form Analyzer BaseForm [0:1] A [1:*] BaseForm [0:3] B [1:*] BaseForm [0:*] C [1:*] BaseForm [0:3] D1 [1:*] D2 [1:*] D3 [1:*] BaseForm [0:*] E1 [1:*] E2 [1:*] E3 [1:*] Sample FormObject and Realationship Sets and Constraints
14
14 Ontology Generator – Form Analyzer Digital Camera application Forms Object and Relationship Sets and Constraints Digital Camera [0:1] Brand [1:*] Digital Camera [0:1] Model [1:*] Digital Camera [0:1] CCD Resolution [1:*] Digital Camera [0:3] Image Resolution [1:*] Digital Camera [0:1] Zooms [1:*] Digital Camera [0:1] Weight [1:*] Digital Camera [0:1] Dimensions [1:*] Digital Camera [0:1] Price [1:*] Digital Camera [0:1] LCD Size [1:*] Zooms [0:1] Optical Zoom [1:*] Zooms [0:1] Digital Zoom [1:*] Dimensions [0:1] Width [1:*] Dimensions [0:1] Depth [1:*] Dimensions [0:1] Height [1:*]
15
15 Ontology Generator – Context Phrase Locator Context Phrase 1: 400, ISO 200, ISO 50 Digital Zoom - 4.1 x Shooting Modes - Frame movie mode Context Phrase 2: x Digital Zoom - 4 x Camera Flash - Pop-up flash Red Eye R Context Phrase 3: 3.2X digital zoom PictBridge compatibl
16
16 Ontology Generator – Data Frame Matcher Data Frame Matching Heuristics: Number of matched data Data Frame Ranking Heuristics: Number of matched data Keywords and/or Contexts Matching Order of Specialization/Generalization
17
17 Ontology Generator – Keyword and Context Recognizer Context Phrase 1: 400, ISO 200, ISO 50 Digital Zoom - 4.1 x Shooting Modes - Frame movie mode Context Phrase 2: x Digital Zoom - 4 x Camera Flash - Pop-up flash Red Eye R Context Phrase 3: 3.2X digital zoom PictBridge compatibl Left Context Expression: Right Context Expression: (\s)?(x|X) Keywords: Digital\sZoom|digital\szoom
18
18 Ontology Generator – Data Frame Editor
19
19 Extraction Ontology
20
20 Experimental Preparation Selected two domains of interest Digital Camera Application and Apartment Rental Application Constructed an initial data frame library Integer (any integer value), SmallPositiveInteger (from 1 to 99), SingleDigit (from 0 to 9), RealNumber (any real value), SmallPositiveReal (from 0.01 to 99.99), Date, Email, PhoneNumber, and Price Created application-dependent forms for each application Collected 5 sample pages from different web sites for each domain Marked desired data on sample pages
21
21 Experimental Results – Digital Camera Application
22
22 Experimental Results – Digital Camera Application Object Set Matching Data Frame Left Context Expression Right Context Expression Keywords Brand*--- Model*--- CCD ResolutionSmallPositiveReal- \s(Megapixel |MegaPixel) - Image Resolution*--- Optical ZoomSingleDigit-(\s)?(x|X) (Optical\sZoom |optical\szoom) Digital ZoomSmallPositiveReal-(\s)?(x|X) (Digital\sZoom |digital\szoom) WeightSmallPositiveReal-\s(oz|Oz)(Weight|weight) WidthSmallPositiveReal-\s(in)(Width|width) DepthSmallPositiveReal-\s(in)(Depth|depth) HeightSmallPositiveReal-\s(in)(Height|height) Price ($)(\s)?-(Price|price) LCD SizeSmallPositiveReal-(")LCD Note:* Application-dependent Lexicons not in Initial Data Frame Library - Not Available from Sample Pages
23
23 Experimental Results – Apartment Rental Application Apartment Rental [0:1] Apt Type [1:*] Apartment Rental [0:1] Bedroom Number [1:*] Apartment Rental [0:1] Bathroom Number [1:*] Apartment Rental [0:1] Gender Requirement [1:*] Apartment Rental [0:1] Date Available [1:*] Apartment Rental [0:1] Monthly Rate [1:*] Apartment Rental [0:1] Deposit [1:*] Apartment Rental [0:*] Features [1:*] Apartment Rental [0:1] Contact Phone [1:*] Apartment Rental [0:1] Contact Person [1:*] Apartment Rental [0:1] Furnished Condition [1:*] Apartment Rental [0:1] Utility [1:*]
24
24 Experimental Results – Apartment Rental Application
25
25 Experimental Results – Apartment Rental Application Object Set Matching Data Frames Left Context Expression Right Context Expression Keywords Apt Type*--- Bedroom NumberSingleDigit--(Bedroom|bdrm) Bathroom Number SmallPositiveReal--(Bathroom|bath) Furnish Condition*--- Gender Requirement *--- Utility*--- Features*--- Contact PhonePhoneNumber--(Contact|contact) Contact Person*--- Monthly RatePrice($)(\s)?-- DepositPrice($)(\s)?-(Deposit|deposit) Date AvailableDate--Available Note:* Application-dependent Lexicons not in Initial Data Frame Library - Not Available from Sample Pages
26
26 Experimental Observations – Strengths of OntoByE OntoByE provides a friendly and intuitive interface to help ordinary users describe data of interest without exposing them to abstract ontology concepts With a small initial data frame library and a small set of sample pages, OntobyE works well to search for and suggest appropriate existing data frames for object sets with application-independent values OntoByE successfully recognizes possible keywords and contexts for user marked-data from sample pages and helps users to create new data frames with the keywords and contexts
27
27 Experimental Observations – Limitations of OntoByE The performance of searching for or constructing data frames by OntoByE is limited by the scope and the quality of prior knowledge The accuracy and completeness of keyword and context expression construction are limited by the number and representativeness of user samples Constructing value expressions for application-dependent data frames requires that users know how to write regular expressions.
28
28 Conclusion We implemented a user-friendly interface for ordinary users to take advantage of our ontology-based web data-extraction approach. We developed a framework for interacting with ordinary users to generate ontologies by example. Our experiments demonstrate that OntoByE works well to generate ontologies with assistance of a limited prior knowledge. As time goes by, along with the expansion of prior knowledge, OntoByE will achieve better performance.
29
29 Future Work Have OntoByE learn to build application-dependent lexicons for users’ applications Improve the sub-components of the back-end ontology generator, e.g. Context Phrase Locator
30
30 The end
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.