Download presentation
Presentation is loading. Please wait.
2
Ontology-Based Information Extraction and Structuring Stephen W. Liddle † School of Accountancy and Information Systems Brigham Young University Douglas M. Campbell, David W. Embley, ‡ and Randy D. Smith Research funded in part by † Faneuil Research and ‡ Novell, Inc. Copyright 1998
3
Motivation n Database-style queries are effective –Find red cars, 1993 or newer, < $5,000 Select * From Car Where Color=“red” And Year >= 1993 And Price < 5000 n Web is not a database –Uses keyword search –Retrieves documents, not records –Assuming we have a range operator: “red” and (1993 to 1998) and (1 to 5000)
4
Solutions n Web query languages n Wait for XML to emerge –Interoperation/Standards? –XML query language? n Wrappers –Hand-written or semi-automatically generated parsers –Specific to source site, subject to change
5
Our Approach n Automatic wrapper generation n Based on application ontology –Augmented conceptual model –Defines constants, keywords, their relationships n Best for: –Narrow ontological breadth –Data-rich documents
6
Car-Ad Ontology n Object-Relationship Model + Data Frames YearPrice Make Mileage Model Feature PhoneNr Extension Car has is for has 1..* 0..1 1..* 0..1 0..* 1..* Graphical Car [0:1] has Year [1:*]; Year {regexp[2]: “\d{2} : \b’\d{2}\b, … }; Car [0:1] has Make [1:*]; Make {regexp[10]: “\bchev\b”, “\bchevy\b”, … }; Car [0:1] has Model [1:*]; Model {…}; Car [0:1] has Mileage [1:*]; Mileage {regexp[8] “\b[1-9]\d{1,2}k”, “1-9]\d?,\d{3} : [^\$\d][1-9]\d?,\d{3}[^\d]” } {context: “\bmiles\b”, “\bmi\.”, “\bmi\b”}; Car [0:*] has Feature [1:*]; Feature {regexp[20]: -- Colors “\baqua\s+metallic\b”, “\bbeige\b”, … -- Transmission “(5|6)\s*spd\b”, “auto : \bauto(\.|,)”, -- Accessories “\broof\s+rack\b”, “\bspoiler\b”, …... Textual (See Figures 2 & 3 of Paper)
7
Fixed Processes Application Ontology Parser Constant/Keyword Recognizer Database-Instance Generator Unstructured Document Constant/Keyword Matching Rules Data-Record Table List of Objects, Relation- ships, and Constraints Database Scheme Populated Database (See Figure 1 of Paper)
8
Constant/Keyword Recognizer Database-Instance Generator Unstructured Document Data-Record Table Populated Database Make : \bchev\b … KEYWORD(Mileage) : \bmiles\b KEYWORD(Mileage) : \bmi\.... create table Car ( Car integer, Year varchar(2), … ); create table CarFeature ( Car integer, Feature varchar(10));... Object: Car;... Car: Year [0:1]; Car: Make [0:1]; … CarFeature: Car [0:*] has Feature [1:*]; Ontology Parser Application Ontology Parser Constant/Keyword Matching Rules List of Objects, Relation- ships, and Constraints Database Scheme
9
Application Ontology Parser Database-Instance Generator List of Objects, Relation- ships, and Constraints Database Scheme Populated Database Constant/Keyword Recognizer Descriptor/String/Position(start/end) Year|97|1|3 Make|CHEV|5|8 Model|Cavalier|10|17 Feature|Red|20|22 Feature|5 spd|25|29 Mileage|7,000|37|41 KEYWORD(Mileage)|miles|43|47 Price|11,995|108|114 PhoneNr|556-3800|146|153 '97 CHEV Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800 Constant/Keyword Recognizer Unstructured Document Constant/Keyword Matching Rules Data-Record Table
10
Application Ontology Parser Constant/Keyword Recognizer Unstructured Document Constant/Keyword Matching Rules Database-Instance Generator insert into Car values(1001, “97”, “CHEV”, “Cavalier”, “7,000”, “11,995”, “556-3800”) insert into CarFeature values(1001, “Red”) insert into CarFeature values(1001, “5 spd”) Database-Instance Generator Data-Record Table List of Objects, Relation- ships, and Constraints Database Scheme Populated Database
11
Heuristics n Keyword proximity n Subsumed and overlapping constants n Functional relationships n Nonfunctional relationships n First occurrence without constraint violation
12
Keyword Proximity Year|97|2|3 Make|CHEV|5|8 Model|Cavalier|10|17 Feature|Red|20|22 Feature|5 spd|25|29 Mileage|7,000|37|41 KEYWORD(Mileage)|miles|43|47 Price|11,995|101|106 Mileage|11,995|101|106 PhoneNr|566-3800|140|147 '97 CHEV Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800
13
Subsumed/Overlapping Constants Make|CHEV|5|8 Make|CHEVROLET|5|13 Model|Cavalier|15|22 Feature|Red|25|27 Feature|5 spd|30|34 Mileage|7,000|42|46 KEYWORD(Mileage)|miles|48|52 Price|11,995|101|106 Mileage|11,995|101|106 PhoneNr|566-3800|140|147 '97 CHEVROLET Cavalier, Red, 5 spd, only 7,000 miles. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800
14
Functional Relationships Year|97|2|3 Make|CHEV|5|8 Model|Cavalier|10|17 Feature|Red|20|22 Feature|5 spd|25|29 Mileage|7,000|37|41 KEYWORD(Mileage)|miles|43|47 Price|11,995|101|106 Mileage|11,995|101|106 PhoneNr|566-3800|140|147 '97 CHEV Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800
15
Nonfunctional Relationships Year|97|2|3 Make|CHEV|5|8 Model|Cavalier|10|17 Feature|Red|20|22 Feature|5 spd|25|29 Mileage|7,000|37|41 KEYWORD(Mileage)|miles|43|47 Price|11,995|101|106 Mileage|11,995|101|106 PhoneNr|566-3800|140|147 '97 CHEV Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800
16
First Occurrence without Constraint Violation Year|97|2|3 Make|CHEV|5|8 Model|Cavalier|10|17 Feature|Red|20|22 Feature|5 spd|25|29 Mileage|7,000|37|41 KEYWORD(Mileage)|miles|43|47 Price|11,995|101|106 Mileage|11,995|101|106 PhoneNr|566-3800|140|147 PhoneNr|566-3802|149|156 '97 CHEV Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800, 566-3802
17
'97 CHEV Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800 Recall & Precision N = number of facts in source C = number of facts declared correctly I = number of facts declared incorrectly (of facts available, how many did we find?) (of facts retrieved, how many were relevant?)
18
Experimental Results Salt Lake Tribune Tuning set: 100 Test set: 116 Recall %Precision % Year 100 100 Make 97 100 Model 82 100 Mileage 90 100 Price 100 100 PhoneNr 94 100 Extension 50 100 Feature 91 99 (See Table 1 of Paper)
19
Trouble Spots n Unbounded sets –missed: MERC, Town Car, 98 Royale –could use lexicon of makes and models n Unspecified variation in lexical patterns –missed: 5 speed (instead of 5 spd), p.l (instead of p.l.) –could adjust lexical patterns n Misidentification of attributes –classified AUTO in AUTO SALES as automatic transmission –could adjust exceptions in lexical patterns n Typographical errors –“Chrystler”, “DODG ENeon”, “I-15566-2441” –could look for spelling variations and common typos
20
Contributions n Fully automatic technique for wrapper generation n Uses syntactic, not semantic constant- recognition techniques n Adapts readily to different unstructured document formats n Good precision & recall ratios n Implemented ( Perl, C++, Lex/Yacc, Java )
21
Limitations n Works best for data-rich documents, narrow ontological domains n Ontology creation is still manual –Domain expert –Trained in our conceptual model & tools
22
Future Work n Graphical ontology editor n Improve automatic record-boundary recognition –Make suitable for broader domains (obituaries, university catalog, etc.) n Improve heuristics –Use a declarative language –Employ more of OSM’s rich constraints
23
Future Work (cont.) n Add operations to data frames –General constraints –Canonical representations –Inferred information n Develop ontology libraries n Finish porting to 100% Java n Incorporate learning/feedback n Ontology-enabled agents
24
Our Web Site n I have a demo on my laptop n Can download from our Web site n BYU Data Extraction Group http://osm7.cs.byu.edu/deg (See Reference 13 of Paper)
25
Other Domains n Job Listings n Obituaries n University Course Catalogs
26
Job Listings Results Los Angeles Times Tuning set: 50 Test set: 50 Recall %Precision % Degree 100 100 Skill 74 100 Contact 100 100 Email 91 83 Fax 91 100 Voice 79 92 (See Table 2 of Paper)
27
Obituaries Results Salt Lake Tribune Tuning set: ~40 Test set: 38 Recall %Precision % Deceased Name100100 Age9195 Birth Date10097 Death Date94100 Funeral Date92100 Funeral Address9696 Funeral Time97100 Interment Address100100 Viewing9396 Viewing Date70100 Viewing Address76100 Beginning Time88100 Ending Time90100 Relationship8193 Relative Name8871 (See our forthcoming ER’98 paper for details.)
28
Obituaries Results Arizona Daily Star Tuning set: ~40 Test set: 90 Recall %Precision % Deceased Name100100 Age8698 Birth Date9696 Death Date8499 Funeral Date9693 Funeral Address8282 Funeral Time9287 Interment Address100100 Viewing97100 Viewing Date100100 Viewing Address95100 Beginning Time9396 Ending Time95100 Relationship9297 Relative Name9574 (See our forthcoming ER’98 paper for details.)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.