Extracting and Structuring Web Data

Slides:

Advertisements

Similar presentations

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Advertisements

Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:

Record-Boundary Discovery in Web Documents by Yuan Jiang December 1, 1998.

Schema Matching and Data Extraction over HTML Tables Cui Tao Data Extraction Research Group Department of Computer Science Brigham Young University supported.

Semiautomatic Generation of Data-Extraction Ontologies Master’s Thesis Proposal Yihong Ding.

Dialogue – Driven Intranet Search Suma Adindla School of Computer Science & Electronic Engineering 8th LANGUAGE & COMPUTATION DAY 2009.

1 What Do You Want— Semantic Understanding? (You’ve Got to be Kidding) David W. Embley Brigham Young University Funded in part by the National Science.

Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.

Search Engines and Information Retrieval

Extracting and Structuring Web Data D.W. Embley*, D.M Campbell †, Y.S. Jiang, Y.-K. Ng, R.D. Smith, Li Xu Department of Computer Science S.W. Liddle ‡

Schema Matching and Data Extraction over HTML Tables Cui Tao Data Extraction Research Group Department of Computer Science Brigham Young University supported.

6/11/20151 A Binary-Categorization Approach for Classifying Multiple-Record Web Documents Using a Probabilistic Retrieval Model Department of Computer.

Xyleme A Dynamic Warehouse for XML Data of the Web.

Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen Department of Computer Science Brigham Young University March, 2003 Funded by National.

Data Frames Version 3 Proposal. Data Frames Version 2 Year matches [2] constant { extract "\d{2}"; context "([^\$\d]|^)\d{2}[^,\dkK]"; } 0.5, { extract.

Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.

A Fully Automated Object Extraction System for the World Wide Web a paper by David Buttler, Ling Liu and Calton Pu, Georgia Tech.

Recognizing Ontology-Applicable Multiple-Record Web Documents David W. Embley Dennis Ng Li Xu Brigham Young University.

6/17/20151 Table Structure Understanding by Sibling Page Comparison Cui Tao Data Extraction Group Department of Computer Science Brigham Young University.

BYU 2003BYU Data Extraction Group Automating Schema Matching David W. Embley, Cui Tao, Li Xu Brigham Young University Funded by NSF.

Extracting Data Behind Web Forms Stephen W. Liddle David W. Embley Del T. Scott, Sai Ho Yau Brigham Young University Presented by: Helen Chen.

DLLS Ontologically-based Searching for Jobs in Linguistics Deryle Lonsdale Funded by:

ER 2002BYU Data Extraction Group Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure David W. Embley, Cui Tao,

Learning to Advertise. Introduction Advertising on the Internet = $$$ –Especially search advertising and web page advertising Problem: –Selecting ads.

Ontology-Based Information Extraction and Structuring Stephen W. Liddle † School of Accountancy and Information Systems Brigham Young University Douglas.

Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center.

Extracting and Structuring Web Data David W. Embley Department of Computer Science Brigham Young University D.M. Campbell, Y.S. Jiang, Y.-K. Ng, R.D. Smith.

1 Automating the Extraction of Domain-Specific Information from the Web A Case Study for the Genealogical Domain Troy Walker Thesis Defense November 19,

From OSM-L to JAVA Cui Tao Yihong Ding. Overview of OSM.

DASFAA 2003BYU Data Extraction Group Discovering Direct and Indirect Matches for Schema Elements Li Xu and David W. Embley Brigham Young University Funded.

UFMG, June 2002BYU Data Extraction Group Automating Schema Matching for Data Integration David W. Embley Brigham Young University Funded by NSF.

Filtering Multiple-Record Web Documents Based on Application Ontologies Presenter: L. Xu Advisor: D.W.Embley.

Scheme Matching and Data Extraction over HTML Tables from Heterogeneous Sources Cui Tao March, 2002 Founded by NSF.

BYU Data Extraction Group Automating Schema Matching David W. Embley, Cui Tao, Li Xu Brigham Young University Funded by NSF.

January 2004 ADC’ What Do You Want— Semantic Understanding? (You’ve Got to be Kidding) David W. Embley Brigham Young University Funded in part by.

Record-Boundary Discovery in Web Documents D.W. Embley, Y. Jiang, Y.-K. Ng Data-Extraction Group* Department of Computer Science Brigham Young University.

An Abstract Framework for Extraction Plans and Heuristics in a Data Extraction System Alan Wessman Brigham Young University Based on research supported.

BYU Data Extraction Group Funded by NSF1 Brigham Young University Li Xu Source Discovery and Schema Mapping for Data Integration.

Extracting and Structuring Web Data D.W. Embley*, D.M Campbell †, Y.S. Jiang, Y.-K. Ng, R.D. Smith Department of Computer Science S.W. Liddle ‡, D.W.

Automatic Creation and Simplified Querying of Semantic Web Content An Approach Based on Information-Extraction Ontologies Yihong Ding, David W. Embley,

Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen Department of Computer Science Brigham Young University March 31, 2004 Funded by National.

Ground-truthing Obituaries. Project Overview Untapped sources – Obituaries: hundreds of millions – Problem: how to cost-effectively extract Extraction.

Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.

Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.

Stephen W. Liddle, PhD Academic Director, Rollins Center for Entrepreneurship & Technology Professor, Information Systems Department Marriott School, Brigham.

Cross-Language Hybrid Keyword and Semantic Search David W. Embley, Stephen W. Liddle, Deryle W. Lonsdale, Joseph S. Park, Andrew Zitzelberger Brigham Young.

Search Engines and Information Retrieval Chapter 1.

Avoid using attributes? Some of the problems using attributes: Attributes cannot contain multiple values (child elements can) Attributes are not easily.

Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.

Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.

6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.

A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.

August 17, 2005Question Answering Passage Retrieval Using Dependency Parsing 1/28 Question Answering Passage Retrieval Using Dependency Parsing Hang Cui.

Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.

Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

PAIR project progress report Yi-Ting Chou Shui-Lung Chuang Xuanhui Wang.

Extracting and Structuring Web Data

Web Data Extraction Based on Partial Tree Alignment

What Do You Want—Semantic Understanding?

Introduction The amount of Web data has increased dramatically

Social Knowledge Mining

Martin Rajman, EPFL Switzerland & Martin Vesely, CERN Switzerland

CS 430: Information Discovery

Automating Schema Matching for Data Integration

Family History Technology Workshop

Chapter 7: Transformations

Extraction Rule Creation by Text Snippet Examples

Data Pre-processing Lecture Notes for Chapter 2

Chapter 12 Analyzing Semistructured Decision Support Systems

Presentation transcript:

Extracting and Structuring Web Data D.W. Embley*, D.M Campbell†, Y.S. Jiang, Y.-K. Ng, R.D. Smith, Li Xu Department of Computer Science S.W. Liddle‡ School of Accountancy and Information Systems Marriott School of Management D.W. Lonsdale Department of Linguistics Brigham Young University Provo, UT, USA Funded in part by *Novell, Inc., †Ancestry.com, Inc., ‡Faneuil Research, and the National Science Foundation (NSF).

Information Exchange Source Target

GOAL Query the Web like we query a database Example: Get the year, make, model, and price for 1987 or later cars that are red or white. Year Make Model Price ----------------------------------------------------------------------- 97 CHEVY Cavalier 11,995 94 DODGE 4,995 94 DODGE Intrepid 10,000 91 FORD Taurus 3,500 90 FORD Probe 88 FORD Escort 1,000

PROBLEM The Web is not structured like a database. Example: <html> <head> <title>The Salt Lake Tribune Classifieds</title> </head> … <hr> <h4> ‘97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415 JERRY SEINER MIDVALE, 566-3800 or 566-3888</h4> </html>

Making the Web Look Like a Database Web Query Languages Treat the Web as a graph (pages = nodes, links = edges). Query the graph (e.g., Find all pages within one hop of pages with the words “Cars for Sale”). Wrappers Parse page to extract attribute-value pairs, form records, and either insert them into a database or filter them wrt a query. Write parser by hand. Use machine learning to discover how to parse a site. Develop an application-specific, site-independent ontology to parse a site. Query the database or present the filtered result.

Automatic Wrapper Generation for unstructured record documents, rich in data and narrow in ontological breadth Application Ontology Web Page Ontology Parser Record Extractor Constant/Keyword Matching Rules Record-Level Objects, Relationships, and Constraints Database Scheme Unstructured Record Documents Constant/Keyword Recognizer Database-Instance Generator Populated Database Data-Record Table

Application Ontology: Object-Relationship Model Instance Year Price Make Mileage Model Feature PhoneNr Extension Car has is for 1..* 0..1 0..* Car [-> object]; Car [0..1] has Model [1..*]; Car [0..1] has Make [1..*]; Car [0..1] has Year [1..*]; Car [0..1] has Price [1..*]; Car [0..1] has Mileage [1..*]; PhoneNr [1..*] is for Car [0..1]; PhoneNr [0..1] has Extension [1..*]; Car [0..*] has Feature [1..*];

Application Ontology: Data Frames Make matches [10] case insensitive constant { extract “chev”; }, { extract “chevy”; }, { extract “dodge”; }, … end; Model matches [16] case insensitive { extract “88”; context “\bolds\S*\s*88\b”; }, Mileage matches [7] case insensitive { extract “[1-9]\d{0,2}k”; substitute “k” -> “,000”; }, keyword “\bmiles\b”, “\bmi\b “\bmi.\b”; ...

Relationships, and Constraints Ontology Parser create table Car ( Car integer, Year varchar(2), … ); create table CarFeature ( Feature varchar(10)); ... Make : chevy … KEYWORD(Mileage) : \bmiles\b ... Application Ontology Parser Constant/Keyword Matching Rules Record-Level Objects, Relationships, and Constraints Database Scheme Object: Car; ... Car: Year [0..1]; Car: Make [0..1]; … CarFeature: Car [0..*] has Feature [1..*];

Record Extractor <html> … <h4> ‘97 CHEVY Cavalier, Red, 5 spd, … </h4> <hr> <h4> ‘89 CHEVY Corsica Sdn teal, auto, … </h4> …. </html> Unstructured Record Documents Record Extractor Web Page … ##### ‘97 CHEVY Cavalier, Red, 5 spd, … ‘89 CHEVY Corsica Sdn teal, auto, … ...

Constant/Keyword Recognizer ‘97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415 JERRY SEINER MIDVALE, 566-3800 or 566-3888 Descriptor/String/Position(start/end) Year|97|2|3 Make|CHEV|5|8 Make|CHEVY|5|9 Model|Cavalier|11|18 Feature|Red|21|23 Feature|5 spd|26|30 Mileage|7,000|38|42 KEYWORD(Mileage)|miles|44|48 Price|11,995|100|105 Mileage|11,995|100|105 PhoneNr|566-3800|136|143 PhoneNr|566-3888|148|155 Constant/Keyword Recognizer Unstructured Record Documents Matching Rules Data-Record Table

Heuristics Keyword proximity Subsumed and overlapping constants Functional relationships Nonfunctional relationships First occurrence without constraint violation

Keyword Proximity Year|97|2|3 Make|CHEV|5|8 Make|CHEVY|5|9 Model|Cavalier|11|18 Feature|Red|21|23 Feature|5 spd|26|30 Mileage|7,000|38|42 KEYWORD(Mileage)|miles|44|48 Price|11,995|100|105 Mileage|11,995|100|105 PhoneNr|566-3800|136|143 PhoneNr|566-3888|148|155 D = 2 D = 52 '97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800 or 566-3888

Subsumed/Overlapping Constants Year|97|2|3 Make|CHEV|5|8 Make|CHEVY|5|9 Model|Cavalier|11|18 Feature|Red|21|23 Feature|5 spd|26|30 Mileage|7,000|38|42 KEYWORD(Mileage)|miles|44|48 Price|11,995|100|105 Mileage|11,995|100|105 PhoneNr|566-3800|136|143 PhoneNr|566-3888|148|155 '97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800 or 566-3888

Functional Relationships Year|97|2|3 Make|CHEV|5|8 Make|CHEVY|5|9 Model|Cavalier|11|18 Feature|Red|21|23 Feature|5 spd|26|30 Mileage|7,000|38|42 KEYWORD(Mileage)|miles|44|48 Price|11,995|100|105 Mileage|11,995|100|105 PhoneNr|566-3800|136|143 PhoneNr|566-3888|148|155 '97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800 or 566-3888

Nonfunctional Relationships Year|97|2|3 Make|CHEV|5|8 Make|CHEVY|5|9 Model|Cavalier|11|18 Feature|Red|21|23 Feature|5 spd|26|30 Mileage|7,000|38|42 KEYWORD(Mileage)|miles|44|48 Price|11,995|100|105 Mileage|11,995|100|105 PhoneNr|566-3800|136|143 PhoneNr|566-3888|148|155 '97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800 or 566-3888

First Occurrence without Constraint Violation Year|97|2|3 Make|CHEV|5|8 Make|CHEVY|5|9 Model|Cavalier|11|18 Feature|Red|21|23 Feature|5 spd|26|30 Mileage|7,000|38|42 KEYWORD(Mileage)|miles|44|48 Price|11,995|100|105 Mileage|11,995|100|105 PhoneNr|566-3800|136|143 PhoneNr|566-3888|148|155 '97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800 or 566-3888

Database-Instance Generator Year|97|2|3 Make|CHEV|5|8 Make|CHEVY|5|9 Model|Cavalier|11|18 Feature|Red|21|23 Feature|5 spd|26|30 Mileage|7,000|38|42 KEYWORD(Mileage)|miles|44|48 Price|11,995|100|105 Mileage|11,995|100|105 PhoneNr|566-3800|136|143 PhoneNr|566-3888|148|155 Database-Instance Generator Data-Record Table Record-Level Objects, Relationships, and Constraints Database Scheme Populated insert into Car values(1001, “97”, “CHEVY”, “Cavalier”, “7,000”, “11,995”, “556-3800”) insert into CarFeature values(1001, “Red”) insert into CarFeature values(1001, “5 spd”)

Recall & Precision N = number of facts in source C = number of facts declared correctly I = number of facts declared incorrectly (of facts available, how many did we find?) (of facts retrieved, how many were relevant?)

Results: Car Ads Salt Lake Tribune Recall % Precision % Year 100 100 Make 97 100 Model 82 100 Mileage 90 100 Price 100 100 PhoneNr 94 100 Extension 50 100 Feature 91 99 Training set for tuning ontology: 100 Test set: 116

Car Ads: Comments Unbounded sets missed: MERC, Town Car, 98 Royale could use lexicon of makes and models Unspecified variation in lexical patterns missed: 5 speed (instead of 5 spd), p.l (instead of p.l.) could adjust lexical patterns Misidentification of attributes classified AUTO in AUTO SALES as automatic transmission could adjust exceptions in lexical patterns Typographical errors “Chrystler”, “DODG ENeon”, “I-15566-2441” could look for spelling variations and common typos

Results: Computer Job Ads Los Angeles Times Recall % Precision % Degree 100 100 Skill 74 100 Email 91 83 Fax 100 100 Voice 79 92 Training set for tuning ontology: 50 Test set: 50

Obituaries (A More Demanding Application) Multiple Dates Our beloved Brian Fielding Frost, age 41, passed away Saturday morning, March 7, 1998, due to injuries sustained in an automobile accident. He was born August 4, 1956 in Salt Lake City, to Donald Fielding and Helen Glade Frost. He married Susan Fox on June 1, 1981. He is survived by Susan; sons Jord- dan (9), Travis (8), Bryce (6); parents, three brothers, Donald Glade (Lynne), Kenneth Wesley (Ellen), … Funeral services will be held at 12 noon Friday, March 13, 1998 in the Howard Stake Center, 350 South 1600 East. Friends may call 5-7 p.m. Thurs- day at Wasatch Lawn Mortuary, 3401 S. Highland Drive, and at the Stake Center from 10:45-11:45 a.m. Names Family Relationships Multiple Viewings Addresses

Obituary Ontology (partial)

Data Frames Lexicons & Specializations Name matches [80] case sensitive constant { extract First, “\s+”, Last; }, … { extract “[A-Z][a-zA-Z]*\s+([A-Z]\.\s+)?”, Last; }, lexicon { First case insensitive; filename “first.dict”; }, { Last case insensitive; filename “last.dict”; }; end; Relative Name matches [80] case sensitive { extract First, “\s+\(“, First, “\)\s+”, Last; substitute “\s*\([^)]*\)” -> “”; } ...

Keyword Heuristics Singleton Items RelativeName|Brian Fielding Frost|16|35 DeceasedName|Brian Fielding Frost|16|35 KEYWORD(Age)|age|38|40 Age|41|42|43 KEYWORD(DeceasedName)|passed away|46|56 KEYWORD(DeathDate)|passed away|46|56 BirthDate|March 7, 1998|76|88 DeathDate|March 7, 1998|76|88 IntermentDate|March 7, 1998|76|98 FuneralDate|March 7, 1998|76|98 ViewingDate|March 7, 1998|76|98 ...

Keyword Heuristics Multiple Items … KEYWORD(Relationship)|born … to|152|192 Relationship|parent|152|192 KEYWORD(BirthDate)|born|152|156 BirthDate|August 4, 1956|157|170 DeathDate|August 4, 1956|157|170 IntermentDate|August 4, 1956|157|170 FuneralDate|August 4, 1956|157|170 ViewingDate|August 4, 1956|157|170 RelativeName|Donald Fielding|194|208 DeceasedName|Donald Fielding|194|208 RelativeName|Helen Glade Frost|214|230 DeceasedName|Helen Glade Frost|214|230 KEYWORD(Relationship)|married|237|243 ...

Results: Obituaries Arizona Daily Star Recall % Precision % DeceasedName* 100 100 Age 86 98 BirthDate 96 96 DeathDate 84 99 FuneralDate 96 93 FuneralAddress 82 82 FuneralTime 92 87 … Relationship 92 97 RelativeName* 95 74 *partial or full name Training set for tuning ontology: ~ 24 Test set: 90

Results: Obituaries Salt Lake Tribune Recall % Precision % DeceasedName* 100 100 Age 91 95 BirthDate 100 97 DeathDate 94 100 FuneralDate 92 100 FuneralAddress 96 96 FuneralTime 97 100 … Relationship 81 93 RelativeName* 88 71 *partial or full name Training set for tuning ontology: ~ 12 Test set: 38

“Open” Problems Record-Boundary Detection Record Reconfiguration Page/Ontology Matching Form Interfaces Rapid Ontology Construction, Evolution, and Improvement

Record-Boundary Detection: High Fan-Out Heuristic <html> <head> <title>The Salt Lake Tribune … </title> </head> <body bgcolor=“#FFFFFF”> <h1 align=”left”>Domestic Cars</h1> … <hr> <h4> ‘97 CHEVY Cavalier, Red, … </h4> <h4> ‘89 CHEVY Corsica Sdn … </h4> </body> </html> html head body title h1 … hr h4 hr h4 hr ...

Record-Boundary Detection: Record-Separator Heuristics Identifiable “separator” tags Highest-count tag(s) Interval standard deviation Ontological match Repeating tag patterns Example: … <hr> <h4> ‘97 CHEVY Cavalier, Red, 5 spd, <i>only 7,000 miles</i> on her. Asking <i>only $11,995</i>. … </h4> <h4> ‘89 CHEV Corsica Sdn teal, auto, air, <i>trouble free</i>. Only $8,995 … </h4> ...

Record-Boundary Detection: Consensus Heuristic Certainty is a generalization of: C(E1) + C(E2) - C(E1)C(E2). C denotes certainty and Ei is the evidence for an observation. Our certainties are based on observations from 10 different sites for 2 different applications (car ads and obituaries) Correct Tag Rank Heuristic 1 2 3 4 IT 96% 4% HT 49% 33% 16% 2% SD 66% 22% 12% OM 85% 12% 2% 1% RP 78% 12% 9% 1%

Record-Boundary Detection: Results Heuristic Success Rate IT 95% HT 45% SD 65% OM 80% RP 75% Consensus 100% 4 different applications (car ads, job ads, obituaries, university courses) with 5 new/different sites for each application

Record Reconfiguration: Problems Encountered factored split joined interspersed off-page

Record Reconfiguration: Proposed Solution Maximize a Record-Recognition Measure Improvements: Split joined records Distribute factored values Link off-page information Join split records Discard interspersed records

Record Reconfiguration: Use Record-Recognition Measures based on Vector Space Modeling VSM VSM Measures Ontology Vector <f1, … fn> Document Vector <f1, … fn> DV Cosine Vector Length OV

Record Reconfiguration: Test Set Characteristics 30 pre-selected documents Characteristics 8 contained only regular car ads. 13 contained inside-boundary joined car ads all with inside-boundary factored values. 1 contained outside-boundary factored values. 13 contained interspersed non-car-ads. None contained off-page or split ads.

Record Reconfiguration: Results Correctly reconfigured 91% (of 304) 36 false drops 11 car ads improperly discarded (value-recognition problem) 25 car ads improperly reconfigured 20 ads with identical phone numbers on every 5th 5 inside-boundary ads not split (missing years & makes; not all models recognized) Correctly discarded 94% (of 47) 3 false positives all snowmobile ads Correctly produced 97% (of 1,077)

Page/Ontology Matching: Recognition Heuristics Density Heuristic Lots of constant and keyword matches. Total matched characters / total characters Expected-Values Heuristic Recognized constants appear in expected frequencies. VSM cosine measure Grouping Heuristic Recognized constants are grouped as expected. Number of distinct “one-max” values: ordered & grouped by expected size of group

Page/Ontology Matching: Machine-Learned Combined Heuristic A heuristic triple (H1, H2, H3) represents a document. Training set: 20 positive examples; 30 negative. C4.5 machine-learning algorithm produced decision trees. Car Ads Obituaries Universal H2 no yes >0.88 H1 no yes H2 >0.68 >0.22 H3 no yes H2 H1 >0.63 >0.37

Page/Ontology Matching: Results Car Ads Rule 1 correctly matched 97% (of 11 positive and 19 negative) One false negative: “ROBERTS FORDCHRYSLERPLYMOUTHJEEPUSED CARS99 Plymouth Breeze $12,99599 Plymouth Neon $11,99599 Ford … theHomer Adams Pkwy., Alton466-7220” Obituaries Rule 2 correctly matched 97% (of 10 positive and 20 negative) One false positive: Missing People Singleton Obituaries – marginal for famous people

Form Interfaces

Form Interfaces: Questions What’s the best way to automate retrieval of data behind Web forms? Can we match a form to an ontology? Can we learn how to fill in a form for a given (ontology) query? Is it reasonable to try to retrieve all the data behind a form? Can we automatically: Fill in Web forms? Extract information behind forms? Screen out error messages and inapplicable Web pages? Eliminate duplicate data?

Rapid Ontology Construction, Evolution, and Improvement Is it possible to (semi)automate the construction of an application ontology? Can we build tools to help users create an ontology? (yes) Can we assemble components from a knowledgebase of ontological components? Can we use machine learning? Can we use automated construction techniques to aid in ontology evolution and improvement?

Conclusions Given an ontology and a Web page with multiple records, it is possible to extract and structure the data automatically. Recall and Precision results are encouraging. Car Ads: ~ 94% recall and ~ 99% precision Job Ads: ~ 84% recall and ~ 98% precision Obituaries: ~ 90% recall and ~ 95% precision (except on names: ~ 73% precision) Resolution of Problems Record-Boundary Detection – excellent if records nicely separated Record Reconfiguration – excellent for known patterns Page/Ontology matching – excellent for multiple-record documents Open Problems Extraction of data behind forms Rapid ontology construction http://www.deg.byu.edu/