Automatically Generated DAML Markup for Semistructured Documents William Krueger, Jonathan Nilsson, Tim Oates, Tim Finin Supported by DARPA contract F30602-00-2-0591.

Automatically Generated DAML Markup for Semistructured Documents William Krueger, Jonathan Nilsson, Tim Oates, Tim Finin Supported by DARPA contract F30602-00-2-0591

DAML and the Semantic Web The most efficient way for machines to understand the semantics of the vast amount of information on the web is to add semantic markup to the information DAML (DARPA Agent Markup Language) is one existing semantic markup language

The Problem Semantically marking up large amounts of data by hand is far too time consuming We use machine learning techniques to automate the task

An Excerpt From a Talk Announcement The International Computer Science Institute is pleased to present a talk: "Automatic Classification of Acoustic Signals Based on Psychoacoustic and Neurophysiological Knowledge" Michael Kleinschmidt Medical Physics Group, University of Oldenburg, Germany Who is the speaker?

An Excerpt From a Talk Announcement (Solution) The International Computer Science Institute is pleased to present a talk: "Automatic Classification of Acoustic Signals Based on Psychoacoustic and Neurophysiological Knowledge" Michael Kleinschmidt Medical Physics Group, University of Oldenburg, Germany

Outline Talk Ontology Hierarchical Wrapper Induction Contributions Experimental Results Future Considerations

Our Talk Ontology An ontology is the hierarchically organized vocabulary used to semantically mark up information sources The root of our talk ontology is Talk The ontological children of Talk include elements such as Talk:Title and Talk:BeginTime The element Talk:BeginTime has its own ontological children, Talk:BeginTime:Hour and Talk:BeginTime:Minute

Advantages of a Hierarchy Using a hierarchical data model, we can break up documents into embedded segments When learning rules for the speaker’s first name, for example, we only have to consider the speaker segment of each document

Wrappers A wrapper is the set of rules used to extract data along with the code required to perform the extraction

The STALKER Algorithm Stalker is a hierarchical wrapper induction algorithm developed at ISI We use a modified Stalker algorithm to do information extraction on a source The extracted information along with a DAML ontology can then be used to create markup for the source

Defining Rule, Landmark, Token, etc A token is an elementary piece of text –Lowercase words, HTML tags, Numbers, Alphanumeric words, Symbols, etc. A landmark is a sequence of one or more consecutive tokens A rule clause contains one landmark and is one of two types: SkipTo or SkipUntil A rule is an ordered list of rule clauses –can be applied either forward or backward –used to locate both the beginning and end of an information field

Rule Disjunction Because our system is based on a sequential covering algorithm, a rule disjunction is learned for each tag A rule disjunction is an ordered set of rules that are applied in order when placing a tag –The first of that set to match in the document is used to place the tag Keep in mind that it is a rule disjunction of one or more rules that is learned for each tag

Example of a rule matching

Refining a Rule A rule initially contains a single token –The token is taken from the tokens immediately adjacent to the target data item –Examples: SkipTo(SYMBOL) or SkipUntil(John) –Then, either a landmark is added to the rule or a token is added to one of the existing landmarks

Refining a Rule Example –SkipTo(SYMBOL) can become: –SkipTo(be SYMBOL) –SkipTo(speaker) SkipTo(SYMBOL) –etc.

Refining a Rule After refining a rule, the best candidate rule is chosen and is determined to be either perfect or imperfect The best candidate rule has the greatest number of matches on the remaining training documents –Early and failed matches are preferred over late matches –If the best candidate is perfect, it is returned; otherwise it is refined again

Keeping a Rule We want to keep rules that have perfect accuracy on the training documents –No negative matches where the rule being evaluated misplaces a tag in a –No false positive matches where the rule places a tag for a data item in some training document where that data item does not exist When a rule continues being refined without becoming “perfect” it reaches a limit and is returned as is –The rule in this case is probably not very useful –This case is infrequent

General overview of our improvements Minimum Refinements Rule Score Refinement Window Wildcards In the upcoming examples, we often explain how each of these improvements is useful in finding a begin tag for an ontology element; the usefulness for end tags is similar

Minimum Refinements Forces rules to be refined some minimum number of times We typically use a minimum number of 5

Minimum Refinements Example Consider the rule SkipTo(George) Suppose this rule is perfect In general, this rule would be very ineffective at finding the speaker’s first name We would force this rule to be refined further so that it might ultimately have a greater coverage over all documents and reflect the structure of the domain

Rule Score Utilizes an evaluation set of documents Decides whether forward rules or backward rules are better for a particular tag based on their performance on the evaluation set

Rule Score Example What should we do when forward and backward rules disagree on the location of a tag? We test the forward and backward rules on a set of evaluation documents that were not used during the training If the forward rules have a better score on the evaluation set, they are stored as the rules for placing that tag Requires additional marked-up documents

Refinement Window Only consider the closest n tokens to a tag when refining a rule We typically use n = 10

Refinement Window Example Consider the tag Talk:Title Its ontological parent is Talk, the entire talk announcement Without a refinement window, many irrelevant tokens would be considered when learning rules for the title At worst, some irrelevant tokens would actually be used in a rule Such a rule would not generalize well

Wildcards Both domain-dependent and domain- independent; can be used in place of tokens Allow us to better generalize a document’s structure Examples are: MONTH, NUMBER, HTML_TAG, etc.

Wildcards Example Consider the tags Talk:Date:Month and Talk:Date:DayOfWeek We might start with the rule SkipUntil(INITIAL_CAP_WORD) for finding the month, but this rule would match the day of the week, as well By virtue of the wildcard MONTH, we can use the rule skipUntil(MONTH) to accurately locate the month

Marked Up by a Human <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:daml="http://www.daml.org/2000/10/daml-ont#" xmlns="http://daml.umbc.edu/ontologies/talk-ont#" xmlns:time="http://daml.umbc.edu/ontologies/calendar-ont#"> New Developments in Still Image and Video Compression./daml/trainfile1.daml While the demand on quality of digital images and videos increases,.... We will also show a real-time IP video conference system based on earlier versions of these wavelet codecs. 2002 March 27 Thursday 3 30 00 2002 March 27 Thursday 4 30 00 Hans L. Cycon FHTW Berlin, University of Applied Sciences hcycon@fhtw-berlin.de

Marked Up by our Basic System <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:daml="http://www.daml.org/2000/10/daml-ont#" xmlns="http://daml.umbc.edu/ontologies/talk-ont#" xmlns:time="http://daml.umbc.edu/ontologies/calendar-ont#"> New Developments in Still Image and Video Compression Prof. Hans L. Cycon FHTW Berlin, University of Applied Sciences./daml/trainfile1.daml While the demand on quality of digital images and videos increases,.... We will also show a real-time IP video conference system based on earlier versions of these wavelet codecs. 00 3:30-4 30-4:30 00 Hans L hcycon@fhtw-berlin.de

Marked Up by our Full System <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:daml="http://www.daml.org/2000/10/daml-ont#" xmlns="http://daml.umbc.edu/ontologies/talk-ont#" xmlns:time="http://daml.umbc.edu/ontologies/calendar-ont#"> New Developments in Still Image and Video Compression Prof. Hans L. Cycon FHTW Berlin, University of Applied Sciences./daml/trainfile1.daml While the demand on quality of digital images and videos increases,.... We will also show a real-time IP video conference system based on earlier versions of these wavelet codecs. 2002 March 27 Thursday 3 30 00 2002 March 27 Thursday 4 30 00 Hans L. Cycon Prof. Hans L. Cycon FHTW Berlin, University of Applied Sciences hcycon@fhtw-berlin.de

Experimental Setup 3 Domains –UC Berkeley, UCSB, and ITTALKS 6 Systems –Basic, Min Refine, Score, Refinement Window, Wildcards, and Full 10 Partitions 20/20/20 split –Training/Evaluation/Testing Sets recall = number of correctly extracted data items divided by the total number of data items in the documents

Average Recall Over All Tags UC Berkeley Domain UCSB Domain

Performance Improvements on Individual Tags UC Berkeley Domain

Performance Improvements on Individual Tags UCSB Domain

Conclusion Our system extends the state-of-the-art algorithm STALKER Our system performs DAML markup on talk announcements It can trivially be extended to different markup languages and different domains A working implementation of everything described here exists!

Future Considerations Active Learning: select training documents that yield rules with the greatest possible coverage Cardinality Issues: ontology elements that appear in lists Linguistic Information: use a system like Aerotext to preprocess the documents Google API: check to see if our tag placement “makes sense”

Acknowledgements This work was supported in part by the Defense Advanced Research Projects Agency under contract F30602-00-2-0 591 as part of the DAML program (http://daml.org/) It was also supported by a Northrop Grumman Fellowship

References Ciravegna, F. (2001). (LP)2, an Adaptive Algorithm for Information Extraction from Web-related Texts. In Proceedings of the IJCAI-2001 Workshop on Adaptive Text Extraction and Mining held in conjunction with 17th International Joint Conference on Artificial Intelligence (IJCAI). Cost, R. S., T. Finin, A. Joshi, Y. Peng, C. Nicholas, I. Soboroff, H. Chen, L. Kagal, F. Perich, Y. Zou, and S. Tolia. (2002). ITtalks: A Case Study in the Semantic Web and DAML+OIL. IEEE Intelligent Systems, 17(1):40-47. Hendler, J. (2001). Agents and the Semantic Web. IEEE Intelligent Systems,16(2):30-37. Hendler, J., and D. L. McGuinness. (2000). The Darpa Agent Markup Language. IEEE Intelligent Systems, 15(6):67-73. Knoblock, C. A., K. Lerman, S. Minton, and I. Muslea. Accurately and reliably extracting data from the web: A machine learning approach. Data Engineering Bulletin. Muslea, I., S. Minton, and C. Knoblock. (2001). Hierarchical wrapper induction for semistructured information sources. Journal of Autonomous Agents and Multi-Agent Systems.

Automatically Generated DAML Markup for Semistructured Documents William Krueger, Jonathan Nilsson, Tim Oates, Tim Finin Supported by DARPA contract F30602-00-2-0591.

Similar presentations

Presentation on theme: "Automatically Generated DAML Markup for Semistructured Documents William Krueger, Jonathan Nilsson, Tim Oates, Tim Finin Supported by DARPA contract F30602-00-2-0591."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Automatically Generated DAML Markup for Semistructured Documents William Krueger, Jonathan Nilsson, Tim Oates, Tim Finin Supported by DARPA contract F30602-00-2-0591.

Similar presentations

Presentation on theme: "Automatically Generated DAML Markup for Semistructured Documents William Krueger, Jonathan Nilsson, Tim Oates, Tim Finin Supported by DARPA contract F30602-00-2-0591."— Presentation transcript:

Similar presentations

About project

Feedback