Download presentation
Presentation is loading. Please wait.
1
1 On Embedding Machine-Processable Semantics into Documents Krishnaprasad Thirunarayan Department of Computer Science & Engineering Wright State University Dayton, OH-45435, USA
2
2 Talk Outline Background and Motivation (Why?) Goals (What?) Details (How?) Conclusions
3
3 Background and Motivation
4
4 Heterogeneous Doc.Spec. Defn. Rep. Content Extraction: Formalize doc, using controlled vocabulary
5
5 Problems with this approach to content extraction Archiving spec (for human comprehension) separately from its formalization is not conducive traceability. Manual extraction from spec (from scratch) for each use is labor intensive, time consuming, and prone to typographical errors.
6
6 Observation Conceptually, every piece of information in an extraction owes its existence to a phrase in spec, and possibly, controlled vocabulary. So, explore techniques to maintain correspondence between a spec fragment and its formalization.
7
7 Goal
8
8 General Problem Embed domain-specific mark-up (annotations) into human sensible document to make explicit semantics of “content” text and complex data, and to augment an interpretation in a modular fashion. Document text: Human comprehensible Semantic Mark-up: Machine processable
9
9 Details (How?)
10
10 Nature of Specs Semi-structured Heterogeneous Text Tables Images Constrained technical vocabulary Available as MS Word document
11
11 Pre-processing Spec Abstract content from spec document by removing display oriented information Save text Save tabular data, preserving grid layout Retain links to images … Note: “Save As text” option in MS Word inadequate
12
12 Heterogeneous Document
13
13 XML generated by Majix
14
14 ASCII Output
15
15 Annotating Pre-processed Spec Embedding Machine Processable Semantics Recognizing and tagging text using controlled vocabulary By product of: Document Indexing and Semantic Search Tagging tabular data to make explicit its semantics : Same grid layout, but different interpretation and dependencies based on headings Explore: XML-based programming language Water for defining data and its behavior (semantics)
16
16 Locating Controlled Vocabulary Terms
17
17 Example Table Thickness (mm) Tensile Strength (ksi) Yield Strength (ksi) 0.50 and under 165155 0.05 – 1.00160150 1.00 – 1.50155145
18
18 Example of Tagged Table Thickness (mm) Tensile Strength (ksi) Yield Strength (ksi) table. 0.50 and under 165 155 table. 0.50 - 1.00 160 150 table. 1.00 - 1.50 155 145 table....
19
19 Example of Processing Code /> <set rows= table.rows. />/> …
20
20 (cont’d) ….1/> temp..0/> /> > table.rows..2 />
21
21 (cont’d) … fluid. <try /> TensileStrength > "TABLE: out of range error occurred"
22
22 Water XML-based OO Scripting Language Facilitates creating Web Services Run methods remotely via web-browser Generalizes dynamic typing to constraint checking Conformance of actuals to formals
23
23 Pros and cons Encoding Improvement Amount of tagging can be controlled by suitably delimiting table data and annotating it with corresponding “string-processing” method Master Copy Update Changes to spec requires manual modification to archived annotated version. Irregular Tables in Specs Different units, etc
24
24 Some Related Work Microsoft Smart Tags Recognize “controlled” words in Office 2003 documents and associate predefined list of actions with each occurrence SHOE Table data in a declarative (logic) language
25
25 Prolog rendition strengthTableRow( 0, 0.50, 165, 155). strengthTableRow(0.50, 1.00, 160, 150). strengthTableRow(1.00, 1.50, 155, 145).... strengthTable(Thickness, TensileStrength, YieldStrength) :- strengthTableRow(L, U, TensileStrength, YieldStrength), L = Thickness. thicknessToTensileStrength(Thickness, TensileStrength) :- strengthTable(Thickness, TensileStrength, _). thicknessToYieldStrength(Thickness, YieldStrength) :- strengthTable(Thickness, _, YieldStrength). ?- thicknessToYieldStrength(0.6,YS).
26
26 Conclusions
27
27 A Step towards Holy Grail Ultimately enable authoring and/or extracting, human-comprehensible and machine-processable parts of a document “hand in hand”, and keep them “side by side”.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.