Sample Talks for Organizational Hints Krishnaprasad Thirunarayan Department of Computer Science and Engineering Wright State University Dayton, OH-45435.

Slides:



Advertisements
Similar presentations
DCMI Workshop on Metadata and Search Vendor Panel Presentation Bradley P. Allen
Advertisements

Ontology-enhanced retrieval (and Ontology-enhanced applications) Deborah L. McGuinness Associate Director and Senior Research Scientist Knowledge Systems.
TU e technische universiteit eindhoven / department of mathematics and computer science Modeling User Input and Hypermedia Dynamics in Hera Databases and.
TU/e technische universiteit eindhoven Hera: Development of Semantic Web Information Systems Geert-Jan Houben Peter Barna Flavius Frasincar Richard Vdovjak.
CH-4 Ontologies, Querying and Data Integration. Introduction to RDF(S) RDF stands for Resource Description Framework. RDF is a standard for describing.
WebRatio BPM: a Tool for Design and Deployment of Business Processes on the Web Stefano Butti, Marco Brambilla, Piero Fraternali Web Models Srl, Italy.
1 Introduction to XML. XML eXtensible implies that users define tag content Markup implies it is a coded document Language implies it is a metalanguage.
Information Retrieval in Practice
CS652 Spring 2004 Summary. Course Objectives  Learn how to extract, structure, and integrate Web information  Learn what the Semantic Web is  Learn.
Xyleme A Dynamic Warehouse for XML Data of the Web.
Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.
Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.
Annotating Documents for the Semantic Web Using Data-Extraction Ontologies Dissertation Proposal Yihong Ding.
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
1 On Embedding Machine-Processable Semantics into Documents Krishnaprasad Thirunarayan Department of Computer Science & Engineering Wright State University.
CIS607, Fall 2005 Semantic Information Integration Article Name: Clio Grows Up: From Research Prototype to Industrial Tool Name: DH(Dong Hwi) kwak Date:
Stimulating reuse with an automated active code search tool Júlio Lins – André Santos (Advisor) –
Overview of Search Engines
Knowledge Science & Engineering Institute, Beijing Normal University, Analyzing Transcripts of Online Asynchronous.
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
Basic Concepts The Unified Modeling Language (UML) SYSC System Analysis and Design.
Improving Data Discovery in Metadata Repositories through Semantic Search Chad Berkley 1, Shawn Bowers 2, Matt Jones 1, Mark Schildhauer 1, Josh Madin.
Metadata and identifiers for e- journals Copenhagen Juha Hakala Helsinki University Library
Formalizing and Querying Heterogeneous Documents with Tables Krishnaprasad Thirunarayan and Trivikram Immaneni Department of Computer Science and Engineering.
1/19 Component Design On-demand Learning Series Software Engineering of Web Application - Principles of Good Component Design Hunan University, Software.
RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.
Tutorial 1 Getting Started with Adobe Dreamweaver CS3
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
Introduction to MDA (Model Driven Architecture) CYT.
Assessing the Suitability of UML for Modeling Software Architectures Nenad Medvidovic Computer Science Department University of Southern California Los.
PART IV: REPRESENTING, EXPLAINING, AND PROCESSING ALIGNMENTS & PART V: CONCLUSIONS Ontology Matching Jerome Euzenat and Pavel Shvaiko.
SWETO: Large-Scale Semantic Web Test-bed Ontology In Action Workshop (Banff Alberta, Canada June 21 st 2004) Boanerges Aleman-MezaBoanerges Aleman-Meza,
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Meta Tagging / Metadata Lindsay Berard Assisted by: Li Li.
Of 33 lecture 10: ontology – evolution. of 33 ece 720, winter ‘122 ontology evolution introduction - ontologies enable knowledge to be made explicit and.
Košice, 10 February Experience Management based on Text Notes The EMBET System Michal Laclavik.
XML and Digital Libraries M. Zubair Department of Computer Science Old Dominion University.
Ontologies and Lexical Semantic Networks, Their Editing and Browsing Pavel Smrž and Martin Povolný Faculty of Informatics,
1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.
Chapter 7 System models.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Using Several Ontologies for Describing Audio-Visual Documents: A Case Study in the Medical Domain Sunday 29 th of May, 2005 Antoine Isaac 1 & Raphaël.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
Leveraging SET, OWL, CAM and Dictionary based tools to enabled automated cross-dictionary domain translations David Webber OASIS SET TC / CAM TC (with.
XML and Database.
Introduction to the Semantic Web and Linked Data Module 1 - Unit 2 The Semantic Web and Linked Data Concepts 1-1 Library of Congress BIBFRAME Pilot Training.
User Profiling using Semantic Web Group members: Ashwin Somaiah Asha Stephen Charlie Sudharshan Reddy.
Of 33 lecture 1: introduction. of 33 the semantic web vision today’s web (1) web content – for human consumption (no structural information) people search.
Strategies for subject navigation of linked Web sites using RDF topic maps Carol Jean Godby Devon Smith OCLC Online Computer Library Center Knowledge Technologies.
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
A modular metadata-driven statistical production system The case of price index production system at Statistics Finland Pekka Mäkelä, Mika Sirviö.
DANIELA KOLAROVA INSTITUTE OF INFORMATION TECHNOLOGIES, BAS Multimedia Semantics and the Semantic Web.
Achieving Semantic Interoperability at the World Bank Designing the Information Architecture and Programmatically Processing Information Denise Bedford.
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
Ontologies Reasoning Components Agents Simulations An Overview of Model-Driven Engineering and Architecture Jacques Robin.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Semantic Interoperability in GIS N. L. Sarda Suman Somavarapu.
GoRelations: an Intuitive Query System for DBPedia Lushan Han and Tim Finin 15 November 2011
XML and Distributed Applications By Quddus Chong Presentation for CS551 – Fall 2001.
Mechanisms for Requirements Driven Component Selection and Design Automation 최경석.
Information Retrieval in Practice
Search Engine Architecture
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Web Service Modeling Ontology (WSMO)
General Adaptation Framework
ece 627 intelligent web: ontology and beyond
A Modular Approach to Document Indexing and Semantic Search
Software Architecture & Design
SDMX IT Tools SDMX Registry
Presentation transcript:

Sample Talks for Organizational Hints Krishnaprasad Thirunarayan Department of Computer Science and Engineering Wright State University Dayton, OH-45435

Overall R&D Agenda Develop semi-automatic techniques for information extraction/retrieval to enable man and machine to complement each other in assimilation of semi-structured, heterogeneous documents => Semantic Web Technologies.

A Modular Approach to Document Indexing and Semantic Search

Goal (What?) Background and Motivation (Why?) Implementation Details (How?) Evaluation and Applications (Why?) Conclusions

Goal

Develop a modular approach to improving effectiveness of searching documents for information Reuse and integrate mature software components

Background and Motivation

Improve recall using information implicit in the English language Improve precision and recall using domain- specific information implicit in the document collection Assist manual content extraction by mapping document phrases to controlled vocabulary terms (domain library)  NSF-SBIR Phases I and II with Cohesia Corp.

Enable extensions Spell check input query Organize search results through grouping  Improve precision thro sense-disambiguation Enable experimentation Investigate empirical relationship between significant eigenvalues in the Singular Value Decomposition (SVD) and the number of document clusters using benchmarks.

Implementation Details (How?)

Tools Used Apache’s Lucene APIs A high-performance, Java text search engine library with smart indexing strategies. WordNet and Java WordNet Library NIST and MathWork’s Java Matrix package (JAMA) for LSI Domain-specific controlled vocabulary for Materials and Process Specs

Jazzy, a Java Open Source Spell- Checker MEDLINE dataset 20-Newsgroups dataset Reuters newswire stories datasets

Architecture of Content-based Indexing and Semantic Search Engine Inverted Document Index LSA Term Matrix Document Indexer Configurer Searcher Query Modifier Highlighter WordNet Output User query Domain Library Inverted DLIndex DL Term Locator Document collection

Evaluation and Application (Why?)

Enhanced search illustrating wildcard pattern and synonym expansion

Matching DL Items; DL Term and its location in the document

Example illustrating skippable group

LSI and Clustering Exploring relationship between the number of significant eigenvalues and the number of document clusters  20-Mini-Newsgroup dataset 2000 postings, 20 groups  Reuters Newswire Stories dataset Used 2000 stories at a time, 70 topics

Conclusions

Useful assistance for manual content extraction from materials and process specs, given the controlled vocabulary In future, this framework / infrastructure can be used for experiments with expressive and context-aware search.

Formalizing and Querying Heterogeneous Documents with Tables

Goal (What?) Background and Motivation (Why?) Implementation Details (How?) Evaluation and Applications (Why?) Conclusions

Goal

Define, embed, and use metadata in semi- structured documents containing tables. Content-oriented/domain-specific metadata of human sensible document Makes explicit semantics of complex data Enables augmentation of an interpretation in a modular fashion.

Heterogeneous Document

Background and Motivation

Embedding metadata improves traceability, thereby facilitating Content Extraction Verification Update

Implementation Details (How?)

XML Technology Document-Centric View: XML is used to annotate documents for use by humans in the realm of document processing and content extraction. Data-Centric View: XML is used as text- based format for information exchange / serialization in the context of Web Services.

Basic idea behind our approach Unify the two views by using XML- elements to materialize abstract syntax, and together with XML attributes and XML element definitions, formalize the content.  Key advantage: Minimizes maintenance of additional data structures to relate original document with its formalization.

Two Concrete Implementations Use Web Services language Water which amalgamates XML Technology with programming language concepts Use XML/XSLT infrastructure

Water-based approach Each annotation reflects the semantics of the text fragment it encloses. The annotated data can be interpreted by viewing it as a function/procedure call in Water. The correspondence between formal parameter and actual argument is position-based. The semantics of annotation is defined in Water as a method definition in a class, separately.

Example Table Thickness (mm) Tensile Strength (ksi) Yield Strength (ksi) 0.50 and under – –

Example of Tagged Table Thickness (mm) Tensile Strength (ksi) Yield Strength (ksi) table and under table table table....

Example of Processing Code /> <set rows= table.rows. />/> …

XML/XSLT-based approach Each annotation reflects the semantics of the text fragment it encloses. To make the annotated data XML compliant, dummy attributes such as one, two, three, … etc are introduced. The correspondence between formal attribute and the actual value is name-based. The semantics is defined by interpreting XML-elements and its XML-attributes via XSLT, separately.

Example of Tagged Table <tableSchema one="Thickness(min)" two="Thickness(max)" three="Tensile Strength“ four="Yield Strength"/>...

XSLT Stylesheets can be used to: Query: to perform table look-ups. Transform: to change units of measure such as from standard SI units to FPS units and vice versa. Format: to display the table in HTML form. Extract: to recover the original table. Verify: to check static semantic constraints on table data values.

Evaluation and Application (Why?)

Advantage Only tabular data in each document is annotated. The annotation definition is factored out as background knowledge. Thus, the semantics of each table type is specified just once outside the document and is reused with different documents containing similar tables.

Disadvantage Both avenues require mature tool support for wide spread adoption. For example, develop MS FrontPage like interface where the Master document is the annotated form, and the user explicitly interacts with/edits only a view of the annotated document, for readability reasons, and has support for export as XML to generate well-formed XML document.

Prolog rendition strengthTableRow( 0, 0.50, 165, 155). strengthTableRow(0.50, 1.00, 160, 150). strengthTableRow(1.00, 1.50, 155, 145).... strengthTable(Thickness, TensileStrength, YieldStrength) :- strengthTableRow(L, U, TensileStrength, YieldStrength), L = Thickness. thicknessToTensileStrength(Thickness, TensileStrength) :- strengthTable(Thickness, TensileStrength, _). thicknessToYieldStrength(Thickness, YieldStrength) :- strengthTable(Thickness, _, YieldStrength). ?- thicknessToYieldStrength(0.6,YS).

Conclusion and Future Work

Develop a catalog of predefined tables, specifying them using Semantic Web formalisms (such as RDF, OWL, etc) and mapping the tabular data into a set of pre- defined tables, possibly qualified. Develop techniques for manual mapping of complex tables into simpler ones: To provide semantics to data. To improve traceability. To facilitate automatic manipulation.

Tailor and improve IE and IR techniques developed in the context of text processing to Semantic Web documents such as in XML, RDF, etc benefiting from additional support from ontologies such as in OWL, etc

Our Related Publications

K. Thirunarayan, A. Berkovich, and D. Sokol, An Information Extraction Approach to Reorganizing and Summarizing Specifications, In: Information and Software Technology Journal, Vol. 47, Issue 4, pp , K. Thirunarayan, On Embedding Machine- Processable Semantics into Documents, In: IEEE Transactions on Knowledge and Data Engineering, Vol. 17, No. 7, pp , July 2005.

Holy Grail Ultimately develop principles, techniques and tools, to author and extract human-readable and machine-comprehensible parts of a document hand in hand, and keep them side by side.