September 25, 2006 NASA Feasibility Study Status Update.

Slides:



Advertisements
Similar presentations
Inside an XSLT Processor Michael Kay, ICL 19 May 2000.
Advertisements

Stefania Bergamasco, Cecilia Colasanti An integrated approach to turn statistics into knowledge combining data warehouse, controlled vocabularies and advanced.
Text mining Extract from various presentations: Temis, URI-INIST-CNRS, Aster Data …
ELPUB 2006 June Bansko Bulgaria1 Automated Building of OAI Compliant Repository from Legacy Collection Kurt Maly Department of Computer.
Information Retrieval in Practice
Human Language Technologies. Issue Corporate data stores contain mostly natural language materials. Knowledge Management systems utilize rich semantic.
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
Supervised by Prof. LYU, Rung Tsong Michael Department of Computer Science & Engineering The Chinese University of Hong Kong Prepared by: Chan Pik Wah,
ExpressReader Pro adopted to retrodigitization of mathematical documents Kazuaki Yokota.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
September 15, 2003Houssam Haitof1 XSL Transformation Houssam Haitof.
Overview of Search Engines
1 1 Roadmap to an IEPD What do developers need to do?
©2011 Quest Software, Inc. All rights reserved. Steve Walch, Senior Product Manager Blog: November, 2011 Partner Training Webcast.
Classification with Hyperplanes Defines a boundary between various points of data which represent examples plotted in multidimensional space according.
Sheet 1XML Technology in E-Commerce 2001Lecture 6 XML Technology in E-Commerce Lecture 6 XPointer, XSLT.
Mark Phillips Digital Projects Department University of North Texas Annexation of Texas Project.
Database Design for DNN Developers Sebastian Leupold.
Ihr Logo Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization Turban, Aronson, and Liang.
DE&T (QuickVic) Reporting Software Overview Term
FIX Repository based Products Infrastructure for the infrastructure Presenter Kevin Houstoun.
DDI Lifecycle: Moving Forward Outcome of the Recent Workshop in Dagstuhl Joachim Wackerow.
MAHI Research Database Data Validation System Software Prototype Demonstration September 18, 2001
Copyright © 2012 Accenture All Rights Reserved.Copyright © 2012 Accenture All Rights Reserved. Accenture, its logo, and High Performance Delivered are.
Survey of Semantic Annotation Platforms
The ODU Metadata Extraction Project March 28, 2007 Dr. Steven J. Zeil
Metadata ODU for DTIC Presentation to Senior Management May 16, 2007 Kurt Maly, Steve Zeil, Mohammad Zubair {maly, zeil,
ISO Environmental management — Life cycle assessment — Data documentation format.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
© ITEDO Software 2001 From 3D CAD to Web catalogs Dieter Weidenbrück.
XML 102 Mike Rawlins Rawlins EC Consulting. Soup to Nuts Data content based on TS190 Based on XML Forum approach: Instance document representation W3C.
Serving society Stimulating innovation Supporting legislation Joint Research Centre The Inspire Geoportal Validator.
Approved for Public Release U.S. Government Work (17 USC§105) Not copyrighted in the U.S. Defense Research & Engineering Information for the Warfighter.
Metadata Extraction for NASA Collection June 21, 2007 Kurt Maly, Steve Zeil, Mohammad Zubair {maly, zeil,
Ihr Logo Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization Turban, Aronson, and Liang.
Principles of Data Mining. Introduction: Topics 1. Introduction to Data Mining 2. Nature of Data Sets 3. Types of Structure Models and Patterns 4. Data.
Increment 1 Development Plan - License Checker 2.0 Jing Jing-Helles
November 23, 2010 Service Computation Keynote - Lisbon, Portugal Automated Metadata Extraction Services Kurt Maly Contact:
Unclassified//For Official Use Only 1 Analysis of Uncertain Data in Text Documents Carnegie Mellon University and DYNAM i X Technologies PI : Jaime G.
EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila.
XML technologies for text encoding Tamás Váradi
Slide 1 Product Line Studio TM Features used for “Customization of Documents”
Project Overview Vangelis Karkaletsis NCSR “Demokritos” Frascati, July 17, 2002 (IST )
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Oct. 12, 2007 STEV 2007, Portland OR A Scriptable, Statistical Oracle for a Metadata Extraction System Kurt J. Maly, Steven J. Zeil, Mohammad Zubair, Ashraf.
WEB MINING. In recent years the growth of the World Wide Web exceeded all expectations. Today there are several billions of HTML documents, pictures and.
CSC 480 Software Engineering
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
User Profiling using Semantic Web Group members: Ashwin Somaiah Asha Stephen Charlie Sudharshan Reddy.
BOĞAZİÇİ UNIVERSITY DEPARTMENT OF MANAGEMENT INFORMATION SYSTEMS MATLAB AS A DATA MINING ENVIRONMENT.
2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International.
SCORM Status. 2 Stabilization, Clarification and Issue Resolution Bug Fixes, Corrections & Clarifications SCORM 2004 January 2004 SCORM nd Edition.
Dictionary based interchanges for iSURF -An Interoperability Service Utility for Collaborative Supply Chain Planning across Multiple Domains David Webber.
 Loan Origination System (LOS) ◦ Automated and digitised process including  Decisioning  Bureau Services  Documents for personal, overdraft and secured.
May 19-22, 2008 Open Forum for Metadata Registries Sydney Automated Metadata Extraction for Large, Diverse and Evolving Document Collections Kurt Maly.
 XSL – Extensible Style Sheet Language  XSLT – XSL Transformations › Used to transform XML documents to other formats,like HTML or other XML documents.
Feb 21-25, 2005ICM 2005 Mumbai1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science.
WISE Working Group D September 2009, Brussels Jon Maidens.
©2001 Priority Technologies, Inc. All Rights Reserved Meteor Status Miami Face to Face Meeting January 16 – 18, 2002.
ANALYSIS PHASE OF BUSINESS SYSTEM DEVELOPMENT METHODOLOGY.
Getting Started with Quick Fields LAB 103 Jonathan Lai.
WP1: Plan for the remainder (1) Ontology –Finalise ontology and lexicons for the 2 nd domain (RTV) Changes agreed in Heraklion –Improvement to existing.
UAB Requirements for 2016 Ivan Prieto Barreiro 18/04/2016 UAB Requirements for
Information Retrieval in Practice
Presentation to Senior Management January 7, 2010
Metadata Extraction Progress Report 12/14/2006.
Querying XML XSLT.
AI Discovery Template IBM Cloud Architecture Center
Module 2 - Xtrata Pro Product Overview Module 2 – Product Overview
Unit 6 - XML Transformations
Presentation transcript:

September 25, 2006 NASA Feasibility Study Status Update

September 25, 2006 NASA Milestones A. Feasibility Study to identify the NASA document types –Report - May 31, 2006 B. Form identification and template development - Template set - Aug 31, 2006 C. Enhance classification algorithm for two specific classes – software packaged -Oct 31, 2006 D. Process study for inter-organizational collections – configuration software – Dec 1, 2006 E. Enhance engine to recognize two major classes – software packaged – Dec 15, 2006 F. Evaluation of extraction process – report – Feb 28,2006

September 25, 2006 Form Identification and Template Development August 31 Deliverable

September 25, 2006 Form Identification and Template Development August 31 Deliverable DEMO

September 25, 2006 Active Tasks for future NASA Milestones Standard Intermediate Representation of the Scanned Document (IDM) Design Classification Algorithm

September 25, 2006 Independent Document Model (IDM) Platform independent Document Model Motivation Dramatic XML Schema Change between Omnipage 14 and 15 Tie the template engine to stable specification Protects from linking directly to specific OCR product Allows us to include statistics for enhanced feature usage Statistics (i.e. avgDocFontSize, avgPageFontSize, wordCount, avgDocWordCount, etc..) Supports Pointpage Detection, Classification Use XSLT 2.0 stylesheets to transform Supporting new OCR schema only requires generation of new XSLT stylesheet. -- Engine does not change Chain a series of sheets to add functionality (CleanML )

September 25, 2006 IDM Usage Each incoming XML schema requires specific XSLT 2.0 Stylesheet Resulting IDM Doc used for “Form Based” templates IDM transformed into CleanML for “Non-form” templates CleanML XML Doc docTreeModelOmni14.xsl docTreeModelOmni15.xsl docTreeModelOther.xsl docTreeModelCleanML.xsl OmniPage 14 XML Doc OmniPage 15 XML Doc Other OCR Output XML Doc IDM XML Doc Form Based Extraction Non Form Extraction

September 25, 2006 Classification Algorithm Two approaches: Classification(switching) based on image classification Post-hoc classification via validation

September 25, 2006 Post-hoc classification via validation Attempt metadata extraction with all plausible templates Validate each results set, assigning confidence scores Field-specific validation rules, may combine - statistical models derived for each field of - text length - % of words from English dictionary - % of phrases from knowledge base prepared for that field - pattern matching Select metadata set with highest confidence score

September 25, 2006 Sample set of extracted metadata bindings Steven J. Zeil Old Dominion University Technical Report September 12, 2006 Validation of Extracted Metadata A lengthy discussion of techniques for validating metadata is

September 25, 2006 Validation template customized for the collection <val:validate collection="dtic" xmlns:val="jelly:edu.odu.cs.dtic.validation.ValidationTagLibrary">

September 25, 2006

September 25, 2006 Annotated version of the metadata bindings Steven J. Zeil <organization confidence="0.42" warning="inappropriate vocabulary">Old Dominion University Technical Report September 12, 2006 Validation of Extracted Metadata A lengthy discussion of techniques for validating metadata is