Concepts, Semantics and Syntax in E-Discovery David Eichmann Institute for Clinical and Translational Science The University of Iowa David Eichmann Institute.

Slides:



Advertisements
Similar presentations
A Human-Centered Computing Framework to Enable Personalized News Video Recommendation (Oh Jun-hyuk)
Advertisements

Shelley Powers, O’Reilly SNU IDB Lab. Hyewon Kim
An Introduction to XML Based on the W3C XML Recommendations.
Copyright 2008 Tieto Corporation Database merge. Copyright 2008 Tieto Corporation Table of contents Please, do not remove this slide if you want to use.
Taxonomies of Knowledge: Building a Corporate Taxonomy Wendi Pohs, Iris Associates
UNIT-III By Mr. M. V. Nikum (B.E.I.T). Programming Language Lexical and Syntactic features of a programming Language are specified by its grammar Language:-
3/5/2009Computer systems1 Analyzing System Using Data Dictionaries Computer System: 1. Data Dictionary 2. Data Dictionary Categories 3. Creating Data Dictionary.
Extraction of text data and hyperlink structure from scanned images of mathematical journals Ann Arbor, March 19, 2002 Masakazu Suzuki (Kyushu University)
Link Detection David Eichmann School of Library and Information Science The University of Iowa David Eichmann School of Library and Information Science.
By Neng-Fa Zhou Compiler Construction CIS 707 Prof. Neng-Fa Zhou
Chapter 11 Beyond Bag of Words. Question Answering n Providing answers instead of ranked lists of documents n Older QA systems generated answers n Current.
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Nnadi & Bieber, NJIT © Lightweight Integration of Documents and Services (Digital Library Integration Infrastructure) Nkechi Nnadi and Michael Bieber.
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
Introduction to XML This material is based heavily on the tutorial by the same name at
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
1 Newspaper Digitisation Workflows Rose Holley- Manager ANDP Presentation to Cultural Heritage Digitisation professionals 26 November 2008.
Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science Pereslavl-Zalessky Russia.
Information Retrieval in Practice
Web Document Analysis: How can Natural Language Processing Help in Determining Correct Content Flow? Hassan Alam, Fuad Rahman and Yuliya Tarnikova Human.
Learning Objects Stephen Downes Leaders in Learning May 5, 2000.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Some Thoughts on HPC in Natural Language Engineering Steven Bird University of Melbourne & University of Pennsylvania.
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
Chapter 1 Introduction Dr. Frank Lee. 1.1 Why Study Compiler? To write more efficient code in a high-level language To provide solid foundation in parsing.
Introduction to XML. XML - Connectivity is Key Need for customized page layout – e.g. filter to display only recent data Downloadable product comparisons.
GLOSSARY COMPILATION Alex Kotov (akotov2) Hanna Zhong (hzhong) Hoa Nguyen (hnguyen4) Zhenyu Yang (zyang2)
The Metadata Object Description Schema (MODS) NISO Metadata Workshop May 20, 2004 Rebecca Guenther Network Development and MARC Standards Office Library.
Peoplesoft XML Publisher Integration with PeopleTools -Jayalakshmi S.
Using the Open Metadata Registry (openMDR) to create Data Sharing Interfaces October 14 th, 2010 David Ervin & Rakesh Dhaval, Center for IT Innovations.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Metadata and Geographical Information Systems Adrian Moss KINDS project, Manchester Metropolitan University, UK
PETRA – the Personal Embedded Translation and Reading Assistant Werner Winiwarter University of Vienna InSTIL/ICALL Symposium 2004 June 17-19, 2004.
An Alternative Approach to Interoperability Testing The Use of Special Diagnostic Records in the Context of Z39.50 and Online Library Catalogs William.
Methods for the Automatic Construction of Topic Maps Eric Freese, Senior Consultant ISOGEN International.
1/26/2004TCSS545A Isabelle Bichindaritz1 Database Management Systems Design Methodology.
Katrin Erk Vector space models of word meaning. Geometric interpretation of lists of feature/value pairs In cognitive science: representation of a concept.
“This presentation is for informational purposes only and may not be incorporated into a contract or agreement.”
Artificial Intelligence Research Center Pereslavl-Zalessky, Russia Program Systems Institute, RAS.
Event-Centric Summary Generation Lucy Vanderwende, Michele Banko and Arul Menezes One Microsoft Way, WA, USA DUC 2004.
Compiler design Lecture 1: Compiler Overview Sulaimany University 2 Oct
1 Compiler Construction (CS-636) Muhammad Bilal Bashir UIIT, Rawalpindi.
1 Compiler Design (40-414)  Main Text Book: Compilers: Principles, Techniques & Tools, 2 nd ed., Aho, Lam, Sethi, and Ullman, 2007  Evaluation:  Midterm.
1 Lex & Yacc. 2 Compilation Process Lexical Analyzer Source Code Syntax Analyzer Symbol Table Intermed. Code Gen. Code Generator Machine Code.
Chapter 1 Introduction Major Data Structures in Compiler
Strategies for subject navigation of linked Web sites using RDF topic maps Carol Jean Godby Devon Smith OCLC Online Computer Library Center Knowledge Technologies.
MedKAT Medical Knowledge Analysis Tool December 2009.
Visualizations, Mashups and Dashboards University of Illinois at Urbana-Champaign.
Search Engine Know- How: How To Optimize Your Content, Navigation Pages, & Documents For Search Engines.
DeepDive Introduction Dongfang Xu Ph.D student, School of Information, University of Arizona Sept 10, 2015.
What am I? while b != 0 if a > b a := a − b else b := b − a return a AST == Abstract Syntax Tree.
TWC Illuminate Knowledge Elements in Geoscience Literature Xiaogang (Marshall) Ma, Jin Guang Zheng, Han Wang, Peter Fox Tetherless World Constellation.
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 1 Mining knowledge from natural language texts using fuzzy associated concept mapping Presenter : Wu,
The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign Understanding Web Query Interfaces: Best-Efforts Parsing with Hidden Syntax.
Statistical Data and Metadata Exchange SDMX Metadata Common Vocabulary Status of project and issues ( ) Marco Pellegrino Eurostat
CSC 4181 Compiler Construction
© University of Manchester Creative Commons Attribution-NonCommercial 3.0 unported 3.0 license Quality Assurance, Ontology Engineering, and Semantic Interoperability.
Lecture 15: Query Optimization. Very Big Picture Usually, there are many possible query execution plans. The optimizer is trying to chose a good one.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Semantic Interoperability in GIS N. L. Sarda Suman Somavarapu.
Of 24 lecture 11: ontology – mediation, merging & aligning.
© University of Manchester Creative Commons Attribution-NonCommercial 3.0 unported 3.0 license Quality Assurance, Ontology Engineering, and Semantic Interoperability.
Compiler Design (40-414) Main Text Book:
School of Library and Information Science
How to publish in a format that enhances literature-based discovery?
Grant Number: IIS Institution of PI: Brigham Young University PI’s: David W. Embley, Stephen W. Liddle, Deryle W. Lonsdale Title:
AI Discovery Template IBM Cloud Architecture Center
Presentation transcript:

Concepts, Semantics and Syntax in E-Discovery David Eichmann Institute for Clinical and Translational Science The University of Iowa David Eichmann Institute for Clinical and Translational Science The University of Iowa

Our Approach  Analyze the human-generated metadata available for document collections for organizational and individual interactions  Explore the syntactic and semantic nature of document content and the potential for automatic generation of metadata  Explore the concept space generated by the previous step and its correspondence to boolean predicate specification in discovery

Our Target Corpus  The Illinois Institute of Technology Complex Document Information Processing Test Collection (IIT CDIP), v. 1.0  Derived from the tobacco master settlement agreement  Comprises 6,910,192 ‘documents’  Or more properly the OCR output from those documents  Two merged XML tag sets of metadata, with overlapping content 

Metadata Entity Frequencies Entity Occurrences TotalDistinctAvg/EntityAvg/Doc. Bates9,476,7948,054, Category13,594, , Doctype18,359,6442,5017, Prodbox6,830,9936,3061,

Metadata Entity Frequencies Entity Occurrences TotalDistinctAvg/EntityAvg/Doc. Attendee65,691,47349,3751, Brand26,498,001155, Copied8,775,307322,

Metadata Entity Frequencies Org. Entity Occurrences TotalDistinctAvg/EntityAvg/Doc. Author8,742,976149, Mentioned31,406,753883, Receiving8,262,49663,

Metadata Entity Frequencies Person Entity Occurrences TotalDistinctAvg/EntityAvg/Doc. Author11,128,029875, Mentioned34,683,2891,938, Receiving23,427,415455,

Database Schema  We map the XML structure to a set of relational database tables  Non-recurring fields are collected in a table named ‘document’  docid  title  description  OCR text  Recurring elements each get a table  docid  value

Identifying an Individual Person # of Occurrences as AttendeeAuthorReceiverMention REININGHAUS, W189,38023,88032,76416,152 REININGHAUS7, ,9742,837 REININGHAUS, B1962 REININGHAUS, R

How Many Reininghaus?  Reininghaus,R  Reininghaus,W

Co-mention Connections ReininghausWalk PersonCountPersonCount WALK,RA3,871REININGHAUS,W3,871 ROEMER,E3,716ROEMER,E2,883 HAUSSMANN,HJ3,293HAUSSMANN,HJ2,799 TEWES,F2,784HACKENBERG,U2,360

Co-mention Connections ReininghausRoemer PersonCountPersonCount WALK,RA3,871REININGHAUS,W3,716 ROEMER,E3,716WALK,RA2,883 HAUSSMANN,HJ3,293HACKENBERG,U2,623 TEWES,F2,784HAUSSMAN,HJ2,573

Co-mention Connections ReininghausHaussmann PersonCountPersonCount WALK,RA3,871REININGHAUS,W3,293 ROEMER,E3,716WALK,RA2,799 HAUSSMANN,HJ3,293ROEMER,E2,573 TEWES,F2,784VONCKEN,P2,323

Co-mention Affiliations PersonAffiliation Reininghaus, Wolf Gen. Mgr, Contract Research, INBIFO Walk, Rudiger-AlexanderDir. Human Studies, Philip Morris Roemer, EwaldINBIFO Haussmann, Hans-JurgenAssoc. Prin. Scientist, Philip Morris Tewes, F.Biologist, INBIFO Hackenberg, UlrichINBIFO Voncken, P.Chemist, INBIFO

Semantics and Structure  Our analysis of content involves the following phases:  Lexical analysis  Sentence boundary detection  Named entity recognition  Sentence parsing  Relationship extraction  The nature of the OCR data seriously impacts each of the phases (sometimes in different ways)

CDIP Parse Tree Complexity

Clean Text Parse Tree Complexity

Next Steps  Experiment with custom lexical analysis of the OCR  Start with simple white space detection  Construct a lexicon and look for out-of-band vocabulary as OCR errors candidates  Rewrite the analyzer to support OCR error correction  Sentence boundary detect and parse the full corpus  Generate entity relationships using our question answering framework

And Beyond That…  Return to the document images and analyze document layout  Regenerate OCR to include token coordinates  Use our PDF structure extraction framework to generate logical document structure  Generate a set of document models based upon similar layout  Use the document models to map OCR text to metadata elements

For Example