Download presentation
Presentation is loading. Please wait.
1
Concepts, Semantics and Syntax in E-Discovery David Eichmann Institute for Clinical and Translational Science The University of Iowa David Eichmann Institute for Clinical and Translational Science The University of Iowa
2
Our Approach Analyze the human-generated metadata available for document collections for organizational and individual interactions Explore the syntactic and semantic nature of document content and the potential for automatic generation of metadata Explore the concept space generated by the previous step and its correspondence to boolean predicate specification in discovery
3
Our Target Corpus The Illinois Institute of Technology Complex Document Information Processing Test Collection (IIT CDIP), v. 1.0 Derived from the tobacco master settlement agreement Comprises 6,910,192 ‘documents’ Or more properly the OCR output from those documents Two merged XML tag sets of metadata, with overlapping content
4
Metadata Entity Frequencies Entity Occurrences TotalDistinctAvg/EntityAvg/Doc. Bates9,476,7948,054,0751.181.40 Category13,594,49474183,709.382.00 Doctype18,359,6442,5017,340.922.70 Prodbox6,830,9936,3061,083.251.01
5
Metadata Entity Frequencies Entity Occurrences TotalDistinctAvg/EntityAvg/Doc. Attendee65,691,47349,3751,330.469.68 Brand26,498,001155,350170.573.90 Copied8,775,307322,29427.221.29
6
Metadata Entity Frequencies Org. Entity Occurrences TotalDistinctAvg/EntityAvg/Doc. Author8,742,976149,64158.431.29 Mentioned31,406,753883,28535.564.63 Receiving8,262,49663,625129.861.22
7
Metadata Entity Frequencies Person Entity Occurrences TotalDistinctAvg/EntityAvg/Doc. Author11,128,029875,29212.711.64 Mentioned34,683,2891,938,31017.895.11 Receiving23,427,415455,40451.443.45
8
Database Schema We map the XML structure to a set of relational database tables Non-recurring fields are collected in a table named ‘document’ docid title description OCR text Recurring elements each get a table docid value
9
Identifying an Individual Person # of Occurrences as AttendeeAuthorReceiverMention REININGHAUS, W189,38023,88032,76416,152 REININGHAUS7,3372001,9742,837 REININGHAUS, B1962 REININGHAUS, R1714412
10
How Many Reininghaus? Reininghaus,R Reininghaus,W
11
Co-mention Connections ReininghausWalk PersonCountPersonCount WALK,RA3,871REININGHAUS,W3,871 ROEMER,E3,716ROEMER,E2,883 HAUSSMANN,HJ3,293HAUSSMANN,HJ2,799 TEWES,F2,784HACKENBERG,U2,360
12
Co-mention Connections ReininghausRoemer PersonCountPersonCount WALK,RA3,871REININGHAUS,W3,716 ROEMER,E3,716WALK,RA2,883 HAUSSMANN,HJ3,293HACKENBERG,U2,623 TEWES,F2,784HAUSSMAN,HJ2,573
13
Co-mention Connections ReininghausHaussmann PersonCountPersonCount WALK,RA3,871REININGHAUS,W3,293 ROEMER,E3,716WALK,RA2,799 HAUSSMANN,HJ3,293ROEMER,E2,573 TEWES,F2,784VONCKEN,P2,323
14
Co-mention Affiliations PersonAffiliation Reininghaus, Wolf Gen. Mgr, Contract Research, INBIFO Walk, Rudiger-AlexanderDir. Human Studies, Philip Morris Roemer, EwaldINBIFO Haussmann, Hans-JurgenAssoc. Prin. Scientist, Philip Morris Tewes, F.Biologist, INBIFO Hackenberg, UlrichINBIFO Voncken, P.Chemist, INBIFO
15
Semantics and Structure Our analysis of content involves the following phases: Lexical analysis Sentence boundary detection Named entity recognition Sentence parsing Relationship extraction The nature of the OCR data seriously impacts each of the phases (sometimes in different ways)
16
CDIP Parse Tree Complexity
17
Clean Text Parse Tree Complexity
18
Next Steps Experiment with custom lexical analysis of the OCR Start with simple white space detection Construct a lexicon and look for out-of-band vocabulary as OCR errors candidates Rewrite the analyzer to support OCR error correction Sentence boundary detect and parse the full corpus Generate entity relationships using our question answering framework
19
And Beyond That… Return to the document images and analyze document layout Regenerate OCR to include token coordinates Use our PDF structure extraction framework to generate logical document structure Generate a set of document models based upon similar layout Use the document models to map OCR text to metadata elements
20
For Example
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.