Presentation is loading. Please wait.

Presentation is loading. Please wait.

Concepts, Semantics and Syntax in E-Discovery David Eichmann Institute for Clinical and Translational Science The University of Iowa David Eichmann Institute.

Similar presentations


Presentation on theme: "Concepts, Semantics and Syntax in E-Discovery David Eichmann Institute for Clinical and Translational Science The University of Iowa David Eichmann Institute."— Presentation transcript:

1 Concepts, Semantics and Syntax in E-Discovery David Eichmann Institute for Clinical and Translational Science The University of Iowa David Eichmann Institute for Clinical and Translational Science The University of Iowa

2 Our Approach  Analyze the human-generated metadata available for document collections for organizational and individual interactions  Explore the syntactic and semantic nature of document content and the potential for automatic generation of metadata  Explore the concept space generated by the previous step and its correspondence to boolean predicate specification in discovery

3 Our Target Corpus  The Illinois Institute of Technology Complex Document Information Processing Test Collection (IIT CDIP), v. 1.0  Derived from the tobacco master settlement agreement  Comprises 6,910,192 ‘documents’  Or more properly the OCR output from those documents  Two merged XML tag sets of metadata, with overlapping content 

4 Metadata Entity Frequencies Entity Occurrences TotalDistinctAvg/EntityAvg/Doc. Bates9,476,7948,054,0751.181.40 Category13,594,49474183,709.382.00 Doctype18,359,6442,5017,340.922.70 Prodbox6,830,9936,3061,083.251.01

5 Metadata Entity Frequencies Entity Occurrences TotalDistinctAvg/EntityAvg/Doc. Attendee65,691,47349,3751,330.469.68 Brand26,498,001155,350170.573.90 Copied8,775,307322,29427.221.29

6 Metadata Entity Frequencies Org. Entity Occurrences TotalDistinctAvg/EntityAvg/Doc. Author8,742,976149,64158.431.29 Mentioned31,406,753883,28535.564.63 Receiving8,262,49663,625129.861.22

7 Metadata Entity Frequencies Person Entity Occurrences TotalDistinctAvg/EntityAvg/Doc. Author11,128,029875,29212.711.64 Mentioned34,683,2891,938,31017.895.11 Receiving23,427,415455,40451.443.45

8 Database Schema  We map the XML structure to a set of relational database tables  Non-recurring fields are collected in a table named ‘document’  docid  title  description  OCR text  Recurring elements each get a table  docid  value

9 Identifying an Individual Person # of Occurrences as AttendeeAuthorReceiverMention REININGHAUS, W189,38023,88032,76416,152 REININGHAUS7,3372001,9742,837 REININGHAUS, B1962 REININGHAUS, R1714412

10 How Many Reininghaus?  Reininghaus,R  Reininghaus,W

11 Co-mention Connections ReininghausWalk PersonCountPersonCount WALK,RA3,871REININGHAUS,W3,871 ROEMER,E3,716ROEMER,E2,883 HAUSSMANN,HJ3,293HAUSSMANN,HJ2,799 TEWES,F2,784HACKENBERG,U2,360

12 Co-mention Connections ReininghausRoemer PersonCountPersonCount WALK,RA3,871REININGHAUS,W3,716 ROEMER,E3,716WALK,RA2,883 HAUSSMANN,HJ3,293HACKENBERG,U2,623 TEWES,F2,784HAUSSMAN,HJ2,573

13 Co-mention Connections ReininghausHaussmann PersonCountPersonCount WALK,RA3,871REININGHAUS,W3,293 ROEMER,E3,716WALK,RA2,799 HAUSSMANN,HJ3,293ROEMER,E2,573 TEWES,F2,784VONCKEN,P2,323

14 Co-mention Affiliations PersonAffiliation Reininghaus, Wolf Gen. Mgr, Contract Research, INBIFO Walk, Rudiger-AlexanderDir. Human Studies, Philip Morris Roemer, EwaldINBIFO Haussmann, Hans-JurgenAssoc. Prin. Scientist, Philip Morris Tewes, F.Biologist, INBIFO Hackenberg, UlrichINBIFO Voncken, P.Chemist, INBIFO

15 Semantics and Structure  Our analysis of content involves the following phases:  Lexical analysis  Sentence boundary detection  Named entity recognition  Sentence parsing  Relationship extraction  The nature of the OCR data seriously impacts each of the phases (sometimes in different ways)

16 CDIP Parse Tree Complexity

17 Clean Text Parse Tree Complexity

18 Next Steps  Experiment with custom lexical analysis of the OCR  Start with simple white space detection  Construct a lexicon and look for out-of-band vocabulary as OCR errors candidates  Rewrite the analyzer to support OCR error correction  Sentence boundary detect and parse the full corpus  Generate entity relationships using our question answering framework

19 And Beyond That…  Return to the document images and analyze document layout  Regenerate OCR to include token coordinates  Use our PDF structure extraction framework to generate logical document structure  Generate a set of document models based upon similar layout  Use the document models to map OCR text to metadata elements

20 For Example

21


Download ppt "Concepts, Semantics and Syntax in E-Discovery David Eichmann Institute for Clinical and Translational Science The University of Iowa David Eichmann Institute."

Similar presentations


Ads by Google