Download presentation
Presentation is loading. Please wait.
Published byDarlene Harris Modified over 9 years ago
1
India Research Lab Auto-grouping Emails for Faster eDiscovery Sachindra Joshi, Danish Contractor, Kenney Ng*, Prasad M Deshpande, and Thomas Hampp* IBM Research – India*IBM Software Group
2
India Research Lab | Outline of the Talk eDiscovery Process A new way of eDiscovery Review: Group Level Review Creating Syntactic Groups Creating Semantic Groups Experiments and Conclusion
3
India Research Lab | eDiscovery Process Discovery: Process in pre-trial phase - Produce relevant information eDiscovery: FRCP 2006 amendment - Produce relevant Electronically Stored Information (ESI) Emails, chats, word docs, presentations etc. Huge volumes of ESI - Process is expensive - 60% of cases warrant some form of eDisovery - 4.8 billion dollars industry in 2011
4
India Research Lab | eDiscovery Process High cost due to review stage - Lawsuit between Clinton administration and tobacco companies (U.S. Vs. Philip Morris) Apply Text Mining Techniques to reduce high costs involved in eDiscovery Process
5
India Research Lab | Named entity annotator Language Annotator Signature Annotator Architecture of eDiscovery Review Systems
6
India Research Lab | Group Level Review Review groups of documents that are “related” instead of individual documents - Mark whole group as responsive/unresponsive or privileged - Efficient and consistent - Syntactically Similar Documents Automated messages, Near and exact duplicates - Semantically Similar Documents Threads, semantic categories
7
India Research Lab | Detecting Syntactic Groups: Automated Messages
8
India Research Lab | Detecting Near Duplicates S1: I am away from 17/2/2011 to 19/2/2011. Please mail xyz@in.ibm.com in case of any need xyz@in.ibm.com S2: I am away from 26/7/2011 to 31/7/2011. Please mail abc@us.ibm.com in case of any need abc@us.ibm.com Notion of Similarity: Resemblance Use fingerprinting (Rabin) instead of actual chunks.
9
India Research Lab | Efficient Detection of Near Duplicates For a document of length n words there would be - n-K+1 chunks with a window size of K It suffices to keep for each document a relatively small fixed size signature Let S n be the set of permutations of [n] And let P be chosen uniformly at random over S n
10
India Research Lab | Signature Annotator In practice choosing the permutations randomly is hard Use a set of n one-to- one functions f i and keep only the smallest value for each f i Keep only j lowest significant bits for each value
11
India Research Lab | Discovering Automated Messages Generating groups of near duplicate – Index Based Clustering - For each document d in index I do If d is not covered - Let S = {S 1, S 2, …, S n } be the signature of document d - D = Query(I, atleast(S,k)) - For each document d’ in D d’ is covered Discovering Groups of Automated Messages - Automated Messages, Group of bulk emails, Group of forward emails Use MD5 to detect bulk emails. Emails with one segment are automated messages
12
India Research Lab | Detecting Semantic Groups: Email Threads A tree like structure A link denotes that the child node was written as a reply to the parent node. Capture the context in which an email was written
13
India Research Lab | Detecting Email Threads Meta data based methods - Headers are not consistently used Content of old mail remains in the new mail - A segment contains text of only one communication An email e i contains e j iff e i approximately contains all the segment of e j
14
India Research Lab © 2007 IBM Corporation Method for Thread Detection Email Segment Generator (ESG) –Creates segments of it where each segment contains content of only one email. Segment Signature Generator (SSG): –Generates a signature for a segment Use near duplicate signatures For practical implementation, we limit on the number of segment signatures (N) that can be associated with an email, e.g. 20 segments.
15
India Research Lab © 2007 IBM Corporation Method: Processing at Indexing Time w1w1 w2w2 wnwn Word index ESG SSG Meta index Signature index
16
India Research Lab © 2007 IBM Corporation Method: Processing at Query Time q Word index w1w1 w2w2 wnwn Meta indexSignature index Generating Candidate Thread Set Use Signature Of First Segment
17
India Research Lab | Detecting Email Threads Given a Candidate Thread Set - Identify the email with only root segment - An email e c is child of an email e p if e c minimally contains e p
18
India Research Lab | Creating Semantic Categories Focus Categories - Documents that are likely to be responsive - Legal Content, Financial Communication, Intellectual Property - High recall Filter Categories - Documents that are likely to be unresponsive - Bulk emails, Private communication, Jokes - High precision
19
India Research Lab | Creating Semantic Categories Email Segmentation Pattern based annotation: Use System T based method Consolidation - Each concept is independent - Apply additional constraints over concepts
20
India Research Lab | Experiments – Near Duplicate Detection Enron Corpus - 517K emails from 150 users Measuring precision - Manually evaluated near duplicate set for 500 queries - With more bits precision is 100% even with 40% similarity threshold Only 33.3 % emails are unique
21
India Research Lab | Experiments – Email Thread Detection No ground truth for threads Subject approximation Method: Based on “Re:”, “Fw:” etc in subject Manually verified the results of thread for our method and subject approximation method - The union of correct emails in thread for both approaches is treated as ground truth.
22
India Research Lab | Experiments – Semantic Group Ground truth: Sampled 2200 emails using generic keywords and then manually labeled
23
India Research Lab | Conclusions We developed a framework that allow group level review of documents We developed methods for finding syntactic groups such as automated messages for creating groups We developed methods for finding email threads and semantic groups We showed significant reduction in the review time by using the group level review and integrated the proposed techniques with IBM Infosphere eDiscovery Analyzer product
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.