Analysis 360: Blurring the line between EDA and PC Andrea Gibson, Product Director, Kroll Ontrack March 27, 2014.

Slides:



Advertisements
Similar presentations
Active Learning with Feedback on Both Features and Instances H. Raghavan, O. Madani and R. Jones Journal of Machine Learning Research 7 (2006) Presented.
Advertisements

Microsoft Business Value Planning Services Microsoft has launched a new Software Assurance benefit to help customers identify, unlock, and capture the.
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
Project leaders will keep track of team progress using an A3 Report.
Technology Assisted Review: Trick or Treat? Ralph Losey, Esq., Jackson Lewis 1.
Advanced Searching Engineering Village.
Strategies for Preserving the Attorney-Client Privilege in the World of Electronic Discovery Beth Rose Ford Motor Company.
Project Planning and Management in E-Discovery DAVID A. ELLIS – MAYER BROWN BROWNING E. MAREAN – DLA PIPER.
Benjamin J. Deaver Advisor – Dr. LiGuo Huang Department of Computer Science and Engineering Southern Methodist University.
Overview of Collaborative Information Retrieval (CIR) at FIRE 2012 Debasis Ganguly, Johannes Leveling, Gareth Jones School of Computing, CNGL, Dublin City.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.
© 2004, The Trustees of Indiana University 1 OneStart Workflow Basics Brian McGough, Manager, Systems Integration, UITS Ryan Kirkendall, Lead Developer.
Driving Productivity with Microsoft Dynamics CRM Presenter Name Presenter Title Presenter Date.
Near-Duplicate Detection by Instance-level Constrained Clustering Hui Yang, Jamie Callan Language Technologies Institute School of Computer Science Carnegie.
1 UCB Digital Library Project An Experiment in Using Lexical Disambiguation to Enhance Information Access Robert Wilensky, Isaac Cheng, Timotius Tjahjadi,
Implementing Metadata Marjorie M K Hlava, President Access Innovations, Inc. Albuquerque, NM
Credit Strategy.Net Web Based (ASP) Application Credit Module Copyright (C) Credit Strategy, Inc. All rights reserved Page Up or Down to navigate.
Web 2.0 Testing and Marketing E-engagement capacity enhancement for NGOs HKU ExCEL3.
RANDOM SAMPLING PRACTICAL APPLICATIONS IN eDISCOVERY.
Get Off of My I-Cloud: Role of Technology in Construction Practice Sanjay Kurian, Esq. Trent Walton, CTO U.S. Legal Support.
CHAPTER 3: DEVELOPING LITERATURE REVIEW SKILLS
Fusion GPS Externalization Pilot Training 1/5/2011 Lydia M. Naylor Research Lead.
The Future of Legal Technology Kent Radford. Why It Matters ABA Model Rule 1.1 Competence A lawyer shall provide competent representation to a client.
Processing of large document collections Part 2 (Text categorization, term selection) Helena Ahonen-Myka Spring 2005.
Chapter 6 Supplement Knowledge Engineering and Acquisition Chapter 6 Supplement.
Marco Nasca Senior Director, Client Solutions TRANSFORMING DISCOVERY THROUGH DATA MANAGEMENT.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Nobody’s Unpredictable Ipsos Portals. © 2009 Ipsos Agenda 2 Knowledge Manager Archway Summary Portal Definition & Benefits.
A COMPETENCY APPROACH TO HUMAN RESOURCE MANAGEMENT
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
Gmail Labels + Filters. Table of Contents Purpose Logging In What ARE labels Creating labels How can you USE labels What ARE filters Creating filters.
Planning a search strategy.  A search strategy may be broadly defined as a conscious approach to decision making to solve a problem or achieve an objective.
FOCUS – Framing, Organizing, Collecting, Understanding, and Synthesizing Paul Friga’s McKinsey Engagement.
Educational Research: Competencies for Analysis and Application, 9 th edition. Gay, Mills, & Airasian © 2009 Pearson Education, Inc. All rights reserved.
Computer Aided Process Planning (CAPP). What is Process Planning? Process planning acts as a bridge between design and manufacturing by translating design.
Search Engine Architecture
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
Conducting Modern Investigations Analytics & Predictive Coding for Investigations & Regulatory Matters.
Ami™ as a process Showing the structural elements in the Accelerated Model for Improvement™
Session 7: Early Case Assessment LBSC 708X/INFM 718X Seminar on E-Discovery Jason R. Baron Adjunct Faculty University of Maryland March 8, 2012.
WEB 2.0 PATTERNS Carolina Marin. Content  Introduction  The Participation-Collaboration Pattern  The Collaborative Tagging Pattern.
Records Management for Paper and ESI Document Retention Policies addressing creation, management and disposition Minimize the risk and exposure Information.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Symantec Archiving & eDiscovery 1 Randy Law, Symantec Andy Becker, Trace3 Introducing the Clearwell eDiscovery Platform.
Generating Query Substitutions Alicia Wood. What is the problem to be solved?
Achieving Semantic Interoperability at the World Bank Designing the Information Architecture and Programmatically Processing Information Denise Bedford.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
CHAPTER 3 – JOB ANALYSIS. KEY CONCEPTS AND SKILLS ➲ Define job analysis ➲ Reasons for conducting job analysis ➲ Types of information required for job.
Sitecore. Compelling Web Experiences Page 1www.sitecore.net Patrick Schweizer Director of Sales Enablement 2013.
INFORMATION RETRIEVAL MEASUREMENT OF RELEVANCE EFFECTIVENESS 1Adrienn Skrop.
Thinking of Drupal 8? Get started with the resources.
When the law firm is the client Handling legal holds, document collections and productions of your own firm’s documents.
Beyond Predictive Coding – The True Power of Analytics.
Time for a change? eDISCOVERY RFP Toolkit.
A Generic Approach to Big Data Alarms Prioritization
Educational Communication & E-learning
Background: The Big Data era
Information Organization: Overview
RSA Case Study.
6/22/2018 2:09 PM BRK3102 How Microsoft Legal drives down eDiscovery costs with machine learning in Office 365 Rachi Messing Senior Program Manager, O365.
ServiceNow Implementation Knowledge Management
Programming Assignment Help
Introductions MARK HJERPE Divergent Translations Partner
Thinking In College In this lesson, we’ll explore what it means to be a college-level thinker, and how to develop strong thinking skills. Any questions.
Thinking In College In this lesson, we’ll explore what it means to be a college-level thinker, and how to develop strong thinking skills. Any questions.
Information Organization: Overview
Presentation transcript:

Analysis 360: Blurring the line between EDA and PC Andrea Gibson, Product Director, Kroll Ontrack March 27, 2014

Discussion Overview  Pushing the Boundaries of Early Data Analysis (EDA)  Examining Traditional EDA Tools  Leveraging Predictive Coding (PC) for Analysis  Using PC in an EDA Environment 2

Pushing the Boundaries of EDA

EDA | an acronym worth defining 4  Early Data Analysis (EDA) aides fact-finding and narrows the data scope by helping attorneys understand their datasets »Triage data into critical and non-critical groupings »Identify and reduces number of key players »Test search terms »Identify critical case arguments »Categorize documents as efficiently as possible for production  A true methodology – technology fuels human decisions

»Filter »Search »Cluster »Processing »Ensure portability of groups and tags »Ensure production/ search capabilities of review platform »Search »Tag »Redact 5 Identify, Collect & Process Analysis Export to Review Platform »Log »Route »Report Import & Perform Early Analysis »Test »QC Document Review Traditional EDA | Overview

»Filter »Search »Cluster »Processing »Ensure portability of groups and tags »Ensure production/ search capabilities of review platform 6 Identify, Collect & Process Analysis Export to Review Platform Import & Perform Analysis »Test »QC Where does Predictive Coding fit in? Predictive Coding! »Search »Tag »Redact »Log »Route »Report Document Review

»Filter »Search »Cluster »Ensure portability of groups and tags »Ensure production/ search capabilities of review platform »Search »Tag »Redact Predictive Coding! 7 Identify, Collect & Process Analysis Export to Review Platform »Log »Route »Report Import & Perform Analysis »Test »QC Review Traditional EDA | How efficient is it? The Bermuda Triangle of ediscovery »PC is massively underused »The tools used during analysis and review overlap substantially »Pointless inefficiencies are created by jockeying data between two standalone platforms

8 Identify, Collect & Process Analyze and Review EDA + Review | Could it look like this? »Process »PC »Filter »Search »Cluster »Test »QC »Route »Report »Tag

Examining Traditional EDA Tools

Keyword Search & Concept Search 10 »Uses search terms and Boolean operators (&, or, not) to retrieve documents that contain those exact terms »Standard practice »Generally accepted in the courts “baseball & field” »Technology alternative »Allows reviewers to find documents with similar conceptual terms even if they do not contain exact search terms »Seldom used for filtering; increasingly used for review “baseball”  diamond, MLB, hit, out

11 Finance »Documents automatically grouped by theme without human input Topic Grouping & »Identify all languages in a document »Used to group and sort documents for review by multilingual reviewers Topic Grouping & Language Identification

12 »Identifies and groups conversations based on content Topic Grouping & »Reviewers can quickly identify and compare documents that are very similar to one another but are not exact duplicates Threading & Near Deduplication Start-Point RE: FWD: End-Point

Finding a Common Thread 13  At their cores, these tools help attorneys learn more about their data »Does PC fit the bill? Topic Group Key Word Search Language ID Dedupe Threading Concept Search Analytical Tools Predictive Coding

Leveraging PC for Analysis

15 Predictive Coding for Production

Predictive Coding For Analysis 16  PC has been praised for its ability to reduce the amount of documents manually reviewed during first pass  But at least three critical components of PC empower attorneys with unrivaled knowledge about their case: »Prioritization »Categorization »Active Learning

The Prioritization Component 17 74, ,000 ResponsiveNon-responsive  Learns from reviewer decisions and escalates documents based on two binary categories »Responsive or nonresponsive »Works based on modest amount of learning  Increases the ratio of responsive documents that get routed to reviewers

The Prioritization Component 18  How does this help attorneys analyze their case? »When attorneys ‘check out’ documents to review, they are seeing those documents most likely to be responsive »For the same reasons this speeds up production, attorneys who put eyes on these richly relevant documents will know more about their case earlier – driving arguments and filling knowledge gaps »It runs in the background, you don’t need to carve into billable hours to test keywords Request batch Entire Corpus

19  Learns from trainer decisions and suggests coding on multiple categories for an entire collection of documents  Assigns a predicted responsiveness score  Improves speed and quality of categorization decisions 75% Predicted Responsive Non-responsive Privileged 67% Predicted 89% Predicted The Categorization Component

20  How does this help attorneys analyze their case? »Allows attorneys to segregate data at user-defined predicted responsiveness ratings after modest training »Empowers attorneys to route certain categories of documents (e.g. “hot” docs) to certain sub-groups within the team 0% 100% 1,427 docs 9,522 docs Post Round One Categorization Results (65% cutoff) 65% % likelihood to be responsive To: Brief-writer Bryan Re: Good Luck on the first draft!

 Key component of any true PC solution »Automatically escalates focus documents for training (as opposed to just handpicked, or just randomly selected training documents)  Focus Documents: »Come from grey areas in the classifier because the machine is currently uncertain whether they are responsive or not responsive »Ideal candidates to improve machine learning »Not random, but queried % responsive 0% non-responsive 90% 80% 70% 60% 50% 40% 30% 20% 10% The Active Learning Component

 How does this help attorneys analyze their case? »Introduces attorneys to the documents on the fringe of relevancy –These could be case-changing documents that the machine just doesn’t know enough about yet »Most effective way to boost metrics and improve results between early training rounds –Reduces false positives; improves accuracy of machine’s concept of relevancy 22 The Active Learning Component Precision Recall Precision TR 1 TR 2

Additional Efficiencies 23  Production »Can easily transition into production whether leveraging PC, or not –Most practical form of PC for EDA  Reporting »Even if just one or two training rounds are performed, metrics will show where you stand –In this vein, no other EDA tool comes close to PC’s automatic reporting –There’s a reason courts often ask for recall and precision - these indicate whether you’re understanding of the data set is accurate

Additional Efficiencies 24  Other ECA tools complement predictive coding »Predictive coding requires reviewing a few thousand documents in training –Most PC solutions also come equipped with all other EDA tools available –This helps you navigate the training set as well as during review  Intra-team quality control »Can compare reviewer-machine agreement rates side-by-side »Identify points of disagreement and inconsistency

Additional Efficiencies 25  The small case conundrum »The analytical value from PC is greater where the same subject-matter expert who trains the system is the same attorney who is forming case strategy –This is most likely true in small-medium cases where one attorney may be in charge of a case through trial »The production value from using PC to aid review is greater where high upfront costs can be recouped from applying the machine’s logic to a large amount of documents –Traditionally, this has been true only in large cases

Additional Efficiencies 26  This is all changing  The “portfolio approach” to ediscovery »Pay yearly for PC (and everything that preceded it) in all your cases for a data hosting fee (process on the vendor’s side) –Upload on day one, train on day one, see a list of documents ranked by relevancy on day one

Using PC in an EDA Environment

Overview 28  It’s not that crazy »EDA tools let you learn more about your data—so does PC »Many of the tools discussed today (e.g. de-duplication, concept searching) already exist in standalone “PC solutions”  Aggressive culling via keywords can have an impact on training in PC  Any search strategy must be well designed according to the matter at hand  The producing party has substantial deference in conducting its search

 In re Biomet »Defendant’s search strategy: »Plaintiffs argued: the defendant should have used PC on the whole 19.5 million document corpus; the keywords tainted the training. We want joint review of training docs. »Court held: defendant’s search was reasonable Pre-PC Keyword Cull? 29 3 million documents 19.5 million documents Production Keyword PC

Parting Thoughts 30  There are many ways to learn about data »Different tools on the same belt; multi-modal search  Solutions are emerging that offer all of these tools in one location »No more data jockeying »More information for better decisions  Quality control is essential whenever you use one of these tools to remove documents from production