RANDOM SAMPLING PRACTICAL APPLICATIONS IN eDISCOVERY.

Slides:



Advertisements
Similar presentations
Information Extraction Lecture 4 – Named Entity Recognition II CIS, LMU München Winter Semester Dr. Alexander Fraser, CIS.
Advertisements

Mean, Proportion, CLT Bootstrap
Estimating a Population Proportion
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Copyright © 2011 Pearson Education, Inc. Statistical Tests Chapter 16.
Technology Assisted Review: Trick or Treat? Ralph Losey, Esq., Jackson Lewis 1.
Statistics for Managers Using Microsoft® Excel 5th Edition
March 2013 ESSnet DWH - Workshop IV DATA LINKING ASPECTS OF COMBINING DATA INCLUDING OPTIONS FOR VARIOUS HIERARCHIES (S-DWH CONTEXT)
Math 161 Spring 2008 What Is a Confidence Interval?
Evaluating Search Engine
Decision Tree Algorithm
Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.
Capital Budgeting Decision Tools 05/17/06. Introduction Capital Budgeting is the process of identifying, evaluating, and implementing a firm’s longer.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 9-1 Chapter 9 Fundamentals of Hypothesis Testing: One-Sample Tests Basic Business Statistics.
Testing an individual module
7-1 Copyright ©2011 Pearson Education, Inc. publishing as Prentice Hall Chapter 7 Sampling and Sampling Distributions Statistics for Managers using Microsoft.
Statistics for Managers Using Microsoft® Excel 5th Edition
Introduction to Machine Learning Approach Lecture 5.
Statistics for Managers Using Microsoft® Excel 7th Edition
Unit 3: Sample Size, Sampling Methods, Duration and Frequency of Sampling #3-3-1.
BA 427 – Assurance and Attestation Services
Determining the Size of
Decision Tree Models in Data Mining
Copyright © 2007 Pearson Education Canada 1 Chapter 12: Audit Sampling Concepts.
Multiple testing correction
Business Statistics, A First Course (4e) © 2006 Prentice-Hall, Inc. Chap 9-1 Chapter 9 Fundamentals of Hypothesis Testing: One-Sample Tests Business Statistics,
Chapter 8 Hypothesis testing 1. ▪Along with estimation, hypothesis testing is one of the major fields of statistical inference ▪In estimation, we: –don’t.
Fundamentals of Hypothesis Testing: One-Sample Tests
Determining Sample Size
by Chris Brown under Prof. Susan Rodger Duke University June 2012
An Attempt at Group Belief Characterization and Detection Danny Dunlavy Computer Science and Informatics Department (1415) Sandia National Laboratories.
Multiple Indicator Cluster Surveys Survey Design Workshop Sampling: Overview MICS Survey Design Workshop.
Fundamentals of Data Analysis Lecture 9 Management of data sets and improving the precision of measurement.
7 - 1 Chapter 7: Data Analysis for Modeling PowerPoint Slides Prepared By: Alan Olinsky Bryant University Management Science: The Art of Modeling with.
Statistics for Managers Using Microsoft Excel, 5e © 2008 Pearson Prentice-Hall, Inc.Chap 8-1 Statistics for Managers Using Microsoft® Excel 5th Edition.
Copyright ©2011 Pearson Education 7-1 Chapter 7 Sampling and Sampling Distributions Statistics for Managers using Microsoft Excel 6 th Global Edition.
Evaluating What’s Been Learned. Cross-Validation Foundation is a simple idea – “ holdout ” – holds out a certain amount for testing and uses rest for.
A Strategy for Prioritising Non-response Follow-up to Reduce Costs Without Reducing Output Quality Gareth James Methodology Directorate UK Office for National.
CC3020N Fundamentals of Security Management CC3020N Fundamentals of Security Management Lecture 2 Risk Identification and Risk Assessment.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
Uncertainty & Error “Science is what we have learned about how to keep from fooling ourselves.” ― Richard P. FeynmanRichard P. Feynman.
1 MARKETING RESEARCH Week 5 Session A IBMS Term 2,
COMP 208/214/215/216 – Lecture 8 Demonstrations and Portfolios.
Copyright © 2014, 2011 Pearson Education, Inc. 1 Chapter 16 Statistical Tests.
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
Database Applications – Microsoft Access Lesson 4 Working with Queries 36 Slides in Presentation.
Conducting Modern Investigations Analytics & Predictive Coding for Investigations & Regulatory Matters.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 8-1 Chapter 8 Fundamentals of Hypothesis Testing: One-Sample Tests Statistics.
Statistical Expertise for Sound Decision Making Quality Assurance for Census Data Processing Jean-Michel Durr 28/1/20111Fourth meeting of the TCG - Lubjana.
Statistical Inference for the Mean Objectives: (Chapter 9, DeCoursey) -To understand the terms: Null Hypothesis, Rejection Region, and Type I and II errors.
Auditing: The Art and Science of Assurance Engagements Chapter 13: Audit Sampling Concepts Copyright © 2011 Pearson Canada Inc.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
Bayesian decision theory: A framework for making decisions when uncertainty exit 1 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e.
Decision Tree Algorithms Rule Based Suitable for automatic generation.
Defensible Quality Control for E-Discovery Geoff Black and Albert Barsocchini.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Sample Size Mahmoud Alhussami, DSc., PhD. Sample Size Determination Is the act of choosing the number of observations or replicates to include in a statistical.
Statistical Inference for the Mean Objectives: (Chapter 8&9, DeCoursey) -To understand the terms variance and standard error of a sample mean, Null Hypothesis,
Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Module 2: Random Sampling Background, Key Concepts and Issues Outcome Monitoring and Evaluation Using LQAS.
Yandell – Econ 216 Chap 8-1 Chapter 8 Confidence Interval Estimation.
6/22/2018 2:09 PM BRK3102 How Microsoft Legal drives down eDiscovery costs with machine learning in Office 365 Rachi Messing Senior Program Manager, O365.
Speakers: Ian Campbell, Claire Hass,
Introduction into Knowledge and information
Sample Size and Accuracy
Introduction to Information Retrieval
False discovery rate estimation
Presentation transcript:

RANDOM SAMPLING PRACTICAL APPLICATIONS IN eDISCOVERY

WELCOME  Thank you for joining  Numerous diverse attendees  Today’s topic and presenters  Question submission for later response  You will receive slides, recording and survey tomorrow  Coming up next month – SharePoint webinar

SPEAKERS  Matthew Verga –Director, Content Marketing and eDiscovery Strategy

TODAY’S TOPICS  Sampling’s opaque ubiquity and the dark ages of discovery  Finding out what’s in a new dataset, aka estimating prevalence  Finding out how good a search string is, aka testing classifiers  Finding out how good your reviewers are, aka quality control  Finding out how much stuff you missed, aka measuring elusion

SAMPLING’S OPAQUE UBIQUITY AND THE DARK AGES OF DISCOVERY

INTRODUCTION  The topic of “sampling” comes up constantly –when referring to collections, –early case assessment –and review o both human o and technology-assisted  Before review software incorporated sophisticated sampling tools, practitioners were taking samples manually

INTRODUCTION, CONT.  When I started out, nearly 8 years ago, the best wisdom: –Included iterative testing of search strings by partners or senior attorneys, who would informally sample the results of each revised search string to inform their next revision –Suggested employing a 3-pass document review process with successively more senior attorneys performing each pass: o The first pass reviewed everything o The second pass re-reviewed a random 10% sample o And the third pass re-reviewed a random 5% sample

INTRODUCTION, CONT.  But, but, but: –Why is a search that returns more documents than expected invalid? –How many search results are enough to sample? –Why re-review 10% and 5%? –What’s the basis? o (Answer Key: It’s not necessarily; Not just however many “feels right”; Mostly because it’s what was done before; There isn’t much of one)  Turns out, law schools need to add statistics courses

FINDING OUT WHAT’S IN A NEW DATASET, AKA ESTIMATING PREVALENCE

ESTIMATING PREVALENCE  Finding out what’s in a new, unknown dataset  Prevalence –Prevalence is the portion of a dataset that is relevant to a particular information need –For example, if one third of a dataset was relevant in a case, the prevalence of relevant materials would be 33% –Always known at the end of a document review project  Why estimate it at the beginning?

ESTIMATING PREVALENCE, CONT.  Knowing the prevalence of relevant materials can guide the selection of culling and review techniques to be employed –(It can also provide a measuring stick for overall progress)  Knowing the prevalence of different subclasses of materials can guide decisions about resource allocation and prioritization –(e.g., associates vs. contract attorneys vs. LPO)  Knowing the prevalence of specific features facilitates more accurate estimation of project costs: –(e.g., volume to review, volume to redact, volume to privilege log, etc.)

ESTIMATING PREVALENCE, CONT.  Estimating prevalence of one or more features of a new, unknown dataset is fundamentally valuable because it provides discovery intelligence for data-driven decision making, replacing gut-feelings and anecdotes with data and knowledge

ESTIMATING PREVALENCE, CONT.  Now that we know why estimating prevalence can be valuable, how do we do it?  Steps: –Step 1: Identify your sampling frame –Step 2: Determine your needed sample size –Step 3: Take and review your simple random sample –Step 4: Calculate your prevalence estimate

ESTIMATING PREVALENCE, CONT.  Step 1: Identify your sampling frame  Generally, the same pool that would be subjected to review: –A pool with system files removed (de-NISTed) –A pool with documents outside of any applicable date range removed –A pool that has been de-duplicated –A pool to which any other obvious, objective culling criteria have been applied o (e.g., court mandated key word or custodian filtering)

ESTIMATING PREVALENCE, CONT.  Step 2: Determine your sample size  The sample size you should take depends on: –The strength of the measurement you wish to take –The size of your sampling frame –The prevalence of relevant material within the dataset  Let’s look at how each affects sample size

ESTIMATING PREVALENCE, CONT.  Step 2: Determine your sample size, cont. –The strength of the measurement you want to take  Expressed through two values: confidence level and interval –Confidence level o How certain you are about the results you get o How many times out of 100 would you get the same results o Typically 90%, 95%, or 99% –Confidence interval o how precise your results are o how much uncertainty there is in your results o Typically between +/-2% and +/-5%

ESTIMATING PREVALENCE, CONT.  Step 2: Determine your sample size, cont. –The strength of the measurement you want to take  The higher the confidence level sought, the larger the sample size that will be required  The lower the confidence interval sought, the larger the sample size that will be required

ESTIMATING PREVALENCE, CONT.

 Step 2: Determine your sample size, cont. –The size of your sampling frame  The larger the sampling frame, the larger the sample size needed – but only up to a point –Required sample size eventually levels off –The sample size needed for 1,000,000 documents is not much larger than the sample size needed for 100,000 documents –Potential for cost savings compared to old methods

ESTIMATING PREVALENCE, CONT.

 Step 2: Determine your sample size, cont. –The prevalence of relevant material within the dataset  Sample size decreases as prevalence increases or decreases from 50% –When taking a sample to estimate prevalence, prevalence is unknown –So, the most conservative option (i.e., resulting in the largest sample size) should be used, which is 50% –Most sampling calculators default to this; on some it is not variable

ESTIMATING PREVALENCE, CONT.

 Step 2: Determine your sample size, cont.  An Example: –Sampling frame of 1,000,000 –Desired strength of 95%, +/-2% –Assumed prevalence of 50%  Resulting Sample Size: 2,396 Documents  Sampling calculators are integrated into most review tools and are also available online

ESTIMATING PREVALENCE, CONT.  Step 3: Take and review your simple random sample  Simple random sample: one in which every document has an equal chance of being selected –Most modern review programs include sampling tools –Spreadsheet programs can generate lists of random numbers  Ensure the highest quality review possible, as any errors in the review of the sample will be effectively amplified in the estimations based on that review

ESTIMATING PREVALENCE, CONT.  Step 4: Calculate your prevalence estimate  Resulting Sample Size: 2,396 Documents (95%, +/-2%) –If review finds 599/2,396 (i.e., 25%) are relevant, then… o You would have 95% confidence that the overall prevalence of relevant documents is between 23% and 27% o You would have 95% confidence that between 230,000 and 270,000 of the 1,000,000 document sampling frame are relevant

ESTIMATING PREVALENCE, CONT.  Beyond relevant documents, you could measure the prevalence of: –Privileged documents –Documents requiring redaction –Documents presenting HIPAA or FOIA issues –Documents requiring other special handling  You can use this information to: –Guide the selection of culling and review techniques –Provide a measuring stick for overall progress –Guide decisions about resource allocation –Estimate project costs more accurately

FINDING OUT HOW GOOD A SEARCH STRING IS, AKA TESTING CLASSIFIERS

TESTING CLASSIFIERS  Finding out how good a search string is  Classifiers –Tools, mechanisms or processes by which documents are classified into categories like responsive/nonresponsive or privileged/non-privileged –The tools, mechanisms, and processes employed could include: o Keyword or Boolean searches o Individual human reviewers o Overall team review processes o Machine categorization by latent semantic indexing o Predictive coding by probabilistic latent semantic analysis

TESTING CLASSIFIERS, CONT.  What’s the value of testing classifiers?  Testing classifiers, like estimating prevalence, is a source of discovery intelligence to guide data-driven decision making –First, testing classifiers can improve the selection and iterative refinement of search strings and other classifiers –Second, testing provides a stronger basis for argument for or against search strings or classifiers during pre-trial proceedings

TESTING CLASSIFIERS, CONT.  The efficacy of classifiers is expressed through two values: recall and precision –Recall is a measurement of how much of the material sought was returned by the classifier o e.g., if 250,000 relevant documents exist and a search returns 125,000 of them, it has a recall of 50%, finding half of what is sought o Under-inclusiveness –Precision is a measurement of how much unwanted material is returned by a classifier o e.g., if a search returns 150,000 documents of which only 50,000 are relevant, it has a precision of just 33%, having returned 100,000 unwanted items o Over-inclusiveness

TESTING CLASSIFIERS, CONT.  Testing classifiers before applying them to a full dataset requires the creation of a control set against which they can be tested  A control set is a representative sample of the full dataset that has already been classified by the best reviewers possible so that it can function as a gold standard  The classifier is run against the control set, and its classifications are compared against the experts’

TESTING CLASSIFIERS, CONT.  Example –A simple random sample of 2,396 documents has been reviewed by subject matter experts for relevance to function as a control set –A search string proposed by plaintiff is run against the control set –It returns 1,200 out of the 2,396 documents, a mix of documents deemed relevant and not relevant by the subject matter experts –How do you sort out the recall and precision?

TESTING CLASSIFIERS, CONT. Relevant (SMEs)Not Relevant (SMEs) Returned (P. Search) Not Returned (P. Search)  There are 400 true positives –Deemed relevant by SME and returned by P. Search  There are 596 true negatives –Deemed not relevant by SME and not returned by P. Search  There are 800 false positives –Deemed not relevant by SME but returned by P. Search  There are 600 false negatives –Deemed relevant by SME but not returned by P. Search

TESTING CLASSIFIERS, CONT.  Calculating recall: –Recall is the percentage of all relevant documents correctly returned –400 (TP) out of 1000 (TP+FN) =.40 = 40% –The P. Search has recall of 40%  Calculating precision: –Precision is the percentage of the returned documents that are relevant –400 (TP) out of 1200 (TP +FP) =.33 = 33% –The P. Search has precision of 33%

FINDING OUT HOW GOOD YOUR REVIEWERS ARE, AKA QUALITY CONTROL

QUALITY CONTROL  Finding out how good your reviewers are  Accuracy and error rate in document review –Accuracy is the measure of how many reviewer determinations are correct –Error rate is the measure hoe many reviewer determinations are incorrect o Together, accuracy and error rate should total 100%  Like testing classifiers, involves the comparison of one set of classifications (the initial reviewer’s) to another (the more senior reviewer performing quality control review)

QUALITY CONTROL, CONT.  Lot acceptance sampling: –Methodology employed in pharmaceutical manufacturing, military contract fulfillment, and other high-volume, quality- focused processes  How it works: –A sampling protocol and maximum acceptable error rate is established –Each lot has a random sample taken from it for testing –If the acceptable error rate is exceeded, the entire lot is rejected without further evaluation

QUALITY CONTROL, CONT.  In the context of document review, the “lot” can correspond to individual reviewer batches or to aggregations, such as an individual reviewer’s total completed review for each day  Statistics on batch rejection can be tracked by reviewer, by team, by source material, or by other useful properties  Choosing not to acknowledge or measure the error rate in a document review project does not mean that it does not exist

FINDING OUT HOW MUCH STUFF YOU MISSED, AKA MEASURING ELUSION

MEASURING ELUSION  Finding out how much stuff you missed  Elusion –Elusion is a measure of how much relevant material was left behind –Typically referenced in the context of predictive coding and the portion of a dataset left unreviewed by a technology-assisted review process  An elusion measurement can also be thought of as an estimation or prevalence specific to the remainder pile

MEASURING ELUSION, CONT.  Example –Even after targeted collection and processing, 1,000,000 documents –Use key word and Boolean searches to cull that 1,000,000 to 250,000 for human document review –In addition to documenting the culling techniques employed and their rationales, an elusion/prevalence measurement can be taken

MEASURING ELUSION, CONT.  Example, cont. –For a remainder pile of 750,000 documents –Reviewing a simple random sample of just 16,229 documents –Will allow you to measure the elusion o (aka the prevalence of relevant materials in the remainder) –With a 99% confidence level –And a confidence interval of +/-1%

MEASURING ELUSION, CONT.  Example, cont. –Assuming, hypothetically, that your review of that sample found 5% of that sample of the remainder to be relevant o You would then be able to say with 99% confidence o That no more than 4-6% of the remainder is relevant –(aka 30,000-45,000 documents out of 750,000) –You would also learn about what types of missed materials exist o To develop a method of targeted retrieval o To demonstrate low utility of retrieval –(i.e., by showing material is duplicative or of low probative value relative to additional expense required for retrieval)

REVIEW AND PREVIEW

REVIEW  Estimating prevalence –Discovery intelligence –Four steps o Identify sampling frame o Determine sample size (measurement strength) o Take and review simple random sample o Calculate prevalence estimate  Testing classifiers –Evaluated for recall and precision o Recall is how under-inclusive o Precision is how over-inclusive –Must be tested against control set

REVIEW, CONT.  Quality control of human document review –Evaluated for accuracy and error rate –Lot acceptance sampling –Acceptable error rate  Measuring elusion –Elusion is the prevalence of relevant materials in the remainder –Measured the same way as estimating prevalence –Useful both for process validation and for defensibility

ENGAGE WITH MODUS  Modus coming tomorrow with slides, recording, survey and invite to next webinar  Visit us online at for more information on next webinar on January 27 th at 1:00PM EST entitled: –“Meeting the Microsoft SharePoint Discovery Challenge”  Make sure to visit our website for valuable white papers, blogs and other information at –White Paper: “Practical Applications of Random Sampling in eDiscovery” Thank you!