Download presentation
Presentation is loading. Please wait.
Published byStewart Lucas Modified over 9 years ago
1
RANDOM SAMPLING PRACTICAL APPLICATIONS IN eDISCOVERY
2
WELCOME Thank you for joining Numerous diverse attendees Today’s topic and presenters Question submission for later response You will receive slides, recording and survey tomorrow Coming up next month – SharePoint webinar
3
SPEAKERS Matthew Verga –Director, Content Marketing and eDiscovery Strategy
4
TODAY’S TOPICS Sampling’s opaque ubiquity and the dark ages of discovery Finding out what’s in a new dataset, aka estimating prevalence Finding out how good a search string is, aka testing classifiers Finding out how good your reviewers are, aka quality control Finding out how much stuff you missed, aka measuring elusion
5
SAMPLING’S OPAQUE UBIQUITY AND THE DARK AGES OF DISCOVERY
6
INTRODUCTION The topic of “sampling” comes up constantly –when referring to collections, –early case assessment –and review o both human o and technology-assisted Before review software incorporated sophisticated sampling tools, practitioners were taking samples manually
7
INTRODUCTION, CONT. When I started out, nearly 8 years ago, the best wisdom: –Included iterative testing of search strings by partners or senior attorneys, who would informally sample the results of each revised search string to inform their next revision –Suggested employing a 3-pass document review process with successively more senior attorneys performing each pass: o The first pass reviewed everything o The second pass re-reviewed a random 10% sample o And the third pass re-reviewed a random 5% sample
8
INTRODUCTION, CONT. But, but, but: –Why is a search that returns more documents than expected invalid? –How many search results are enough to sample? –Why re-review 10% and 5%? –What’s the basis? o (Answer Key: It’s not necessarily; Not just however many “feels right”; Mostly because it’s what was done before; There isn’t much of one) Turns out, law schools need to add statistics courses
9
FINDING OUT WHAT’S IN A NEW DATASET, AKA ESTIMATING PREVALENCE
10
ESTIMATING PREVALENCE Finding out what’s in a new, unknown dataset Prevalence –Prevalence is the portion of a dataset that is relevant to a particular information need –For example, if one third of a dataset was relevant in a case, the prevalence of relevant materials would be 33% –Always known at the end of a document review project Why estimate it at the beginning?
11
ESTIMATING PREVALENCE, CONT. Knowing the prevalence of relevant materials can guide the selection of culling and review techniques to be employed –(It can also provide a measuring stick for overall progress) Knowing the prevalence of different subclasses of materials can guide decisions about resource allocation and prioritization –(e.g., associates vs. contract attorneys vs. LPO) Knowing the prevalence of specific features facilitates more accurate estimation of project costs: –(e.g., volume to review, volume to redact, volume to privilege log, etc.)
12
ESTIMATING PREVALENCE, CONT. Estimating prevalence of one or more features of a new, unknown dataset is fundamentally valuable because it provides discovery intelligence for data-driven decision making, replacing gut-feelings and anecdotes with data and knowledge
13
ESTIMATING PREVALENCE, CONT. Now that we know why estimating prevalence can be valuable, how do we do it? Steps: –Step 1: Identify your sampling frame –Step 2: Determine your needed sample size –Step 3: Take and review your simple random sample –Step 4: Calculate your prevalence estimate
14
ESTIMATING PREVALENCE, CONT. Step 1: Identify your sampling frame Generally, the same pool that would be subjected to review: –A pool with system files removed (de-NISTed) –A pool with documents outside of any applicable date range removed –A pool that has been de-duplicated –A pool to which any other obvious, objective culling criteria have been applied o (e.g., court mandated key word or custodian filtering)
15
ESTIMATING PREVALENCE, CONT. Step 2: Determine your sample size The sample size you should take depends on: –The strength of the measurement you wish to take –The size of your sampling frame –The prevalence of relevant material within the dataset Let’s look at how each affects sample size
16
ESTIMATING PREVALENCE, CONT. Step 2: Determine your sample size, cont. –The strength of the measurement you want to take Expressed through two values: confidence level and interval –Confidence level o How certain you are about the results you get o How many times out of 100 would you get the same results o Typically 90%, 95%, or 99% –Confidence interval o how precise your results are o how much uncertainty there is in your results o Typically between +/-2% and +/-5%
17
ESTIMATING PREVALENCE, CONT. Step 2: Determine your sample size, cont. –The strength of the measurement you want to take The higher the confidence level sought, the larger the sample size that will be required The lower the confidence interval sought, the larger the sample size that will be required
18
ESTIMATING PREVALENCE, CONT.
19
Step 2: Determine your sample size, cont. –The size of your sampling frame The larger the sampling frame, the larger the sample size needed – but only up to a point –Required sample size eventually levels off –The sample size needed for 1,000,000 documents is not much larger than the sample size needed for 100,000 documents –Potential for cost savings compared to old methods
20
ESTIMATING PREVALENCE, CONT.
21
Step 2: Determine your sample size, cont. –The prevalence of relevant material within the dataset Sample size decreases as prevalence increases or decreases from 50% –When taking a sample to estimate prevalence, prevalence is unknown –So, the most conservative option (i.e., resulting in the largest sample size) should be used, which is 50% –Most sampling calculators default to this; on some it is not variable
22
ESTIMATING PREVALENCE, CONT.
23
Step 2: Determine your sample size, cont. An Example: –Sampling frame of 1,000,000 –Desired strength of 95%, +/-2% –Assumed prevalence of 50% Resulting Sample Size: 2,396 Documents Sampling calculators are integrated into most review tools and are also available online
24
ESTIMATING PREVALENCE, CONT. Step 3: Take and review your simple random sample Simple random sample: one in which every document has an equal chance of being selected –Most modern review programs include sampling tools –Spreadsheet programs can generate lists of random numbers Ensure the highest quality review possible, as any errors in the review of the sample will be effectively amplified in the estimations based on that review
25
ESTIMATING PREVALENCE, CONT. Step 4: Calculate your prevalence estimate Resulting Sample Size: 2,396 Documents (95%, +/-2%) –If review finds 599/2,396 (i.e., 25%) are relevant, then… o You would have 95% confidence that the overall prevalence of relevant documents is between 23% and 27% o You would have 95% confidence that between 230,000 and 270,000 of the 1,000,000 document sampling frame are relevant
26
ESTIMATING PREVALENCE, CONT. Beyond relevant documents, you could measure the prevalence of: –Privileged documents –Documents requiring redaction –Documents presenting HIPAA or FOIA issues –Documents requiring other special handling You can use this information to: –Guide the selection of culling and review techniques –Provide a measuring stick for overall progress –Guide decisions about resource allocation –Estimate project costs more accurately
27
FINDING OUT HOW GOOD A SEARCH STRING IS, AKA TESTING CLASSIFIERS
28
TESTING CLASSIFIERS Finding out how good a search string is Classifiers –Tools, mechanisms or processes by which documents are classified into categories like responsive/nonresponsive or privileged/non-privileged –The tools, mechanisms, and processes employed could include: o Keyword or Boolean searches o Individual human reviewers o Overall team review processes o Machine categorization by latent semantic indexing o Predictive coding by probabilistic latent semantic analysis
29
TESTING CLASSIFIERS, CONT. What’s the value of testing classifiers? Testing classifiers, like estimating prevalence, is a source of discovery intelligence to guide data-driven decision making –First, testing classifiers can improve the selection and iterative refinement of search strings and other classifiers –Second, testing provides a stronger basis for argument for or against search strings or classifiers during pre-trial proceedings
30
TESTING CLASSIFIERS, CONT. The efficacy of classifiers is expressed through two values: recall and precision –Recall is a measurement of how much of the material sought was returned by the classifier o e.g., if 250,000 relevant documents exist and a search returns 125,000 of them, it has a recall of 50%, finding half of what is sought o Under-inclusiveness –Precision is a measurement of how much unwanted material is returned by a classifier o e.g., if a search returns 150,000 documents of which only 50,000 are relevant, it has a precision of just 33%, having returned 100,000 unwanted items o Over-inclusiveness
31
TESTING CLASSIFIERS, CONT. Testing classifiers before applying them to a full dataset requires the creation of a control set against which they can be tested A control set is a representative sample of the full dataset that has already been classified by the best reviewers possible so that it can function as a gold standard The classifier is run against the control set, and its classifications are compared against the experts’
32
TESTING CLASSIFIERS, CONT. Example –A simple random sample of 2,396 documents has been reviewed by subject matter experts for relevance to function as a control set –A search string proposed by plaintiff is run against the control set –It returns 1,200 out of the 2,396 documents, a mix of documents deemed relevant and not relevant by the subject matter experts –How do you sort out the recall and precision?
33
TESTING CLASSIFIERS, CONT. Relevant (SMEs)Not Relevant (SMEs) Returned (P. Search)400800 Not Returned (P. Search)600596 There are 400 true positives –Deemed relevant by SME and returned by P. Search There are 596 true negatives –Deemed not relevant by SME and not returned by P. Search There are 800 false positives –Deemed not relevant by SME but returned by P. Search There are 600 false negatives –Deemed relevant by SME but not returned by P. Search
34
TESTING CLASSIFIERS, CONT. Calculating recall: –Recall is the percentage of all relevant documents correctly returned –400 (TP) out of 1000 (TP+FN) =.40 = 40% –The P. Search has recall of 40% Calculating precision: –Precision is the percentage of the returned documents that are relevant –400 (TP) out of 1200 (TP +FP) =.33 = 33% –The P. Search has precision of 33%
35
FINDING OUT HOW GOOD YOUR REVIEWERS ARE, AKA QUALITY CONTROL
36
QUALITY CONTROL Finding out how good your reviewers are Accuracy and error rate in document review –Accuracy is the measure of how many reviewer determinations are correct –Error rate is the measure hoe many reviewer determinations are incorrect o Together, accuracy and error rate should total 100% Like testing classifiers, involves the comparison of one set of classifications (the initial reviewer’s) to another (the more senior reviewer performing quality control review)
37
QUALITY CONTROL, CONT. Lot acceptance sampling: –Methodology employed in pharmaceutical manufacturing, military contract fulfillment, and other high-volume, quality- focused processes How it works: –A sampling protocol and maximum acceptable error rate is established –Each lot has a random sample taken from it for testing –If the acceptable error rate is exceeded, the entire lot is rejected without further evaluation
38
QUALITY CONTROL, CONT. In the context of document review, the “lot” can correspond to individual reviewer batches or to aggregations, such as an individual reviewer’s total completed review for each day Statistics on batch rejection can be tracked by reviewer, by team, by source material, or by other useful properties Choosing not to acknowledge or measure the error rate in a document review project does not mean that it does not exist
39
FINDING OUT HOW MUCH STUFF YOU MISSED, AKA MEASURING ELUSION
40
MEASURING ELUSION Finding out how much stuff you missed Elusion –Elusion is a measure of how much relevant material was left behind –Typically referenced in the context of predictive coding and the portion of a dataset left unreviewed by a technology-assisted review process An elusion measurement can also be thought of as an estimation or prevalence specific to the remainder pile
41
MEASURING ELUSION, CONT. Example –Even after targeted collection and processing, 1,000,000 documents –Use key word and Boolean searches to cull that 1,000,000 to 250,000 for human document review –In addition to documenting the culling techniques employed and their rationales, an elusion/prevalence measurement can be taken
42
MEASURING ELUSION, CONT. Example, cont. –For a remainder pile of 750,000 documents –Reviewing a simple random sample of just 16,229 documents –Will allow you to measure the elusion o (aka the prevalence of relevant materials in the remainder) –With a 99% confidence level –And a confidence interval of +/-1%
43
MEASURING ELUSION, CONT. Example, cont. –Assuming, hypothetically, that your review of that sample found 5% of that sample of the remainder to be relevant o You would then be able to say with 99% confidence o That no more than 4-6% of the remainder is relevant –(aka 30,000-45,000 documents out of 750,000) –You would also learn about what types of missed materials exist o To develop a method of targeted retrieval o To demonstrate low utility of retrieval –(i.e., by showing material is duplicative or of low probative value relative to additional expense required for retrieval)
44
REVIEW AND PREVIEW
45
REVIEW Estimating prevalence –Discovery intelligence –Four steps o Identify sampling frame o Determine sample size (measurement strength) o Take and review simple random sample o Calculate prevalence estimate Testing classifiers –Evaluated for recall and precision o Recall is how under-inclusive o Precision is how over-inclusive –Must be tested against control set
46
REVIEW, CONT. Quality control of human document review –Evaluated for accuracy and error rate –Lot acceptance sampling –Acceptable error rate Measuring elusion –Elusion is the prevalence of relevant materials in the remainder –Measured the same way as estimating prevalence –Useful both for process validation and for defensibility
47
ENGAGE WITH MODUS Modus email coming tomorrow with slides, recording, survey and invite to next webinar Visit us online at http://discovermodus.com/webinars/ for more information on next webinar on January 27 th at 1:00PM EST entitled:http://discovermodus.com/webinars/ –“Meeting the Microsoft SharePoint Discovery Challenge” Make sure to visit our website for valuable white papers, blogs and other information at www.discovermodus.comwww.discovermodus.com –White Paper: “Practical Applications of Random Sampling in eDiscovery” Thank you!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.