The Next Frontier in TAR: Choose Your Own Algorithm LegalTech New York 2017 Presented by Dr. David Grossman, Georgetown University Tara Emory, Esq., PMP, Director of Consulting for Driven, Inc.
Introduction What’s inside TAR? The Role of the Algorithm Leveraging Algorithms to improve TAR Discovery Process Implications Q&A
What’s inside TAR? Workflow Software Algorithm
What’s inside TAR One software product = one algorithm Attempts to compare algorithms and products do not isolate workflow vs. software vs. algorithm Other variables include document set and nature of what you need to find
Prior work TAR vs. keywords and manual review Teams with different workflows and software on same issues with same documents Industry tests of one software vs. another on same issues and same documents Different algorithms on same issues with same documents Next Frontier: Different algorithms on same documents, compared to same algorithms on different documents
The Role of the Algorithm Prior work TAR vs. keywords and manual review Teams with different workflows and software on same issues with same documents Industry tests of one software vs. another on same issues and same documents Different algorithms on same issues with same documents Next Frontier: Which algorithms work best for different types of cases?
The Role of the Algorithm TREC Issue Winner at 15% Recall at 15% 201 XGBoost CV -Binary 92 202 LSI 98 203 Logistical Regression –log TF-IDF and LSI log TF-IDF 97 207 94
Leveraging Algorithms to Improve TAR No “best” algorithm for all cases Success of different algorithms varies by Size of document set Prevalence of responsiveness in set Amount of review appropriate for case Availability of best examples to train Broad vs. narrow topics Single vs. many issues
Discovery Process Implications How should attorneys adapt to new understandings of TAR algorithms? How does an attorney judge what is reasonable? Should algorithm selection be included in discovery negotiations? Could this be another point of disagreement between opposing parties?
Questions
Supplement
Workflow Sampling (in some workflows) Seed set Training Validation (in some workflows)
Technologies: LSI/LSA Latent Semantic Indexing/Analysis Find relationships between: Words - words Words - topics Topics – documents Map into semantic space
Logistical Regression Just like linear regression except we fit a curve instead of a line. Probability Of Relevance
Technologies: Bayesian Probability Bayesian Probability/Naïve Bayes Probabilistic Identifies probability that a word contributes to a document matches a category, based on examples Each word contributes independently to likelihood
Technologies: SVM Support Vector Machine Process for making binary decisions Documents mapped based on word count expressed as percentage of words in the document As user identifies responsive and non responsive examples, a dividing line is determined
Other Lexical Techniques Rely on linguists and dictionaries Linguists serve as experts and work with attorneys Deconstructs language into parts of speech Determine classification rules for responsiveness and non-responsiveness based on key words May or may not involve machine learning