Download presentation
Presentation is loading. Please wait.
Published byHanna Inga Hermansson Modified over 5 years ago
1
Current Methodologies for Supervised Machine Learning in E-Discovery
Bill Dimm Text Analytics Forum 2018 November 8, 2018 Since supervised machine learning gained court acceptance for use in e-discovery 6 years ago, best practices have evolved. This talk describes the special circumstances of e-discovery and the best approaches that are currently in use. How robust is the Continuous Active Learning (CAL) approach? How much impact does the choice of seed documents have? What are SCAL and TAR 3.0?
2
E-Discovery
3
Supervised Machine Learning Predictive Coding
Technology-Assisted Review
4
TAR Acceptance My analysis (from the draft of my book) of the 2009 TREC Legal Track that Grossman & Cormack analyzed.
5
E-Discovery Considerations
Need to hit a certain recall Near-duplicates Variable data Dirty data Small part of document can be critical Non-standard word usage Prevalence can be low Difficulty getting data for experiments
6
Toy Example for Workflows
Jump out to animations after explaining this figure.
7
Weak Seed
8
Wrong Seed
9
Disjoint Relevance
10
References Maura Grossman and Gordon Cormack, Technology-Assisted Review in E-Discovery Can Be More Effective and More Efficient Than Exhaustive Manual Review. Richmond Journal of Law and Technology, XVII, 2011. William Dimm, Predictive Coding: Theory & Practice, [Draft December 9, 2015], Appendix A. William Dimm, TAR 3.0 Performance, Clustify Blog, January 28, 2016. Gordon Cormack and Maura Grossman, Scalability of Continuous Active Learning for Reliable High-Recall Text Classification, Proceedings of the 25th ACM International Conference on Information and Knowledge Management, 2016. William Dimm, The Single Seed Hypothesis, Clustify Blog, April 25, 2015.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.