Current Methodologies for Supervised Machine Learning in E-Discovery Bill Dimm Text Analytics Forum 2018 November 8, 2018 Since supervised machine learning gained court acceptance for use in e-discovery 6 years ago, best practices have evolved. This talk describes the special circumstances of e-discovery and the best approaches that are currently in use. How robust is the Continuous Active Learning (CAL) approach? How much impact does the choice of seed documents have? What are SCAL and TAR 3.0?
E-Discovery
Supervised Machine Learning Predictive Coding Technology-Assisted Review
TAR Acceptance My analysis (from the draft of my book) of the 2009 TREC Legal Track that Grossman & Cormack analyzed.
E-Discovery Considerations Need to hit a certain recall Near-duplicates Variable data Dirty data Small part of document can be critical Non-standard word usage Prevalence can be low Difficulty getting data for experiments
Toy Example for Workflows Jump out to animations after explaining this figure.
Weak Seed
Wrong Seed
Disjoint Relevance
References Maura Grossman and Gordon Cormack, Technology-Assisted Review in E-Discovery Can Be More Effective and More Efficient Than Exhaustive Manual Review. Richmond Journal of Law and Technology, XVII, 2011. William Dimm, Predictive Coding: Theory & Practice, [Draft December 9, 2015], Appendix A. William Dimm, TAR 3.0 Performance, Clustify Blog, January 28, 2016. Gordon Cormack and Maura Grossman, Scalability of Continuous Active Learning for Reliable High-Recall Text Classification, Proceedings of the 25th ACM International Conference on Information and Knowledge Management, 2016. William Dimm, The Single Seed Hypothesis, Clustify Blog, April 25, 2015.