e-Discovery through Text Mining

Slides:



Advertisements
Similar presentations
Mining: Extracting Collaborative Activities from Akiko Murakami Koichi Takeda.
Advertisements

Classification & Your Intranet: From Chaos to Control Susan Stearns Inmagic, Inc. E-Libraries E204 May, 2003.
© 2007 Megaputer Intelligence Utilizing Text Analytics in Your VOC Program: Analyzing Verbatims with PolyAnalyst Sergei Ananyan Megaputer Intelligence.
Subrogation Prediction Through Text Mining and Data Modeling
New Technologies Supporting Technical Intelligence Anthony Trippe, 221 st ACS National Meeting.
CSO’s 2014 Training & Networking Conference | Austin, TX | Copyright © 2014 CSO Research, Inc. Wonderful World of Data Cleanup Keenan & Mona.
RED FLAGS OF OCCUPATIONAL FRAUD Caroline Burnell, CFE, CGFM.
PolyAnalyst Data and Text Mining tool Your Knowledge Partner TM www
Text mining Extract from various presentations: Temis, URI-INIST-CNRS, Aster Data …
Search Engines and Information Retrieval
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
WMES3103 : INFORMATION RETRIEVAL
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dörre, Peter Gerstl, and Roland Seiffert Presented By: Jake Happs,
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
India Research Lab Auto-grouping s for Faster eDiscovery Sachindra Joshi, Danish Contractor, Kenney Ng*, Prasad M Deshpande, and Thomas Hampp* IBM.
“Recipients ” “Signature” “Subject Line” CONTENT of .
Search Engines and Information Retrieval Chapter 1.
1 The BT Digital Library A case study in intelligent content management Paul Warren
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
Classification Technology at LexisNexis SIGIR 2001 Workshop on Operational Text Classification Mark Wasson LexisNexis September.
GLOSSARY COMPILATION Alex Kotov (akotov2) Hanna Zhong (hzhong) Hoa Nguyen (hnguyen4) Zhenyu Yang (zyang2)
Using Text Mining and Natural Language Processing for Health Care Claims Processing Cihan ÜNAL
Flexible Text Mining using Interactive Information Extraction David Milward
Improve your R&D Effectiveness and Manage Your Intellectual Property Assets with Luxid ® for Life Sciences.
The Office Procedures and Technology Chapter 4 Communicating in Written Form Copyright 2003 by South-Western, a division of Thomson Learning.
Preparing s Using Etiquette Lesson A4-3.
Preparing s Using Etiquette. Learning Objectives Define . List the parts of an and an header. List rules for etiquette.
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
 LAN ◦ A LAN (Local Area Network) is a system whereby individual PCs are connected together within a company or organization.  WAN ◦ A WAN (Wide Area.
WEB 2.0 PATTERNS Carolina Marin. Content  Introduction  The Participation-Collaboration Pattern  The Collaborative Tagging Pattern.
Information Retrieval
Metadata By N.Gopinath AP/CSE Metadata and it’s role in the lifecycle. The collection, maintenance, and deployment of metadata Metadata and tool integration.
IT and Network Organization Ecommerce. IT and Network Organization OPTIMIZING INTERNAL COLLABORATIONS IN NETWORK ORGANIZATIONS.
Achieving Semantic Interoperability at the World Bank Designing the Information Architecture and Programmatically Processing Information Denise Bedford.
March, 2007RCO LLC, RCO Text Analysis Technologies for information extraction and business intelligence We can tell you everything about.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
AQUAINT Mid-Year PI Meeting – June 2002 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
Beyond Predictive Coding – The True Power of Analytics.
Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.
Information Retrieval in Practice
Background: The Big Data era
Taking a Tour of Text Analytics
13 YEARS 11/2000 – 11/2013 Automated Privilege Detection, De-Threading & Automated Priv Logs 1st Quarter 2014 Confidential.
Advantages of ICT over Manual Methods of Processing Data
INAGO Project Automatic Knowledge Base Generation from Text for Interactive Question Answering.
Memory Standardization
Personalized Social Image Recommendation
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
PolyAnalyst Data and Text Mining tool
Multimedia Information Retrieval
Data Sources, Use Cases and Capabilities
Stop Data Wrangling, Start Transforming Data to Intelligence
TDM=Text Mining “automated processing of large amounts of structured digital textual content for purposes of information retrieval, extraction, interpretation.
IT Preservation Holds and Public Information Requests
Sergei Ananyan, Ph.D. Healthcare Fraud Detection through Data Mining Your Knowledge Partner TM (c) Megaputer Intelligence.
CSE 635 Multimedia Information Retrieval
Introduction to Information Retrieval
Content Augmentation for Mixed-Mode News Broadcasts Mike Dowman
Text Mining & Natural Language Processing
FCE (FIRST CERTIFICATE IN ENGLISH) General information.
PolyAnalyst Web Report Training
What is Direct Marketing ?
Data Warehousing Concepts
PolyAnalyst Web Report Training
PolyAnalyst Web Report Training
Data Pre-processing Lecture Notes for Chapter 2
PolyAnalyst Web Report Training
Text Mining Application Programming Chapter 9 Text Categorization
PolyAnalyst™ text mining tool Allstate Insurance example
Presentation transcript:

e-Discovery through Text Mining Fraud Detection example Sergei Ananyan, Ph.D. Megaputer Intelligence Inc.

What is e-Discovery? Electronic Discovery is the process when electronic data is sought, located, secured, and searched with the intent of using it as evidence in a legal case

Electronic evidence Documents are increasingly produced and stored electronically Corporate litigations involve the production and analysis of electronic evidence Litigations might involve different parties: Company vs. Company Government vs. Company Person vs. Company

Who uses e-Discovery systems? Document Analyst Opposing Legal Team E-Discovery System Litigation Support Manager Court Attorney

Analytics & Reporting Text Mining

Old approach to text analysis Data analysts perform searches based on: Key words and phrases with proximity Date ranges Known relevant documents – seeking similar documents

Typical example US federal agency is investigating a mortgage fraud case against a major bank Subpoenas all documents matching words: Apprais* w/25: correct*, target, increas*, chang*, second, … Pric* w/25: change*, increas*, rais*, … Receives over 3,000,000 matching documents This agency division has 4 data analysts and 3 attorneys to work on the case

Time for document analysis 3 million docs 3 min per document Manual Analysis 20 docs per hour 2 month – one analyst Text Mining 40K docs per year 75 years to check 3M docs DONE! Text Mining delivers results 450 times faster!

Encountered challenges Overwhelming # of documents Primarily irrelevant documents Repetitive documents Numerous typos Missing information about communicating parties

Where Text Mining can help? Data normalization Parsing and aggregating data from disparate formats Cleansing data Feature extraction Data analysis Deep linguistic parsing (context based) Searching for patterns

Use text/data mining techniques Language detection Spell-checking / correction Deep linguistic parsing Part of speech detection – context based Chunker: detect noun phrases, verb phrases, etc. Semantic dictionaries Auto-categorization (Pattern Detection Language) Entity Extraction Clustering Latent Semantic Analysis De-duplication Inverse frequency analysis Social Network Analysis

Possible Analysis Scenarios Let us consider different scenarios: We can formulate patterns we are searching for We have a collection of documents with relevant evidence We have a list of relevant custodians We know only the time interval when the problem occurred We don’t know anything except the keywords documents should contain

If we know relevant patterns Write patterns in a special language – capture: Proximity (terms, sentences and paragraphs) Part of speech information Semantic similarities Negations Density of terms

If we know relevant documents Need to search for similar documents Use Latent Semantic Analysis or similar techniques Identify custodians associated with relevant documents Find additional features of potential interest associated with these custodians

Know only custodians & time range Search for unique features of their communications with others Train the system on all available data Reveal anomalous terms & phrases Example: “fruit language” Lemon – kickback: “For this property we received from XYZ a lemon worth over 3M.” “They gave us significant lemons on both these transactions.”

Know only the problem time range Look for spikes in communications for all people Sudden changes in topics discussed Spikes in unusual lexicon terms

Know only theme & keywords Clustering of topics Analysis of pairwise communications Unusual clusters & lexicon Group pairs of people with similar lexicon Gather ideas for further investigation

Data preparation Remove definitely irrelevant documents Junk mail Mass broadcasts Magazine articles (post-factum documents) Split email chains into individual messages Eliminate full and near duplicates Reconstruct email addresses Find and adaptively correct misspells

Reconstruct & extract features Extract fields of interest: Date To, From, CC and BCC Subject Names of people, companies and organizations Addresses Telephone numbers Custom entities: SSN, drug names with dosage, frequency, application mode, etc.

Networks of related custodians Reveal & graphically present networks of people exchanging relevant documents Social Network Analysis performed on email communications

Present selected documents Obtain a small collection of highly relevant documents Summarize key findings in easy to comprehend interactive web reports Provide drill-down to original documents Have important patterns in text highlighted in the drill-down documents Export collections of marked-up relevant documents

Case Description Data: 3,000,000 documents from a mortgage company, primarily email notes Objectives: Detect signatures of potential fraud and abuse Identify and visualize involved individuals

E-Discovery Methodology Step 1. Prepare and normalize data Step 2. Cleanse data Step 3. Extract entities of interest: $ amounts, loan #s, postal addresses, etc. Step 4. Pattern Analysis: search for text patterns representing fraud and abuse Step 5. Who is involved? Visualize networks of communications of identified suspects

Data analysis scenario

Step 1. Data Preparation and Normalization

Data Preparation Objectives Remove non-email documents Reconstruct email addresses Convert chains of email responses found in one email letter into collections of individual letters Parse documents into structured fields: From, To, CC, BCC Subject Date Email body

Parsing Original Documents Email 1 Todd.Seal@homesite.com Dorothy.Koen@homesite.com Lisa.Simpson@homesite.com Email 2 Todd.Seal@homesite.com Dorothy.Koen@homesite.com Lisa.Simpson@homesite.com … Email 3+

Reconstructing and Parsing

3M Chains Parsed into 5.6M Emails

Step 2. Data Cleansing

Data Cleansing Objectives Identify and correct misspells Identify duplicates and near duplicates Remove magazine articles

Auto-SpellChecker – misspelled words Automatically identified & corrected over 600,000 misspells

Detect Duplicates and Near-duplicates Automatically eliminated over 1,000,000 duplicates

Remove Magazine Articles

Step 3. Extract Entities: Multiple Valuation Homes?

Entity Extraction Objectives Extract standard and custom entities of potential interest Names of People and Companies Postal Addresses and Phones Currency amounts and Loan numbers, etc. Find documents discussing different values of the same home Remove discussions of revenue and salaries

Extract Names of People & Companies Automatically extract standard entities

Extract $ Amounts and Loan #s Extract standard and custom entities

Extract Notes w/Multiple Home Prices

Remove Discussions of Revenue & Salary

Different Valuations for the Same Home

Step 4. Discover Signatures of Fraud and Abuse

Taxonomy: Distribution of Topics

Taxonomy-based Categorization

Taxonomy Results: Value Opinions

Step 5. Who is involved? Social Network Analysis

People Discussing Multiple Values of Homes

Benefits of Text Mining Dramatic savings in time and resources Smaller teams of investigators can complete large projects Elimination of tedious manual work Better precision: focus only on relevant documents Increased recall: find unexpected patterns of terms Convincing and consistent presentation of results Stronger case / defense position Preventative measures become possible

Questions? (812) 330-0110 info@megaputer.com Call or email 1600 W Bloomfield Rd, Suite E Bloomington, IN 47404 USA