Download presentation
Presentation is loading. Please wait.
Published byDiana Thompson Modified over 6 years ago
1
Data Mining Personally Identifiable Information (PII)
Monday, June 22, 2015 1:00 pm ET Sandra Serkes, President & CEO Valora Technologies, Inc.
2
(this is Valora’s story, too)
AGENDA for Data Mining Private Identification Information (PII) Valora Technologies What is PII? How is it different from PHI or other sensitive information? Where does PII live? How can I find it? What do I do once I find PII? What PII management obligations are there? How does data mining help identify PII? Why data mine? What is driving this practice? Why is it new/hot/etc? Who is doing this & why? How do you data mine documents? What does that mean? What about attachments & versions? What are the typical techniques to do data mining? How does it work? How to get started on a PII data mining project. The Basics You Need to Know Important Terms & Concepts Typical PII Data Mining Workflow Tips & Tools Things to Watch Out For Bedford, MA software firm specializing in machine-assisted document processing capabilities (aka analytics) World experts in the automated analysis, indexing, mining and presentation of documents, data & content 20 staff, 200+ clients, 1,500,000+ pages every week Customers: corporate legal departments, government agencies, and their professional advisory colleagues (law firms & consultancies) Target market: those who wish to harness and profit from the 2.5 quintillion bytes of document & content data being created each day, aka “Big Data” Objective: to overtake traditional information repository creation (manual data entry), management, analysis (search, review) and workflow (retention, production, routing) with high quality, low cost, scalable technology & best practices in analytics. Provide cost competitive document analytics solutions in the United States Provide efficient, world-class, targeted solutions to data, document & content utilization problems The power of Big Data is the story about the ability to compete and win with few resources and limited dollars Forbes, March 2012 (this is Valora’s story, too) “ ”
3
What is all this? Personally Identifiable Information (aka Private Identification Information) is: Information that can be used on its own or with other information to identify, contact, or locate a single person, or to identify an individual in context. Protected Health Information (aka Private Health Information or Personal Health Information) is: Information, medical history, test and laboratory results, insurance information and other data that is collected by a health care profession Sensitive Information (aka Trade Secrets or Classified Information) is: Information that is protected against unwarranted disclosure. Access to sensitive information should be safeguarded. Protection of sensitive information may be required for legal or ethical reasons, for issues pertaining to personal privacy, or for proprietary considerations
4
Pop Quiz! PII, PHI or SI? Patent Application Home address
Secret formula Layoff List PINK SLIP Blood type Credit Card Number Cell phone number Household Income Age of a minor Employee ID
5
Where is PII likely to show up?
Forms & Applications Employee Information (HR) Supplier Information (Procurement) Customer Information (Sales & Marketing) Litigation & Investigations Records Retention
6
Does your organization know what exactly your data says?
EXPIRED
7
Why should we care about PII?
“We know what’s in our data, but we aren’t dealing with it.” “We don’t know what’s in our data” Courts, shareholders, consumers, government agencies, watchdog groups, media spotlight reports and more demanding responsible data management (aka Information Governance). VERSUS
8
Why Data Mine Corporate Documents?
Document is a loose term here. Really, it means any structured or unstructured form of textual or metadata content. Voic s, tweets, texts, websites, audio & video files, receipts and transactions are all “documents” as far as data analytics are concerned. Litigation Doc Review & Productions Investigations Compliance Legal, regulatory & ethics Financial & investor Health & safety Business Intelligence Information Governance Management & control Cost savings Exposure mitigation We data mine documents to learn where they are & what they say. Ultimately, we gain management and control over the contents, obligations, storage, access, retrieval, use and exposure of our information.
9
Who is Data Mining Documents for PII?
Large multi-national corporations Sometimes litigation or investigation collections, legal hold Sometimes part of larger Information Governance initiative Sometimes part of compliance and/or retention strategies Typically happening at the departmental level Corporate Advisory (Outside Counsel & Consultants) Looking to assist clients in the items above Often part of business process re-engineering or IG engagements Health Care, Financial Services & Insurance Analysis for HIPAA compliance Analysis of mortgages, stock trades, tax forms, and other financial transactions Analysis to de-identify documents for aggregate data mining purposes
10
Data Mining PII is a Good Example of Information Governance
Universal Issue Involves several key IG problems: Storage/hosting Content analysis & classification Sub-Context – terms, provisions, obligations & stipulations Administration, management & maintenance Elements of Backfile and Day Forward records management Typically a mix of paper and ESI documents Signatures & affirmations play a key role PII Management is a hot button issue with real budgets available Investor & media attention Customer concerns Risk & compliance danger zone Predecessor to Big Data mining
11
How a computer identifies PII with data mining (analytics)
Clear PII: SSN Likely PII (“warning sign”) Not PII: Interest Rate Clear PII: Home Phone Number Implied classification: Active PII, needs protection & redaction
12
How a computer classifies a contract with data mining (analytics)
Agreement Type How a computer classifies a contract with data mining (analytics) Authors/Parties Author Validation & Contact Info Key Provisions Contract Date Contract Term No Survivorship Clauses Implied classification: Active contract
13
How a computer classifies an attachment or exhibit with data mining (analytics)
Date Format = US DocType = Patent Application Date = 10/18/2007 Author = Patent Authors, Author City, Author Country Assignee = RIM Tone = Neutral to slightly positive Embedded Graphic with Title Other Data Capturable Data Elements: Patent Number Filing Date Key Phrases & Terms Managing PTO Implied/Attached Docs Bar Code Present And many more . . . Up to 160 unique attributes.. And counting!
14
Additional Info Data Mining Determines
What Type of Document is this? vs. contract vs. employment application, etc. Who executed this? Who are the parties? Authors, Recipients, Copyees What are the key content areas? How similar to other/past provisions? Which provisions most popular? What attachments and exhibits are part of this record or file? What does their association imply? What is the scope of the PII? Internal or external PII? Financial? Health? Personal? Company-Sensitive? Special conditions What other context can be inferred? Can someone put together? What obligations? What risk? What workflow is needed for this documents? Exception Handling & Escalation Approval process Obsolescence planning, scheduled deletion What predictions about future PII activity? What trends?
15
The importance of Context – is it PII?
Single instances of “likely” PII are not necessarily PII, unless and until they can be linked to a specific person Compare the following: SSN is always a ringer; it is a unique ID PII problems compound quickly Not just in single instances, but across populations
16
When does PII become a problem?
The existence of PII isn’t a problem per se It becomes a problem when: It is exposed to others (knowingly or unknowingly) It needs to be produced It needs to be evaluated There is an ongoing ethical obligation to treat such information properly Single points of data can be connected to others to build a composite picture (connect-the-dots) Force your group to evaluate: What would happen in a data breach? What would be exposed? How quickly could you recover? What can you do now to mitigate expense & crisis later?
17
675 million records compromised since 2005
Think you are immune? Data Breaches Last Year (2014) 28%, a record high 15 incidents per week Hacking is #1 cause (28%) 675 million records compromised since 2005 Every 3 seconds there is a new victim
18
What to do once you’ve reached the “it’s a problem” stage
Use analytics (or “safe” brute force labor) to: Cull out documents no longer needed (Retention Schedule) Mark documents heading for removal Build a broad dashboard of your data contents Population Analysis is beyond just a Data Map What do you have where, and in what concentrations? Analyze & assess documents into 3 categories: critical, nervous & safe ID specifics for each document type, person & information class (build your Ruleset) Determine who should/should not have access “at rest” Determine presentation of data to different parties, particularly in production/presentation scenarios Consider selective redactions with on/off toggle Run Analytics to ID, mark & redact offending information Host documents in an environment that can utilize and maintain existing information (Backfile), as well as proactively analyze new material entering the system (Day Forward)
19
AutoRedaction Defined
[REDACTED] Automated redaction of “offending” text or phrases Software performs the redaction based on Rules Multiple choice presentation Image, text or both Solid Black, Black with white writing, Translucent Yellow, Translucent Gray Available for all kinds of information List provided or “derived” from tags Ex: SSN, DOB, Name, Age, Address, Account Number, Product Name/ID… Unlimited redactions in a single document
20
What kind of redaction makes sense?
Serkes Sandra Should redactions be visible: always, sometimes or never? Does someone need to approve or override system redactions?
21
What is Data Visualization? (aka Data Presentation)
Simple visual representation of relationships and patterns in document data Common examples Graph sales over time Distribution by ethnicity Word Clouds & Heat Maps USA Today-style graphics Use of charts, graphs, dashboards, animation and sound to help convey important connections Data visualization is a general term that describes any effort to help people understand the significance of data by placing it in a visual context. -TechTarget
22
Data Visualization examples you already know
23
Data Visualization of Document Data Mining
BlackCat Screen Shots
24
Data Mining with Redactions (PowerHouse QCUI Main Screen)
PH AutoCoder has previously filled in these fields Screen capture showing details of PowerHouse QCUI (Quality Control User Interface). Quality Control environment customized for each matter. Full record-by-record and field-by-field display, as well as many automated tools to improve throughput & efficiency. QC Analysts edit content right in the document view. PH AutoRedaction of private identification information (PII).
25
Understanding the Basics of a Contracts Data Mining Project - Vocabulary
Important Terms Backfile Day Forward Custodians & Paths Rules & Confidences Exception Handling Data flow model Maintenance & Tuning Identical Duplicates & Near Duplicates Data Visualization Taxonomy Data Obsolescence AutoRedaction
26
Typical PII Data Mining Project Workflow
Project design & scoping Budgetary approval Initial data transfer Rules creation & testing Limited access operation Typical Phase 1: reduced scope, offsite/offline backfile, limited users Full data transfer/access Full rules creation & testing Complete backfile processing Full backfile operation Typical Phase 2: full scope, onsite full backfile, full users Integration with live systems Full Day Forward operation Admin & User Admin & User Training Ongoing Maintenance Typical Phase 3: full scope, onsite full backfile + day forward, full users
27
Tips & Tools for Getting Your PII Data Mining Project Started
Start at the departmental level Identify 3 critical pain points in that department’s document usage/management/etc. Ex: classifying & managing departing old/inherited documents; creating standardized PII management terms; or identifying PII exposure areas Pick a department that is already one of the critical buy-in parties: legal, procurement or marketing Start with a financially & logistically palatable Phase 1: Examples (< $30K, 1-5 parties affected, 20% of ultimate work spec) Keep onsite system installations to a minimum Work on Backlog first – before Day Forward (new) files Have an end point in mind Where/How will PII ultimately be stored? What is the ending file structure? How will new documents revise the existing taxonomy? Remember that PII is a cross-population issue, not just single documents Effectively all file types & purposes Have a project champion The litigation matter has the senior partner. Who is driving the PII data mining project? Who is the point person, and internal advocate?
28
Be prepared for “surprise” content
Things to Watch Out For Be prepared for “surprise” content Most organizations hold onto key infromation forever. Be prepared for defunct companies, groups, policies, provisions, etc. Files may be anything Document content may say anything Suggestion: start with some basic rules about what’s in/out for the analysis population before the project “officially” starts. Make sure you know what the obligations are once certain types of document content & patterns of behavior are made apparent. Data mining & management can have “Big Brother” overtones Contentoversight makes people nervous. Suggestion: Share rules & classification criteria with those concerned enough to ask about it. Be transparent. Once senior mgmt learns about the ability to monitor, track & predict behavior, they will want regular reporting on these topics Suggestion: Make sure your analysis & classification tools include easy reporting & monitoring of system behavior, & usage patterns
29
More Things to Watch Out For
System needs & priorities will change over time Unlike discovery, which has a fixed time window for document collection, PII data mining typically endures forever What is acceptable today may not be tomorrow, when other concerns dominate or additional material is added Suggestion: make sure your systems & workflow are flexible enough to add or delete processing stages, adapt rules confidences, and grow with your needs. Look for systems that have a “tuning” component. Don’t forget other related content stores Stored contracts and agreements and attachments Field office documents Remember: PII is probably lurking in many of the documents that your organization has likely kept for several decades or more. (Organization is easier to swallow than deletion/removal.)
30
(this is Valora’s story, too)
Valora Technologies Bedford, MA software firm specializing in machine-assisted document processing capabilities (aka analytics) World experts in the automated analysis, indexing, mining and presentation of documents, data & content 20 staff, 200+ clients, 1,500,000+ pages every week Customers: corporate legal departments, government agencies, and their professional advisory colleagues (law firms & consultancies) Target market: those who wish to harness and profit from the 2.5 quintillion bytes of document & content data being created each day, aka “Big Data” Objective: to overtake traditional information repository creation (manual data entry), management, analysis (search, review) and workflow (retention, production, routing) with high quality, low cost, scalable technology & best practices in analytics. Provide cost competitive document analytics solutions in the United States Provide efficient, world-class, targeted solutions to data, document & content utilization problems The power of Big Data is the story about the ability to compete and win with few resources and limited dollars Forbes, March 2012 (this is Valora’s story, too) “ ”
31
Typical Problems Valora Solves
Legal/Litigation/eDiscovery Problems Too many documents to review, cull & produce by hand Cost-effective alternative solutions to contract attorney & offshore labor “armies” Missing, poor, or ineffective metadata Re-unitization, organization, indexing & redacting of documents Bridging multi-language document populations to English Records Management Problems Help automate defensible deletion efforts for IG Organize & control loose documents on shared drives, desktops, networks & devices Eliminate expensive and information-poor storage options Serve as automated intake for multiple content generation sources Business Intelligence Problems Organize & control decades of contracts & agreements Provide brand integrity/protection data mining of public/private documents Forecast & trending of topics, people & locations over time Loose, shared files analysis & control Health Care Problems Heavy expense & time converting hardcopy medical records to EMRs/EHRs Cannot keep up with fax server data collection Cost effective alternative solutions to “armies” of temp data entry coders
32
Valora Technologies, Inc.
Thank You! For More Information: Valora Technologies, Inc. 101 Great Road, Suite 220 Bedford, MA 01730
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.