Preventing Information Leaks in Email Vitor Text Learning Group Meeting Jan 18, 2007 – SCS/CMU.

Slides:



Advertisements
Similar presentations
Property Inventory Valuation Replacement Cost Value The amount it would take to replace property with like property of the same quality and construction.
Advertisements

Configuration management
Configuration management
Page 1 of 14 To the Voltage Online Training Course Voltage encryption is used to protect sensitive and personal information sent via to external.
Handheld Contact Wireless syncing ACT! Blackberry and Windows Mobile 5/6.
Lesson 1. Course Outline E-Commerce and its types, Internet and WWW Basics, Internet standards and protocols, IP addressing, Data communication on internet,
Privacy and Information Security Training ( ) VUMC Privacy Website
The Third International Forum on Financial Consumer Protection & Education “Fostering Greater Consumer Protection & Education” Preventing Identity Theft.
Information Security Jim Cusson, CISSP. Largest Breaches 110, NorthgateArinso, Verity Trustees 6, Aurora St. Luke's Medical.
1 Identity Theft and Phishing: What You Need to Know.
Computers, Freedom and Privacy April 23, 2004 Identity Theft: Addressing the Problem in California Joanne McNabb, Chief CA Office of Privacy Protection.
Copyright © 2014 Pearson Education, Inc. 1 Managers from across organizations are involved in developing and acquiring information systems Chapter 5 -
Protect Yourself from Your Customer Kristin A. Stedman, AAP Senior Vice President Education Services 1 © 2014 TACHA. All Rights Reserved.
Online Banking Fraud Prevention Recommendations and Best Practices This document provides you with fraud prevention best practices that every employee.
The Ecommerce Security Environment For most law-abiding citizens, the internet holds the promise of a global marketplace, providing access to people and.
& A Recommendation System for Recipients Vitor R. Carvalho and William W. Cohen, Carnegie Mellon University March 2007 Preventing Leaks.
DATA SECURITY Social Security Numbers, Credit Card Numbers, Bank Account Numbers, Personal Health Information, Student and/or Staff Personal Information,
Scams and Schemes. Today’s Objective I can understand what identity theft is and why it is important to guard against it, I can recognize strategies that.
FIRST COURSE Computer Concepts Internet and Microsoft Office Get to Know Your Computer.
BTT12OI.  Do you know someone who has been scammed? What happened?  Been tricked into sending someone else money (not who they thought they were) 
Frequently Asked Questions. No, in fact DOCTUS considers itself a strategic extension of your organization. Hence, we deliver the work the way you do.
Microsoft Office Word 2013 Expert Microsoft Office Word 2013 Expert Courseware # 3251 Lesson 4: Working with Forms.
07/19/04 NorCal OAUG Training Day, Paper 2.4 John Peters, JRPJR, Inc.1 Oracle Workflow Notifications John Peters JRPJR, Inc.
Privacy and Encryption The threat of privacy due to the sale of sensitive personal information on the internet Definition of anonymity and how it is abused.
This chapter is extracted from Sommerville’s slides. Text book chapter
Automating 100 Processes with Interneer Apps Chris Condon – Director, IT Innovation and Solutions, Los Angeles Firemen’s Credit Union.
CensorNet Ltd An introduction to CensorNet Mailsafe Presented by: XXXXXXXX Product Manager Tel: XXXXXXXXXXXXX.
Security 101 Harper P. Johnson Information Technology Services Director of Information Security.
2 Session S105 FISAP On The Web n eCB’s FISAP On The Web
2015 ANNUAL TRAINING By: Denise Goff
CompuBase Data for CRM / PRM Integration How compuBase fits to an existing CRM / PRM system? Last review 25/03/2007.
Delight QuickBooks Online Banking Internal Support Training QuickBooks Windows 2009/2010 Online Banking.
What is Sentinel? Sentinel is an innovative printer management solution, designed for organizations who need better control over their printing system.
E-commerce Vocabulary Terms. E-commerce Buying and selling of goods, services, or information via World Wide Web, , or other pathways on the Internet.
E-commerce Vocabulary Terms By: Laura Kinchen. Buying and selling of goods, services, or information via World Wide Web, , or other pathways on the.
2014 e-ISuite CTSP Presentation 2014 e-ISuite CTSP Presentation.
Collection of Assessment Results
Dimensions of E – Commerce Security
IB ITGS Case Study. Introduction: Serving thousands of clients, it is method of environment-friendly green ticketing. User friendly system which minimizes.
BTT12OI.  Do you know someone who has been scammed online? What happened?  Been tricked into sending someone else money (not who they thought they were)
Why the Data Protection Act was brought in  The 1998 Data Protection Act was passed by Parliament to control the way information is handled and to give.
IPortal Bringing your company and your business partners together through customized WEB-based portal software. SanSueB Software Presents iPortal.
Customer Interface for wuw.com 1.Context. Customer Interface for wuw.com 2. Content Our web-site can be classified as an service-dominant website. 3.
Front Page Title Name Introduction Appropriate Images The Legal Issues -Personal Data -Freedom of Information -Computer Crimes Ethical Issues -
Copyright ©2005 CNET Networks, Inc. All rights reserved. Practice safety Learn how to protect yourself against common attacks.
Chapter 11 Working with Credit Card Methods of Processing Credit Cards Preparing for Cyber Cash Authoring a Credit card Transaction.
Learning Intentions: To understand what is required to achieve a Pass, Merit or Distinction for Task 3.
Chapter 12: How Private are Web Interactions?. Why we care? How much of your personal info was released to the Internet each time you view a Web page?
12/23/2015Software Assist Corporation1 “Most companies have little idea how pervasive FTP activity is in their organizations because FTP is no longer just.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
1 Chapter 12 Configuration management This chapter is extracted from Sommerville’s slides. Text book chapter 29 1.
Creativematch eCRM Creativematch has launched a new eCRM platform providing access to an marketing suite from your own desktop. Our eCRM platform.
Engineering and Management of Secure Computer Networks School of Engineering © Steve Woodhead 2009 Corporate Governance and Information Security (InfoSec)
 Computer News  Identity Guard  One meeting a month (2 nd Wednesday)  Website &  
SAP R/3 User Administration1. 2 User administration in a productive environment is an ongoing process of creating, deleting, changing, and monitoring.
Transitions: An Agent’s Perspective Leslie Covington E Journals Account Manager EBSCO Information Services.
Component D: Activity D.3: Surveys Department EU Twinning Project.
Online Training Course
Gift Card Risk Mitigation – Presentation A
IT Security  .
Chapter 5 Electronic Commerce | Security
Information Security 101 Richard Davis, Rob Laltrello.
Preventing Information Leaks in
Forensics Week 11.
Chapter 5 Electronic Commerce | Security
CLIENT RELATIONSHIP MANAGEMENT KEEPING TRACK OF REQUESTS THE EASY WAY
Ethics Communication Channels
Ranking Users for Intelligent Message Addressing
HOW TO REGISTER FOR THE J.P. MORGAN CHASE PAYMENTNET WEBSITE
ONLINE SECURE DATA SERVICE
Presentation transcript:

Preventing Information Leaks in Vitor Text Learning Group Meeting Jan 18, 2007 – SCS/CMU

Outline 1. Motivation 2. Idea and method Leak Criteria, text-based baselines Crossvalidation, network features 3. Results 4. Finding Real Leaks in the Enron Data 5. Predicting Real leaks in the Enron Data  Smoothing the leak criteria 6. Related Work 7. Conclusions

Information Leaks  What’s being leaked? Credit card, New products information Social Security Numbers Software pre-release versions Business Strategy, Health records, etc.  Multi-million dollar industry (ILDP) Anonymity and Privacy of data Information Leakage Detection and Prevention (from Wikipedia)

Information Leak using  Hard to estimate, but according to PortAuthority Technologies How data is being leaked

Leaks make good headlines. Just google it…  California Power-Buying Data Disclosed in Misdirected  Leaked exposes MS charity as PR exercise  Bush Glad FEMA Took Blame for Katrina, According to Leaked

More leak in the headlines  Dell leaked shows channel plans - Direct threat haunts dealers-A leaked reveals Dell wants to get closer to UK resellers.  Business group say Liberals handled leaked badly. Business group say Liberals handled leaked badly.  Is Leaked a SCO-Microsoft Connection?  “Leaked may be behind Morgan Stanley's Asia economist's sudden resignation”

Detecting Leaks  Idea Goal: to detect s accidentally sent to the wrong person Generate artificial leaks: leaks may be simulated by various criteria: a typo, similar last names, identical first names, aggressive auto- completion of addresses, etc. Method: LOOK FOR OUTLIERS. Leak: accidentally sent to wrong person Leak

Avoiding Expensive Errors  Method Create simulated/artificial recipients Build model for (msg.recipients): train classifier on real data to detect synthetically created outliers (added to the true recipient list).  Features: textual(subject, body), network features (frequencies, co-occurrences, etc). Rank potential outliers - Detect outlier and warn user based on confidence. Rec_6 Rec_2 … Rec_K Rec_5 Most likely outlier Least likely outlier P(rec_t) P(rec_t) =Probability recipient t is an outlier given “message text and other recipients in the message”.

Method User’s Database User New Composed Message Privacy Policy Module Language Models Leak Classifier List of Potential Leaks Leak classifier is trained with real data combined with simulated outliers. Flowchart of Leak Detection Application This module produces the simulated outliers. It simulates the most common types of mistake that can cause a leak. For instance, addresses with the same initial letters, or addresses with very close spelling, etc. All messages sent and received by the user

Leak Criteria: how to generate (artificial) outliers  Several options: Frequent typos, same/similar last names, identical/similar first names, aggressive auto- completion of addresses, etc.  In this paper, we adopted the 3g-address criteria: On each trial, one of the msg recipients is randomly chosen and an outlier is generated according to: Else: Randomly select an address book entry

Dataset: Enron Collection  Why? Large, thousands of messages Natural , not lists Real work environment Free No privacy concerns More than 100 users (with sent+received msgs)

Enron Data Preprocessing 1  Setup a realistic temporal setup For each user, 10% (most recent) sent messages will be used as test  All users had their Address Books extracted List of all recipients in the sent messages.

Enron Data Preprocessing 2  ISI version of Enron Remove repeated messages and inconsistencies  Disambiguate Main Enron addresses List provided by Corrada-Emmanuel from UMass  Bag-of-words Messages were represented as the union of BOW of body and BOW of subject  Some stop words removed  Self-addressed messages were removed

Experiments: using Textual Features only  Three Baseline Methods Random  Rank recipient addresses randomly Cosine or TfIdf Centroid  Create a “TfIdf centroid” for each user in Address Book. A user1-centroid is the sum of all training messages (in TfIdf vector format) that were addressed to user user1. For testing, rank according to cosine similarity between test message and each centroid. Knn-30  Given a test msg, get 30 most similar msgs in training set. Rank according to “sum of similarities” of a given user on the 30-msg set.

Experiments: using Textual Features only Leak Prediction Results: in 10 trials. On each trial, a different set of outliers is generated

Network Features  How frequent a recipient was addressed  How these recipients co- occurred in the training set

Using Network Features 1. Frequency features Number of received messages (from this user) Number of sent messages (to this user) Number of sent+received messages 2. Co-Occurrence Features Number of times a user co-occurred with all other recipients. Co-occurr means “two recipients were addressed in the same message in the training set” 3. Max3g features For each recipient R, find Rm (=address with max score from 3g-address list of R), then use score(R)- score(Rm) as feature. Scores come from the CV10 procedure. Leak-recipient scores are likely to be smaller than their 3g-address highest score.

To combine textual features with network features: Crossvalidation  Training Use Knn-30 on 10-Fold crossvalidation setting to get “textual score” of each user for all training messages Turn each train example into |R| binary examples, where |R| is the number of recipients of the message.  |R|-1 positive (the real recipients)  1 negative (leak-recipient) Augment “textual score” with network features Quantize features Train a classifier VP5- Classification-based ranking scheme  (VP5=Voted Perceptron with 5 passes over training set)

Results: Textual+Network Features

Finding Real Leaks in Enron  How can we find it? Grep for “mistake”, “sorry” or “accident ”. We were looking for sentences like “Sorry. Sent this to you by mistake. Please disregard.”, “I accidentally send you this reminder”, etc.  How many can we find? Dozens of cases. Unfortunately, most of these cases were originated by non- Enron addresses or by an Enron address that is not one of the 151 Enron users whose messages were collected Our method requires a collection of sent (+received) messages from a user. Only 150 Enron users.

Finding Real Leaks in Enron  Found 2 good cases: 1. Message germanyc/sent/930, message has 20 recipients, leak is 2. kitchen-l/sent items/497, it has 44 recipients, leak is Prepared training data accordingly (90/10 split) and no simulated leak added

Results: Finding Real Leaks in Enron Very Disappointing!! Reason: and were never observed in the training set! Average Rank], 100 trials

“Smoothing” the leak generation Else: Randomly select an address book entry Generate a random address NOT in Address Book   Sampling from random unseen recipients with probability 

Some Results: Kitchen-l has 4 unseen addresses out of the 44 recipients, Germany-c has only one.

Mixture parameter  :

Back to the simulated leaks:

What’s next  Modeling Better, more elegant model  Server side application Predict based on all users on mail server In companies, use info from all users Privacy issues  Integration with cc-prediction

Related Work  Privacy Enforcement System Boufaden et al. (CEAS-2005) - used information extraction techniques and domain knowledge to detect privacy breaches via in a university environment. Breaches: student names, student grades and student IDs.  CC Prediction Pal & McCallum (CEAS-06) Counterpart problem: prediction of most likely intended recipients of msg. One single user, limited evaluation, not public data  Expert finding in Dom et al.(SIGMOD-03), Campbell et al(CIKM-03) Balog & de Rijke (www-06), Balog et al (SIGIR-06) Soboroff, Craswell, de Vries (TREC-Enterprise …) Expert finding task on the W3C corpus

Thanks! Questions? Comments? Ideas?

Role Data Leak: Bank Loses IPO  Deutsche Bank has lost its spot among the underwriters of Hertz Global Holdings Inc.'s initial public offering after several s discussing the $1.5 billion initial public offering were inadvertently sent by the bank. This security breach will not only affect them financially, but will no doubt weaken their ability to capture new business in the future. A simple data security policy within Protect Enterprise Suite would have stopped this leak from occurring.  Source: Bloomberg.com Source: Bloomberg.com

Cases of Malicious Leaks  In October 2002, an sent from Merrill Lynch to Standard & Poor's in which  it requested an assessment of Commerzbank was leaked, causing the latter to  issue a statement regarding its financial robustness.  In October 2002, an internal Dell Computer document regarding its plan to enter  the PDA market was leaked and posted on a French Web site.  In February 2004, portions of the Windows 2000 and Windows NT 4 source code  databases were leaked, apparently by one of its outsourcers for code  development.  In September 2004, a former helpdesk employee at Teledata Communications  pleaded guilty to a scheme to steal and sell 30,000 consumer credit reports of  the company's customers.  In October 2004, confidential information about 145,000 American residents was  leaked from identification and credential verification services provider  ChoicePoint. The company registered $11.4 million in charges related to this  incident.  In December 2004, Apple filed a lawsuit against three members of its Apple  Developer Connection network, who allegedly distributed a pre-release version of  "Tiger," the company's next major Mac OS X release, through the P2P filesharing  network BitTorrent.  In June 2005 it was reported that 40 million credit cards of all brands had been  hacked at credit card processor CardSystems, after files containing 239,00  account numbers were downloaded by criminals.

 Sun Hung Kai leak Sun Hung Kai leak  Another classic example of a stupendous security breach, this time by Sun Hung Kai’s online brokerage operation, SHK Online.SHK Online  Hundreds of account holders received the following , sent on April 4:  Dear Client,  We notice that there has been no securities trading activities in your account with SHK ONLINE (SECURITIES) LIMITED for a long time and the account is currently showing a ZERO balance. As part of our company’s regular account maintenance, we will classify such accounts as INACTIVE. Should the account remain in such INACTIVE status without any balance and/or securities trading activities by 4 May 2006, your account will be closed automatically without further notice.  Should you wish to reactivate your account, please contact our customer service hotline at (852) or us at as soon as possible for assistance.  Regards,  Customer Service Department  SHK Online (Securities) Ltd Level 11, One Pacific Place, 88 Queensway, Hong Kong.  Tel: (852) Fax: (852)  “What’s wrong with this?” I hear you ask. And how do I know how many people received it?  The wrong is that the addresses of the recipients were all congregated together in the TO: field. The recipients were not BCC’d nor were they sent this newsletter style using a mailer.