Presentation is loading. Please wait.

Presentation is loading. Please wait.

Preventing Information Leaks in Email Vitor Text Learning Group Meeting Jan 18, 2007 – SCS/CMU.

Similar presentations


Presentation on theme: "Preventing Information Leaks in Email Vitor Text Learning Group Meeting Jan 18, 2007 – SCS/CMU."— Presentation transcript:

1 Preventing Information Leaks in Email Vitor Text Learning Group Meeting Jan 18, 2007 – SCS/CMU

2 Outline 1. Motivation 2. Idea and method Leak Criteria, text-based baselines Crossvalidation, network features 3. Results 4. Finding Real Leaks in the Enron Data 5. Predicting Real leaks in the Enron Data  Smoothing the leak criteria 6. Related Work 7. Conclusions

3 Information Leaks  What’s being leaked? Credit card, New products information Social Security Numbers Software pre-release versions Business Strategy, Health records, etc.  Multi-million dollar industry (ILDP) Anonymity and Privacy of data Information Leakage Detection and Prevention (from Wikipedia)

4 Information Leak using Email  Hard to estimate, but according to PortAuthority Technologies How data is being leaked

5 Email Leaks make good headlines. Just google it…  California Power-Buying Data Disclosed in Misdirected E-Mail  Leaked email exposes MS charity as PR exercise  Bush Glad FEMA Took Blame for Katrina, According to Leaked Email

6 More Email leak in the headlines  Dell leaked email shows channel plans - Direct threat haunts dealers-A leaked email reveals Dell wants to get closer to UK resellers.  Business group say Liberals handled leaked email badly. Business group say Liberals handled leaked email badly.  Is Leaked eMail a SCO-Microsoft Connection?  “Leaked email may be behind Morgan Stanley's Asia economist's sudden resignation”

7 Detecting Email Leaks  Idea Goal: to detect emails accidentally sent to the wrong person Generate artificial leaks: Email leaks may be simulated by various criteria: a typo, similar last names, identical first names, aggressive auto- completion of addresses, etc. Method: LOOK FOR OUTLIERS. Email Leak: email accidentally sent to wrong person Email Leak

8 Avoiding Expensive Email Errors  Method Create simulated/artificial email recipients Build model for (msg.recipients): train classifier on real data to detect synthetically created outliers (added to the true recipient list).  Features: textual(subject, body), network features (frequencies, co-occurrences, etc). Rank potential outliers - Detect outlier and warn user based on confidence. Rec_6 Rec_2 … Rec_K Rec_5 Most likely outlier Least likely outlier P(rec_t) P(rec_t) =Probability recipient t is an outlier given “message text and other recipients in the message”.

9 Method User’s Email Database User New Composed Message Privacy Policy Module Language Models Leak Classifier List of Potential Leaks Leak classifier is trained with real email data combined with simulated outliers. Flowchart of Leak Detection Application This module produces the simulated outliers. It simulates the most common types of mistake that can cause a leak. For instance, email addresses with the same initial letters, or addresses with very close spelling, etc. All messages sent and received by the user

10 Leak Criteria: how to generate (artificial) outliers  Several options: Frequent typos, same/similar last names, identical/similar first names, aggressive auto- completion of addresses, etc.  In this paper, we adopted the 3g-address criteria: On each trial, one of the msg recipients is randomly chosen and an outlier is generated according to: Else: Randomly select an address book entry 1 2 3 Marina.wang @enron.com

11

12 Dataset: Enron Email Collection  Why? Large, thousands of messages Natural email, not email lists Real work environment Free No privacy concerns More than 100 users (with sent+received msgs)

13 Enron Data Preprocessing 1  Setup a realistic temporal setup For each user, 10% (most recent) sent messages will be used as test  All users had their Address Books extracted List of all recipients in the sent messages.

14 Enron Data Preprocessing 2  ISI version of Enron Remove repeated messages and inconsistencies  Disambiguate Main Enron addresses List provided by Corrada-Emmanuel from UMass  Bag-of-words Messages were represented as the union of BOW of body and BOW of subject  Some stop words removed  Self-addressed messages were removed

15 Experiments: using Textual Features only  Three Baseline Methods Random  Rank recipient addresses randomly Cosine or TfIdf Centroid  Create a “TfIdf centroid” for each user in Address Book. A user1-centroid is the sum of all training messages (in TfIdf vector format) that were addressed to user user1. For testing, rank according to cosine similarity between test message and each centroid. Knn-30  Given a test msg, get 30 most similar msgs in training set. Rank according to “sum of similarities” of a given user on the 30-msg set.

16 Experiments: using Textual Features only Email Leak Prediction Results: Prec@1 in 10 trials. On each trial, a different set of outliers is generated

17 Network Features  How frequent a recipient was addressed  How these recipients co- occurred in the training set

18 Using Network Features 1. Frequency features Number of received messages (from this user) Number of sent messages (to this user) Number of sent+received messages 2. Co-Occurrence Features Number of times a user co-occurred with all other recipients. Co-occurr means “two recipients were addressed in the same message in the training set” 3. Max3g features For each recipient R, find Rm (=address with max score from 3g-address list of R), then use score(R)- score(Rm) as feature. Scores come from the CV10 procedure. Leak-recipient scores are likely to be smaller than their 3g-address highest score.

19 To combine textual features with network features: Crossvalidation  Training Use Knn-30 on 10-Fold crossvalidation setting to get “textual score” of each user for all training messages Turn each train example into |R| binary examples, where |R| is the number of recipients of the message.  |R|-1 positive (the real recipients)  1 negative (leak-recipient) Augment “textual score” with network features Quantize features Train a classifier VP5- Classification-based ranking scheme  (VP5=Voted Perceptron with 5 passes over training set)

20 Results: Textual+Network Features

21 Finding Real Leaks in Enron  How can we find it? Grep for “mistake”, “sorry” or “accident ”. We were looking for sentences like “Sorry. Sent this to you by mistake. Please disregard.”, “I accidentally send you this reminder”, etc.  How many can we find? Dozens of cases. Unfortunately, most of these cases were originated by non- Enron email addresses or by an Enron email address that is not one of the 151 Enron users whose messages were collected Our method requires a collection of sent (+received) messages from a user. Only 150 Enron users.

22 Finding Real Leaks in Enron  Found 2 good cases: 1. Message germanyc/sent/930, message has 20 recipients, leak is alex.perkins@ 2. kitchen-l/sent items/497, it has 44 recipients, leak is rita.wynne@ Prepared training data accordingly (90/10 split) and no simulated leak added

23 Results: Finding Real Leaks in Enron Very Disappointing!! Reason: alex.perkins@ and rita.wynne@ were never observed in the training set! [Prec@1, Average Rank], 100 trials

24 “Smoothing” the leak generation Else: Randomly select an address book entry 1 2 3 Marina.wang @enron.com Generate a random email address NOT in Address Book   Sampling from random unseen recipients with probability 

25 Some Results: Kitchen-l has 4 unseen addresses out of the 44 recipients, Germany-c has only one.

26 Mixture parameter  :

27

28 Back to the simulated leaks:

29 What’s next  Modeling Better, more elegant model  Email Server side application Predict based on all users on mail server In companies, use info from all email users Privacy issues  Integration with cc-prediction

30 Related Work  Email Privacy Enforcement System Boufaden et al. (CEAS-2005) - used information extraction techniques and domain knowledge to detect privacy breaches via email in a university environment. Breaches: student names, student grades and student IDs.  CC Prediction Pal & McCallum (CEAS-06) Counterpart problem: prediction of most likely intended recipients of email msg. One single user, limited evaluation, not public data  Expert finding in Email Dom et al.(SIGMOD-03), Campbell et al(CIKM-03) Balog & de Rijke (www-06), Balog et al (SIGIR-06) Soboroff, Craswell, de Vries (TREC-Enterprise 2005- 06-07…) Expert finding task on the W3C corpus

31 Thanks! Questions? Comments? Ideas?

32

33 http://www.workshare.com/company/blog/default.aspx?1=1&postid=18&title=Data-Leak:-Bank-Loses-IPO- Role Data Leak: Bank Loses IPO  Deutsche Bank has lost its spot among the underwriters of Hertz Global Holdings Inc.'s initial public offering after several e-mails discussing the $1.5 billion initial public offering were inadvertently sent by the bank. This security breach will not only affect them financially, but will no doubt weaken their ability to capture new business in the future. A simple data security policy within Protect Enterprise Suite would have stopped this leak from occurring.  Source: Bloomberg.com Source: Bloomberg.com

34 http://hrwatch.counciloned.com/06&0705/Email.htm

35 Cases of Malicious Leaks  In October 2002, an email sent from Merrill Lynch to Standard & Poor's in which  it requested an assessment of Commerzbank was leaked, causing the latter to  issue a statement regarding its financial robustness.  In October 2002, an internal Dell Computer document regarding its plan to enter  the PDA market was leaked and posted on a French Web site.  In February 2004, portions of the Windows 2000 and Windows NT 4 source code  databases were leaked, apparently by one of its outsourcers for code  development.  In September 2004, a former helpdesk employee at Teledata Communications  pleaded guilty to a scheme to steal and sell 30,000 consumer credit reports of  the company's customers.  In October 2004, confidential information about 145,000 American residents was  leaked from identification and credential verification services provider  ChoicePoint. The company registered $11.4 million in charges related to this  incident.  In December 2004, Apple filed a lawsuit against three members of its Apple  Developer Connection network, who allegedly distributed a pre-release version of  "Tiger," the company's next major Mac OS X release, through the P2P filesharing  network BitTorrent.  In June 2005 it was reported that 40 million credit cards of all brands had been  hacked at credit card processor CardSystems, after files containing 239,00  account numbers were downloaded by criminals.

36 http://flagrantharbour.com/?p=206  Sun Hung Kai email leak Sun Hung Kai email leak  Another classic example of a stupendous security breach, this time by Sun Hung Kai’s online brokerage operation, SHK Online.SHK Online  Hundreds of account holders received the following email, sent on April 4:  Dear Client,  We notice that there has been no securities trading activities in your account with SHK ONLINE (SECURITIES) LIMITED for a long time and the account is currently showing a ZERO balance. As part of our company’s regular account maintenance, we will classify such accounts as INACTIVE. Should the account remain in such INACTIVE status without any balance and/or securities trading activities by 4 May 2006, your account will be closed automatically without further notice.  Should you wish to reactivate your account, please contact our customer service hotline at (852) 2822 5001 or email us at enquiry@shkonline.com as soon as possible for assistance.  Regards,  Customer Service Department  SHK Online (Securities) Ltd Level 11, One Pacific Place, 88 Queensway, Hong Kong.  Tel: (852) 2822-5001 Fax: (852) 2822-5998  “What’s wrong with this?” I hear you ask. And how do I know how many people received it?  The wrong is that the email addresses of the recipients were all congregated together in the TO: field. The recipients were not BCC’d nor were they sent this newsletter style using a mailer.


Download ppt "Preventing Information Leaks in Email Vitor Text Learning Group Meeting Jan 18, 2007 – SCS/CMU."

Similar presentations


Ads by Google