Presentation is loading. Please wait.

Presentation is loading. Please wait.

Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung.

Similar presentations


Presentation on theme: "Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung."— Presentation transcript:

1 Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung

2 Alias Definition Alias of names – Dubya = G.W. Bush – Usama = Osama – G.W.Bush = the President Osama bin Laden = the Emir, the Prince Misspelled words – Unintentional (typos) – Intentional : mortgage = m0rtg@ge (Spam)

3 In What Context Do Aliases Occur? Newspaper articles WebPages Spam emails Any collections of text

4 Link Data Set A way to represent the context Compose of set of names and links – Names are extracted from the text – Names can refer to the same entity (“Dubya” and “G.W.Bush”) – Links are collection of names and represent a relationship between names

5 Example Wanted al-Qaeda terror network chief Osama bin Laden and his top aide, Ayman al-Zawahri, have Moved out of Pakistan and are believed to have Crossed the mountainous border back into Afghanistan (Osama bin Laden, Ayman al-Zawahri, al-Qaeda) (Pakistan, Osama bin Laden) (Afghanistan, Osama bin Laden)

6 Graph Representation Osama al-Qaeda Ayman Pakistan Afghanistan

7 Advantages Link data set is easily understood by computers Mimic the way intelligence communities gather data

8 Alias Detection Given two names in a link data set, are they aliases (i.e. do they refer to the same entity?) How to measure their alias-ness? Semi-supervised learning

9 Orthographic Measures String edit distance – Minimum number of insertions, deletions, and substitutions required to transform one name into the other – SED(Osama, Usama) = 2 – SED(Osama, Bush) = 7 – Intuitive measure

10 Some Orthographic Measures String edit distance Normalized string edit distance Discretized string edit distance

11 Semantic Measures But what about aliases such as the Prince and Osama? Define friends of Osama as people who have occurred in same links with Osama Through link data sets, number of occurrences of each friend can be collected Intuition: friends of the Prince look like friends of Osama Treat friends as probability vectors

12 Example of Friends al-Qaeda 10 5 Islam CNN 2 Osama

13 Comparing Two Friends Lists Osama al-Qaeda Music The Prince 10 2 5 50 Islam CNN 2 8

14 Some Semantic Measures Dot Product: 10 * 2 + 2 * 8 Normalized Dot Product Common Friends: 2 (CNN, AlQaeda) KL Distance:

15 Classifier So we have a link data set We have some measures of what aliases are We can easily hand-pick some examples of aliases Let’s build a classifier!

16 Classifier Training Set Positive examples: hand-pick pairs of names in link data set that are known aliases Negative examples: randomly pick pairs of names from the same link data set Calculate measures for all the pairs and insert them as attributes into the training set

17 Classifier Example:

18 Classifier : Cross-Validation Experimented with Decision Trees, k- Nearest Neighbors, Naïve Bayes, Support Vector Machines, and Logistic Regression Logistic Regression performed the best

19 Prediction Given a query name in the link data set with known aliases Pair query name with ALL other names Calculate attributes for all pairs Run each pair through the classifier and obtain a score (how likely are they to be aliases?)

20 Example

21 Prediction Use the score to sort the pairs from most likely to be an alias to least likely See where the true aliases lie in the sorted list and produce a ROC curve Evaluate classifier based on ROC curve

22 Summary Train Logistic Regression Calc Attributes True alias pairs (no query name) Random pairs Query name Run ClassifierROC curve

23 ROC Curve Start from (0,0) on the graph Go down the sorted list If the name on the list is a true alias, move y by one unit If the name on the list is not a true alias, move x by one unit

24 Perfect ROC Example 123 1 2 3 0

25 ROC Example 123 1 2 3 0

26 ROC: Normalize 0.30.61 0.3 0.6 1 0 Balance positive and negative examples Area under curve(AUC) = 5/9 Able to average multiple curves

27 Empirical Results Test on one web page link data set and two spam link data sets Hand pick aliases for each set

28 Empirical Results Choose an alias from the set of hand pick aliases as a query name Build classifier from other aliases that are not aliases with the query name Do prediction and obtain ROC curve Repeat for each alias in the set of hand pick aliases Average all ROC curves by normalized axis

29 Evaluation We want to know how significant is each group of attributes Train one classifier with just orthographic attributes Train another with just semantic attributes Train a third with both sets of attributes Compare curve and area under curve (AUC)

30 Terrorist Data Set Manually extracted from public web pages News and articles related to terrorism Names mentioned in the articles are subjectively linked Used 919 alias pairs for training

31 Web Page Chart

32 Spam Data Set Collection of spam emails Filter out html tags All the words are converted to tokens with white spaces being the boundaries Common tokens are filtered (e.g. “the” “a”) Each email represents a link Each link contains tokens from corresponding email

33 Example Subject:Mortgage rates as low as 2.95% Ref ina nce to day to as low as 2. 95% Sa ve thou sa nds of dol l ars or b uy the ho me of yo ur dr eams! Filtered to: (mortgage, rates, low, refinance, today, save, thousands, dollars, home, dreams)

34 Spam I Chart

35 Spam II Chart

36 Conclusion Orthographic measures work well Semantic sometimes better, sometimes worse than orthographic Combining them produces the best Future work includes adding other measures such as phonetic string edit distance Larger question: many aliases to many names


Download ppt "Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung."

Similar presentations


Ads by Google