Presentation is loading. Please wait.

Presentation is loading. Please wait.

Named Entity Extraction: A tool for tracking Internet censorship Tony Espinoza, Leif A Guillermo, Ronnie Garduño, Veronika Strnadova, Jed Crandall This.

Similar presentations


Presentation on theme: "Named Entity Extraction: A tool for tracking Internet censorship Tony Espinoza, Leif A Guillermo, Ronnie Garduño, Veronika Strnadova, Jed Crandall This."— Presentation transcript:

1 Named Entity Extraction: A tool for tracking Internet censorship Tony Espinoza, Leif A Guillermo, Ronnie Garduño, Veronika Strnadova, Jed Crandall This material is based upon work supported by the National Science Foundation under Grant No. IIS/REU/0755462

2 Outline ● Motivation /Incentive ● Goal ● Strategy ● Results??? ● New Strategy ● Results ● End

3 Part of A Whole This project is part of a bigger picture to develop tools for tracking and Understanding Keyword- Based Internet Censorship(in china, for now)

4 Named Entity Extraction

5 First Approach ● Parse Wikipedia ● Label All Links ● Segment Every Article ● Associate features with every word/phrase ● Do Magic (Maximum Entropy toolkit) ● Search through output to find most likely legal path

6 Link Extraction Named Entity Extraction ≈ Link Extraction Well for us any way

7 Maximum Entropy Rundown “choose a model consistent with all the facts, but otherwise as uniform as possible.” Berger et al A Maximum Entropy Approach to Natural Language Processing

8 Why Use Maximum Entropy Entropy is used because it assumes the least about the data set.

9 Maximum Entropy Toolkit Output For context: prev_7 prev_is_not_dict_term 1 is_not_dict_term next_1 next_has_punctuation NOT_LINK[0.8590] COMPLETE_LINK[0.0227] BEGINNING_LINK[0.0419] MIDDLE_LINK[0.0347] END_LINK[0.0416] For context: prev_1 prev_is_not_dict_term 1 has_punctuation next_2 next_no_punctuation next_is_dict_term NOT_LINK[0.8606] COMPLETE_LINK[0.0239] BEGINNING_LINK[0.0417] MIDDLE_LINK[0.0320] END_LINK[0.0418].

10 Measuring Output True Negative = Correctly labeled “Not Link” True Positive = Correctly labeled “Link” Specificity = tn / (tn + fp) Precision = tp/ (tp+fp) Recall = tp / (tp + fn) Accuracy = (tp + tn)/ (tp + fp+ fn + tn)

11 The problem True Positive:1238 True Negatives:14481 False Negatives:1466 False Positives:2815 Specificity = 0.8372456059204441 Precision = 0.305452751048606 Recall = 0.4578402366863905 Accuracy = 0.78595

12 Next Approach First approach Second approach

13 Second Approach ● Parse Wikipedia (Improve) ● Label All Names as Names ● Segment Every Article ● Associate features with every word/phrase ● Do Magic (Maximum Entropy toolkit) ● Search through output to find most likely legal path

14 Desired Results

15 Thank you Jed, Terran, Rafael, Andy, Amy, Dean, and Peiyou Song

16 Questions? Any questions about the presentation or censoring in China?


Download ppt "Named Entity Extraction: A tool for tracking Internet censorship Tony Espinoza, Leif A Guillermo, Ronnie Garduño, Veronika Strnadova, Jed Crandall This."

Similar presentations


Ads by Google