Named Entity Extraction: A tool for tracking Internet censorship Tony Espinoza, Leif A Guillermo, Ronnie Garduño, Veronika Strnadova, Jed Crandall This.

Named Entity Extraction: A tool for tracking Internet censorship Tony Espinoza, Leif A Guillermo, Ronnie Garduño, Veronika Strnadova, Jed Crandall This material is based upon work supported by the National Science Foundation under Grant No. IIS/REU/0755462

Outline ● Motivation /Incentive ● Goal ● Strategy ● Results??? ● New Strategy ● Results ● End

Part of A Whole This project is part of a bigger picture to develop tools for tracking and Understanding Keyword- Based Internet Censorship(in china, for now)

Named Entity Extraction

First Approach ● Parse Wikipedia ● Label All Links ● Segment Every Article ● Associate features with every word/phrase ● Do Magic (Maximum Entropy toolkit) ● Search through output to find most likely legal path

Link Extraction Named Entity Extraction ≈ Link Extraction Well for us any way

Maximum Entropy Rundown “choose a model consistent with all the facts, but otherwise as uniform as possible.” Berger et al A Maximum Entropy Approach to Natural Language Processing

Why Use Maximum Entropy Entropy is used because it assumes the least about the data set.

Maximum Entropy Toolkit Output For context: prev_7 prev_is_not_dict_term 1 is_not_dict_term next_1 next_has_punctuation NOT_LINK[0.8590] COMPLETE_LINK[0.0227] BEGINNING_LINK[0.0419] MIDDLE_LINK[0.0347] END_LINK[0.0416] For context: prev_1 prev_is_not_dict_term 1 has_punctuation next_2 next_no_punctuation next_is_dict_term NOT_LINK[0.8606] COMPLETE_LINK[0.0239] BEGINNING_LINK[0.0417] MIDDLE_LINK[0.0320] END_LINK[0.0418].

Measuring Output True Negative = Correctly labeled “Not Link” True Positive = Correctly labeled “Link” Specificity = tn / (tn + fp) Precision = tp/ (tp+fp) Recall = tp / (tp + fn) Accuracy = (tp + tn)/ (tp + fp+ fn + tn)

The problem True Positive:1238 True Negatives:14481 False Negatives:1466 False Positives:2815 Specificity = 0.8372456059204441 Precision = 0.305452751048606 Recall = 0.4578402366863905 Accuracy = 0.78595

Next Approach First approach Second approach

Second Approach ● Parse Wikipedia (Improve) ● Label All Names as Names ● Segment Every Article ● Associate features with every word/phrase ● Do Magic (Maximum Entropy toolkit) ● Search through output to find most likely legal path

Desired Results

Thank you Jed, Terran, Rafael, Andy, Amy, Dean, and Peiyou Song

Questions? Any questions about the presentation or censoring in China?

Named Entity Extraction: A tool for tracking Internet censorship Tony Espinoza, Leif A Guillermo, Ronnie Garduño, Veronika Strnadova, Jed Crandall This.

Similar presentations

Presentation on theme: "Named Entity Extraction: A tool for tracking Internet censorship Tony Espinoza, Leif A Guillermo, Ronnie Garduño, Veronika Strnadova, Jed Crandall This."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Named Entity Extraction: A tool for tracking Internet censorship Tony Espinoza, Leif A Guillermo, Ronnie Garduño, Veronika Strnadova, Jed Crandall This.

Similar presentations

Presentation on theme: "Named Entity Extraction: A tool for tracking Internet censorship Tony Espinoza, Leif A Guillermo, Ronnie Garduño, Veronika Strnadova, Jed Crandall This."— Presentation transcript:

Similar presentations

About project

Feedback