Presentation is loading. Please wait.

Presentation is loading. Please wait.

INTRODUCTION TO ARTIFICIAL INTELLIGENCE Truc-Vien T. Nguyen Lab: Named Entity Recognition.

Similar presentations


Presentation on theme: "INTRODUCTION TO ARTIFICIAL INTELLIGENCE Truc-Vien T. Nguyen Lab: Named Entity Recognition."— Presentation transcript:

1 INTRODUCTION TO ARTIFICIAL INTELLIGENCE Truc-Vien T. Nguyen Lab: Named Entity Recognition

2 Download Slides http://sites.google.com/site/trucviennguyen/Lab NER -- Vien.pdf Software http://sites.google.com/site/trucviennguyen/Teaching/AI/SSHSecureShellClien t-3.2.9.rar

3 Natural Language Processing (NLP) Main purpose of NLP – Build systems able to analyze, understand and generate languages which human use naturally Involved Tasks – Automatic Summarization – Information Extraction – Speech Recognition – Machine Translation –…–…

4 Information Extraction (1) Mapping of texts into fixed structure representing the key informations News 3 News 2 News 1 Form 3 WHO: vcvcvcvcvcvcvcvcvc WHAT: vcvcvcvcvcvcvcvcvc WHEN: vcvcvcvcvcvcvcvcvc Form 2 WHO: vcvcvcvcvcvcvcvcvc WHAT: vcvcvcvcvcvcvcvcvc WHEN: vcvcvcvcvcvcvcvcvc Form 1 WHO: vcvcvcvcvcvcvcvcvc WHAT: vcvcvcvcvcvcvcvcvc WHEN: vcvcvcvcvcvcvcvcvc

5 Information Extraction (2) Sam Brown retired as executive vice president of the famous hot dog manufacturer, Hupplewhite Inc. He will be succeeded by Harry Jones. EVENT: leave job Person: Sam Brown Position: executive vice president Company: Hupplewhite Inc. EVENT: start job Person: Harry Jones Position: executive vice president Company: Hupplewhite Inc.

6 Entity and Relation Entity – An object in the world – Ex. President Bush was in Washington today – Example: Person, Organization, Location, GPE Relation – A relationship between two entities – Ex. LocatedIn(“Bush”, “Washington”) – Example: LocatedIn, Family, Employment

7 Named Entity Recognition – Subtask of information extraction – Locate and classify elements in text into predefined categories: names of persons, organizations, locations, expressions of times, etc Example – James Clarke, director of ABC company (Person) (Organization)

8 CoNLL2003 shared task (1) English and German language 4 types of NEs: – LOC Location – MISC Names of miscellaneous entities – ORG Organization – PER Person Training Set for developing the system Test Data for the final evaluation

9 CoNLL2003 shared task (2) Data – columns separated by a single space – A word for each line – An empty line after each sentence – Tags in IOB format An example MilanNNPB-NPI-ORG 'sPOSB-NPO playerNNI-NPO GeorgeNNPI-NPI-PER WeahNNPI-NPI-PER meetVBPB-VPO

10 CoNLL2003 shared task (3) Englishprecision recall F [FIJZ03]88.99%88.54%88.76% [CN03]88.12%88.51%88.31% [KSNM03]85.93%86.21%86.07% [ZJ03]86.13%84.88%85.50% --------------------------------------------------- [Ham03]69.09%53.26%60.15% baseline71.91%50.90%59.61%

11 Dataset Italian NER-- Evalita 2009 - PER/ORG/LOC/GPE – Development set:223.706 tokens – Test set: 90.556 tokens English NER-- CoNLL 2003 - PER/ORG/LOC/MISC – Training set:203.621 tokens – Development set: 51.362 tokens – Test set: 46.435 tokens Mention Detection-- ACE 2005 – 599 documents

12 CRF++ (1) Can redefine feature sets Written in C++ with STL Fast training based on LBFGS for large scale Less memory usage both in training and testing encoding/decoding in practical time Available as an open source software http://crfpp.googlecode.com/svn/trunk/doc/index.html

13 CRF++ (2) use Conditional Random Fields (CRFs) CRFs methodology: use statistical correlated features and train them discriminatively simple, customizable, and open source implementation for segmenting/labeling sequential data can define – unigram/bigram features – relative positions (windows-size)

14 Template basic An example: HePRPB-NP reckonsVBZB-VP theDTB-NP<< CURRENT TOKEN currentJJI-NP accountNNI-NP TemplateExpanded feature %x[0,0]the %x[0,1]DT %x[-1,0]reckons %x[-2,1]PRP %x[0,0]/%x[0,1]the/DT

15 A Case Study Installing CRF++ Data for Training and Test Making the baseline Training CRF++ on the – NER dataset: English CoNLL2003, Italian EVALITA – Mention classification: ACE 2005 dataset Annotating the test corpus with CRF++ Evaluating results Exercise

16 Installing CRF++ First, ssh compute-0-x where x=1..10 Unzip the lab--NER.tar.gz file (tar -xvzf lab-- NER.tar.gz) Enter the lab--NER directory – Unzip the CRF++-0.54.tar.gz file (tar -xvzf CRF++- 0.54.tar.gz) – Enter the CRF++-0.54 directory – Run./configure – Run make

17 Training/Classification (1) Notations – xxxtrain_it.dat/train_en.dat/train_mention.dat – nnnit.model/en.model/mention.model – yyytest_it.dat/test_en.dat/test_mention.dat – zzztest_it.tagged/test_en.tagged/test_mention.tagged – ttttest_it.eval/test_en. eval/test_mention.eval Note that the test_it.dat already contains the right NE tags but the system is not using this information for tagging the data

18 Training/Classification (2) Enter the CRF++-0.54 directory Training./crf_learn../templates/template_4../corpus/xxx../models/nnn Classification./crf_test -m../models/nnn../corpus/yyy >../corpus/zzz Evaluation perl../eval/conlleval.pl../corpus/zzz >../corpus/ttt See the results cat../corpus/ttt

19 THANKS I used material from – Text Processing II: Bernardo Magnini – Lab Text Processing II: Roberto Zanoli


Download ppt "INTRODUCTION TO ARTIFICIAL INTELLIGENCE Truc-Vien T. Nguyen Lab: Named Entity Recognition."

Similar presentations


Ads by Google