INTRODUCTION TO ARTIFICIAL INTELLIGENCE Truc-Vien T. Nguyen Lab: Named Entity Recognition
Download Slides NER -- Vien.pdf Software t rar
Natural Language Processing (NLP) Main purpose of NLP – Build systems able to analyze, understand and generate languages which human use naturally Involved Tasks – Automatic Summarization – Information Extraction – Speech Recognition – Machine Translation –…–…
Information Extraction (1) Mapping of texts into fixed structure representing the key informations News 3 News 2 News 1 Form 3 WHO: vcvcvcvcvcvcvcvcvc WHAT: vcvcvcvcvcvcvcvcvc WHEN: vcvcvcvcvcvcvcvcvc Form 2 WHO: vcvcvcvcvcvcvcvcvc WHAT: vcvcvcvcvcvcvcvcvc WHEN: vcvcvcvcvcvcvcvcvc Form 1 WHO: vcvcvcvcvcvcvcvcvc WHAT: vcvcvcvcvcvcvcvcvc WHEN: vcvcvcvcvcvcvcvcvc
Information Extraction (2) Sam Brown retired as executive vice president of the famous hot dog manufacturer, Hupplewhite Inc. He will be succeeded by Harry Jones. EVENT: leave job Person: Sam Brown Position: executive vice president Company: Hupplewhite Inc. EVENT: start job Person: Harry Jones Position: executive vice president Company: Hupplewhite Inc.
Entity and Relation Entity – An object in the world – Ex. President Bush was in Washington today – Example: Person, Organization, Location, GPE Relation – A relationship between two entities – Ex. LocatedIn(“Bush”, “Washington”) – Example: LocatedIn, Family, Employment
Named Entity Recognition – Subtask of information extraction – Locate and classify elements in text into predefined categories: names of persons, organizations, locations, expressions of times, etc Example – James Clarke, director of ABC company (Person) (Organization)
CoNLL2003 shared task (1) English and German language 4 types of NEs: – LOC Location – MISC Names of miscellaneous entities – ORG Organization – PER Person Training Set for developing the system Test Data for the final evaluation
CoNLL2003 shared task (2) Data – columns separated by a single space – A word for each line – An empty line after each sentence – Tags in IOB format An example MilanNNPB-NPI-ORG 'sPOSB-NPO playerNNI-NPO GeorgeNNPI-NPI-PER WeahNNPI-NPI-PER meetVBPB-VPO
CoNLL2003 shared task (3) Englishprecision recall F [FIJZ03]88.99%88.54%88.76% [CN03]88.12%88.51%88.31% [KSNM03]85.93%86.21%86.07% [ZJ03]86.13%84.88%85.50% [Ham03]69.09%53.26%60.15% baseline71.91%50.90%59.61%
Dataset Italian NER-- Evalita PER/ORG/LOC/GPE – Development set: tokens – Test set: tokens English NER-- CoNLL PER/ORG/LOC/MISC – Training set: tokens – Development set: tokens – Test set: tokens Mention Detection-- ACE 2005 – 599 documents
CRF++ (1) Can redefine feature sets Written in C++ with STL Fast training based on LBFGS for large scale Less memory usage both in training and testing encoding/decoding in practical time Available as an open source software
CRF++ (2) use Conditional Random Fields (CRFs) CRFs methodology: use statistical correlated features and train them discriminatively simple, customizable, and open source implementation for segmenting/labeling sequential data can define – unigram/bigram features – relative positions (windows-size)
Template basic An example: HePRPB-NP reckonsVBZB-VP theDTB-NP<< CURRENT TOKEN currentJJI-NP accountNNI-NP TemplateExpanded feature %x[0,0]the %x[0,1]DT %x[-1,0]reckons %x[-2,1]PRP %x[0,0]/%x[0,1]the/DT
A Case Study Installing CRF++ Data for Training and Test Making the baseline Training CRF++ on the – NER dataset: English CoNLL2003, Italian EVALITA – Mention classification: ACE 2005 dataset Annotating the test corpus with CRF++ Evaluating results Exercise
Installing CRF++ First, ssh compute-0-x where x=1..10 Unzip the lab--NER.tar.gz file (tar -xvzf lab-- NER.tar.gz) Enter the lab--NER directory – Unzip the CRF tar.gz file (tar -xvzf CRF tar.gz) – Enter the CRF directory – Run./configure – Run make
Training/Classification (1) Notations – xxxtrain_it.dat/train_en.dat/train_mention.dat – nnnit.model/en.model/mention.model – yyytest_it.dat/test_en.dat/test_mention.dat – zzztest_it.tagged/test_en.tagged/test_mention.tagged – ttttest_it.eval/test_en. eval/test_mention.eval Note that the test_it.dat already contains the right NE tags but the system is not using this information for tagging the data
Training/Classification (2) Enter the CRF directory Training./crf_learn../templates/template_4../corpus/xxx../models/nnn Classification./crf_test -m../models/nnn../corpus/yyy >../corpus/zzz Evaluation perl../eval/conlleval.pl../corpus/zzz >../corpus/ttt See the results cat../corpus/ttt
THANKS I used material from – Text Processing II: Bernardo Magnini – Lab Text Processing II: Roberto Zanoli