Information Extraction Lecture 4 – Named Entity Recognition II CIS, LMU München Winter Semester 2014-2015 Dr. Alexander Fraser, CIS.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Information Extraction Lecture 7 – Linear Models (Basic Machine Learning) CIS, LMU München Winter Semester Dr. Alexander Fraser, CIS.
Information Extraction Lecture 3 – Rule-based Named Entity Recognition CIS, LMU München Winter Semester Dr. Alexander Fraser, CIS.
Evaluation of Information Retrieval Systems Thanks to Marti Hearst, Ray Larson.
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen.
Search Engines and Information Retrieval
INFO 624 Week 3 Retrieval System Evaluation
1 I256: Applied Natural Language Processing Marti Hearst Sept 27, 2006.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 16: Flat Clustering 1.
SLIDE 1IS 240 – Spring 2009 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
Retrieval Evaluation: Precision and Recall. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity.
Retrieval Evaluation. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
Jeremy Wyatt Thanks to Gavin Brown
ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.
Introduction to Machine Learning Approach Lecture 5.
Evaluation of Image Retrieval Results Relevant: images which meet user’s information need Irrelevant: images which don’t meet user’s information need Query:
CIS, LMU München Winter Semester Dr. Alexander Fraser, CIS
Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20.
Multiple testing correction
Evaluation David Kauchak cs160 Fall 2009 adapted from:
Search Engines and Information Retrieval Chapter 1.
Basic Machine Learning: Linear Models Alexander Fraser CIS, LMU München WSD and MT.
Scott Duvall, Brett South, Stéphane Meystre A Hands-on Introduction to Natural Language Processing in Healthcare Annotation as a Central Task for Development.
Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o.
Evaluating What’s Been Learned. Cross-Validation Foundation is a simple idea – “ holdout ” – holds out a certain amount for testing and uses rest for.
Classification Performance Evaluation. How do you know that you have a good classifier? Is a feature contributing to overall performance? Is classifier.
Logical Structure Recovery in Scholarly Articles with Rich Document Features Minh-Thang Luong, Thuy Dung Nguyen and Min-Yen Kan.
Evaluation of (Search) Results How do we know if our results are any good? Evaluating a search engine  Benchmarks  Precision and recall Results summaries:
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Computational Intelligence: Methods and Applications Lecture 16 Model evaluation and ROC Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Basic Implementation and Evaluations Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
Information Extraction Lecture 4 – Named Entity Recognition II CIS, LMU München Winter Semester Dr. Alexander Fraser, CIS.
Performance Measures. Why to Conduct Performance Evaluation? 2 n Evaluation is the key to building effective & efficient IR (information retrieval) systems.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
Information Extraction Lecture 3 – Rule-based Named Entity Recognition Dr. Alexander Fraser, U. Munich September 3rd, 2014 ISSALE: University of Colombo.
What Does the User Really Want ? Relevance, Precision and Recall.
Information Extraction Lecture 3 – Rule-based Named Entity Recognition CIS, LMU München Winter Semester Dr. Alexander Fraser, CIS.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
Information Retrieval (based on Jurafsky and Martin) Miriam Butt October 2003.
Evaluation. The major goal of IR is to search document relevant to a user query. The evaluation of the performance of IR systems relies on the notion.
Information Retrieval Quality of a Search Engine.
Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang.
Classification Cheng Lei Department of Electrical and Computer Engineering University of Victoria April 24, 2015.
Relevant Document Distribution Estimation Method for Resource Selection Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
Knowledge and Information Retrieval Dr Nicholas Gibbins 32/4037.
Sampath Jayarathna Cal Poly Pomona
Statistical Machine Translation Part II: Word Alignments and EM
Evaluation of Information Retrieval Systems
Information Extraction Lecture 3 – Rule-based Named Entity Recognition
Improving Data Discovery Through Semantic Search
7CCSMWAL Algorithmic Issues in the WWW
Information Extraction Lecture 4 – Named Entity Recognition II
Evaluation.
Features & Decision regions
Prof. Dr. Alexander Fraser, CIS
Modern Information Retrieval
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
CS246: Information Retrieval
Evaluation of Information Retrieval Systems
False discovery rate estimation
Dr. Sampath Jayarathna Cal Poly Pomona
Retrieval Performance Evaluation - Measures
Dr. Sampath Jayarathna Cal Poly Pomona
Presentation transcript:

Information Extraction Lecture 4 – Named Entity Recognition II CIS, LMU München Winter Semester Dr. Alexander Fraser, CIS

Outline Evaluation in more detail Look at Information Retrieval Return to Rule-Based NER The CMU Seminar dataset Issues in Evaluation of IE Human Annotation for NER 2

Recall Measure of how much relevant information the system has extracted (coverage of system). Exact definition: Recall = 1 if no possible correct answers else: # of correct answers given by system total # of possible correct answers in text Slide modified from Butt/Jurafsky/Martin

Precision Measure of how much of the information the system returned is correct (accuracy). Exact definition: Precision = 1 if no answers given by system else: # of correct answers given by system # of answers given by system Slide modified from Butt/Jurafsky/Martin

Evaluation Every system, algorithm or theory should be evaluated, i.e. its output should be compared to the gold standard (i.e. the ideal output). Suppose we try to find scientists… Algorithm output: O = {Einstein, Bohr, Planck, Clinton, Obama} Gold standard: G = {Einstein, Bohr, Planck, Heisenberg} Precision: What proportion of the output is correct? | O ∧ G | |O| Recall: What proportion of the gold standard did we get? | O ∧ G | |G| ✓ ✓ ✓ ✗ ✗ ✓ ✓✓ ✗ Slide modified from Suchanek

Evaluation Why Evaluate? What to Evaluate? How to Evaluate? Slide from Giles

Why Evaluate? Determine if the system is useful Make comparative assessments with other methods/systems –Who’s the best? Test and improve systems Others: Marketing, … Slide modified from Giles

What to Evaluate? In Information Extraction, we try to match a pre-annotated gold standard But the evaluation methodology is mostly taken from Information Retrieval –So let's consider relevant documents to a search engine query for now –We will return to IE evaluation later

Relevant vs. Retrieved Documents Relevant Retrieved All docs available Set approach Slide from Giles

Contingency table of relevant and retrieved documents Precision: P= Ret Rel / Retrieved Recall: R = Ret Rel / Relevant Ret Rel Ret NotRel NotRet Rel NotRet NotRel Ret = Ret Rel + Ret NotRel Relevant = Ret Rel + NotRet Rel NotRel Rel Ret NotRet Total # of documents available N = Ret Rel + NotRet Rel + Ret NotRel + NotRet NotRel P = [0,1] R = [0,1] Not Relevant = Ret NotRel + NotRet NotRel NotRet = NotRet Rel + NotRet NotRel retrieved relevant Slide from Giles

Contingency table of classification of documents False positive rate  = fp/(negatives) False negative rate  = fn/(positives) tp fp type1 fn type2 tn fp type 1 error present = tp + fn positives = tp + fp negatives = fn + tn Absent Present Positive Negative Total # of cases N = tp + fp + fn + tn fn type 2 error Test result Actual Condition Slide from Giles

Retrieval example Documents available: D1,D2,D3,D4,D5,D6, D7,D8,D9,D10 Relevant: D1, D4, D5, D8, D10 Query to search engine retrieves: D2, D4, D5, D6, D8, D9 relevantnot relevant retrieved not retrieved Slide from Giles

Retrieval example relevantnot relevant retrieved D4,D5,D8D2,D6,D9 not retrieved D1,D10D3,D7 Documents available: D1,D2,D3,D4,D5,D6, D7,D8,D9,D10 Relevant: D1, D4, D5, D8, D10 Query to search engine retrieves: D2, D4, D5, D6, D8, D9 Slide from Giles

Precision and Recall – Contingency Table Precision: P= w / w+y =3/6 =.5 Recall: R = w / w+x = 3/5 =.6 w=3x=2 y=3z=2 Relevant = w+x= 5 Retrieved = w+y = 6 Not retrieved Retrieved Relevant Not relevant Total documents N = w+x+y+z = 10 Not Relevant = y+z = 5 Not Retrieved = x+z = 4 Slide from Giles

Contingency table of relevant and retrieved documents Precision: P= Ret Rel / Retrieved = 3/6 =.5 Recall: R = Ret Rel / Relevant = 3/5 =.6 Ret Rel =3Ret NotRel =3 NotRet Rel =2NotRet NotRel =2 Ret = Ret Rel + Ret NotRel = = 6 Relevant = Ret Rel + NotRet Rel = = 5 NotRelRel Ret NotRet Total # of docs N = Ret Rel + NotRet Rel + Ret NotRel + NotRet NotRel = 10 P = [0,1] R = [0,1] Not Relevant = Ret NotRel + NotRet NotRel = = 4 NotRet = NotRet Rel + NotRet NotRe = = 4 retrieved relevant Slide from Giles

What do we want Find everything relevant – high recall Only retrieve what is relevant – high precision Slide from Giles

Relevant vs. Retrieved Relevant Retrieved All docs Slide from Giles

Precision vs. Recall Relevant Retrieved All docs Slide from Giles

Retrieved vs. Relevant Documents Relevant Very high precision, very low recall retrieved Slide from Giles

Retrieved vs. Relevant Documents Relevant High recall, but low precision retrieved Slide from Giles

Retrieved vs. Relevant Documents Relevant Very low precision, very low recall (0 for both) retrieved Slide from Giles

Retrieved vs. Relevant Documents Relevant High precision, high recall (at last!) retrieved Slide from Giles

Why Precision and Recall? Get as much of what we want while at the same time getting as little junk as possible. Recall is the percentage of relevant documents returned compared to everything that is available! Precision is the percentage of relevant documents compared to what is returned! The desired trade-off between precision and recall is specific to the scenario we are in Slide modified from Giles

Relation to Contingency Table Accuracy: (a+d) / (a+b+c+d) Precision: a/(a+b) Recall: a/(a+c) Why don’t we use Accuracy for IR? –(Assuming a large collection) Most docs aren’t relevant Most docs aren’t retrieved Inflates the accuracy value Doc is Relevant Doc is NOT relevant Doc is retrieved ab Doc is NOT retrieved cd Slide from Giles

CMU Seminars task Given an about a seminar Annotate –Speaker –Start time –End time –Location

CMU Seminars - Example Type: cmu.cs.proj.mt Topic: Nagao Talk Dates: 26-Apr-93 Time: 10: :00 AM PostedBy: jgc+ on 24-Apr-93 at 20:59 from NL.CS.CMU.EDU (Jaime Carbonell) Abstract: This Monday, 4/26, Prof. Makoto Nagao will give a seminar in the CMT red conference room am on recent MT research results.

Creating Rules Suppose we observe "the seminar at 4 pm will [...]" in a training document The processed representation will have access to the words and to additional knowledge We can create a very specific rule for –And then generalize this by dropping constraints (as discussed previously)

For each rule, we look for: – Support (training examples that match this pattern) – Conflicts (training examples that match this pattern with no annotation, or a different annotation) Suppose we see: "tomorrow at 9 am " – The rule in our example applies! – If there are no conflicts, we have a more general rule Overall: we try to take the most general rules which don't have conflicts

Returning to Evaluation This time, evaluation specifically for IE

False Negative in CMU Seminars Gold standard test set: Starting from 11 am System marks nothing: Starting from 11 am False negative (which measure does this hurt?)

False Positive in CMU Seminars Gold standard test set:... Followed by lunch at 11:30 am, and meetings System marks:... at 11:30 am False positive (which measure does this hurt?)

Mislabeled in CMU Seminars Gold standard test set: at a different time - 6 pm System marks: pm Which measures are affected? Note that this is different from Information Retrieval!

Partial Matches in CMU Seminars Gold standard test set:... at 5 pm System marks:... at 5 pm Then I get a partial match (worth 0.5) Also different from Information Retrieval

Evaluation is a critical issue where there is still much work to be done But before we can evaluate, we need a gold standard Training IE systems – Critical component for "learning" statistical classifiers – The more data, the better the classifier Can also be used for developing a handcrafted NER system – Constant rescoring and coverage checks are very helpful Necessary in both cases for evaluation

Annotator Variability Differences in annotation are a significant problem – Only some people are good at annotation – Practice helps Even good annotators can have different understanding of the task – For instance, in doubt, annotate? Or not? – (~ precision/recall tradeoffs) Effect of using gold standard corpora that are not well annotated – Evaluations can return inaccurate results – Systems trained on inconsistent data can develop problems which are worse than if the training examples are eliminated Crowd-sourcing, which we will talk about later, has all of these same problems even more strongly!

Slide sources – Some of the slides presented today were from C. Lee Giles, Penn State and Fabio Ciravegna, Sheffield

Conclusion Last two lectures – Rule-based NER – Learning rules for NER – Evaluation – Annotation

Thank you for your attention! 57