Presentation is loading. Please wait.

Presentation is loading. Please wait.

Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant, Ido Dagan Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant,

Similar presentations


Presentation on theme: "Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant, Ido Dagan Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant,"— Presentation transcript:

1 Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant, Ido Dagan Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant, Ido Dagan

2 Outline Bar Ilan University @ ACL 20122 Inference-Rule Evaluation We address Crowdsourcing Rule Applications Annotation Empirically Compare Different Resources Allowing us to 1 2 3 By

3 Bar Ilan University @ ACL 20122 Inference-Rule Evaluation We address Crowdsourcing Rule Applications Annotation By 1 2 Empirically Compare Different Resources Allowing us to 3

4 Inference Rules – important component in semantic applications Bar Ilan University @ ACL 20123 Inference-Rule Evaluation Crowdsourcing Rule Application Annotations Empirically Compare Different Resources X brought up in Y  X raised in Y Q Where was Reagan raised? A Reagan was brought up in Dixon.

5 Inference Rules – important component in semantic applications Bar Ilan University @ ACL 20123 Inference-Rule Evaluation Crowdsourcing Rule Application Annotations Empirically Compare Different Resources X brought up in Y  X raised in Y Q Where was Reagan raised? A Reagan was brought up in Dixon. Hiring Event PERSON ROLE Bob worked as an analyst for Dell

6 Inference Rules – important component in semantic applications Bar Ilan University @ ACL 20123 Inference-Rule Evaluation Crowdsourcing Rule Application Annotations Empirically Compare Different Resources X brought up in Y  X raised in Y Q Where was Reagan raised? A Reagan was brought up in Dixon. Hiring Event PERSON ROLE Bob worked as an analyst for Dell X work as Y  X hired as Y

7 Inference Rules – important component in semantic applications Bar Ilan University @ ACL 20123 Inference-Rule Evaluation Crowdsourcing Rule Application Annotations Empirically Compare Different Resources X brought up in Y  X raised in Y Q Where was Reagan raised? A Reagan was brought up in Dixon. Hiring Event PERSON ROLE Bob worked as an analyst for Dell X work as Y  X hired as Y analyst Bob

8 Evaluation - What are the options? 4Bar Ilan University @ ACL 2012 Inference-Rule Evaluation Crowdsourcing Rule Application Annotations Empirically Compare Different Resources

9 Evaluation - What are the options? 4 Impact on end task QA, IE, RTE Pro: What interests an inference system developer Con: Many components, address multiple phenomena  Hard to asses the effect of a single resource. 1 Bar Ilan University @ ACL 2012 Inference-Rule Evaluation Crowdsourcing Rule Application Annotations Empirically Compare Different Resources

10 Evaluation - What are the options? 4 Impact on end task QA, IE, RTE Pro: What interests an inference system developer Con: Many components, address multiple phenomena  Hard to asses the effect of a single resource. 1 Bar Ilan University @ ACL 2012 Inference-Rule Evaluation Crowdsourcing Rule Application Annotations Judge rule correctness directly Pro: Theoretically most intuitive Con: In fact hard to do  Often results in low inter-annotator agreement. 2 Empirically Compare Different Resources

11 Evaluation - What are the options? 4 Impact on end task QA, IE, RTE Pro: What interests an inference system developer Con: Many components, address multiple phenomena  Hard to asses the effect of a single resource. 1 Bar Ilan University @ ACL 2012 Inference-Rule Evaluation Crowdsourcing Rule Application Annotations Judge rule correctness directly Pro: Theoretically most intuitive Con: In fact hard to do  Often results in low inter-annotator agreement. 2 Empirically Compare Different Resources X reside in Y  X live in Y X reside in Y  X born in Y

12 Evaluation - What are the options? 4 Impact on end task QA, IE, RTE Pro: What interests an inference system developer Con: Many components, address multiple phenomena  Hard to asses the effect of a single resource. 1 Bar Ilan University @ ACL 2012 Inference-Rule Evaluation Crowdsourcing Rule Application Annotations Judge rule correctness directly Pro: Theoretically most intuitive Con: In fact hard to do  Often results in low inter-annotator agreement. 2 Empirically Compare Different Resources X reside in Y  X live in Y X reside in Y  X born in Y X criticize Y  X attack Y

13 Evaluation - What are the options? 4 Impact on end task QA, IE, RTE Pro: What interests an inference system developer Con: Many components, address multiple phenomena  Hard to asses the effect of a single resource. 1 Instance-based evaluation (Szpektor et al 2007., Bhagat et al. 2007) Pro: Simulates utility of rules in an application Yields high inter-annotator agreement. 3 Bar Ilan University @ ACL 2012 Inference-Rule Evaluation Crowdsourcing Rule Application Annotations Judge rule correctness directly Pro: Theoretically most intuitive Con: In fact hard to do  Often results in low inter-annotator agreement. 2 Empirically Compare Different Resources X reside in Y  X live in Y X reside in Y  X born in Y X criticize Y  X attack Y

14 Evaluation - What are the options? 4 Impact on end task QA, IE, RTE Pro: What interests an inference system developer Con: Many components, address multiple phenomena  Hard to asses the effect of a single resource. 1 Instance-based evaluation (Szpektor et al 2007., Bhagat et al. 2007) Pro: Simulates utility of rules in an application Yields high inter-annotator agreement. 3 Bar Ilan University @ ACL 2012 Inference-Rule Evaluation Crowdsourcing Rule Application Annotations Judge rule correctness directly Pro: Theoretically most intuitive Con: In fact hard to do  Often results in low inter-annotator agreement. 2 Empirically Compare Different Resources X reside in Y  X live in Y X reside in Y  X born in Y X criticize Y  X attack Y

15 5Bar Ilan University @ ACL 2012 Target: Judge if a rule application is valid or not Empirically Compare Different Resources Instance Based Evaluation – Decisions Inference-Rule Evaluation Crowdsourcing Rule Application Annotations

16 5Bar Ilan University @ ACL 2012 Target: Judge if a rule application is valid or not Empirically Compare Different Resources Instance Based Evaluation – Decisions Rule: X teach Y  X explain to Y Inference-Rule Evaluation Crowdsourcing Rule Application Annotations

17 5Bar Ilan University @ ACL 2012 Target: Judge if a rule application is valid or not Empirically Compare Different Resources Instance Based Evaluation – Decisions Inference-Rule Evaluation Crowdsourcing Rule Application Annotations Rule: X teach Y  X explain to Y LHS: Steve teaches kids

18 5Bar Ilan University @ ACL 2012 Target: Judge if a rule application is valid or not Empirically Compare Different Resources Instance Based Evaluation – Decisions Inference-Rule Evaluation Crowdsourcing Rule Application Annotations Rule: X teach Y  X explain to Y LHS: Steve teaches kids RHS: Steve explains to kids

19 5Bar Ilan University @ ACL 2012 Target: Judge if a rule application is valid or not Empirically Compare Different Resources Instance Based Evaluation – Decisions Inference-Rule Evaluation Crowdsourcing Rule Application Annotations Rule: X teach Y  X explain to Y LHS: Steve teaches kids RHS: Steve explains to kids

20 5Bar Ilan University @ ACL 2012 Target: Judge if a rule application is valid or not Empirically Compare Different Resources Instance Based Evaluation – Decisions Rule: X resides in Y  X born in Y LHS: He resides in Paris RHS: He born in Paris Inference-Rule Evaluation Crowdsourcing Rule Application Annotations Rule: X teach Y  X explain to Y LHS: Steve teaches kids RHS: Steve explains to kids

21 5Bar Ilan University @ ACL 2012 Target: Judge if a rule application is valid or not Empirically Compare Different Resources Instance Based Evaluation – Decisions Rule: X resides in Y  X born in Y LHS: He resides in Paris RHS: He born in Paris Inference-Rule Evaluation Crowdsourcing Rule Application Annotations Rule: X teach Y  X explain to Y LHS: Steve teaches kids RHS: Steve explains to kids

22 5Bar Ilan University @ ACL 2012 Target: Judge if a rule application is valid or not Empirically Compare Different Resources Instance Based Evaluation – Decisions Rule: X turn in Y  X bring in Y LHS: humans turn in bed RHS: humans bring in bed Rule: X resides in Y  X born in Y LHS: He resides in Paris RHS: He born in Paris Inference-Rule Evaluation Crowdsourcing Rule Application Annotations Rule: X teach Y  X explain to Y LHS: Steve teaches kids RHS: Steve explains to kids

23 5Bar Ilan University @ ACL 2012 Target: Judge if a rule application is valid or not Empirically Compare Different Resources Instance Based Evaluation – Decisions Rule: X turn in Y  X bring in Y LHS: humans turn in bed RHS: humans bring in bed Rule: X resides in Y  X born in Y LHS: He resides in Paris RHS: He born in Paris Inference-Rule Evaluation Crowdsourcing Rule Application Annotations Rule: X teach Y  X explain to Y LHS: Steve teaches kids RHS: Steve explains to kids

24 5Bar Ilan University @ ACL 2012 Target: Judge if a rule application is valid or not Empirically Compare Different Resources Instance Based Evaluation – Decisions Rule: X turn in Y  X bring in Y LHS: humans turn in bed RHS: humans bring in bed Rule: X resides in Y  X born in Y LHS: He resides in Paris RHS: He born in Paris Inference-Rule Evaluation Crowdsourcing Rule Application Annotations Rule: X teach Y  X explain to Y LHS: Steve teaches kids RHS: Steve explains to kids Our Goal: Robust Replicable

25 Crowdsourcing Bar Ilan University @ ACL 20126 Inference-Rule Evaluation Crowdsourcing Rule Application Annotations Empirically Compare Different Resources Recent trend of using crowdsourcing for annotation tasks Previous Works (Snow et al., 2008; Wang and Callison-Burch, 2010; Mehdad et al., 2010; Negri et al., 2011) Focused on RTE text-hypothesis pairs Didn’t address annotation and evaluation of rules

26 Crowdsourcing Bar Ilan University @ ACL 20126 Inference-Rule Evaluation Crowdsourcing Rule Application Annotations Empirically Compare Different Resources Recent trend of using crowdsourcing for annotation tasks Previous Works (Snow et al., 2008; Wang and Callison-Burch, 2010; Mehdad et al., 2010; Negri et al., 2011) Focused on RTE text-hypothesis pairs Didn’t address annotation and evaluation of rules Challenges

27 Crowdsourcing Bar Ilan University @ ACL 20126 Inference-Rule Evaluation Crowdsourcing Rule Application Annotations Empirically Compare Different Resources Recent trend of using crowdsourcing for annotation tasks Previous Works (Snow et al., 2008; Wang and Callison-Burch, 2010; Mehdad et al., 2010; Negri et al., 2011) Focused on RTE text-hypothesis pairs Didn’t address annotation and evaluation of rules Challenges Simplify

28 Crowdsourcing Bar Ilan University @ ACL 20126 Inference-Rule Evaluation Crowdsourcing Rule Application Annotations Empirically Compare Different Resources Recent trend of using crowdsourcing for annotation tasks Previous Works (Snow et al., 2008; Wang and Callison-Burch, 2010; Mehdad et al., 2010; Negri et al., 2011) Focused on RTE text-hypothesis pairs Didn’t address annotation and evaluation of rules Challenges Simplify Communicate

29 Bar Ilan University @ ACL 20127 Inference-Rule Evaluation We address Crowdsourcing Rule Applications Annotation By 2 Empirically Compare Different Resources Allowing us to 3 1

30 8Bar Ilan University @ ACL 2012 Empirically Compare Different Resources Simplify Process Inference-Rule Evaluation Crowdsourcing Rule Application Annotations

31 8Bar Ilan University @ ACL 2012 Empirically Compare Different Resources Simplify Process Inference-Rule Evaluation Crowdsourcing Rule Application Annotations Simple Tasks

32 8Bar Ilan University @ ACL 2012 Is a phrase meaningful? 1 Empirically Compare Different Resources Simplify Process Inference-Rule Evaluation Crowdsourcing Rule Application Annotations Simple Tasks

33 8Bar Ilan University @ ACL 2012 Is a phrase meaningful? 1 Empirically Compare Different Resources Simplify Process Inference-Rule Evaluation Crowdsourcing Rule Application Annotations Simple Tasks Rule: X resides in Y  X born in Y LHS: He resides in Paris RHS: He born in Paris Rule: X turn in Y  X bring in Y LHS: humans turn in bed RHS: humans bring in bed Rule: X teach Y  X explain to Y LHS: Steve teaches kids RHS: Steve explains to kids

34 8Bar Ilan University @ ACL 2012 Is a phrase meaningful? 1 Empirically Compare Different Resources Simplify Process Inference-Rule Evaluation Crowdsourcing Rule Application Annotations Steve teaches kids Steve explains to kids He born in Paris He resides in Paris humans turn in bed humans bring in bed Simple Tasks Rule: X resides in Y  X born in Y LHS: He resides in Paris RHS: He born in Paris Rule: X turn in Y  X bring in Y LHS: humans turn in bed RHS: humans bring in bed Rule: X teach Y  X explain to Y LHS: Steve teaches kids RHS: Steve explains to kids

35 8Bar Ilan University @ ACL 2012 Is a phrase meaningful? 1 Empirically Compare Different Resources Simplify Process Inference-Rule Evaluation Crowdsourcing Rule Application Annotations Steve teaches kids Steve explains to kids He born in Paris He resides in Paris humans turn in bed humans bring in bed Simple Tasks Rule: X resides in Y  X born in Y LHS: He resides in Paris RHS: He born in Paris Rule: X turn in Y  X bring in Y LHS: humans turn in bed RHS: humans bring in bed Rule: X teach Y  X explain to Y LHS: Steve teaches kids RHS: Steve explains to kids

36 8Bar Ilan University @ ACL 2012 Is a phrase meaningful? 1 Empirically Compare Different Resources Simplify Process Inference-Rule Evaluation Crowdsourcing Rule Application Annotations He born in Paris He resides in Paris humans turn in bed humans bring in bed Simple Tasks 2 Judge if one phrase is true given another. Steve explains to kids Steve teaches kids Rule: X resides in Y  X born in Y LHS: He resides in Paris RHS: He born in Paris Rule: X turn in Y  X bring in Y LHS: humans turn in bed RHS: humans bring in bed Rule: X teach Y  X explain to Y LHS: Steve teaches kids RHS: Steve explains to kids

37 8Bar Ilan University @ ACL 2012 Is a phrase meaningful? 1 Empirically Compare Different Resources Simplify Process Inference-Rule Evaluation Crowdsourcing Rule Application Annotations Steve teaches kids Steve explains to kids He born in Paris He resides in Paris humans turn in bed humans bring in bed Simple Tasks 2 Judge if one phrase is true given another. He resides in Paris He born in Paris Steve explains to kids Steve teaches kids Rule: X resides in Y  X born in Y LHS: He resides in Paris RHS: He born in Paris Rule: X turn in Y  X bring in Y LHS: humans turn in bed RHS: humans bring in bed Rule: X teach Y  X explain to Y LHS: Steve teaches kids RHS: Steve explains to kids Steve explains to kids

38 8Bar Ilan University @ ACL 2012 Is a phrase meaningful? 1 Empirically Compare Different Resources Simplify Process Inference-Rule Evaluation Crowdsourcing Rule Application Annotations they observe holidays they celebrate holidays He born in Paris He resides in Paris humans turn in bed humans bring in bed Simple Tasks 2 Judge if one phrase is true given another. Steve teaches kids Steve explains to kids He born in Paris He resides in Paris Rule: X resides in Y  X born in Y LHS: He resides in Paris RHS: He born in Paris Rule: X turn in Y  X bring in Y LHS: humans turn in bed RHS: humans bring in bed Rule: X teach Y  X explain to Y LHS: Steve teaches kids RHS: Steve explains to kids

39 9Bar Ilan University @ ACL 2012 Inference-Rule Evaluation Crowdsourcing Rule Application Annotations Empirically Compare Different Resources Communicate Entailment

40 9Bar Ilan University @ ACL 2012 Inference-Rule Evaluation Crowdsourcing Rule Application Annotations Empirically Compare Different Resources Communicate Entailment Gold Standard

41 9Bar Ilan University @ ACL 2012 Inference-Rule Evaluation Crowdsourcing Rule Application Annotations Empirically Compare Different Resources Communicate Entailment Educating  “Confusing” examples used as gold with feedback if Turkers get them wrong 1 Gold Standard

42 9Bar Ilan University @ ACL 2012 2 Enforcing  Unanimous examples used as gold to estimate Turker reliability Inference-Rule Evaluation Crowdsourcing Rule Application Annotations Empirically Compare Different Resources Communicate Entailment Educating  “Confusing” examples used as gold with feedback if Turkers get them wrong 1 Gold Standard

43 10Bar Ilan University @ ACL 2012 Inference-Rule Evaluation WithoutWith Agreement with Gold0.79 Kappa with gold0.54 False-positive rate18% False-negative rate4% Crowdsourcing Rule Application Annotations Empirically Compare Different Resources Communicate - Effect of Communication

44 10Bar Ilan University @ ACL 2012 Inference-Rule Evaluation WithoutWith Agreement with Gold0.79 Kappa with gold0.54 False-positive rate18% False-negative rate4% Crowdsourcing Rule Application Annotations Empirically Compare Different Resources Communicate - Effect of Communication 0.9 0.79 6% 5%

45 10Bar Ilan University @ ACL 2012 Inference-Rule Evaluation WithoutWith Agreement with Gold0.79 Kappa with gold0.54 False-positive rate18% False-negative rate4% Crowdsourcing Rule Application Annotations Empirically Compare Different Resources Communicate - Effect of Communication 0.9 0.79 6% 5% 63% of annotations judged unanimously between annotators and with our annotation

46 Bar Ilan University @ ACL 201211 Inference-Rule Evaluation We address Crowdsourcing Rule Applications Annotation By 1 Empirically Compare Different Resources Allowingusto Allowing us to 3 2

47 Case Study – Data Set Bar Ilan University @ ACL 201212 improves sate-of-the-art Executed four entailment rule learning methods on a set of 1B extractions extracted by ReVerb (Fader et al. 2011) Applied rules on randomly sampled extractions to get 20,000 rule applications Annotated each rule application using our framework Inference-Rule Evaluation Crowdsourcing Rule Application Annotations Empirically Compare Different Resources

48 Case Study – Algorithm Comparison 13Bar Ilan University @ ACL 2012 Inference-Rule Evaluation Crowdsourcing Rule Application Annotations AlgorithmAUC DIRT (Lin and Pantel, 2001) 0.40 Cover (Weeds andWeir, 2003) 0.43 BInc (Szpektor and Dagan, 2008) 0.44 Berant ( Berant et al., 2010) 0.52 Empirically Compare Different Resources

49 Case Study – Output 14 Task 1 1,012 meaningful LHS; meaningless RHS 8,264 both sides were judged meaningful Task 2 2,447 positive entailment 3,108 negative entailment Overall 6,567 rule applications Annotated for $1000 About a week Task 1 1,012 meaningful LHS; meaningless RHS 8,264 both sides were judged meaningful Task 2 2,447 positive entailment 3,108 negative entailment Overall 6,567 rule applications Annotated for $1000 About a week Bar Ilan University @ ACL 2012 Inference-Rule Evaluation Crowdsourcing Rule Application Annotations Empirically Compare Different Resources  non-entailment  passed to Task 2

50 Summary 15 A framework for crowdsourcing inference rule evaluation Simplifies instance-based evaluation Communicates entailment decision across to Turkers Proposed framework can be beneficial for – resource developers – inference system developers Crowdsourcing forms and annotated extractions can be found at: BIU NLP downloads: http://www.cs.biu.ac.il/~nlp/downloads A framework for crowdsourcing inference rule evaluation Simplifies instance-based evaluation Communicates entailment decision across to Turkers Proposed framework can be beneficial for – resource developers – inference system developers Crowdsourcing forms and annotated extractions can be found at: BIU NLP downloads: http://www.cs.biu.ac.il/~nlp/downloads Bar Ilan University @ ACL 2012 Inference-Rule Evaluation Crowdsourcing Rule Application Annotations Empirically Compare Different Resources

51 Summary 15 A framework for crowdsourcing inference rule evaluation Simplifies instance-based evaluation Communicates entailment decision across to Turkers Proposed framework can be beneficial for – resource developers – inference system developers Crowdsourcing forms and annotated extractions can be found at: BIU NLP downloads: http://www.cs.biu.ac.il/~nlp/downloads A framework for crowdsourcing inference rule evaluation Simplifies instance-based evaluation Communicates entailment decision across to Turkers Proposed framework can be beneficial for – resource developers – inference system developers Crowdsourcing forms and annotated extractions can be found at: BIU NLP downloads: http://www.cs.biu.ac.il/~nlp/downloads Bar Ilan University @ ACL 2012 Inference-Rule Evaluation Crowdsourcing Rule Application Annotations Thank You Empirically Compare Different Resources


Download ppt "Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant, Ido Dagan Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant,"

Similar presentations


Ads by Google