Download presentation
Presentation is loading. Please wait.
Published byAnnabel Weaver Modified over 9 years ago
1
Ehud Reiter, Computing Science, University of Aberdeen1 NLG Evaluation, Commercial NLG Ehud Reiter (Abdn Uni and Arria/Data2text)
2
Ehud Reiter, Computing Science, University of Aberdeen2 Structure l Evaluation Concepts l Commercial NLG l Appendix: Designing a controlled ratings-based evaluation
3
Ehud Reiter, Computing Science, University of Aberdeen3 Purpose of Evaluation l What do we want to know? l Audience?
4
Ehud Reiter, Computing Science, University of Aberdeen4 Example: BabyTalk l Goal: Summarise clinical data about premature babies in neonatal ICU l Input: sensor data; records of actions/observations by medical staff l Output: multi-para texts, summarise »BT45: 45 mins data, for doctors »BT-Nurse: 12 hrs data, for nurses »BT-Family: 24 hrs data, for parents
5
Ehud Reiter, Computing Science, University of Aberdeen5 Babytalk: Neonatal ICU
6
Ehud Reiter, Computing Science, University of Aberdeen6 Babytalk Input: Sensor Data
7
Ehud Reiter, Computing Science, University of Aberdeen7 BT-Nurse text (extract) Respiratory Support Current Status Currently, the baby is on CMV in 27 % O2. Vent RR is 55 breaths per minute. Pressures are 20/4 cms H2O. Tidal volume is 1.5. SaO2 is variable within the acceptable range and there have been some desaturations. … Events During the Shift A blood gas was taken at around 19:45. Parameters were acceptable. pH was 7.18. CO2 was 7.71 kPa. BE was -4.8 mmol/L. …
8
Ehud Reiter, Computing Science, University of Aberdeen8 Babytalk eval: goals l Babytalk evaluation goals »Medics want to know if Babytalk summaries enhance patient outcome –Deploy Babytalk on ward and measure outcome (RCT) »Psychologists want to know if Babytalk texts are effective decision support tool –Controlled “off ward” study of decision effectiveness »Developers want to know how improve system –Qualitative feedback often most useful »Software house wants to know if profitable –Business model (costs and revenue)
9
Ehud Reiter, Computing Science, University of Aberdeen9 Which Goal? l Depends on your purpose »Publish academic NLG research paper? »Convince someone to use NLG system? »Improve an NLG system? »Pitch for venture capital?
10
Ehud Reiter, Computing Science, University of Aberdeen10 Types of NLG Evaluation l Task Performance l Human Ratings l Metric (comparison to gold standard) l Controlled vs Real-World
11
Ehud Reiter, Computing Science, University of Aberdeen11 Task-Performance Eval l Measure whether NLG system achieves its communicative goal »Typically helping user perform a task »Other possibilities, eg behaviour change l Evaluate in real-world or in controlled experiment
12
Ehud Reiter, Computing Science, University of Aberdeen12 Real world: STOP smoking l STOP system generates personalised smoking-cessation letters l Recruited 2553 smokers »Sent 1/3 STOP letters »Sent 1/3 fixed (non-tailored) letter »Sent 1/3 simple “thank you” letter l Waited 6 months, and compared smoking cessation rates between the groups
13
Ehud Reiter, Computing Science, University of Aberdeen13 Results: STOP l 6-Month cessation rate »STOP letter: 3.5% »Non-tailored letter: 4.4% »Thank-you letter: 2.6% l Note: »More heavy smokers in STOP group »Heavy smokers less likely to quit
14
Ehud Reiter, Computing Science, University of Aberdeen14 Negative result l Should be published! l Don’t ignore or “tweak stats” until you get the “right” answer l E Reiter, R Robertson, and L Osman (2003). Lessons from a Failure: Generating Tailored Smoking Cessation Letters. Artificial Intelligence 144:41-58.
15
Ehud Reiter, Computing Science, University of Aberdeen15 Controlled exper: BT45 l Babytalk BT-45 system (short reports) l Choose 24 data sets (scenarios) »From historical data (5 years old) l Created 3 presentations of each scenario »BT45 text, Human text, Visualisation l Asked 35 subjects (medics) to look at presentations and decide on intervention »In experiment room, not in ward »Compared intervention to gold standard l Computed likelihood of correct decision
16
Ehud Reiter, Computing Science, University of Aberdeen16 Results: BT45 l Correct decision made »BT45 text: 34% »Human text: 39% »Visualisation: 33% l Note: »BT45 texts mostly as good as human, but were pretty bad in scenario where target action was “no action” or “sensor error”
17
Ehud Reiter, Computing Science, University of Aberdeen17 Reference l F Portet, E Reiter, A Gatt, J Hunter, S Sripada, Y Freer, C Sykes (2009). Automatic Generation of Textual Summaries from Neonatal Intensive Care Data. Artificial Intelligence 173:789-816 l M. van der Meulen, R. Logie, Y. Freer, C. Sykes, N. McIntosh, and J. Hunter, "When a graph is poorer than 100 words: A comparison of computerised natural language generation, human generated descriptions and graphical displays in neonatal intensive care," Applied Cognitive Psychology, vol. 24, pp. 77-89, 2008.
18
Ehud Reiter, Computing Science, University of Aberdeen18 Task-based evaluations l Most respected »Especially outwith NLG/NLP community l Expensive and time-consuming l Eval is of specific system, not generic algorithm or idea »Small changes to both STOP and BT45 would probably changes eval result
19
Ehud Reiter, Computing Science, University of Aberdeen19 Human Ratings l Ask human subjects to assess texts »Readability (linguistic quality) »Accuracy (content quality) »Usefulness l Can assess control/baseline as well l Usually use Likert scale »Strongly agree, agree, undecided, disagree, strongly disagree (5 pt scale)
20
Ehud Reiter, Computing Science, University of Aberdeen20 Real world: BT-Nurse l Deployed BT-Nurse on ward l Nurses used it on real patients »Both beginning and end of shift »Vetted to remove content that could damage care –No content removed l Nurses gave scores (3-pt scale) on each text »Understandable, accurate, helpful »Agree, neutral, disagree l Also free-text comments
21
Ehud Reiter, Computing Science, University of Aberdeen21 Results: BT-Nurse l Numerical results »90% of texts understandable »70% of texts accurate »60% of texts helpful »[no texts damaged care] l Many comments »More content »Software bugs »A few “really helped me” comments
22
Ehud Reiter, Computing Science, University of Aberdeen22 Reference l J Hunter, Y Freer, A Gatt, E Reiter, S Sripada, C Sykes, D Westwater (2011). BT-Nurse: Computer Generation of Natural Language Shift Summaries from Complex Heterogeneous Medical Data. Journal of the American Medical Informatics Association 18:621-624 l J Hunter, Y Freer, A Gatt, E Reiter, S Sripada, C Sykes (2012). Automatic generation of natural language nursing shift summaries in neonatal intensive care: BT-Nurse. Artificial Intelligence in Medicine 56:157–172
23
Ehud Reiter, Computing Science, University of Aberdeen23 Controlled exper: Sumtime l Marine weather forecasts l Choose 5 weather data sets (scenarios) l Created 3 presentations of each scenario »Sumtime text »human texts »Hybrid: Human content, SumTime language l Asked 73 subjects (readers of marine forecasts) to give preference »Each saw 2 of the 3 possible variants of a scenario »Most readable, most accurate, most appropriate
24
Ehud Reiter, Computing Science, University of Aberdeen24 Results: SumTime Question ST Human same p value SumTime vs. human texts More appropriate? 43% 27% 30% 0.021 More accurate? 51% 33% 15% 0.011 Easier to read? 41% 36% 23% >0.1 Hybrid vs. human texts More appropriate? 38% 28% 34% >0.1 More accurate? 45% 36% 19% >0.1 Easier to read? 51% 17% 33% <0.0001
25
Ehud Reiter, Computing Science, University of Aberdeen25 Reference l E Reiter, S Sripada, J Hunter, J Yu, and I Davy (2005). Choosing Words in Computer-Generated Weather Forecasts. Artificial Intelligence 167:137- 169.
26
Ehud Reiter, Computing Science, University of Aberdeen26 Human ratings evaluation l Probably most common type in NLG »Well accepted in academic literature l Easier/quicker than task-based »For controlled eval, may be able to use Amazon Mechanical Turk »Can answer questions which are hard to fit into a task-based evaluation –Can ask people to generalise
27
Ehud Reiter, Computing Science, University of Aberdeen27 Metric-based evaluation l Create a gold standard set »Input data for NLG system (scenarios) »Desired output text (usually human-written) –Sometimes multiple “reference” texts specified l Run NLG system on above data sets l Compare output to gold standard output »Various metrics, such as BLEU l Widely used in machine translation
28
Ehud Reiter, Computing Science, University of Aberdeen28 Example: SumTime input data day/hour wind-dir speed gust l 05/06SSW1822 l 05/09S1620 l 05/12S1417 l 05/15S1417 l 05/18SSE1215 l 05/21SSE1012 l 06/00VAR67
29
Ehud Reiter, Computing Science, University of Aberdeen29 Example: Gold standard l Reference 1 - SSW’LY 16-20 GRADUALLY BACKING SSE’LY THEN DECREASING VARIABLE 4-8 BY LATE EVENING l Reference 2 - SSW 16-20 GRADUALLY BACKING SSE BY 1800 THEN FALLING VARIABLE 4-8 BY LATE EVENING l Reference 3 - SSW 16-20 GRADUALLY BACKING SSE THEN FALLING VARIABLE 04-08 BY LATE EVENING Above written by three professional forecasters
30
Ehud Reiter, Computing Science, University of Aberdeen30 Metric evaluation example l SumTime output: »SSW 16-20 GRADUALLY BACKING SSE THEN BECOMING VARIABLE 10 OR LESS BY MIDNIGHT l Compare to Reference 1 »SSW’LY 16-20 GRADUALLY BACKING SSE’LY THEN DECREASING BECOMING VARIABLE 4-8 10 OR LESS BY LATE EVENING MIDNIGHT l Compute score using metric »edit distance, BLEU, etc
31
Ehud Reiter, Computing Science, University of Aberdeen31 Issues l Is SSW’LY better than SSW? »2 out of 3 reference texts use SSW »Good to have multiple reference texts l Is BY LATE EVENING better than BY MIDNIGHT? »User studies with forecast readers suggest BY MIDNIGHT is less ambiguous »Should ST be eval against human texts? –SumTime texts are better than human texts!
32
Ehud Reiter, Computing Science, University of Aberdeen32 Which Metric is Best? l Assess by validation study »Do “gold standard” eval of multiple systems –Task-performance or human ratings –Ideally evaluate 10 or more NLG systems l Which must have same inputs and target outputs »Also evaluate systems using metrics »Which metric correlates best with “gold standard” human evaluations? –Do any metrics correlate?
33
Ehud Reiter, Computing Science, University of Aberdeen33 Validation: result l NIST-5 (BLEU variant) is best predictor of human clarity (readability) judgements l No metric correlates with human accuracy judgements l E Reiter, A Belz (2009). An Investigation into the Validity of Some Metrics for Automatically Evaluating Natural Language Generation Systems Computational Linguistics 35:529–558
34
Ehud Reiter, Computing Science, University of Aberdeen34 Metric-based evaluation l I seriously dislike »we don’t have strong evidence that metrics predict human ratings, let alone task perf »Also people can “game” the metrics l (my opinion) have distorted machine translation, summarisation »Communities forced to use poorly validated metrics for political/funding reasons »Not the way to do good science
35
Ehud Reiter, Computing Science, University of Aberdeen35 General issues l Validity »Are eval technique correlated with goal? –Do human ratings correlate with task performance? –BT: subjects did best with human text summaries, but preferred the visualisations »Psych: why do US universities use SAT? l Generalisability »Do results generalise (domains, genres, etc)? –ST: Can we generate good aviation forecasts? »Psych: intelligence tests don’t work on minorities
36
Ehud Reiter, Computing Science, University of Aberdeen36 Statistics l Be rigorous! »Non-parametric tests where appropriate »Multiple hypothesis corrections »Two-tailed p-values »Avoid post-hoc analyses l Medicine: strict stats needed »Are “significant” results replicated? »Only if stats are very rigorous
37
Ehud Reiter, Computing Science, University of Aberdeen37 Commercial NLG l Arria/Datatext: U Abdn spinout company »Monitoring equipment on oil platforms »weather forecasts »Agricultural information »Financial summaries
38
Ehud Reiter, Computing Science, University of Aberdeen38 Others l Narrative Science - Builds bespoke “automatic narrative generation” systems »Academic roots in computational creativity l Automated Insights - writes “insightful, personalized reports from your data” »Non-academic roots l Yseop - “Smart NLG” software that “writes like a human” »Chief scientist, Alain Kaeser did NLG in 1980s
39
Ehud Reiter, Computing Science, University of Aberdeen39 Others l Lots of small young startups, I lose track of them »OnlyBoth “Discovers New Insights from Data. Writes Them Up in Perfect English. All Automated” »InfoSentience “Developers of the Most Advanced Automated Narrative Generation Software” »Text-on (German) “Aus abstrakten Daten werden so Texte” l NLG projects at large companies. »INLG 2012 panel - Thomson-Reuters, Agfa »More secretive
40
Ehud Reiter, Computing Science, University of Aberdeen40 Common Themes l Almost all claim to generate narratives/stories from data l Financial reporting is most commonly mentioned use l Companies still quite small »Fewer than 100 employees, compared to 12,000 at Nuance or 400,000 at IBM »But large compared to earlier NLG companies »Also lots of them!
41
Ehud Reiter, Computing Science, University of Aberdeen41 Robojournalism l Computers write articles for newspapers »Sports, finance, weather l Lots of media attention »http://www.bbc.co.uk/news/technology- 34204052 (many others)http://www.bbc.co.uk/news/technology- 34204052 l K Dörr (2015), Mapping the Field of Algorithmic Journalism. To appear in Digital Journalism
42
Ehud Reiter, Computing Science, University of Aberdeen42 Appendix: Specifics l How perform a controlled ratings-based evaluation? l Example: weather forecasts
43
Ehud Reiter, Computing Science, University of Aberdeen43 Experimental Design l Hypotheses l Subjects l Material l Procedure l Analysis
44
Ehud Reiter, Computing Science, University of Aberdeen44 Hypotheses: before experiment l Define hypotheses, stats, etc in detail before the experiment is done »In medicine, expected to publish full experimental design beforehand –https://clinicaltrials.gov/https://clinicaltrials.gov/ »If multiple hypothesis, reduce p value for significance (discuss later) l Why?
45
Ehud Reiter, Computing Science, University of Aberdeen45 Post-hoc l Colleague once told me “I didn’t see a significant effect initially, so I just loaded the data into SPSS and tried all kinds of stuff until I saw something with p <.05” l What is wrong with this?
46
Ehud Reiter, Computing Science, University of Aberdeen46 Why is this bad? l Assume we test 10 variants of a hypothesis »“Sumtime more accurate than human” »“Hybrid more readable than human” »etc l Assume we use 10 different stat tests »Eg, normalise data in different ways l 100 tests »so we’ll see a (variant, stat) combination which is sig at p =.01, even if no genuine effect
47
Ehud Reiter, Computing Science, University of Aberdeen47 Hypotheses: SumTime l Hyp 1: Sumtime texts more appropriate than human texts l Hyp 2: Hybrid texts more readable than human texts l 2 hypotheses, so significant of p <.025 l Any other hypothesis post-hoc »Including “ST texts more accurate” »Not significant even though p = 0.011
48
Ehud Reiter, Computing Science, University of Aberdeen48 Subjects: Who are they l What subjects are needed »Language skills? Domain knowledge? Background? Age? Etc l Sometimes not very restrictive »General hypotheses about language »Mechanical Turk is good option l Sometimes want specific people »Eg, test reaction of users to a system –Babytalk-Family evaluated by parents who have babies in neonatal ICU
49
Ehud Reiter, Computing Science, University of Aberdeen49 How many subjects? l Can do a power calculation to determine subject numbers »https://en.wikipedia.org/wiki/Statistical_powerhttps://en.wikipedia.org/wiki/Statistical_power »Depends on expected effect size –More subjects needed for smaller effects »Typically looking for 50+ –Not a problem with Mturk –Can be real hassle if need subjects with specialised skills or backgrounds
50
Ehud Reiter, Computing Science, University of Aberdeen50 Recruitment of subjects l General subjects (easier) »Mechanical Turk »Local students l Specialised subjects (harder) »email lists, networks, conferences, … »Personal contacts
51
Ehud Reiter, Computing Science, University of Aberdeen51 Subjects: SumTime l Type: regular readers and users of marine weather forecasts l Recruitment: asked domain experts (working on project) to recruit via their networks and contacts l Number: wanted 50, got 72
52
Ehud Reiter, Computing Science, University of Aberdeen52 Material: scenarios l Usually start by choosing some scenarios (data sets) »Usually try to be representative and/or cover important cases »Random choice also possible
53
Ehud Reiter, Computing Science, University of Aberdeen53 Material: presentations l Typically prepare different presentations of each scenario »Output of NLG system(s) being evaluated »Control/baseline –Human-authored text? –Output of current best-performing NLG system? –Fixed (non-generated) text? »Depends on hypotheses
54
Ehud Reiter, Computing Science, University of Aberdeen54 Material: structure l For each scenario, subjects can see »One presentation »Some presentations »All presentations l Subjects should not know if a presentation is NLG or control!
55
Ehud Reiter, Computing Science, University of Aberdeen55 Material: Sumtime l Number of scenarios: 5 »Corpus texts written by 5 forecasters »First text written by forecaster after magic date »Wanted human/control texts from each of 5 l Number of presentations: 3 »SumTime (main) »Human (control) »Hybrid (of content-det vs microplan/real) l Structure: »Present pairs (2 out of 3) to each subject
56
Ehud Reiter, Computing Science, University of Aberdeen56 Procedure: What subject do l What questions asked »Readable, accurate, useful »Response: N-pt Likert scale, slider –https://en.wikipedia.org/wiki/Likert_scale l Order »Latin Square (Balanced) »Random l Payment?
57
Ehud Reiter, Computing Science, University of Aberdeen57 Latin Square Scenario 1Scenario 2Scenario 3 Subject 1SumTimeHumanHybrid Subject 2HybridSumTimeHuman Subject 3HumanHybridSumTime
58
Ehud Reiter, Computing Science, University of Aberdeen58 Procedure: Questions l Practice questions at beginning? l Fillers between questions we care about? l Especially important if we want timings
59
Ehud Reiter, Computing Science, University of Aberdeen59 Procedure: Ethics l Can doing experiment harm people? »BT-Nurse and patient care »If so, must present acceptable solution l Subjects can drop out at any time »Can NOT “pressure” them to stay if the want to quit experiment
60
Ehud Reiter, Computing Science, University of Aberdeen60 Procedure: Exclusion l When do we drop a subject from the experiment? »Incomplete responses? »Inconsistent responses? »Bizarre responses?
61
Ehud Reiter, Computing Science, University of Aberdeen61 Procedure: SumTime l Questions »Presented 2 variants »Which variant is: easiest to read; most accurate; most appropriate l Order not randomised l No payment l No practice or filler, no ethical issues l Excluded if less than 50% completed
62
Ehud Reiter, Computing Science, University of Aberdeen62 Statistics: Test l Principle: Likert scales are not numbers »Should not be averaged »Non-parametric test (Wilcoxon Signed Rank) l Practice »Often present average Likert score »Use parametric test, such as t-test »More or less works…. –But not if rigorous stats needed!
63
Ehud Reiter, Computing Science, University of Aberdeen63 Statistics: Normalisation l Some users are more generous than others l Some scenarios are harder than others l Potential bias »User X always “agree”, Y always “disagree” »X rates 10 SumTime texts and 1 corpus text »Y rates 1 SumTime text and 10 corpus texts l Use balanced design (Latin square) l Use linear model »Predicts score on user, scenario, presentation »Just look at presentation element
64
Ehud Reiter, Computing Science, University of Aberdeen64 Statistics: Multiple Hypoth l Bonferroni multiple hypothesis correction l Divide significance p value by number of hypotheses being tested »1 hypothesis: look for p <.05 »2 hypotheses: look for p < 0.025 »10 hypotheses: look for p > 0.005
65
Ehud Reiter, Computing Science, University of Aberdeen65 Statistics: SumTime l Test: Chi-square »Because users asked to state a preference between variants, did not give Likert score l Normalisation: not necessary »Less important with preferences –If user is asked whether A or B is better, doesnt matter how generous he is (“great” vs “poor”) l Multiple hypotheses: p < 0.025 »Because 2 hypotheses
66
Ehud Reiter, Computing Science, University of Aberdeen66 Questions
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.