24 January 2016© P C F de Oliveira Evaluating Summaries Automatically – a system proposal Paulo C. F. de Oliveira, Edson Wilson Torrens, Alexandre Cidral, Sidney Schossland, Evandro Bittencourt University of Joinville Joinville – SC – Brazil {pc.oliveira,edson.wilson, alexandre.cidral,sidney.schossland,
24 January 2016© P C F de Oliveira Overview Introduction Method’s Description Evaluation of Evaluation Conclusions
24 January 2016© P C F de Oliveira Introduction lot of debate in the automatic text summarisation literature regarding appropriate evaluation metrics for measuring the quality of summaries series of conferences have attested the importance of evaluation: Text Retrieval Conference (TREC) Message Understanding Conference (MUC) TIPSTER SUMMAC Text Summarisation Evaluation Document Understanding Conference (DUC) Text Summarisation Challenge (TSC)
24 January 2016© P C F de Oliveira Introduction Difficulties in summary evaluation No clear definition for what constitutes a good summary Each text can have more than one ‘ideal’ summary Dependency of human intervention cause some drawbacks such as time consuming, costly, expensive, subjective judgement and biasIntroduction if one needs an evaluation method which is fast and not influenced by these drawbacks, an automatic evaluation method could be the answer
24 January 2016© P C F de Oliveira Overview Introduction Method’s Description Evaluation of Evaluation Conclusion
24 January 2016© P C F de Oliveira Evaluation Method Automatic summary evaluation system called VERT (Valuation using Enhanced Rationale Technique) VERT-C ( 2 statistics) Deals with content bearing words in both reference text and candidate summary using correlation analysis and 2 statistics VERT-F (n-gram matching) Deals with the matching between sentences using a graph theory method – maximum bipartite matching problem (MBMP) BLEU B i L ingual E valuation U nderstudy “Blue” (Papineni et al 2001) ROUGE R ecall- O riented U nderstudy for G isting E valuation “Red” (Lin 2004)
24 January 2016© P C F de Oliveira Evaluation Method VERT-C (Chi-Square statistics) Algorithm (Siegel and Castellan, 1988) (Siegel and Castellan, 1988) State the null hypothesis (H 0 ): the summary is a good representation of its parent/full text, i.e. the distribution of content bearing words in the summary is the same as in its parent/full text State the alternative hypothesis (H 1 ): the summary is not a good representation of its parent/full text, i.e. the distribution of content bearing words in the summary is different from its parent/full text
24 January 2016© P C F de Oliveira Evaluation Method VERT-C Algorithm VERT-C Algorithm (cont’d) 1.P RODUCE a word frequency list (full text/summary) (observed) 2.N ORMALISE the list (e.g. analyse, analysed, analysing to analys) 3.ARRANGE the list in an array with 2 columns (one for the full text and other for the summary) – contingency table 4.SUM UP the cell frequencies across the columns 5.COMPUTE 6.COMPUTE 7.COMPUTE 8.COMPUTE VERT-C score This is VERT-C
24 January 2016© P C F de Oliveira Evaluation Method VERT-C contingency table example WordFT SP10 O i (E i ) spiers85 (4.9) affairs64 (3.7) ambassador53 (3.1) general53 (3.1) state42 (2.5) political43 (2.5) director33 (1.8) department32 (1.8) undersecretary31 (1.8) pakistan22 (1.2) appoint21 (1.2) turkey21 (1.2) embassy21 (1.2) london21 (1.2) charge21 (1.2) bahamas21 (1.2) secretary21 (1.2) Total5735
24 January 2016© P C F de Oliveira Evaluation Method VERT-F (n-gram matching) (Turian et al 2003) man the saw dog the manwasseenbythedog Reference Text Candidate Text
24 January 2016© P C F de Oliveira Evaluation Method VERT-F (n-gram matching) Maximum Bipartite Matching Problem (MBMP) (Cormen et al 2001) Let G=(V,E) be a bipartite graph in which V can be partitioned into two sets V 1 and V 2 such that V=V 1 V 2. A matching M on G is a subset of the edges (arcs) of G such that each vertex (node) in G is incident with no more than one edge in M. A maximum matching is a subgraph which pairs every vertex with exactly one other vertex; i.e. to find the largest subset of edges such that no two edges share an endpoint
24 January 2016© P C F de Oliveira Evaluation Method Maximum Bipartite Matching Problem (MBMP)
24 January 2016© P C F de Oliveira Evaluation Method Maximum Bipartite Matching Problem (MBMP) Maximum Match Size (MMS) of a bitext is the size of any maximum matching for that bitext For =1, we will have: This is VERT-F For =1, we will have: This is VERT-F
24 January 2016© P C F de Oliveira Evaluation Method VERT’s Architecture
24 January 2016© P C F de Oliveira Overview Introduction Method’s Description Evaluation of Evaluation Conclusion
24 January 2016© P C F de Oliveira Evaluation of Evaluation Investigate the performance of VERT The efficacy of an automatic evaluation metric must be assessed through ‘correlation analysis’ The automatic scores should correlate highly with human scores (Lin 2004, Mani 2000) Correlation analysis: makes use of measures of correlation which are ‘descriptive statistical measures that represent the degree of relationship between two or more variables’ (Sheskin 2000)
24 January 2016© P C F de Oliveira Evaluation of Evaluation Descriptive measures are also known as correlation coefficients Is it possible to determine if one metric is ‘better’ than another through correlation analysis? E.g. ROUGE vs VERT? The answer is ranking correlation (Voorhees 2000) The rankings produced by a particular scoring (e.g. evaluation metric) are more important than the scores themselves Kendall’s Tau ( ) correlation coefficient gives us this insight Kendall’s depends on the number of inversions in the rank order of one variable when the other variable is ranked in order
24 January 2016© P C F de Oliveira Evaluation of Evaluation Data set DUC data because it contains 3 years of human judgments (DUC 2001, 2002 and 2003) DUC 2001 DUC 2002 DUC 2003 Summaries Type Single docs Summaries length (words) No. systems Total no. of summaries
24 January 2016© P C F de Oliveira Evaluation of Evaluation Kendall’s was computed between the system’s average VERT-C and VERT-F scores and their respective mean coverage scores assigned by NIST assessors DUCVERT-C vs HumansVERT-F vs Humans
24 January 2016© P C F de Oliveira Evaluation of Evaluation Why VERT-F performed better than VERT-C? Due to the difference of the approaches VERT-F is based on matching of all words between the reference and the candidate text VERT-C is word frequency based, i.e. the most frequent words (content bearing) in the reference and the candidate are considered, resulting in less similarity
24 January 2016© P C F de Oliveira Evaluation of Evaluation What about ROUGE and BLEU vs Human scores? DUCBLEU vs HumansROUGE vs Humans
24 January 2016© P C F de Oliveira Evaluation of Evaluation
24 January 2016© P C F de Oliveira Overview Introduction Method’s Description Evaluation of Evaluation Conclusions
24 January 2016© P C F de Oliveira Conclusions Investigation into the power of our evaluation procedure was carried out, and relied on correlation analysis VERT-F performed better than VERT-C due to difference between the approaches VERT scores correlated highly and positively with human scores It is a significant achievement because 3 years of human evaluation have been used We found worthwhile use BLEU and ROUGE as baselines for a comparative evaluation because ranking correlation presents itself as an interesting method due to its strong statistical background
24 January 2016© P C F de Oliveira Conclusions We found that ROUGE outperformed BLEU and VERT against human scores However VERT-F had a similar performance in relation to ROUGE – the official metric used by NIST Comparative evaluation is central to summary evaluation research, i.e. evaluation of evaluation The use of a mature discipline like statistics allow us to confirm that our evaluation experiments are significant and the results are consistent Our work contributed to solid advance in the state of the art
24 January 2016© P C F de Oliveira Future Work Future Work Future research can explore application of VERT in MT like BLEU and ROUGE The development of a thesaurus module in VERT can explore linguistic variations like paraphrasing VERT application in different domains