24 January 2016© P C F de Oliveira 20081 Evaluating Summaries Automatically – a system proposal Paulo C. F. de Oliveira, Edson Wilson Torrens, Alexandre.

24 January 2016© P C F de Oliveira 20081 Evaluating Summaries Automatically – a system proposal Paulo C. F. de Oliveira, Edson Wilson Torrens, Alexandre Cidral, Sidney Schossland, Evandro Bittencourt University of Joinville Joinville – SC – Brazil {pc.oliveira,edson.wilson, alexandre.cidral,sidney.schossland, ebitt}@univille.net

24 January 2016© P C F de Oliveira 20082  Overview  Introduction  Method’s Description  Evaluation of Evaluation  Conclusions

24 January 2016© P C F de Oliveira 20083  Introduction  lot of debate in the automatic text summarisation literature regarding appropriate evaluation metrics for measuring the quality of summaries  series of conferences have attested the importance of evaluation:  Text Retrieval Conference (TREC)  Message Understanding Conference (MUC)  TIPSTER SUMMAC Text Summarisation Evaluation  Document Understanding Conference (DUC)  Text Summarisation Challenge (TSC)

24 January 2016© P C F de Oliveira 20084  Introduction  Difficulties in summary evaluation  No clear definition for what constitutes a good summary  Each text can have more than one ‘ideal’ summary  Dependency of human intervention cause some drawbacks such as time consuming, costly, expensive, subjective judgement and biasIntroduction if one needs an evaluation method which is fast and not influenced by these drawbacks, an automatic evaluation method could be the answer

24 January 2016© P C F de Oliveira 20085  Overview  Introduction  Method’s Description  Evaluation of Evaluation  Conclusion

24 January 2016© P C F de Oliveira 20086 Evaluation Method  Automatic summary evaluation system called VERT (Valuation using Enhanced Rationale Technique)  VERT-C (  2 statistics)  Deals with content bearing words in both reference text and candidate summary using correlation analysis and  2 statistics  VERT-F (n-gram matching)  Deals with the matching between sentences using a graph theory method – maximum bipartite matching problem (MBMP) BLEU B i L ingual E valuation U nderstudy “Blue” (Papineni et al 2001) ROUGE R ecall- O riented U nderstudy for G isting E valuation “Red” (Lin 2004)

24 January 2016© P C F de Oliveira 20087 Evaluation Method  VERT-C (Chi-Square statistics) Algorithm (Siegel and Castellan, 1988) (Siegel and Castellan, 1988)  State the null hypothesis (H 0 ): the summary is a good representation of its parent/full text, i.e. the distribution of content bearing words in the summary is the same as in its parent/full text  State the alternative hypothesis (H 1 ): the summary is not a good representation of its parent/full text, i.e. the distribution of content bearing words in the summary is different from its parent/full text

24 January 2016© P C F de Oliveira 20088 Evaluation Method VERT-C Algorithm VERT-C Algorithm (cont’d) 1.P RODUCE a word frequency list (full text/summary) (observed) 2.N ORMALISE the list (e.g. analyse, analysed, analysing to analys) 3.ARRANGE the list in an array with 2 columns (one for the full text and other for the summary) – contingency table 4.SUM UP the cell frequencies across the columns 5.COMPUTE 6.COMPUTE 7.COMPUTE 8.COMPUTE VERT-C score This is VERT-C

24 January 2016© P C F de Oliveira 20089 Evaluation Method  VERT-C contingency table example WordFT SP10 O i (E i ) spiers85 (4.9) affairs64 (3.7) ambassador53 (3.1) general53 (3.1) state42 (2.5) political43 (2.5) director33 (1.8) department32 (1.8) undersecretary31 (1.8) pakistan22 (1.2) appoint21 (1.2) turkey21 (1.2) embassy21 (1.2) london21 (1.2) charge21 (1.2) bahamas21 (1.2) secretary21 (1.2) Total5735

24 January 2016© P C F de Oliveira 200810 Evaluation Method  VERT-F (n-gram matching) (Turian et al 2003) man  the  saw dog  the  manwasseenbythedog Reference Text Candidate Text

24 January 2016© P C F de Oliveira 200811 Evaluation Method  VERT-F (n-gram matching)  Maximum Bipartite Matching Problem (MBMP) (Cormen et al 2001) Let G=(V,E) be a bipartite graph in which V can be partitioned into two sets V 1 and V 2 such that V=V 1  V 2. A matching M on G is a subset of the edges (arcs) of G such that each vertex (node) in G is incident with no more than one edge in M. A maximum matching is a subgraph which pairs every vertex with exactly one other vertex; i.e. to find the largest subset of edges such that no two edges share an endpoint

24 January 2016© P C F de Oliveira 200812 Evaluation Method  Maximum Bipartite Matching Problem (MBMP)

24 January 2016© P C F de Oliveira 200813 Evaluation Method  Maximum Bipartite Matching Problem (MBMP)  Maximum Match Size (MMS) of a bitext is the size of any maximum matching for that bitext For  =1, we will have: This is VERT-F For  =1, we will have: This is VERT-F

24 January 2016© P C F de Oliveira 200814 Evaluation Method VERT’s Architecture

24 January 2016© P C F de Oliveira 200815  Overview  Introduction  Method’s Description  Evaluation of Evaluation  Conclusion

24 January 2016© P C F de Oliveira 200816  Evaluation of Evaluation  Investigate the performance of VERT  The efficacy of an automatic evaluation metric must be assessed through ‘correlation analysis’  The automatic scores should correlate highly with human scores (Lin 2004, Mani 2000)  Correlation analysis: makes use of measures of correlation which are ‘descriptive statistical measures that represent the degree of relationship between two or more variables’ (Sheskin 2000)

24 January 2016© P C F de Oliveira 200817  Evaluation of Evaluation  Descriptive measures are also known as correlation coefficients  Is it possible to determine if one metric is ‘better’ than another through correlation analysis? E.g. ROUGE vs VERT?  The answer is ranking correlation (Voorhees 2000)  The rankings produced by a particular scoring (e.g. evaluation metric) are more important than the scores themselves Kendall’s Tau (  ) correlation coefficient gives us this insight Kendall’s  depends on the number of inversions in the rank order of one variable when the other variable is ranked in order

24 January 2016© P C F de Oliveira 200818  Evaluation of Evaluation  Data set  DUC data because it contains 3 years of human judgments (DUC 2001, 2002 and 2003) DUC 2001 DUC 2002 DUC 2003 Summaries Type Single docs Summaries length (words) 100 10 No. systems151714 Total no. of summaries 330473598050

24 January 2016© P C F de Oliveira 200819  Evaluation of Evaluation  Kendall’s  was computed between the system’s average VERT-C and VERT-F scores and their respective mean coverage scores assigned by NIST assessors DUCVERT-C vs HumansVERT-F vs Humans 20010.780.91 20020.520.89 20030.590.95

24 January 2016© P C F de Oliveira 200820  Evaluation of Evaluation  Why VERT-F performed better than VERT-C?  Due to the difference of the approaches  VERT-F is based on matching of all words between the reference and the candidate text  VERT-C is word frequency based, i.e. the most frequent words (content bearing) in the reference and the candidate are considered, resulting in less similarity

24 January 2016© P C F de Oliveira 200821  Evaluation of Evaluation  What about ROUGE and BLEU vs Human scores? DUCBLEU vs HumansROUGE vs Humans 20010.640.85 20020.540.99 20030.050.97

24 January 2016© P C F de Oliveira 200823  Overview  Introduction  Method’s Description  Evaluation of Evaluation  Conclusions

24 January 2016© P C F de Oliveira 200824  Conclusions  Investigation into the power of our evaluation procedure was carried out, and relied on correlation analysis  VERT-F performed better than VERT-C due to difference between the approaches  VERT scores correlated highly and positively with human scores  It is a significant achievement because 3 years of human evaluation have been used  We found worthwhile use BLEU and ROUGE as baselines for a comparative evaluation because ranking correlation presents itself as an interesting method due to its strong statistical background

24 January 2016© P C F de Oliveira 200825  Conclusions  We found that ROUGE outperformed BLEU and VERT against human scores  However VERT-F had a similar performance in relation to ROUGE – the official metric used by NIST  Comparative evaluation is central to summary evaluation research, i.e. evaluation of evaluation  The use of a mature discipline like statistics allow us to confirm that our evaluation experiments are significant and the results are consistent  Our work contributed to solid advance in the state of the art

24 January 2016© P C F de Oliveira 200826 Future Work  Future Work  Future research can explore application of VERT in MT like BLEU and ROUGE  The development of a thesaurus module in VERT can explore linguistic variations like paraphrasing  VERT application in different domains

24 January 2016© P C F de Oliveira 20081 Evaluating Summaries Automatically – a system proposal Paulo C. F. de Oliveira, Edson Wilson Torrens, Alexandre.

Similar presentations

Presentation on theme: "24 January 2016© P C F de Oliveira 20081 Evaluating Summaries Automatically – a system proposal Paulo C. F. de Oliveira, Edson Wilson Torrens, Alexandre."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

24 January 2016© P C F de Oliveira 20081 Evaluating Summaries Automatically – a system proposal Paulo C. F. de Oliveira, Edson Wilson Torrens, Alexandre.

Similar presentations

Presentation on theme: "24 January 2016© P C F de Oliveira 20081 Evaluating Summaries Automatically – a system proposal Paulo C. F. de Oliveira, Edson Wilson Torrens, Alexandre."— Presentation transcript:

Similar presentations

About project

Feedback