Presentation is loading. Please wait.

Presentation is loading. Please wait.

27-31 May 2008LREC 2008 (Marrakech, Morocco)1 The ACL ARC Anthology Reference Corpus: A Reference Dataset for Bibliographic Research in Computational Linguistics.

Similar presentations


Presentation on theme: "27-31 May 2008LREC 2008 (Marrakech, Morocco)1 The ACL ARC Anthology Reference Corpus: A Reference Dataset for Bibliographic Research in Computational Linguistics."— Presentation transcript:

1 27-31 May 2008LREC 2008 (Marrakech, Morocco)1 The ACL ARC Anthology Reference Corpus: A Reference Dataset for Bibliographic Research in Computational Linguistics Steven Bird 1, Robert Dale 2, Bonnie Dorr 3, Bryan Gibson 4, Mark T. Joseph 4, Min-Yen Kan 5, Dongwon Lee 6, Brett Powley 2, Dragomir R. Radev 4, Yee Fan Tan 5 1 2 3 4 5 6

2 27-31 May 2008LREC 2008 (Marrakech, Morocco)2 http://acl.ldc.upenn.edu (ca. 2002)

3 27-31 May 2008LREC 2008 (Marrakech, Morocco)3 http://www.aclweb.org/anthology-new

4 27-31 May 2008LREC 2008 (Marrakech, Morocco)4 PDF file Booktitle Basic Metadata Some Detailed Metadata http://www.aclweb.org/anthology-new/P/P05

5 27-31 May 2008LREC 2008 (Marrakech, Morocco)5 Hot NLP Problems Graphical Methods for NLP –Social network analysis Text categorization –Sentence / Citation function Sequence Labeling –Reference string parsing Bayesian Models –Topic Models Summarization –Survey Paper Generation

6 27-31 May 2008LREC 2008 (Marrakech, Morocco)6 The Anthology as Corpus Why use newswire? –Because our funding agencies want it Let's build a corpus from our own publications! –Test domain adaptation techniques –Characterize what’s special about scientific discourse –Help ourselves and others understand our research better Start with the largest freely available NLP research archive

7 27-31 May 2008LREC 2008 (Marrakech, Morocco)7 The Anthology Reference Corpus Scholars have already been using scientific articles as input But datasets largely disparate Results not comparable Goal: unify such work by agreeing to work on a central dataset (à la TREC)

8 27-31 May 2008LREC 2008 (Marrakech, Morocco)8 The ACL ARC Consists of most articles available as of February 2006 that have extractable text Papers10,921 Total References 152,546 References to articles 38,767 (25.4%) inside ACL ARC References to articles 113,779 (74.6%) outside ACL ARC

9 27-31 May 2008LREC 2008 (Marrakech, Morocco)9 What’s included now Version 2008 03 25: –PDFs for all 10,921 articles – metadata tuples –Noisy, text extracted output from the PDFs Using non-OCR based extractor ( pdfbox )

10 27-31 May 2008LREC 2008 (Marrakech, Morocco)10 The road ahead 1.Improve data quality 2.Establish subsets for smaller experiments 3.Build and release open-source tools 4.Enlarge coverage of newer materials 5.Release major revisions (infrequently) Achieving the goals of the Linked Anthology Proposal

11 27-31 May 2008LREC 2008 (Marrakech, Morocco)11 Future data (near-term) Inter-document –Manually cleaned citation graph from the ACL Anthology Network Intra-document –Citation to reference string matching Document –Automatic keyphrase generation –OCR based text extracted output (much cleaner) R

12 27-31 May 2008LREC 2008 (Marrakech, Morocco)12 http://belobog.si.umich.edu/clair/anthology/index.cgi

13 27-31 May 2008LREC 2008 (Marrakech, Morocco)13 Tools in development by partners Automatic Reference Segmentation: ParsCit: Open-source reference string parser; also LREC 08 Automatic Survey Article Generation iOpener: summarization of articles at different expertise levels Automatic Reference-Article matching Record Linkage: using web data to match articles Citation Function Classification What’s the purpose of a citation? Next big application Your work here: please join us – this should be a community wide effort

14 27-31 May 2008LREC 2008 (Marrakech, Morocco)14 Thank you! http://acl-arc.comp.nus.edu.sg/ Web: “acl arc” home page dAnth Digital Anthologies mailing list Thank you! http://acl-arc.comp.nus.edu.sg/ Web: “acl arc” home page dAnth Digital Anthologies mailing list


Download ppt "27-31 May 2008LREC 2008 (Marrakech, Morocco)1 The ACL ARC Anthology Reference Corpus: A Reference Dataset for Bibliographic Research in Computational Linguistics."

Similar presentations


Ads by Google