27-31 May 2008LREC 2008 (Marrakech, Morocco)1 The ACL ARC Anthology Reference Corpus: A Reference Dataset for Bibliographic Research in Computational Linguistics.

Slides:



Advertisements
Similar presentations
EBooks and Audiobooks. This class will give you an overview of eBooks and electronic Audiobooks available from the Library. We will also explain the basic.
Advertisements

Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.
Managing References : Mendeley
Use Watch folders to automatically add PDFs to Mendeley Desktop.
Yansong Feng and Mirella Lapata
HTML5 ETDs Edward A. Fox, Sung Hee Park, Nicholas Lynberg, Jesse Racer, Phil McElmurray Digital Library Research Laboratory Virginia Tech ETD 2010, June.
WING Research Group Demos and Posters. Min-Yen Kan, Digital Libraries 22nd CSAIL MIT Workshop Demos SlideSeer (M.-Y. Kan) Coordinating presentation slides.
StatCat Building a Statistical Data Finder ssrs.yale.edu/statcat Steven Citron-Pousty Ann Green Julie Linden Yale University.
IASSIST 2003 Changes in the Way Data Archives Process Data Data Processing at ICPSR Darrell Donakowski.
iOpener Workbench: Tools for Rapid Understanding of Scientific Literature Cody Dunne, Ben Shneiderman, Bonnie Dorr & Judith Klavans {cdunne, ben,
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
Evaluating Visual and Statistical Exploration of Scientific Literature Networks Robert Gove 1,3, Cody Dunne 1,3, Ben Shneiderman 1,3, Judith Klavans 2,
RefWorks for Historians Shona McLean
Information Extraction and Ontology Learning Guided by Web Directory Authors:Martin Kavalec Vojtěch Svátek Presenter: Mark Vickers.
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
Managing references : Mendeley
The International Household Survey Network IHSN IHSN Secretariat PARIS21 Steering Committee, 14 November 2007.
Conference papers & proceedings. Many conference papers are published in journals and some may be released before a conference takes place. Other papers.
Digging by Debating (DbyD):
Scan2DMS Sebastiaan Bos Business Unit Manager I.R.I.S. Netherlands.
GL12 Conf. Dec. 6-7, 2010NTL, Prague, Czech Republic Extending the “Facets” concept by applying NLP tools to catalog records of scientific literature *E.
Use Watch folders to automatically add PDFs to Mendeley Desktop. When you place a document in a watched folder, it will be automatically added to Mendeley.
OOSE 01/17 Institute of Computer Science and Information Engineering, National Cheng Kung University Member:Q 薛弘志 P 蔡文豪 F 周詩御.
CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.
Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Institute for System Programming of RAS.
Citation Recommendation 1 Web Technology Laboratory Ferdowsi University of Mashhad.
Research Data at NCAR 1 August, 2002 Steven Worley Scientific Computing Division Data Support Section.
Planning a digital library How to Build a Digital Library Ian H. Witten and David Bainbridge.
GLOSSARY COMPILATION Alex Kotov (akotov2) Hanna Zhong (hzhong) Hoa Nguyen (hnguyen4) Zhenyu Yang (zyang2)
A Web Application for Customized Corpus Delivery Nancy Ide, Keith Suderman, Brian Simms Department of Computer Science Vassar College USA.
WING Anthology Project Min-Yen Kan 24 April 2015.
The Computational Linguistics Summarization Pilot TAC 2014 Kokil Jaidka †, Muthu Kumar Chandrasekaran* ‡, Min-Yen Kan* ‡, Ankur Khanna ‡ Nanyang.
A Survey for Interspeech Xavier Anguera Information Retrieval-based Dynamic TimeWarping.
CC&E Best Data Management Practices, April 19, 2015 Please take the Workshop Survey 1.
This work is supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number.
Planning a digital library How to Build a Digital Library Ian H. Witten and David Bainbridge.
USGS Metadata in the Broader Picture 1994 Executive Order – Metadata must be created for all Federally-funded research – Federal Geographic Data.
Crawling and Aligning Scholarly Presentations and Documents from the Web By SARAVANAN.S 09/09/2011 Under the guidance of A/P Min-Yen Kan 10/23/
Amy Dai Machine learning techniques for detecting topics in research papers.
LexPageRank: Prestige in Multi- Document Text Summarization Gunes Erkan and Dragomir R. Radev Department of EECS, School of Information University of Michigan.
United Nations Economic Commission for Europe Statistical Division The Importance of Databases in the Dissemination Process Steven Vale, UNECE.
LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009.
Introduction to the Semantic Web and Linked Data
Extracting Keyphrases from Books using Language Modeling Approaches Rohini U AOL India R&D, Bangalore India Bangalore
1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University.
2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International.
GOOGLE SCHOLAR Compiled by Helene van der Sandt. WHAT IS GOOGLE SCHOLAR?
Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq
User Interface Design for a Large-Scale Computer Science Research Digital Library Min-Yen Kan Department of Computer Science National University of Singapore.
1 Centroid Based multi-document summarization: Efficient sentence extraction method Presenter: Chen Yi-Ting.
ForeCite: towards a reader-centric scholarly digital library Thuy Dung Nguyen, Min-Yen Kan, Dinh- Trung Dang, Markus Hänse, Ching Hoi Andy Hong, Minh-Thang.
Introduction to bibliographic software and Mendeley Anyone can use this - just put your name here.
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
Data Management Practices for Early Career Scientists: Closing Robert Cook Environmental Sciences Division Oak Ridge National Laboratory Oak Ridge, TN.
Serendipitous Recommendation for Scholarly Papers Considering Relations Among Researchers Kazunari Sugiyama, Min-Yen Kan National University of Singapore.
Topical Analysis and Visualization of (Network) Data Using Sci2 Ted Polley Research & Editorial Assistant Cyberinfrastructure for Network Science Center.
Matching References to Headers in PDF Papers Tan Yee Fan 2007 December 19 WING Group Meeting.
Min’s Research Update WING Group Meeting Min’s research direction NL Work at Stanford.
Concept-Based Analysis of Scientific Literature Chen-Tse Tsai, Gourab Kundu, Dan Roth UIUC.
Bibliography and reference manager programs (EndNote, Mendeley, Zotero) 2015 Attila Skulteti
CitEc as a source for research assessment and evaluation José Manuel Barrueco Universitat de València (SPAIN) May, й Международной научно-практической.
Bibliography and reference manager programs (EndNote, Mendeley, Zotero) 2015 Attila Skulteti
Bibliography and reference manager programs (EndNote, Mendeley, Zotero) 2016 Attila Skulteti
Moving on : Repository Services after the RAE
Map Reduce.
Extracting Recipes from Chemical Academic Papers
Jonathan Griffin, Managing Director, IFIS Publishing &
Tantan Liu, Fan Wang, Gagan Agrawal The Ohio State University
Summarizing Use the following slides in order to organize your understanding of the article. After filling in the graphic organizer, then write your summary.
Extracting Information from Diverse and Noisy Scanned Document Images
Presentation transcript:

27-31 May 2008LREC 2008 (Marrakech, Morocco)1 The ACL ARC Anthology Reference Corpus: A Reference Dataset for Bibliographic Research in Computational Linguistics Steven Bird 1, Robert Dale 2, Bonnie Dorr 3, Bryan Gibson 4, Mark T. Joseph 4, Min-Yen Kan 5, Dongwon Lee 6, Brett Powley 2, Dragomir R. Radev 4, Yee Fan Tan

27-31 May 2008LREC 2008 (Marrakech, Morocco)2 (ca. 2002)

27-31 May 2008LREC 2008 (Marrakech, Morocco)3

27-31 May 2008LREC 2008 (Marrakech, Morocco)4 PDF file Booktitle Basic Metadata Some Detailed Metadata

27-31 May 2008LREC 2008 (Marrakech, Morocco)5 Hot NLP Problems Graphical Methods for NLP –Social network analysis Text categorization –Sentence / Citation function Sequence Labeling –Reference string parsing Bayesian Models –Topic Models Summarization –Survey Paper Generation

27-31 May 2008LREC 2008 (Marrakech, Morocco)6 The Anthology as Corpus Why use newswire? –Because our funding agencies want it Let's build a corpus from our own publications! –Test domain adaptation techniques –Characterize what’s special about scientific discourse –Help ourselves and others understand our research better Start with the largest freely available NLP research archive

27-31 May 2008LREC 2008 (Marrakech, Morocco)7 The Anthology Reference Corpus Scholars have already been using scientific articles as input But datasets largely disparate Results not comparable Goal: unify such work by agreeing to work on a central dataset (à la TREC)

27-31 May 2008LREC 2008 (Marrakech, Morocco)8 The ACL ARC Consists of most articles available as of February 2006 that have extractable text Papers10,921 Total References 152,546 References to articles 38,767 (25.4%) inside ACL ARC References to articles 113,779 (74.6%) outside ACL ARC

27-31 May 2008LREC 2008 (Marrakech, Morocco)9 What’s included now Version : –PDFs for all 10,921 articles – metadata tuples –Noisy, text extracted output from the PDFs Using non-OCR based extractor ( pdfbox )

27-31 May 2008LREC 2008 (Marrakech, Morocco)10 The road ahead 1.Improve data quality 2.Establish subsets for smaller experiments 3.Build and release open-source tools 4.Enlarge coverage of newer materials 5.Release major revisions (infrequently) Achieving the goals of the Linked Anthology Proposal

27-31 May 2008LREC 2008 (Marrakech, Morocco)11 Future data (near-term) Inter-document –Manually cleaned citation graph from the ACL Anthology Network Intra-document –Citation to reference string matching Document –Automatic keyphrase generation –OCR based text extracted output (much cleaner) R

27-31 May 2008LREC 2008 (Marrakech, Morocco)12

27-31 May 2008LREC 2008 (Marrakech, Morocco)13 Tools in development by partners Automatic Reference Segmentation: ParsCit: Open-source reference string parser; also LREC 08 Automatic Survey Article Generation iOpener: summarization of articles at different expertise levels Automatic Reference-Article matching Record Linkage: using web data to match articles Citation Function Classification What’s the purpose of a citation? Next big application Your work here: please join us – this should be a community wide effort

27-31 May 2008LREC 2008 (Marrakech, Morocco)14 Thank you! Web: “acl arc” home page dAnth Digital Anthologies mailing list Thank you! Web: “acl arc” home page dAnth Digital Anthologies mailing list