A centre of expertise in data curation and preservation Subtitle here, if required Funded by: This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike.

Slides:



Advertisements
Similar presentations
Introduction to Planets Hans Hofman Nationaal Archief Netherlands Prague, 17 October 2008.
Advertisements

Opening the Research Data Lifecycle Workshop Capturing and Sharing Research Data Simon Coles School of Chemistry, University of Southampton, U.K.
The PREMIS Data Dictionary Michael Day Digital Curation Centre UKOLN, University of Bath JORUM, JISC and DCC.
A centre of expertise in data curation and preservation EAOLUG :: RSC :: Cambridge23 May 2006 Funded by: This work is licensed under the Creative Commons.
UKOLN is supported by: Digital Repositories Roadmap: looking forward The JISC/CNI Meeting, July 2006 Rachel Heery Assistant Director R&D, UKOLN
A centre of expertise in data curation and preservation Preserving Digital ArchivesLUCAS March 2006 Funded by: This work is licensed under the Creative.
A centre of expertise in data curation and preservation DCC Workshop: Curating sApril 24 – 25, 2006 Funded by: This work is licensed under the Creative.
A centre of expertise in data curation and preservation UKOLN Open ForumIWMW June 2006 Funded by: This work is licensed under the Creative Commons.
A centre of expertise in data curation and preservation London :: ARK Group Workshop: Archiving the Web :: 28 Sept 2006 Funded by: This work is licensed.
A centre of expertise in data curation and preservation SoA Annual Conference::York::August 2008 Funded by: This work is licensed under the Creative Commons.
A centre of expertise in data curation and preservation CETIS MDR SIG::28 June 2006::University of Bath Funded by: This work is licensed under the Creative.
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
A centre of expertise in data curation and preservation DC 101 Lite, September 10, 2010, London Funded by: This work is licensed under the Creative Commons.
SG KB 2009 NIGMS Workshop: Enabling Technologies for Structural Biology Section on Structural Analysis Margaret J. Gabanyi March 4, 2009 How to Use the.
Providing collections, tools and services for digital humanities A national library perspective Clément Oury Head of Digital Legal Deposit Bibliothèque.
… because good research needs good data DMP Online, Lincoln, 28 th Feb 2013 DMP Online Kerry Miller Digital Curation Centre University of Edinburgh
The Data Lifecycle and the Curation of Laboratory Experimental Data Tony Hey Corporate VP for Technical Computing Microsoft Corporation.
The Central Role of Data ‘Capturing and Sharing Chemistry Research Data’ Simon Coles School of Chemistry, University of Southampton, U.K.
A centre of expertise in data curation and preservation MIS Seminar :: University of Edinburgh :: 2 October 2006 Funded by: This work is licensed under.
JINR / CERN Grid and advanced information systems 2012 Anne Gentil-Beccot CERN Library GS/SIS The Library behind the scene Opportunities for Scientific.
1 / 22 Issues in Text Similarity and Categorization Jordan Smith – MUMT 611 – 27 March 2008.
ELPUB 2006 June Bansko Bulgaria1 Automated Building of OAI Compliant Repository from Legacy Collection Kurt Maly Department of Computer.
UKOLN is supported by: OAI-ORE a perspective on compound information objects ( Defining Image Access.
An Introduction to Metadata by Wendy Duff ECURE 2000 October 6, 2000.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Open Exeter Project Team
Usability Evaluation of Digital Libraries Stacey Greenaway Submitted to University of Wolverhampton module Dec 15 th 2006.
GL12 Conf. Dec. 6-7, 2010NTL, Prague, Czech Republic Extending the “Facets” concept by applying NLP tools to catalog records of scientific literature *E.
A centre of expertise in data curation and preservation Digital Curation Centre/ Edinburgh eScience Collaborative Workshop – 12th June 2008 Funded by:
© HATII, University of Glasgow Introduction to the UK ’ s Digital Curation Centre Prof Seamus Ross Visiting Fellow at Oxford Internet Institute ,
Attention and Event Detection Identifying, attributing and describing spatial bursts Early online identification of attention items in social media Louis.
CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.
Web Archives, IDEAL, and PBL Overview Edward A. Fox Digital Library Research Laboratory Dept. of Computer Science Virginia Tech Blacksburg, VA, USA 21.
: Chapter 10: Image Recognition 1 Montri Karnjanadecha ac.th/~montri Image Processing.
Research Data Management in Academic Libraries Mayu Ishida Qinqin Zhang LIBR 559L June 2011.
Citation Recommendation 1 Web Technology Laboratory Ferdowsi University of Mashhad.
1 The BT Digital Library A case study in intelligent content management Paul Warren
… because good research needs good data DAF at KeepIt Digital preservation tools for repositories, 19/01/10, Southampton Funded by: This work is licensed.
… because good research needs good data PEKin: Developing Data Management Expertise in Research, 21 October 2010 The DCC’s Data Management Planning: Encouraging.
DSpace. TM 2 Agenda  Introduction to DSpace  DSpace community  Institutional Repository  Easy to add/find content in DSpace  Building Online Communities.
THROUGH OR AROUND? SCIENTIFIC RESEARCH DATA AND THE INSTITUTIONAL REPOSITORY Panel Presentation for the International Conference on University Libraries.
Digital/Open Access repositories Paul Sheehan Director of Library Services DCU HEAnet National Networking Conference Athlone 11 th November 2005.
Caring and Sharing Collaboration in Digital Curation outside North America Ross Harvey Simmons College, Boston Curation Matters: 17 June 2010.
PLoS ONE Application Journal Publishing System (JPS) First application built on Topaz application framework Web 2.0 –Uses a template engine to display.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
On The Effectiveness of Kolmogorov Complexity Estimation to Discriminate Semantic Types Presenters: Enkh-Amgalan Baatarjav Kalyan Pathapati Subbu Satyajeet.
1 Bridging the gap between the paper past and digital future.
Preservation of Interoperability and Interoperability of Preservation DL.org Autumn School – Athens, 3-8 October 2010 Seamus Ross, University of Toronto.
TEMPLATE DESIGN © Zhiyao Duan 1,2, Lie Lu 1, and Changshui Zhang 2 1. Microsoft Research Asia (MSRA), Beijing, China.2.
Data Attribution and Citation Practices and Standards Fifth China - U.S. Roundtable on Scientific Data Cooperation Beijing, China, October, 2011.
Now launched! Visit nature.com/scientificdata Honorary Academic Editor Susanna-Assunta Sansone Advisory.
A centre of expertise in data curation and preservation Funded by: This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike.
No. 1 Knowledge Acquisition from Documents with both Fixed and Free Formats* Shigeich Hirasawa Department of Industrial and Management Systems Engineering.
Call to Write, Third edition Chapter Two, Reading for Academic Purposes: Analyzing the Rhetorical Situation.
A centre of expertise in data curation and preservation Digital Curation 101, October 6 th -10 th, 2008, NeSC, Edinburgh Funded by: This work is licensed.
Digital Library The networked collections of digital text, documents, images, sounds, scientific data, and software that are the core of today’s Internet.
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
Image Classification over Visual Tree Jianping Fan Dept of Computer Science UNC-Charlotte, NC
Open Access and Free Journals In OutLook OnLine: A Demonstration
Lifecycle Metadata for Digital Objects November 15, 2004 Preservation Metadata.
Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 1.
Measuring Research Impact Using Bibliometrics Constance Wiebrands Manager, Library Services.
Digital Asset Management: E-Science Life-Cycle Anthony D. Smith Ocean Teacher Academy Training Course, 30 September - 4 October 2013, Mombasa, Kenya.
Reading literacy. Definition of reading literacy: “Reading literacy is understanding, using and reflecting on written texts, in order to achieve one’s.
No. 1 Classification Methods for Documents with both Fixed and Free Formats by PLSI Model* 2004International Conference in Management Sciences and Decision.
Open Exeter Project Team
Seunghui Cha1, Wookhyun Kim1
VI-SEEM Data Repository
TDM=Text Mining “automated processing of large amounts of structured digital textual content for purposes of information retrieval, extraction, interpretation.
Lecture Notes: Spatial Convolution
Presentation transcript:

a centre of expertise in data curation and preservation Subtitle here, if required Funded by: This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 2.5 UK: Scotland License. To view a copy of this license, visit nc-sa/2.5/scotland/ ; or, (b) send a letter to Creative Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA. nc-sa/2.5/scotland/ 20 th International CODATA Conference: October 2006, Beijing, China Yunhyong Kim and Seamus Ross Digital Curation Centre & Humanities Advanced Technology and Information Institute University of Glasgow, Glasgow, UK {y.kim, Automated Genre Classification for Ingest and Appraisal Metadata

a centre of expertise in data curation and preservation Subtitle here, if required End objectives: ● To enable the automatic extraction of descriptive information for digital objects. ● To enable the automatic identification, selection and management of digital material. ● To create a network of relationships and contexts for information produced independently.

a centre of expertise in data curation and preservation Subtitle here, if required In this presentation we discuss automatic genre classification that is the automatic recognition of document types such as scientific papers, tables, theses etc.

a centre of expertise in data curation and preservation Subtitle here, if required Scientific Data Contex t Scientific papers Lab notes Emai ls Documentatio n Review s Other data Understanding data: why genre classification? Technical reports

a centre of expertise in data curation and preservation Subtitle here, if required What is genre? Sample research in genre: ● Biber, D.: Dimensions of Register Variation:a Cross-Linguistic Comparison. Cambridge University Press (1995). ● Karlgren, J. and Cutting, D.: Recognizing Text Genres with Simple Metric using Discriminant Analysis. Proc. 15th conf. Comp. Ling. {\bfseries Vol 2} (1994) ● Kessler, B., Nunberg, G., Schuetze, H.: Automatic Detection of Text Genre. Proc. 35th Ann. Meeting ACL (1997) 32—38. ● Rauber, A. and Müller-Kögler, A.: Integrating Automatic Genre Analysis into Digital Libraries In: Fox, E.A., and Borgman, C.L. (eds.), Proceedings of the ACM/IEEE Joint Conference on Digital Libraries 2001 (JCDL01), June , Roanoke, VA, pp.1- 10, ACM, ● Bagdanov, A. D., Worring, M.: Fine-Grained Document Genre Classification Using First Order Random Graphs. Proceedings of International Conference on Document Analysis and Recognition 2001 (2001) 79. ● Boese, E. S.: Stereotyping the web: genre classification of web documents. Master's thesis, Colorado State University (2005). ● Finn, A. and Kushmerick, N.: Learning to Classify Documents According to Genre. Journal of American Society for Information Science and Technology, 57 (11), , 2006

a centre of expertise in data curation and preservation Subtitle here, if required Genre Structure Function Environment Existence of chapters, sections, links, references or images. Layout and length of these objects and of the document etc. whether it is intended to describe, to inform, to narrate, to be persuasive, to communicate, to instruct etc. Creator, publishers, scientific community, management process etc.

a centre of expertise in data curation and preservation Subtitle here, if required Structure Document function Survival function PDF document Organism ? DNA Environment Selection Documents as a dynamic entities

a centre of expertise in data curation and preservation Subtitle here, if required Properties that characterise documents ● Image (white space analysis) ● Style (length, average length of words, word frequency analysis, number of font changes, difference between largest and smallest font size) ● Language model (N-gram model) ● Semantics (proportion of objective nouns, argumentation structure etc.) ● Context (who created it for whom and where is it from)

a centre of expertise in data curation and preservation Subtitle here, if required Experiments ● Clustering documents ● Binary predictions in a pool of nineteen genres ➢ Retrieving Periodicals ➢ Retrieving Thesis ➢ Retrieving Scientific Articles ➢ Retrieving Business Reports ➢ Retrieving Forms ● Classification of five genres

a centre of expertise in data curation and preservation Subtitle here, if required Results I (Cluster)

a centre of expertise in data curation and preservation Subtitle here, if required Results II (Periodicals) image classifier (acc. 88.6% ) style (acc %) language (acc %)

a centre of expertise in data curation and preservation Subtitle here, if required Results III (Scientific Article, Thesis) Scientific Article Thesis

a centre of expertise in data curation and preservation Subtitle here, if required Results IV (Business Report, Forms) Business Report Forms Classification: five genres (language model)

a centre of expertise in data curation and preservation Subtitle here, if required Conclusions ● different genres have different feature strengths. ● retrieval of selected genres dependent on strong feature types may perform better than global analysis of all features to classify a large number of genres. ● binary decisions divide document space into groups less likely and more likely to contain a given genre type.

a centre of expertise in data curation and preservation Subtitle here, if required Future Work ● Improvement of the classifiers  Extended image classifier  Extended Language model classifier  Augmented stylistic classifier ● More classifiers  Semantic classifier  Contextual classifier ● Human Labelling experiments  Document retrieval exercise  Re-labelling exercise

a centre of expertise in data curation and preservation Subtitle here, if required Errors for Periodicals, Thesis, Scientific Article: Confusion Matrix