A centre of expertise in digital information management www.ukoln.ac.uk UKOLN is supported by: Approaches to automated metadata extraction : FixRep Project.

Slides:



Advertisements
Similar presentations
1 CASUS Authoring System 07/2010 E-Learning & E-Teaching Welcome to the CASUS Authoring System!
Advertisements

A centre of expertise in digital information management Approaches To The Validation Of Dublin Core Metadata Embedded In (X)HTML Documents Background The.
UKOLN is supported by: Put functionality Augmenting interoperability across scholarly repositories 20/21 April 2006 Rachel Heery, UKOLN, University of.
A centre of expertise in digital information managementwww.ukoln.ac.uk QA For Web Sites: Developing Your Own QA Brian Kelly UKOLN University of Bath Bath.
Academic Writing Writing an Abstract.
1 Distributed Agents for User-Friendly Access of Digital Libraries DAFFODIL Effective Support for Using Digital Libraries Norbert Fuhr University of Duisburg-Essen,
Agile Usability Testing Methods
Content Categorization A Road Map Julia Marshall USAID (Bridgeborn Inc.)
Extracting data from reports into Excel What is involved in mining report data for Excel? What is involved in mining report data for Excel? Why export.
Writing a Research Paper
ELPUB 2006 June Bansko Bulgaria1 Automated Building of OAI Compliant Repository from Legacy Collection Kurt Maly Department of Computer.
1 CBioC: Collaborative Bio- Curation Chitta Baral Department of Computer Science and Engineering Arizona State University.
1 User-Centered Design at the USPTO: Application to Patent IT Modernization Marti Hearst Chief IT Strategist, USPTO May 23, 2011.
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
Swami NatarajanJuly 14, 2015 RIT Software Engineering Reliability: Introduction.
Database Design IST 7-10 Presented by Miss Egan and Miss Richards.
 A data processing system is a combination of machines and people that for a set of inputs produces a defined set of outputs. The inputs and outputs.
1 Newspaper Digitisation Workflows Rose Holley- Manager ANDP Presentation to Cultural Heritage Digitisation professionals 26 November 2008.
A centre of expertise in digital information management Approaches To The Validation Of Dublin Core Metadata Embedded In (X)HTML Documents Background Dublin.
SRDR Quarterly Training Brown Evidence-based Practice Center Brown University September 12 th, :00pm-2:00pm SRDR Data Import Tool A Tool to Import.
1. Learning Outcomes At the end of this lecture, you should be able to: –Define the term “Usability Engineering” –Describe the various steps involved.
Concordia University Department of Computer Science and Software Engineering Click to edit Master title style ADVANCED PROGRAMING PRACTICES API documentation.
1 LOMGen: A Learning Object Metadata Generator Applied to Computer Science Terminology A. Singh, H. Boley, V.C. Bhavsar National Research Council and University.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 18 Slide 1 Software Reuse 2.
Copyright 2007, Information Builders. Slide 1 Maintain & JavaScript: Two Great Tools that Work Great Together Mark Derwin and Mark Rawls Information Builders.
Topics Covered: Data preparation Data preparation Data capturing Data capturing Data verification and validation Data verification and validation Data.
Slide Image Retrieval: A Preliminary Study Guo Min Liew and Min-Yen Kan National University of Singapore Web IR / NLP Group (WING)
SWIS Digital Inspections Project (SWIS DIP) Chris Allen, Information Management Branch California Integrated Waste Management Board November 5, 2008 The.
Writing Research Paper
Online Autonomous Citation Management for CiteSeer CSE598B Course Project By Huajing Li.
CHAPTER 16: Inference in Practice. Chapter 16 Concepts 2  Conditions for Inference in Practice  Cautions About Confidence Intervals  Cautions About.
© Paradigm Publishing Inc. 9-1 Chapter 9 Database and Information Management.
Data input 1: - Online data sources -Map scanning and digitizing GIS 4103 Spring 06 Adina Racoviteanu.
Dec 9-11, 2003ICADL Challenges in Building Federation Services over Harvested Metadata Hesham Anan, Jianfeng Tang, Kurt Maly, Michael Nelson, Mohammad.
1 Writeslike.us Em Tonkin, Andrew Hewson
1 Writeslike.us Em Tonkin, Andrew Hewson
Object-Oriented Software Engineering Practical Software Development using UML and Java Chapter 7: Focusing on Users and Their Tasks.
UKOLN is supported by: Approaches to Metadata Quality Marieke Guy QA Focus A centre of expertise in digital information management
DC 2004 Metadata Generation and Accessibility Auditing Liddy Nevile La Trobe University, Australia Mail
Lecture 5: Writing the Project Documentation Part III.
XP Chapter 2 Succeeding in Business with Microsoft Office Access 2003: A Problem-Solving Approach 1 Building The Database Chapter 2 “It is only the farmer.
Chapter 8 Usability Specification Techniques Hix & Hartson.
Facilitating Document Annotation using Content and Querying Value.
A centre of expertise in digital information management UKOLN is supported by: QA Resources
Theme 2: Data & Models One of the central processes of science is the interplay between models and data Data informs model generation and selection Models.
Giving More Adaptation Flexibility to Authors of Adaptive Assessments Symeon Retalis University of Piraeus Department of Technology Education and Digital.
Information Retrieval
Data Collection. Data Capture This is the first stage involved in getting data into a computer Various input devices are used when getting data to the.
JISC/NSF PI Meeting, June Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer.
Oct 12-14, 2003NSDL Challenges in Building Federation Services over Harvested Metadata Kurt Maly, Michael Nelson, Mohammad Zubair Digital Library.
Sixteen Questions About Software Reuse William B. Frakes and Christopher J. Fox Communications of the ACM.
Feb 21-25, 2005ICM 2005 Mumbai1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science.
A centre of expertise in digital information management UKOLN is supported by: Functional Requirements Eprints Application Profile Working.
Microsoft Office 2008 for Mac – Illustrated Unit D: Getting Started with Safari.
A centre of expertise in digital information management UKOLN is supported by: Usability on a Shoestring Budget (1) Emma Tonkin & Greg.
Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 1.
Accessible PDF’s using Adobe Acrobat Standard or Professional Jarilyn Weber 06/11/2014 “Leading for educational excellence and equity. Every day for every.
Facilitating Document Annotation Using Content and Querying Value.
Sharing OERs via Jorum Siobhán Burke and Sarah Currier 12 th December 2012.
Automation Living in a Paper Oriented World and The Steps to Automation.
The PLA Model: On the Combination of Product-Line Analyses 강태준.
Global Rangelands Data Entry Guidelines March 23, 2015.
Moving on : Repository Services after the RAE
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Applied Software Implementation & Testing
Search Techniques and Advanced tools for Researchers
CIS 210 Systems Analysis and Development
Chapter 9 Database and Information Management.
How to Use “Indian Citation Index (ICI)”
Presentation transcript:

A centre of expertise in digital information management UKOLN is supported by: Approaches to automated metadata extraction : FixRep Project Emma Tonkin

A centre of expertise in digital information management Wouldn't it be nice if......computers could author our metadata for us, thus saving a lot of hassle? Mechanical metadata extraction vs manual metadata input

A centre of expertise in digital information management But... Automated tools are fallible There's never quite enough information available Templates change, different domains have different standards In short, computers are often wrong –and so are people

A centre of expertise in digital information management Hybrid approach: –Get what metadata you can –Ask the user to check and clean it if necessary Philosophy: –If the computer gets it wrong, we can fix it later The 'half a loaf' hypothesis

A centre of expertise in digital information management Wouldn’t it be nice if… …computers could fix our metadata for us? Or, more realistically, help us do this work for ourselves.

A centre of expertise in digital information management All about ‘fixing it later’, doing what we can with what we have Automated metadata extraction + metadata consistency assessment Metadata generation, evaluation, characterisation: enabling metadata triage

A centre of expertise in digital information management 1)Challenges in automated metadata extraction 2)Manual metadata generation 3)Metadata extraction in brief 4)Practical use as part of a repository deposit workflow 5)A user study comparing manual and hybrid input 6)Towards metadata triage

A centre of expertise in digital information management Whatever can go wrong... PDFs can be: –Encrypted –Corrupted –Oddly encoded –An image file without embedded text –Occurrence: ~3-6%

A centre of expertise in digital information management Character sets Ligatures, Accents, Symbols - may not always be extractable from PDFs Image © Daniel Ullrich

A centre of expertise in digital information management Document formats/layouts Many possible formats Some formats not widely supported Document layouts vary widely, esp. by discipline

A centre of expertise in digital information management 1)Challenges in metadata extraction 2)Manual metadata generation 3)Metadata extraction in brief 4)Practical use as part of a repository deposit workflow 5)A user study comparing manual and hybrid input 6)Towards metadata triage

A centre of expertise in digital information management Whatever can go wrong... (II) Function following form – interface Model adapted to suit unique user needs Data model incompletely supported Input validation issues Systematic error; typos; localisation; encoding; etc. Lots of past work in characterising manual input errors

A centre of expertise in digital information management 1)Challenges in metadata extraction 2)Manual metadata generation 3)Metadata extraction in brief 4)Practical use as part of a repository deposit workflow 5)A user study comparing manual and hybrid input

A centre of expertise in digital information management Image segmentation, templating & OCR

A centre of expertise in digital information management Working from text There are a number of possible states (ie. title, author, , affiliation, abstract) Directed graph with probabilities – Markov chain: for example, Title Author Affil.

A centre of expertise in digital information management Hidden Markov Model We cannot directly see these states – only the words But we can gather statistics on the correlation between the words and the underlying states, to inform guesses as to how the data should be segmented This may be expressed in terms of an HMM Bayesian statistics used across term appearance

A centre of expertise in digital information management Example parse Confirmation-Guided Discovery of First-Order Rules, PETER A. FLACH, NICOLAS LACHICHE... Confirmation-Guided Discovery of First-Order Rules, PETER A. FLACH, NICOLAS LACHICHE Self-correcting, to the extent that the knowledge base grows as new papers are added to the collection

A centre of expertise in digital information management 1)Challenges in metadata extraction 2)Manual metadata generation 3)Metadata extraction in brief 4)Practical use as part of a repository deposit workflow 5)A user study comparing manual and hybrid input 6)Towards metadata triage

A centre of expertise in digital information management Aims Adaption of existing interfaces Enhancing rather than rewriting Cross-platform, accessible interface Simple reusable REST API, metadata as DC/XML

A centre of expertise in digital information management Sample interfaces

A centre of expertise in digital information management Sample interfaces

A centre of expertise in digital information management Architecture

A centre of expertise in digital information management Using what we know...

A centre of expertise in digital information management 1)Challenges in metadata extraction 2)Manual metadata generation 3)Metadata extraction in brief 4)Practical use as part of a repository deposit workflow 5)A user study comparing manual and hybrid input 6)Towards metadata triage

A centre of expertise in digital information management Question: “Do people accept ‘hybrid’ interfaces?” Here’s one we did earlier…

A centre of expertise in digital information management Hypotheses Correcting extracted metadata is faster than entering or cutting-and-pasting metadata. The resulting metadata has fewer errors when the user is provided with already extracted metadata to correct. User satisfaction with a system is higher if it 'tries' to extract metadata, even if it fails. Measured: speed and accuracy of entering information manually versus hybrid entry, and qualitatively, the user-satisfaction

A centre of expertise in digital information management Results: Timing Hybrid faster under both conditions (Summary of median times)‏

A centre of expertise in digital information management Results: Accuracy Tested against ground-truth Keyword accuracy: First keyword listed was relevant for 46% of the publications. The top two were relevant in 66%; the top-5 cover 81% of all desired keywords. Manual metadata accuracy: –Few users use cut and paste –Capitalisation, punctuation frequently differs –Synonyms are accidentally substituted Hybrid closer to ground-truth, and more complete, but results not clear-cut.

A centre of expertise in digital information management Qualitative results Most users preferred the hybrid mode Most perceived it to be faster than manual data entry Few believed the hybrid approach to be more accurate; in practice, there was no significant difference in quality between hybrid and manual approach Both were good - quality

A centre of expertise in digital information management Discussion Results support hypotheses People prefer the hybrid interface, and found it more satisfying to use Accessibility issues exist, but can be overcome The punchline: one subject actually preferred manual entry because the hybrid system filled in metadata fields that he preferred to leave empty – ie. it did more than the subject wanted!

A centre of expertise in digital information management 1)Challenges in metadata extraction 2)Manual metadata generation 3)Metadata extraction in brief 4)Practical use as part of a repository deposit workflow 5)A user study comparing manual and hybrid input 6)Towards metadata triage

A centre of expertise in digital information management MetRe prototype (2008) Characteristic classes of individual/systematic error highlighted Nb. local and general best practice. Uses: ranking, browsing, correcting systematic error Uses info from intra-/inter-repository harvested metadata to identify patterns, rank occurrences and co-occurrences

A centre of expertise in digital information management v

A centre of expertise in digital information management

A centre of expertise in digital information management Issues Discipline/domain-specific issues Lots of information required to do this right (see metadata schema/terminology registry) Some APs present particular difficulties, such as SWAP (FRBR structure, linking objects by ‘Scholarly Work’)

A centre of expertise in digital information management Approach Generally dependent on heuristics over available data Powered by very specific functions (classifiers, validation, etc…) Potentially expensive, not always domain-independent

A centre of expertise in digital information management Future work More! –Data –Filters (input/output formats) –Methods –Evaluation –Service availability (mail me for announcements!)

A centre of expertise in digital information management Conclusion Metadata creation can be supported through software Specific problem sets in metadata triage Work continues in the FixRep project

A centre of expertise in digital information management Conclusion (II) Formal Metadata Extraction/evaluation Metadata review process Accessibility metadata Entity extraction (named entities, geographical, temporal [k-int!]) Repository integration

A centre of expertise in digital information management Thanks! Comments/Questions?