Mission: Be a leader in the digital curation research and education fields, and foster interdisciplinary partnerships using Big Records and Archival Analytics.

Slides:



Advertisements
Similar presentations
Preservation of the Texas Agricultural Experiment Station Bulletin in the Digital Repository By Dr. Rob McGeachin Texas A&M University Libraries June,
Advertisements

E-Content Service Group Virtual Meeting Digital Preservation: How to Get Started.
ESDS Qualidata Libby Bishop, ESDS Qualidata Economic and Social Data Service UK Data Archive ESDS Awareness Day Friday 5 December 2003Royal Statistical.
Can We Talk? MICHAEL Conference London May 23, 2008Joyce Ray.
Haystack: Per-User Information Environment 1999 Conference on Information and Knowledge Management Eytan Adar et al Presented by Xiao Hu CS491CXZ.
File Management Chapter 3
DRS 2 one in a series of periodic updates Harvard University Library Andrea Goethals October 21, 2009 DRS = Digital Repository Service.
CAPTURE SOFTWARE Please take a few moments to review the following slides. Please take a few moments to review the following slides. The filing of documents.
CAPTURE SOFTWARE Please take a few moments to review the following slides. Please take a few moments to review the following slides. The filing of documents.
ELPUB 2006 June Bansko Bulgaria1 Automated Building of OAI Compliant Repository from Legacy Collection Kurt Maly Department of Computer.
Supervised by Prof. LYU, Rung Tsong Michael Department of Computer Science & Engineering The Chinese University of Hong Kong Prepared by: Chan Pik Wah,
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Mike Smorul Saurabh Channan Digital Preservation and Archiving at the Institute for Advanced Computer Studies University of Maryland, College Park.
1 BrainWave Biosolutions Limited Accelerating Life Science Research through Technology.
1 The Vietnam Center and Archive Stephen Maxner, Ph.D.
The National Archives THE ARCHIVAL DATABASE (VAKKA )
Databases & Data Warehouses Chapter 3 Database Processing.
Digital Library Architecture and Technology
Good practice in Research Data Management Module 6: Tools, training and support.
Software All parts of the computer people can NOT touch, such as programs, files, documents and any other data.
OCLC Online Computer Library Center CONTENTdm ® Digital Collection Management Software Ron Gardner, OCLC Digital Services Consultant ICOLC Meeting April.
Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
R utgers C ommunity R epository RU CORE 1 Research Data and Context  Presentation Goals  The challenge of context  Metadata design to support context.
Laserfiche ECM for Broker-Dealers
NARA INITIATIVES ON GEOSPATIAL DATA RECORDS FGDC Coordination Group Meeting, May 3, 2005 Ken Thibodeau, Director Electronic Records Archives Program Management.
Improving user engagement in a data repository with web analytics LITA Forum November 7, 2013 Heather CoatesSummer Durrant Digital Scholarship & Data Management.
Questys Text & Image Management System Records Management for the Information Age.
Ensemble Computing in the National Science Digital Library (NSDL)
Corral: A Texas-scale repository for digital research data Chris Jordan Data Management and Collections Group Texas Advanced Computing Center.
Introduction to Apache OODT Yang Li Mar 9, What is OODT Object Oriented Data Technology Science data management Archiving Systems that span scientific.
19/10/20151 Semantic WEB Scientific Data Integration Vladimir Serebryakov Computing Centre of the Russian Academy of Science Proposal: SkTech.RC/IT/Madnick.
Document management (aka ‘digital libraries’) The Greenstone Group: Professor Ian Witten (leader); David Bainbridge, Dave Nichols, S.J. Cunningham, Steve.
Metadata Extraction for NASA Collection June 21, 2007 Kurt Maly, Steve Zeil, Mohammad Zubair {maly, zeil,
CONTENT DISCOVERY, SERVICES, AND SUSTAINED ACCESS Timothy Cole, William Mischo, Beth Sandore, Sarah Shreeves ~ University of Illinois Library
1 By: Suman Negi, Technical Officer ‘B’ DESIDOC, DRDO, Delhi Presentation at NACLIN 14 (During 9-11 December 2014, Pondicherry) Design and Development.
Breakout # 1 – Data Collecting and Making It Available Data definition “ Any information that [environmental] researchers need to accomplish their tasks”
March 31, 1998NSF IDM 98, Group F1 Group F Multi-modal Issues, Systems and Applications.
Nikola Tesla Museum Clipping Library Saša Malkov Nenad Mitić Žarko Mijajlović 3 rd SEEDI Int.Conf. Cetinje, Montenegro 14. September 2007.
Millman—Nov 04—1 An Update on Digital Libraries David Millman Director of Research & Development Academic Information Systems Columbia University
Integrating Data Mining and Data Management Technologies for Scholarly Inquiry Ray R. Larson University of California, Berkeley Paul Watry Richard Marciano.
Digital Data Collections ARL, CNI, CLIR, and DLF Forum October 28, 2005 Washington DC Chris Greer Program Director National Science Foundation.
Archiving.Net® Document Management System rchiving.Net® is a bi-lingual (Arabic/English) document management system that lets you capture, index, organize,
Web mining is the use of data mining techniques to automatically discover and extract information from Web documents/services
National Archives Center for Advanced Systems and Technologies (NCAST) The National Archives and Records Administration Welcome! Now What? Mark Conrad.
Digital Data Collections in Biology Collaborative Expedition Workshop November 8, 2005 Arlington, Virginia Chris Greer Program Director National Science.
Collection-Based Persistent Archives Arcot Rajasekar, Richard Marciano, Reagan Moore San Diego Supercomputer Center Presented by: Preetham A Gowda.
Department of Computer Science, Johns Hopkins University EN Instructor: Randal Burns 24 September 2013 NoSQL Data Models and Systems.
Post-ALA Annual July 11, 2008 Pre-Conference Workshop: The Care and Feeding of Compound Objects Geri Ingram OCLC Digital Collection Services Manager, User.
EUDAT receives funding from the European Union's Horizon 2020 programme - DG CONNECT e-Infrastructures. Contract No Herbadrop.
1 Using the Lucene Search Engine. 2 Team Phil Corcoran Project Leader 10 Years Software Telecoms, Finance, Manufacturing Reqs, Design, Test Derek O’ Keeffe.
Dataverse at Scholars Portal Alan Darnell Director, Scholars Portal.
Richard Marciano Professor, University of Maryland iSchool Affiliate Professor, Computer Science Director, Digital Curation Innovation Center (DCIC) University.
7th Annual Hong Kong Innovative Users Group Meeting
REMOVE THIS SLIDE BEFORE PRESENTATION
Identifying Barriers To File Rendering In Bit-level Preservation Repositories A Preliminary Approach Kyle R. Rimkus, University Library Scott D. Witmer,
What is Alchemy?.
Science Reference Center
ISDA + OpenStack Rob Kooper.
Jarek Nabrzyski Director, Center for Research Computing
The IPT user interface and data quality tools
Science Reference Center
Brown Dog Data Collection Native Byte Encoding Data Structures
B OOST W EBSITE P ERFORMANCE WITH T HE C USTOM W ORDPRESS P LUG -I N D EVELOPMENT
DIGITAL LIBRARY.
Partnering to bring business workloads to Box.
ABOUT ME MY NAME IS DIOSDADO MACASAET OR DON
Introduction into Knowledge and information
BUILDING A DIGITAL REPOSITORY FOR LEARNING RESOURCES
build a real time operational data lake in minutes.
Presentation transcript:

Mission: Be a leader in the digital curation research and education fields, and foster interdisciplinary partnerships using Big Records and Archival Analytics through public / industry /government partnerships. Goals: Sponsor interdisciplinary projects that explore the integration of archival research data, user-contributed data, and technology to generate new forms of analysis and historical research engagement. Digital Curation: “personalized access to information, as well as federation, preservation, data lifecycle stewardship, and analysis of large heterogeneous collections, information, and records. ” 2015 NITRD Program Supplement to the President’s Budget

Designing Scalable Cyberinfrastructure Services for Metadata Extraction in Billion-Record Archives NSF: “Brown Dog”, a $10.5M NSF/DIIBs award ( ) -- the “super mutt” of software: –UIUC/NCSA + UMD/DCIC –GOALS: Design & test preservation services in the cloud: DAP & DTS Creating a Big Data Observatory to: –Provide access to big data training sets –Accelerate the development of digital preservation services –

Indigo Peta-scale archival storage and analytics facility, powered by: NetApp storage, Dell computing Open-source Indigo (NoSQL Apache Cassandra): used for long- term archival storage and preservation data C a ve Apache Cassandra (originally developed at Facebook): Adobe, Best Buy, Cisco, Dell, Disney, ebay, FedEx, Netflix, Target, T-Mobile, Travelocity… Scalable to hundreds of petabytes and nodes No external file system (close control of data) P2P – no single point of failure access is available from ANY node, and will delivered from nearest node Data is automatically replicated & compressed Ability to store arbitrary data objects and associated ancillary data ( metadata ) Allows an organization to deposit data objects in a directory tree Deposition/update can trigger arbitrary actions through trigger/rules mechanisms self-managing repositories

DCIC Big Record collection DCIC big record collection: 100 million files 72 terabytes of data Content from over 150 federal agencies: 1000s of file formats diverse records: text Satellite images spreadsheets environmental data Photos Databases Etc. 5

Computational Finding Aids Approaching Billion-Record Digital Archives Gregory Jansen Richard Marciano

Workflow for a Digital Object PDF REPOSITORYSERVICES File Name Directory File Size

Text Format Conversion (PDF to TXT) PDF REPOSITORYSERVICES TXT

Now we have a full text index.. PDF REPOSITORYSERVICES Full Text File Name Directory File Size TXT

Optical Character Recognition (OCR) Extractor PDF REPOSITORYSERVICES OCR Text File Name Directory File Size PNG OCR

Format Recognition (Siegfried PRONOM Extractor) PDF REPOSITORYSERVICES PUID PDF Format OCR Text File Name Directory File Size PNG OCR

Facial Recognition (Computer Vision Extractors) PDF REPOSITORYSERVICES PUID PDF # Faces Format OCR Text File Name Directory File Size PNG OCR 6 FACES

Facial Recognition (Computer Vision Extractor) PDF REPOSITORYSERVICES PUID PDF # Faces # Eyes # Close Ups # Profiles Format OCR Text File Name Directory File Size PNG OCR 6 Faces 12 Eyes 3 in Profile 1 Close Up

PDF Object Enhanced with Extracted Metadata PDF PUID PDF PNG OCR 6 Faces 12 Eyes 3 in Profile 1 Close Up

DON’T PANIC

Elasticsearch + Kibana Kibana: ●Free plugin for Elasticsearch ●Gives shape to an Elasticsearch index ●Write queries visually and interactively Elasticsearch: ●Open-source scalable search engine based on Lucene

Lots of ways to explore the data Files Formats Concentric Pie Chart Inner: Mimetype Outer: PRONOM PUID

Charts can be added to dynamic dashboards

Arrangement can be used as a Facet As you browse the hierarchy... The entire dashboard is redrawn to reflect the particular record group, series or folder under study. “Drill down” or zoom in and out of your collections.

Make comparisons between neighbors Significant Terms are based on full text. They are significant within overall scope of query. Significant Terms can be used to distinguish neighboring folders or documents.

Summary: Indigo, Brown Dog & Elasticsearch Full text searching (from file conversion of OCR) Charts for any extracted data point: Image metrics: pixel count, pixel depth Computer vision: recognized shapes (humans!), image skewness, etc.. Significant Terms File Formats File Sizes Compare neighboring folders (or series) against each other Significant Terms Top formats Use a dashboard to zoom in and out of the arrangement

Richard Marciano Professor, University of Maryland iSchool Affiliate Professor, Computer Science Director, Digital Curation Innovation Center (DCIC) University of Maryland Bill Underwood Research Faculty, University of Maryland iSchool Digital Curation Innovation Center (DCIC) University of Maryland 27