Mission: Be a leader in the digital curation research and education fields, and foster interdisciplinary partnerships using Big Records and Archival Analytics through public / industry /government partnerships. Goals: Sponsor interdisciplinary projects that explore the integration of archival research data, user-contributed data, and technology to generate new forms of analysis and historical research engagement. Digital Curation: “personalized access to information, as well as federation, preservation, data lifecycle stewardship, and analysis of large heterogeneous collections, information, and records. ” 2015 NITRD Program Supplement to the President’s Budget
Designing Scalable Cyberinfrastructure Services for Metadata Extraction in Billion-Record Archives NSF: “Brown Dog”, a $10.5M NSF/DIIBs award ( ) -- the “super mutt” of software: –UIUC/NCSA + UMD/DCIC –GOALS: Design & test preservation services in the cloud: DAP & DTS Creating a Big Data Observatory to: –Provide access to big data training sets –Accelerate the development of digital preservation services –
Indigo Peta-scale archival storage and analytics facility, powered by: NetApp storage, Dell computing Open-source Indigo (NoSQL Apache Cassandra): used for long- term archival storage and preservation data C a ve Apache Cassandra (originally developed at Facebook): Adobe, Best Buy, Cisco, Dell, Disney, ebay, FedEx, Netflix, Target, T-Mobile, Travelocity… Scalable to hundreds of petabytes and nodes No external file system (close control of data) P2P – no single point of failure access is available from ANY node, and will delivered from nearest node Data is automatically replicated & compressed Ability to store arbitrary data objects and associated ancillary data ( metadata ) Allows an organization to deposit data objects in a directory tree Deposition/update can trigger arbitrary actions through trigger/rules mechanisms self-managing repositories
DCIC Big Record collection DCIC big record collection: 100 million files 72 terabytes of data Content from over 150 federal agencies: 1000s of file formats diverse records: text Satellite images spreadsheets environmental data Photos Databases Etc. 5
Computational Finding Aids Approaching Billion-Record Digital Archives Gregory Jansen Richard Marciano
Workflow for a Digital Object PDF REPOSITORYSERVICES File Name Directory File Size
Text Format Conversion (PDF to TXT) PDF REPOSITORYSERVICES TXT
Now we have a full text index.. PDF REPOSITORYSERVICES Full Text File Name Directory File Size TXT
Optical Character Recognition (OCR) Extractor PDF REPOSITORYSERVICES OCR Text File Name Directory File Size PNG OCR
Format Recognition (Siegfried PRONOM Extractor) PDF REPOSITORYSERVICES PUID PDF Format OCR Text File Name Directory File Size PNG OCR
Facial Recognition (Computer Vision Extractors) PDF REPOSITORYSERVICES PUID PDF # Faces Format OCR Text File Name Directory File Size PNG OCR 6 FACES
Facial Recognition (Computer Vision Extractor) PDF REPOSITORYSERVICES PUID PDF # Faces # Eyes # Close Ups # Profiles Format OCR Text File Name Directory File Size PNG OCR 6 Faces 12 Eyes 3 in Profile 1 Close Up
PDF Object Enhanced with Extracted Metadata PDF PUID PDF PNG OCR 6 Faces 12 Eyes 3 in Profile 1 Close Up
DON’T PANIC
Elasticsearch + Kibana Kibana: ●Free plugin for Elasticsearch ●Gives shape to an Elasticsearch index ●Write queries visually and interactively Elasticsearch: ●Open-source scalable search engine based on Lucene
Lots of ways to explore the data Files Formats Concentric Pie Chart Inner: Mimetype Outer: PRONOM PUID
Charts can be added to dynamic dashboards
Arrangement can be used as a Facet As you browse the hierarchy... The entire dashboard is redrawn to reflect the particular record group, series or folder under study. “Drill down” or zoom in and out of your collections.
Make comparisons between neighbors Significant Terms are based on full text. They are significant within overall scope of query. Significant Terms can be used to distinguish neighboring folders or documents.
Summary: Indigo, Brown Dog & Elasticsearch Full text searching (from file conversion of OCR) Charts for any extracted data point: Image metrics: pixel count, pixel depth Computer vision: recognized shapes (humans!), image skewness, etc.. Significant Terms File Formats File Sizes Compare neighboring folders (or series) against each other Significant Terms Top formats Use a dashboard to zoom in and out of the arrangement
Richard Marciano Professor, University of Maryland iSchool Affiliate Professor, Computer Science Director, Digital Curation Innovation Center (DCIC) University of Maryland Bill Underwood Research Faculty, University of Maryland iSchool Digital Curation Innovation Center (DCIC) University of Maryland 27