Download presentation
Presentation is loading. Please wait.
Published byStephanie Hancock Modified over 8 years ago
1
Mission: Be a leader in the digital curation research and education fields, and foster interdisciplinary partnerships using Big Records and Archival Analytics through public / industry /government partnerships. Goals: Sponsor interdisciplinary projects that explore the integration of archival research data, user-contributed data, and technology to generate new forms of analysis and historical research engagement. Digital Curation: “personalized access to information, as well as federation, preservation, data lifecycle stewardship, and analysis of large heterogeneous collections, information, and records. ” 2015 NITRD Program Supplement to the President’s Budget
3
Designing Scalable Cyberinfrastructure Services for Metadata Extraction in Billion-Record Archives NSF: “Brown Dog”, a $10.5M NSF/DIIBs award (2013-2018) -- the “super mutt” of software: –UIUC/NCSA + UMD/DCIC –GOALS: Design & test preservation services in the cloud: DAP & DTS Creating a Big Data Observatory to: –Provide access to big data training sets –Accelerate the development of digital preservation services –http://go.illinois.edu/BrownDoghttp://go.illinois.edu/BrownDog
4
Indigo Peta-scale archival storage and analytics facility, powered by: NetApp storage, Dell computing Open-source Indigo (NoSQL Apache Cassandra): used for long- term archival storage and preservation data C a ve Apache Cassandra (originally developed at Facebook): Adobe, Best Buy, Cisco, Dell, Disney, ebay, FedEx, Netflix, Target, T-Mobile, Travelocity… Scalable to hundreds of petabytes and nodes No external file system (close control of data) P2P – no single point of failure access is available from ANY node, and will delivered from nearest node Data is automatically replicated & compressed Ability to store arbitrary data objects and associated ancillary data ( metadata ) Allows an organization to deposit data objects in a directory tree Deposition/update can trigger arbitrary actions through trigger/rules mechanisms self-managing repositories
5
DCIC Big Record collection DCIC big record collection: 100 million files 72 terabytes of data Content from over 150 federal agencies: 1000s of file formats diverse records: text Satellite images spreadsheets environmental data Photos Databases Etc. 5
6
Computational Finding Aids Approaching Billion-Record Digital Archives Gregory Jansen Richard Marciano
7
Workflow for a Digital Object PDF REPOSITORYSERVICES File Name Directory File Size
8
Text Format Conversion (PDF to TXT) PDF REPOSITORYSERVICES TXT
9
Now we have a full text index.. PDF REPOSITORYSERVICES Full Text File Name Directory File Size TXT
10
Optical Character Recognition (OCR) Extractor PDF REPOSITORYSERVICES OCR Text File Name Directory File Size PNG OCR
11
Format Recognition (Siegfried PRONOM Extractor) PDF REPOSITORYSERVICES PUID PDF 1.4.2 Format OCR Text File Name Directory File Size PNG OCR
12
Facial Recognition (Computer Vision Extractors) PDF REPOSITORYSERVICES PUID PDF 1.4.2 # Faces Format OCR Text File Name Directory File Size PNG OCR 6 FACES
13
Facial Recognition (Computer Vision Extractor) PDF REPOSITORYSERVICES PUID PDF 1.4.2 # Faces # Eyes # Close Ups # Profiles Format OCR Text File Name Directory File Size PNG OCR 6 Faces 12 Eyes 3 in Profile 1 Close Up
14
PDF Object Enhanced with Extracted Metadata PDF PUID PDF 1.4.2 PNG OCR 6 Faces 12 Eyes 3 in Profile 1 Close Up
19
DON’T PANIC
20
Elasticsearch + Kibana Kibana: ●Free plugin for Elasticsearch ●Gives shape to an Elasticsearch index ●Write queries visually and interactively Elasticsearch: ●Open-source scalable search engine based on Lucene
21
Lots of ways to explore the data Files Formats Concentric Pie Chart Inner: Mimetype Outer: PRONOM PUID
22
Charts can be added to dynamic dashboards
23
Arrangement can be used as a Facet As you browse the hierarchy... The entire dashboard is redrawn to reflect the particular record group, series or folder under study. “Drill down” or zoom in and out of your collections.
24
Make comparisons between neighbors Significant Terms are based on full text. They are significant within overall scope of query. Significant Terms can be used to distinguish neighboring folders or documents.
26
Summary: Indigo, Brown Dog & Elasticsearch Full text searching (from file conversion of OCR) Charts for any extracted data point: Image metrics: pixel count, pixel depth Computer vision: recognized shapes (humans!), image skewness, etc.. Significant Terms File Formats File Sizes Compare neighboring folders (or series) against each other Significant Terms Top formats Use a dashboard to zoom in and out of the arrangement
27
Richard Marciano Professor, University of Maryland iSchool Affiliate Professor, Computer Science Director, Digital Curation Innovation Center (DCIC) University of Maryland Bill Underwood Research Faculty, University of Maryland iSchool Digital Curation Innovation Center (DCIC) University of Maryland http://dcic.umd.edu 27 marciano@umd.edu
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.