Presentation is loading. Please wait.

Presentation is loading. Please wait.

Mission: Be a leader in the digital curation research and education fields, and foster interdisciplinary partnerships using Big Records and Archival Analytics.

Similar presentations


Presentation on theme: "Mission: Be a leader in the digital curation research and education fields, and foster interdisciplinary partnerships using Big Records and Archival Analytics."— Presentation transcript:

1 Mission: Be a leader in the digital curation research and education fields, and foster interdisciplinary partnerships using Big Records and Archival Analytics through public / industry /government partnerships. Goals: Sponsor interdisciplinary projects that explore the integration of archival research data, user-contributed data, and technology to generate new forms of analysis and historical research engagement. Digital Curation: “personalized access to information, as well as federation, preservation, data lifecycle stewardship, and analysis of large heterogeneous collections, information, and records. ” 2015 NITRD Program Supplement to the President’s Budget

2

3 Designing Scalable Cyberinfrastructure Services for Metadata Extraction in Billion-Record Archives NSF: “Brown Dog”, a $10.5M NSF/DIIBs award (2013-2018) -- the “super mutt” of software: –UIUC/NCSA + UMD/DCIC –GOALS: Design & test preservation services in the cloud: DAP & DTS Creating a Big Data Observatory to: –Provide access to big data training sets –Accelerate the development of digital preservation services –http://go.illinois.edu/BrownDoghttp://go.illinois.edu/BrownDog

4 Indigo Peta-scale archival storage and analytics facility, powered by: NetApp storage, Dell computing Open-source Indigo (NoSQL Apache Cassandra): used for long- term archival storage and preservation data C a ve Apache Cassandra (originally developed at Facebook): Adobe, Best Buy, Cisco, Dell, Disney, ebay, FedEx, Netflix, Target, T-Mobile, Travelocity… Scalable to hundreds of petabytes and nodes No external file system (close control of data) P2P – no single point of failure access is available from ANY node, and will delivered from nearest node Data is automatically replicated & compressed Ability to store arbitrary data objects and associated ancillary data ( metadata ) Allows an organization to deposit data objects in a directory tree Deposition/update can trigger arbitrary actions through trigger/rules mechanisms self-managing repositories

5 DCIC Big Record collection DCIC big record collection: 100 million files 72 terabytes of data Content from over 150 federal agencies: 1000s of file formats diverse records: text Satellite images spreadsheets environmental data Photos Databases Etc. 5

6 Computational Finding Aids Approaching Billion-Record Digital Archives Gregory Jansen Richard Marciano

7 Workflow for a Digital Object PDF REPOSITORYSERVICES File Name Directory File Size

8 Text Format Conversion (PDF to TXT) PDF REPOSITORYSERVICES TXT

9 Now we have a full text index.. PDF REPOSITORYSERVICES Full Text File Name Directory File Size TXT

10 Optical Character Recognition (OCR) Extractor PDF REPOSITORYSERVICES OCR Text File Name Directory File Size PNG OCR

11 Format Recognition (Siegfried PRONOM Extractor) PDF REPOSITORYSERVICES PUID PDF 1.4.2 Format OCR Text File Name Directory File Size PNG OCR

12 Facial Recognition (Computer Vision Extractors) PDF REPOSITORYSERVICES PUID PDF 1.4.2 # Faces Format OCR Text File Name Directory File Size PNG OCR 6 FACES

13 Facial Recognition (Computer Vision Extractor) PDF REPOSITORYSERVICES PUID PDF 1.4.2 # Faces # Eyes # Close Ups # Profiles Format OCR Text File Name Directory File Size PNG OCR 6 Faces 12 Eyes 3 in Profile 1 Close Up

14 PDF Object Enhanced with Extracted Metadata PDF PUID PDF 1.4.2 PNG OCR 6 Faces 12 Eyes 3 in Profile 1 Close Up

15

16

17

18

19 DON’T PANIC

20 Elasticsearch + Kibana Kibana: ●Free plugin for Elasticsearch ●Gives shape to an Elasticsearch index ●Write queries visually and interactively Elasticsearch: ●Open-source scalable search engine based on Lucene

21 Lots of ways to explore the data Files Formats Concentric Pie Chart Inner: Mimetype Outer: PRONOM PUID

22 Charts can be added to dynamic dashboards

23 Arrangement can be used as a Facet As you browse the hierarchy... The entire dashboard is redrawn to reflect the particular record group, series or folder under study. “Drill down” or zoom in and out of your collections.

24 Make comparisons between neighbors Significant Terms are based on full text. They are significant within overall scope of query. Significant Terms can be used to distinguish neighboring folders or documents.

25

26 Summary: Indigo, Brown Dog & Elasticsearch Full text searching (from file conversion of OCR) Charts for any extracted data point: Image metrics: pixel count, pixel depth Computer vision: recognized shapes (humans!), image skewness, etc.. Significant Terms File Formats File Sizes Compare neighboring folders (or series) against each other Significant Terms Top formats Use a dashboard to zoom in and out of the arrangement

27 Richard Marciano Professor, University of Maryland iSchool Affiliate Professor, Computer Science Director, Digital Curation Innovation Center (DCIC) University of Maryland Bill Underwood Research Faculty, University of Maryland iSchool Digital Curation Innovation Center (DCIC) University of Maryland http://dcic.umd.edu 27 marciano@umd.edu


Download ppt "Mission: Be a leader in the digital curation research and education fields, and foster interdisciplinary partnerships using Big Records and Archival Analytics."

Similar presentations


Ads by Google