Presentation is loading. Please wait.

Presentation is loading. Please wait.

Integrating Data Mining and Data Management Technologies for Scholarly Inquiry Ray R. Larson University of California, Berkeley Paul Watry Richard Marciano.

Similar presentations


Presentation on theme: "Integrating Data Mining and Data Management Technologies for Scholarly Inquiry Ray R. Larson University of California, Berkeley Paul Watry Richard Marciano."— Presentation transcript:

1 Integrating Data Mining and Data Management Technologies for Scholarly Inquiry Ray R. Larson University of California, Berkeley Paul Watry Richard Marciano University of Liverpool University of Maryland

2 Thank you! iRODS available via https://www.irods.org Project web site http://diggingintodata.web.unc.edu Available via https://github.com/cheshire3 Special thanks to John Harrison & Jerome Fuselier (Liverpool), Chien-Yi Hou (UNC), Shreyas &Luis Aguilar (UCB)

3 Integrating Data Mining and Data Management Technologies for Scholarly Inquiry Goals: –Text mining and NLP techniques to extract content (named Persons, Places, Time Periods/Events) and associate context Data: –Internet Archive Books Collection (with associated MARC where available) ~7.2T –Jstore ~1T –Context sources: SNAC Archival and Library Authority records. Tools –Cheshire 3 – Fast open source XML search engine for storing & indexing XML books. Used for extracting GeoLocations and Persons from XML books indexed in Cheshire3. –iRODS – Policy-driven distributed data storage –Amazon S3 storage and EC2 computing

4 Current Version iRODS and C3 on Amazon EC2 and S3 Bucket 2 Bucket 1 Amazon S3 iRODS Cache Resource Cache Resource Amazon EC2 Data Ingestion Cheshire3 Indexing Retrieval iCAT Rule Engine Rule Engine Data Presentation

5

6

7

8 Summary Indexing and IR work very well in the Grid / Cloud environment, with the expected scaling behavior for multiple processes Still in progress: –We are still processing collecting the books collection from the Internet Archive –We are still extracting place names, personal names, corporate names and linking with reference sources (such as GeoNames, VIAF, and SNAC)

9 Current Digital Curation Projects CI-BER (CyberInfrastructure for Billions of Electronic Records): – Funded by NSF/ NARA(2010-2013): ~ $1M – See: http://ci-ber.blogspot.com/http://ci-ber.blogspot.com/ – Big data management project based on the integration of heterogeneous datasets: a. Testbed collection of 100 million files and 50 terabytes of data with content from over 100 federal agencies b. Open source collaborative geo-analytics prototype c. “Citizen-led crowdsourcing” prototype NSF: “Brown Dog”, a $10.5M NSF/DIIBs award (2013-2018) -- the “super mutt” of software: – http://go.illinois.edu/BrownDog http://go.illinois.edu/BrownDog – NCSA (Kenton McHenry) + CI-BER (Richard Marciano) – Creating a Data Observatory to: Provide access to big data training sets Benefits: Accelerate the development of digital curation algorithms and services What if: – Students could be embedded in a major NSF partnership? – We used this implementation project as an opportunity to teach students practical digital curation skills? Creating a data observatory

10 Digital Curation Lab Mission Statement: The Digital Curation Lab (DCL) aims to be a leader in the Digital Curation educational field, providing a model for other universities and laboratories around the world. Vision: The Digital Curation Lab (DCL) will provide a real-world digital curation experience for students and professionals to experiment and innovate. The DCL will also be a key enabler of digital curation projects within the Washington, DC Metro area, and source of cutting edge research that will transform digital curation technologies, practices, and institutions. Values: In order to achieve its vision and mission, the DCL needs to value: Innovative, applied research in in digital curation field. Practical and theory-based education. Enduring partnerships with on and off-campus organizations including alumni organizations. Transformational impact on technologies, institutions, and practice.

11 Digital Curation Laboratory (DCL) 3. ALUMNI HOME 7. INNOVATION HUB 1. DATA SHOP 4. LESSON PLAN & SOFTWARE REPOSITORY 5. INTERDISCIPLINARY COLLEGE RESOURCE 2. OUTSIDE PARTNERS - stretching - rejuvenation - aging 6. QUALITY TESTING SHOP


Download ppt "Integrating Data Mining and Data Management Technologies for Scholarly Inquiry Ray R. Larson University of California, Berkeley Paul Watry Richard Marciano."

Similar presentations


Ads by Google