Integrating Data Mining and Data Management Technologies for Scholarly Inquiry Ray R. Larson University of California, Berkeley Paul Watry Richard Marciano.

Slides:



Advertisements
Similar presentations
Panel 2 – Promoting Re-Use of Scientific Collections John Harrison SHAMAN Project University of Liverpool
Advertisements

The Future of Scholarship in the Digital Age: The Role of Institutional Repositories Ann J. Wolpert Director of Libraries Massachusetts Institute of Technology.
Digital Repositories: interoperability & common services Closing Remarks Dr Liz Lyon, UKOLN, University of Bath, UK
Interoperability Aspects in Europeana Antoine Isaac Workshop on Research Metadata in Context 7./8. September 2010, Nijmegen.
SLIDE 1FIST Shanghai Digging Into Data: Data Mining for Information Access Ray R. Larson University of California, Berkeley Paul Watry.
Jefferson Heard RENCI, UNC Chapel Hill Richard Marciano, SILS, UNC Chapel Hill ISSUES OF SCALE IN NEXT-GEN ARCHIVES.
The Library behind the scene How does it work ? The Library behind the scenes 1 JINR / CERN Grid and advanced information systems 2012 Anne Gentil-Beccot.
Using Sakai to Support eScience Sakai Conference June 12-14, 2007 Sayeed Choudhury Tim DiLauro, Jim Martino, Elliot Metsger, Mark Patton and David Reynolds.
A Very Brief Introduction to iRODS
Towards a Federated Infrastructure for the Preservation and Analysis Archival Data Chien-Yi HOU Richard MARCIANO {chienyi, School.
Robust Tools for Archiving and Preserving Digital Data Joseph JaJa, Mike Smorul, and Mike McGann Institute for Advanced Computer Studies Department of.
DCAPE Distributed Custodial Archival Preservation Environments ( Chien-Yi HOU Richard MARCIANO UNC Chapel Hill, SILS /
Sally Rumsey ORA Service & Development Manager Why ORA? Why Fedora?
Finding the Right LINCS Beth Fredrick, Center for Literacy Studies
Richard MARCIANO Chien-Yi HOU School of Information and Library Science (SILS) Sustainable Archives & Leveraging Technologies Group (SALT) University of.
MUSCLE WP9 E-Team Integration of structural and semantic models for multimedia metadata management Aims: (Semi-)automatic MM metadata specification process.
UNIVERSITY of MARYLAND GLOBAL LAND COVER FACILITY High Performance Computing in Support of Geospatial Information Discovery and Mining Joseph JaJa Institute.
An Introduction to the Open Science Data Cloud Heidi Alvarez Florida International University Robert L. Grossman University of Chicago Open Cloud Consortium.
Caval Collaborative Solutions 1 Electronic Reserves..collaborative model CAVAL developments Collaborative Solutions VARLAC VADL –e-serials –e-books.
New Value from the DSpace Foundation and Fedora Commons Michele Kimpton and Sandy Payette Executive Directors DuraSpace.
DuraCloud A service provided by Sandy Payette and Michele Kimpton.
DuraCloud Managing durable data in the cloud Michele Kimpton, Director DuraSpace.
Berkeley Electronic Press (bepress). Bepress history Started 10 years ago by University of California at Berkeley faculty to publish scholarly journals.
XML: The Strategic Opportunity Roy Tennant Challenges*  Only librarians like to search, everyone else likes to find  Our users want more information.
CERN – IT Department CH-1211 Genève 23 Switzerland t CERN Open Source Collaborative tools: Digital Library Software Tim Smith CERN/IT.
Introducing the Hurricane Preparedness and Recovery Web Portal - October 8, Presented by Charles R. McClure, PhD Director, FSU Information Institute.
Impact of Cyberinfrastructure on Large Research Libraries Grace Baysinger Stanford University 2006 ACS National Fall Meeting.
CI Days: Planning Your Campus Cyberinfrastructure Strategy Russ Hobby, Internet2 Internet2 Member Meeting 9 October 2007.
Research Data Management Services Katherine McNeill Social Sciences Librarians Boot Camp June 1, 2012.
ISpheres Project. Project Overview iSpheresCore iSpheresImage Demonstration References.
Using SRB and iRODS with the Cheshire3 Information Framework Building Data Grids with iRODS May, 2008 National e-Science Centre Edinburgh Dr Robert.
UC3 Standards and Best Practices for Datasets and Other Supplemental Journal Article Materials UC3 Stephen Abrams Patricia Cruse John Kunze.
Preserving Digital Collections for Future Scholarship Oya Y. Rieger Cornell University
Ms. Irene Onyancha ISTD/Library & Information Management Services United Nations Economic Commission for Africa The Second Session of the Committee on.
Caring and Sharing Collaboration in Digital Curation outside North America Ross Harvey Simmons College, Boston Curation Matters: 17 June 2010.
What is Cyberinfrastructure? Russ Hobby, Internet2 Clemson University CI Days 20 May 2008.
SLIDE 1DID Meeting - Montreal Integrating Data Mining and Data Management Technologies for Scholarly Inquiry Ray R. Larson University of California,
The National Science Digital Library & Shibboleth.
Safeguarding the Freedom of Information: Digital Archive Initiatives in the United States Federal Government Michael Paul Huff Information Resource Officer.
Records Administration Conference RACO May 12, 2005 Ronald Reagan Building and International Trade Center Washington, DC ERA Program Update Fynnette.
May 2, 2013 An introduction to DSpace. Module 1 – An Introduction By the end of this module, you will … Understand what DSpace is, and what it can be.
National Center for Supercomputing Applications Barbara S. Minsker, Ph.D. Associate Professor National Center for Supercomputing Applications and Department.
ScholarSpace & Open UH Mānoa March 2013 Beth Tillinghast Web Support Librarian ScholarSpace & eVols Project Manager UHM Library.
CNI, 3rd April 2006 Slide 1 UK National Centre for Text Mining: Activities and Plans Dr. Robert Sanderson Dept. of Computer Science University of Liverpool.
SLIDE 1INFOSCALE Hong Kong Integrating Data Mining and Data Management Technologies for Scholarly Inquiry Paul Watry Richard Marciano.
“A Library outranks any other one thing a community can do to benefit its people.” --Andrew Carnegie.
Transforming video & photo collections into valuable resources John Waugaman President - Tygart Technology, Inc.
Introduction to the Semantic Web and Linked Data
Cyberinfrastructure: An investment worth making Joe Breen University of Utah Center for High Performance Computing.
Millman—Nov 04—1 An Update on Digital Libraries David Millman Director of Research & Development Academic Information Systems Columbia University
26/05/2005 Research Infrastructures - 'eInfrastructure: Grid initiatives‘ FP INFRASTRUCTURES-71 DIMMI Project a DI gital M ulti M edia I nfrastructure.
Institutional Repositories: the DSpace Experience Ann J. Wolpert Director of Libraries Massachusetts Institute of Technology.
April 14, 2005MIT Libraries Visiting Committee Libraries Strategic Plan Theme III Work to shape the future MacKenzie Smith Associate Director for Technology.
Cyberinfrastructure Overview Russ Hobby, Internet2 ECSU CI Days 4 January 2008.
Preliminary Findings Baseline Assessment of Scientists’ Data Sharing Practices Carol Tenopir, University of Tennessee
Managing live digital content with DuraSpace services Bill Branan PASIG Spring 2015.
Scholarly works, research, reports, publications What is an Institutional Repository? Focus on Research Groups Promoting Physics Faculty, Students and.
Making the Case for Curation: The Practical Experiment of DSpace Managing Digital Assets February 5-6, 2005 Charleston, SC Ann J. Wolpert, Director of.
Fedora Commons Overview and Background Sandy Payette, Executive Director UK Fedora Training London January 22-23, 2009.
Access for user self- sufficiency: making rich local content intuitively available Catalog Transformed: From Traditional to Emerging Models of Use Program.
Mission: Be a leader in the digital curation research and education fields, and foster interdisciplinary partnerships using Big Records and Archival Analytics.
Richard Marciano Professor, University of Maryland iSchool Affiliate Professor, Computer Science Director, Digital Curation Innovation Center (DCIC) University.
ICPSR Data Fair November 8, 2010 Katherine McNeill, MIT Libraries
Emphasize “scholarly” and “universities” to distinguish TDL from other efforts. A digital infrastructure for the scholarly activities of Texas universities.
? What is Institutional Repository for Rutgers University
Joseph JaJa, Mike Smorul, and Sangchul Song
VI-SEEM Data Repository
CNI Spring 2010 Membership Meeting
SMETE Information Portal A Digital Library for Science, Mathematics, Engineering and Technology Education Alice M. Agogino, Principal Investigator Flora.
AUC’s Role In Facilitating Access To Knowledge In The Arab World
Presentation transcript:

Integrating Data Mining and Data Management Technologies for Scholarly Inquiry Ray R. Larson University of California, Berkeley Paul Watry Richard Marciano University of Liverpool University of Maryland

Thank you! iRODS available via Project web site Available via Special thanks to John Harrison & Jerome Fuselier (Liverpool), Chien-Yi Hou (UNC), Shreyas &Luis Aguilar (UCB)

Integrating Data Mining and Data Management Technologies for Scholarly Inquiry Goals: –Text mining and NLP techniques to extract content (named Persons, Places, Time Periods/Events) and associate context Data: –Internet Archive Books Collection (with associated MARC where available) ~7.2T –Jstore ~1T –Context sources: SNAC Archival and Library Authority records. Tools –Cheshire 3 – Fast open source XML search engine for storing & indexing XML books. Used for extracting GeoLocations and Persons from XML books indexed in Cheshire3. –iRODS – Policy-driven distributed data storage –Amazon S3 storage and EC2 computing

Current Version iRODS and C3 on Amazon EC2 and S3 Bucket 2 Bucket 1 Amazon S3 iRODS Cache Resource Cache Resource Amazon EC2 Data Ingestion Cheshire3 Indexing Retrieval iCAT Rule Engine Rule Engine Data Presentation

Summary Indexing and IR work very well in the Grid / Cloud environment, with the expected scaling behavior for multiple processes Still in progress: –We are still processing collecting the books collection from the Internet Archive –We are still extracting place names, personal names, corporate names and linking with reference sources (such as GeoNames, VIAF, and SNAC)

Current Digital Curation Projects CI-BER (CyberInfrastructure for Billions of Electronic Records): – Funded by NSF/ NARA( ): ~ $1M – See: – Big data management project based on the integration of heterogeneous datasets: a. Testbed collection of 100 million files and 50 terabytes of data with content from over 100 federal agencies b. Open source collaborative geo-analytics prototype c. “Citizen-led crowdsourcing” prototype NSF: “Brown Dog”, a $10.5M NSF/DIIBs award ( ) -- the “super mutt” of software: – – NCSA (Kenton McHenry) + CI-BER (Richard Marciano) – Creating a Data Observatory to: Provide access to big data training sets Benefits: Accelerate the development of digital curation algorithms and services What if: – Students could be embedded in a major NSF partnership? – We used this implementation project as an opportunity to teach students practical digital curation skills? Creating a data observatory

Digital Curation Lab Mission Statement: The Digital Curation Lab (DCL) aims to be a leader in the Digital Curation educational field, providing a model for other universities and laboratories around the world. Vision: The Digital Curation Lab (DCL) will provide a real-world digital curation experience for students and professionals to experiment and innovate. The DCL will also be a key enabler of digital curation projects within the Washington, DC Metro area, and source of cutting edge research that will transform digital curation technologies, practices, and institutions. Values: In order to achieve its vision and mission, the DCL needs to value: Innovative, applied research in in digital curation field. Practical and theory-based education. Enduring partnerships with on and off-campus organizations including alumni organizations. Transformational impact on technologies, institutions, and practice.

Digital Curation Laboratory (DCL) 3. ALUMNI HOME 7. INNOVATION HUB 1. DATA SHOP 4. LESSON PLAN & SOFTWARE REPOSITORY 5. INTERDISCIPLINARY COLLEGE RESOURCE 2. OUTSIDE PARTNERS - stretching - rejuvenation - aging 6. QUALITY TESTING SHOP