Computational Research and Copyright John Unsworth BNN Future of the Academy Speaker Series MIT Faculty Club May 25, 2012.

Slides:

Advertisements

Similar presentations

Beyond the Google Book: the Future of the Digital Library Cory Snavely Library IT Core Services manager University of Michigan April 20, 2010.

Advertisements

HATHI TRUST A Shared Digital Repository Delivering Data For New Generations of Research Strategies and Challenges Jeremy York NISO/BISG Forum ALA 2010.

Joint CASC/CCI Workshop Report Strategic and Tactical Recommendations EDUCAUSE Campus Cyberinfrastructure Working Group Coalition for Academic Scientific.

HATHI TRUST A Shared Digital Repository Digital Repositories for Preservation and Access Digital Directions 2013 Jeremy York July 22, 2013 Unless otherwise.

Information Analysis at Scale: HathiTrust Research Center Beth Plale Director, Data to Insight Center Co-Director, HathiTrust Research Center November.

HATHITRUST A Shared Digital Repository The HathiTrust Print Monograph Archive Planning Task Force Print Archive Network Forum ALA 2015 Midwinter Meeting.

HathiTrust Research Center Architecture

Global Resources Forum October 21, 2010 The Western Waters Digital Library: Building a Resource Through Multi- State Collaboration and Technology

May 17, 2011 DPLA Global Interoperability and Linked Data Workshop Building a Public Research Center for the HathiTrust Digital Library Robert H. McDonald.

The Digital Preservation Network at UT Austin Chris Jordan Texas Advanced Computing Center.

HATHITRUST A Shared Digital Repository Big Collections in an Era of Big Copyright: Practical Strategies for Making the Most of Digitized Heritage Jeremy.

HathiTrust Research Center Tools SHARC: Secure HathiTrust Analytics Research Commons Dirk Herr-Hoyman HTRC Operations Manager + Architect Indiana University.

Elephant in the Room: Scaling Storage for the HathiTrust Research Center Robert H. McDonald Associate Dean for Library Technologies Deputy.

Advances research methods and proposal writing Ronan Fitzpatrick School of Computing, Dublin Institute of Technology. September 2008.

Rutgers University Libraries What is RUcore? o An institutional repository, to preserve, manage and make accessible the research and publications of the.

Tools and Services for the Long Term Preservation and Access of Digital Archives Joseph JaJa, Mike Smorul, and Sangchul Song Institute for Advanced Computer.

Institutional Repositories Tools for scholarship Mary Westell University of Calgary AMTEC Conference May 26, 2005.

Data Sources & Using VIVO Data Visualizing Scholarship VIVO provides network analysis and visualization tools to maximize the benefits afforded by the.

CORDRA Philip V.W. Dodds March The “Problem Space” The SCORM framework specifies how to develop and deploy content objects that can be shared and.

Research data spring Enabling Complex Analysis of Large Scale Digital Collections 14/7/2015 Lots of money has been spent digitising heritage collections.

The Hathi Trust Research Center and tool builders John Unsworth (with Beth Plale, Scott Poole, Robert McDonald, and others) Project Bamboo Corpora Space.

Digital Library Architecture and Technology

Letters Across the Pond A Digital Library Project Jonathan Tweedy S652 – Fall 2010.

Management, marketing and population of repositories Morag Greig, University of Glasgow.

The SEASR project and its Meandre infrastructure are sponsored by The Andrew W. Mellon Foundation SEASR Overview Loretta Auvil and Bernie Acs National.

HATHITRUST A Shared Digital Repository HathiTrust: Putting Research in Context HTRC UnCamp September 10, 2012 John Wilkin, Executive Director, HathiTrust.

1 The NSDL: A Case Study in Interoperability William Y. Arms Cornell University.

Impact of Cyberinfrastructure on Large Research Libraries Grace Baysinger Stanford University 2006 ACS National Fall Meeting.

Preserving Digital Collections for Future Scholarship Oya Y. Rieger Cornell University

HathiTrust Research Center Dedicated to provision of computational access to comprehensive body of published works for scholarship and education.

Relationships July 9, Producers and Consumers SERI - Relationships Session 1.

Choosing Delivery Software for a Digital Library Jody DeRidder Digital Library Center University of Tennessee.

HTRC Workshop 101 THATCamp Gainesville April 24, 2014.

Breana McCracken University of Illinois at Urbana-Champaign HathiTrust and Copyright Future Implications - Strong precedent for libraries to continue to.

Overview of IU Digital Collections Search Hui Zhang Jon Dunn Indiana University Digital Library Program IU Digital Library Brown Bag October 19, 2011.

Challenges and Opportunities for Academic Libraries Collaborative Imperatives to Support Collections, Digital Initiatives, and New Services for a Changing.

The Future of the iPlant Cyberinfrastructure: Coming Attractions.

HathiTrust Research Center Architecture Overview Robert H. McDonald Executive Committee-HathiTrust Research Center (HTRC) Deputy Director-Data.

Collection and Data Overview Jeremy York Stacy Kowalczyk.

HathiTrust Research Center Architecture Data subsystem.

HathiTrust’s Past, Present and Future. Short- and Long-term Functional Objectives Short-term Page turner mechanism (and Mobile!) Branding (overall initiative;

Accessing HTRC Data. What is Hathitrust Research Center? A collaborative research center launched jointly by Indiana University and the University of.

HATHITRUST A Shared Digital Repository The HathiTrust Print Monograph Archive Planning Task Force Print Archive Network Forum ALA 2015 Annual Meeting June.

INTELLECTUAL RIGHTS AND HISTORIC CORPORA Mark Sandler University of Michigan ICOLC, March, 2003.

1 The NSDL Program Stephen Griffin National Science Foundation.

“A Library outranks any other one thing a community can do to benefit its people.” --Andrew Carnegie.

| 1 Open Access Advancing Text and Data Mining Libraries & Publishers working together to support Researchers What is Text Mining?

HATHITRUST A Shared Digital Repository Institution Uses of HathiTrust Jeremy York University of Maine May 24, 2013.

1 Service Creation, Advertisement and Discovery Including caCORE SDK and ISO21090 William Stephens Operations Manager caGrid Knowledge Center February.

1 NSF/TeraGrid Science Advisory Board Meeting July 19-20, San Diego, CA Brief TeraGrid Overview and Expectations of Science Advisory Board John Towns TeraGrid.

Millman—Nov 04—1 An Update on Digital Libraries David Millman Director of Research & Development Academic Information Systems Columbia University

Session A305 Findability: Information Not Location Mike Creech Web Content Manager Ken Varnum Web Systems Manager University.

A Resource Discovery Service for the Library of Texas Requirements, Architecture, and Interoperability Testing William E. Moen, Ph.D. Principal Investigator.

Cyberinfrastructure: Many Things to Many People Russ Hobby Program Manager Internet2.

HTRC Loretta Auvil, Boris Capitanu University of Illinois at Urbana-Champaign

Carnegie Mellon University’s Million Book Project (MBP) Laurel Foundation – August 27, 2002.

HATHITRUST A Shared Digital Repository HathiTrust Large Digital Libraries: Beyond Google Books Modern Language Association January 5, 2012 Jeremy York,

The Data Capsule for Non-Consumptive Research Beth Plale, Atul Prakash, Geoffrey Fox, Robert H. McDonald A Proposal to the Alfred P. Sloan Foundation HTRC.

Active Directory Domain Services (AD DS). Identity and Access (IDA) – An IDA infrastructure should: Store information about users, groups, computers and.

Bringing visibility to food security data results: harvests of PRAGMA and RDA Quan (Gabriel) Zhou, Venice Juanillas Ramil Mauleon, Jason Haga, Inna Kouper,

Enhancements to Galaxy for delivering on NIH Commons

HathiTrust Digital Library Interface and Services

Accessing the VI-SEEM infrastructure

What’s next with the HathiTrust Research Center?

INTAROS WP5 Data integration and management

GSLIS Research Showcase, April 9, 2010

TDM=Text Mining “automated processing of large amounts of structured digital textual content for purposes of information retrieval, extraction, interpretation.

An Overview and Case Study

HathiTrust And Its Research Center

BUILDING A DIGITAL REPOSITORY FOR LEARNING RESOURCES

Presentation transcript:

Computational Research and Copyright John Unsworth BNN Future of the Academy Speaker Series MIT Faculty Club May 25, 2012

HATHI TRUST A Shared Digital Repository HathiTrust Research Center

Goals of the HTRC Maintain repository of text mining algorithms, retrieval tools, derived data sets, and indices available for human and programmatic discovery. Be a user-driven resource, with an active advisory board, and a community model that allows users to share tools and results. Support interoperability across collections and institutions, through use of inCommon SAML identity. See also: -- a report prepared by the Illinois Center for Informatics Research in Science and Scholarship, on the experience of Google Digital Humanties grant recipients.

The HathiTrust Research Center The HathiTrust Research Center (HTRC) enables computational access for nonprofit and educational users to published works in the public domain. In the future, it will offer computational access to in-copyright works from the HathiTrust as well. The center will break new ground in the areas of text mining and non-consumptive research, allowing scholars to fully utilize content of the HathiTrust Library while observing the requirements of current U.S. copyright law.

HTRC Partners The HTRC is a collaborative research center launched jointly by Indiana University and the University of Illinois, along with the HathiTrust Digital Library and Google. The HTRC will help researchers meet the technical challenges of working with massive digital collections, by developing tools and cyberinfrastructure that enable advanced computational access to those collections.

Memosof Understanding Completed: IU/UIUC MOU HT/IU/UIUC MOU Google/UIUC MOU Google/IU MOU To be developed: HTRC-Researcher/Center MOU

Executive Committee The HathiTrust Research Center is led by an Executive Management Team that includes: Stephen Downie (Co-director), Professor and Associate Dean for Research, University of Illinois Graduate School of Library and Information Science Beth Plale (Co-director and chair), Data To Insight Center director and professor in the School of Informatics and Computing at Indiana University Scott Poole, I-CHASS director and professor in the Department of Communication at the University of Illinois Robert McDonald, Indiana University Associate Dean of Libraries John Unsworth, Vice-Provost and CIO, Brandeis University

Advisory Board Cathy Blake, University of Illinois, Urbana-Champaign Beth Cate, Indiana University Greg Crane, Tufts University Laine Farley, California Digital Libraries Brian Geiger, University of California at Riverside David Greenbaum, University of California at Berkeley Fotis Jannidis, University of Wurzberg, Germany Matthew Jockers, Stanford University Jim Neal, Columbia University Bill Newman, Indiana University Bethany Nowviskie, University of Virginia Andrey Rzhetsky, University of Chicago Pat Steele, University of Maryland Craig Stewart, Indiana University David Theo Goldberg, University of California at Irvine John Towns, National Center for Supercomputing Applications Madelyn Wessel, University of Virginia

Timeline: Phase 1 The primary areas of work in Phase 1 include architecting the core cyberinfrastructure for data analysis, deploying some general-purpose analytical tools, and prototyping end-user services, including an access portal, support center capabilities, and facilities for sharing and storing derived research data. In Phase 1, only the public domain works in the HathiTrust will be available to researchers, since the security framework and policies for working with copyrighted material will still be under development. The HTRC will deliver a demonstration system in June 2012.

Timeline: Phase 2 This phase, which will require significant funding, will involve development of an operational research center that will provide ongoing and up-to-date access to the HTRC research corpus and associated tools. Phase 2 will commence during the 18th month of the project, and its launch will depend on garnering resources during Phase 1 and on the sustainability plan that will be developed in Phase 1.

Current Collections HTRC currently has a 250,000 volume collection of non-Google digitized content and a 50,000 volume collection of content that IU libraries digitized. These collections reside in a cluster of 3 4-core, 16 GB RAM machines. About 2.8M volumes of Google-produced public domain material will shortly be added to the HTRC collections, now that the Google MOUs have been signed.

HTRC Access and Use Users will be able to access the HTRC through a portal or programmatically, through a Data API The Data API cannot be used to download volumes, but it can be used to move data to a location where computation takes place. It can also be used to search SOLR indexes and pass volume IDs to other services for access and computation. The target audience of the HTRC is non-profit and educational researchers Authentication will depend on InCommon, a Shibboleth implementation that most HathiTrust institutions already support.

Architecture Solr Indexes: The HathiTrust and the HTRC both use Apache SOLR to index the materials in their collections. The Solr index is accessed through the Data API layer. The Data API layer limits some access, and does auditing, but otherwise is a pass through to the Solr API. Volume Store: HTRC uses Apache Cassandra, a noSQL data store cluster to hold the volumes of digitized text. Volume- and page-level access to HTRC data is provided through the HTRC Data API. Each machine has 500 GB of disk, and the volumes are partitioned and replicated across the 3 Cassandra instances. Registry: IU is running a version of WSO2 Governance Registry, where applications are registered prior to running in the non-consumptive framework. The registry is also used as a temporary storage for returned results.

“Research in which computational analysis is performed on one or more Books, but not research in which a researcher reads or displays substantial portions of a Book to understand the intellectual content presented within the Book.”

Non-consumptive Research One of HTRC’s unique challenges is support for non-consumptive research. This will entail bringing algorithms to data, and exporting results, and/or providing people with secure computational environments in which they can work with copyrighted materials without exporting them. Why is this worth doing? Because it enables a new art of information that can be used to make new kinds of arguments (and possibly to settle some old ones).

Non-Consumptive Research HTRC received funding from the Alfred P. Sloan Foundation for development of secure infrastructure on which to carry out execution of large-scale parallel tasks on copyrighted data using public compute resources such as FutureGrid or resources at NCSA. The high-level design uses a pool of VM images that run in a secure-capsule mode and are deployed onto compute resources. The team is working on a proof of concept deployment process with an OpenStack platform using Sigiri.

Blacklight Developed at the University of Virginia, Blacklight is an open-source discovery interface: Blacklight supports faceted searches, a known need of researchers. We expect Blacklight to be a significant component of the public face of the HTRC. Blacklight is designed to support data that is both full text and bibliographic. Blacklight is built on SOLR, the same technology that we already use to index the HTRC data.

Google DH study Google Digital Humanities Awards Recipient Interviews Report, Prepared For The Hathitrust Research Center by Virgil E. Varvel Jr. and Andrea Thomer at the Center For Informatics Research In Science And Scholarship, Graduate School of Library and Information Science, University Of Illinois At Urbana-.‐Champaign, in Fall 2011

Scope of the report 22 researchers who had received Google Digital Humanities grants were invited to debrief on their experience, in order to provide input to the design of the HTRC Interviews were conducted by phone, in person, or by Skype, using a semi-structured interview protocol

Findings of the report: OCR OCR quality is a significant issue; steps should be taken to improve OCR output as possible OCR quality should be indicated in volume- level metadata Scalability of scanned page images is necessary for human correction of OCR errors

Other Findings of the Report Researchers would like better metadata about the languages included in texts, particularly in multi-lingual documents. Better metadata about language by sections within volumes would be helpful. Automatic language identification functions would be helpful, but human‐created metadata is preferred, particularly for documents with low OCR quality. For one researcher, the primary issue was retrieving the bibliographic records in usable form. It took 10 months to design the queries and get the data.

Matt Jockers, “The Nineteenth-Century Literary Genome” via Digital Humanities Specialist (aka Elijah Meeks)

Arguing with Data Data enables arguments based on quantitative and/or empirical data Data still requires interpretation, and you can still make better and worse interpretations, and more or less compelling arguments In addition to new kinds of arguments, you can make new kinds of mistakes, especially mistakes based on incomplete data or on an incomplete understanding of data

Mistakes based on incomplete data

New kinds of arguments Ted Underwood is exploring the changing etymological basis of diction in English, over a 200-year period, especially the shift from words derived from German, to words derived from Latin, and back again.

Etymology and Style Ted Underwood, 2011 o English professors have a long, lively history of drawing specious conclusions from the “Latinate” or “Germanic” character of a particular writer’s style. o There is nevertheless good evidence that older words do predominate in informal, and especially spoken English. [Laly Bar-Ilan and Ruth A. Berman, “Developing register differentiation: the Latinate- Germanic divide in English,” Linguistics 45 (2007): ] o Can we use this fact to trace broad changes of register in the history of written English?

The fundamental distinction is not Latinate/Germanic, but date of entry. French was the written language for 200 years; words that entered English before that point had to be used in the spoken language to survive. This includes “Latinate” words like “street” and “wall.”

To understand the significance of the result, it needs to be broken down by genre. Initial results suggest that fiction and nonfiction prose both become more formal (less like speech) in the 18c. Drama and poetry change little, although older, less formal, “speechlike” words always predominate in drama.

The Value of HTRC Ted’s investigation concerns historical trends: as such, it is reasonable to think that it might be interesting to extend beyond Can he do that? Only if he is given the data. Will researchers have this kind of computational access to copyrighted data? Only through some institutional affordance like HTRC. Insitutions are risk-averse: in some sense, the most important infrastructure in HTRC is the MOU.