HathiTrust Research Center Architecture Overview Robert H. McDonald Executive Committee-HathiTrust Research Center (HTRC) Deputy Director-Data.

Slides:



Advertisements
Similar presentations
Beyond the Google Book: the Future of the Digital Library Cory Snavely Library IT Core Services manager University of Michigan April 20, 2010.
Advertisements

LEAD Portal: a TeraGrid Gateway and Application Service Architecture Marcus Christie and Suresh Marru Indiana University LEAD Project (
Distributed Data Processing
Information Analysis at Scale: HathiTrust Research Center Beth Plale Director, Data to Insight Center Co-Director, HathiTrust Research Center November.
HathiTrust Research Center Architecture
University of Illinois Visualizing Text Loretta Auvil UIUC February 25, 2011.
May 17, 2011 DPLA Global Interoperability and Linked Data Workshop Building a Public Research Center for the HathiTrust Digital Library Robert H. McDonald.
Securing Insecure Prabath Siriwardena, WSO2 Twitter
WSO2 Identity Server Road Map
USING THE GLOBUS TOOLKIT This summary by: Asad Samar / CALTECH/CMS Ben Segal / CERN-IT FULL INFO AT:
Elephant in the Room: Scaling Storage for the HathiTrust Research Center Robert H. McDonald Associate Dean for Library Technologies Deputy.
Nikolay Tomitov Technical Trainer SoftAcad.bg.  What are Amazon Web services (AWS) ?  What’s cool when developing with AWS ?  Architecture of AWS 
Cloud Computing Systems Lin Gu Hong Kong University of Science and Technology Sept. 21, 2011 Windows Azure—Overview.
The Hathi Trust Research Center and tool builders John Unsworth (with Beth Plale, Scott Poole, Robert McDonald, and others) Project Bamboo Corpora Space.
Computational Research and Copyright John Unsworth BNN Future of the Academy Speaker Series MIT Faculty Club May 25, 2012.
A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.
IT 210 The Internet & World Wide Web introduction.
Cloud Computing for the Enterprise November 18th, This work is licensed under a Creative Commons.
Building service testbeds on FIRE D5.2.5 Virtual Cluster on Federated Cloud Demonstration Kit August 2012 Version 1.0 Copyright © 2012 CESGA. All rights.
Using the SAS® Information Delivery Portal
HTRC API Overview Yiming Sun. HTRC Architecture Data API Portal access Direct programmatic access (by programs running on HTRC machines) Security (OAuth2)
Copyright © 2011 EMC Corporation. All Rights Reserved. MODULE – 6 VIRTUALIZED DATA CENTER – DESKTOP AND APPLICATION 1.
SCI-BUS is supported by the FP7 Capacities Programme under contract nr RI CloudBroker Platform integration into WS-PGRADE/gUSE Zoltán Farkas MTA.
Robert Fourer, Jun Ma, Kipp Martin Copyright 2006 An Enterprise Computational System Built on the Optimization Services (OS) Framework and Standards Jun.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Module 10: Monitoring ISA Server Overview Monitoring Overview Configuring Alerts Configuring Session Monitoring Configuring Logging Configuring.
HathiTrust Research Center Dedicated to provision of computational access to comprehensive body of published works for scholarship and education.
INFSO-RI Enabling Grids for E-sciencE Logging and Bookkeeping and Job Provenance Services Ludek Matyska (CESNET) on behalf of the.
HTRC Workshop 101 THATCamp Gainesville April 24, 2014.
Client – Server Architecture. Client Server Architecture A network architecture in which each computer or process on the network is either a client or.
IMDGs An essential part of your architecture. About me
SEASR Applications and Future Work University of Illinois at Urbana-Champaign.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
HATHI TRUST RESEARCH CENTER Building Collections and Analyzing Data Stacy Kowalczyk.
The Future of the iPlant Cyberinfrastructure: Coming Attractions.
Collection and Data Overview Jeremy York Stacy Kowalczyk.
HathiTrust Research Center Architecture Data subsystem.
Network Security. 2 SECURITY REQUIREMENTS Privacy (Confidentiality) Data only be accessible by authorized parties Authenticity A host or service be able.
PROGRESS: ICCS'2003 GRID SERVICE PROVIDER: How to improve flexibility of grid user interfaces? Michał Kosiedowski.
Accessing HTRC Data. What is Hathitrust Research Center? A collaborative research center launched jointly by Indiana University and the University of.
Copyright © cs-tutorial.com. Overview Introduction Architecture Implementation Evaluation.
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
A Collaborative Framework for Scientific Data Analysis and Visualization Jaliya Ekanayake, Shrideep Pallickara, and Geoffrey Fox Department of Computer.
CCNA4 v3 Module 6 v3 CCNA 4 Module 6 JEOPARDY K. Martin.
System Center Lesson 4: Overview of System Center 2012 Components System Center 2012 Private Cloud Components VMM Overview App Controller Overview.
Cloud Computing is a Nebulous Subject Or how I learned to love VDF on Amazon.
20409A 7: Installing and Configuring System Center 2012 R2 Virtual Machine Manager Module 7 Installing and Configuring System Center 2012 R2 Virtual.
Development of e-Science Application Portal on GAP WeiLong Ueng Academia Sinica Grid Computing
System/SDWG Update Management Council Face-to-Face Flagstaff, AZ August 22-23, 2011 Sean Hardman.
HTRC Loretta Auvil, Boris Capitanu University of Illinois at Urbana-Champaign
The Gateway Computational Web Portal Marlon Pierce Indiana University March 15, 2002.
Client – Server Architecture A Basic Introduction 1.
+ Support multiple virtual environment for Grid computing Dr. Lizhe Wang.
The Data Capsule for Non-Consumptive Research Beth Plale, Atul Prakash, Geoffrey Fox, Robert H. McDonald A Proposal to the Alfred P. Sloan Foundation HTRC.
Configuring SQL Server for a successful SharePoint Server Deployment Haaron Gonzalez Solution Architect & Consultant Microsoft MVP SharePoint Server
Amazon Web Services. Amazon Web Services (AWS) - robust, scalable and affordable infrastructure for cloud computing. This session is about:
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Overview of Atmosphere.
The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Introduction Salma Saber Electronic.
ArcGIS for Server Security: Advanced
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING CLOUD COMPUTING
BEST CLOUD COMPUTING PLATFORM Skype : mukesh.k.bansal.
Platform as a Service (PaaS)
Investigation authentication using AAF for the CVL on NeCTAR
What’s next with the HathiTrust Research Center?
Overview: Fedora Architecture and Software Features
TYPES OF SERVER. TYPES OF SERVER What is a server.
Unit 27: Network Operating Systems
20409A 7: Installing and Configuring System Center 2012 R2 Virtual Machine Manager Module 7 Installing and Configuring System Center 2012 R2 Virtual.
Introduction to Apache
Overview of big data tools
Presentation transcript:

HathiTrust Research Center Architecture Overview Robert H. McDonald Executive Committee-HathiTrust Research Center (HTRC) Deputy Director-Data to Insight Center Associate Dean-University Libraries Indiana University

Follow Along

HTRC Architecture Group Indiana University Beth Plale, Lead Yiming Sun Stacy Kowalczyk Aaron Todd Jiaan Zeng Guangchen Ruan Zong Peng Swati Nagde University of Illinois J. Stephen Downie Loretta Auvil Boris Capitanu Kirk Hess Harriett Green

Presentation Overview Considerations for Current Architecture Architecture - Use Case Methodology Technical Overview UnCamp Sessions for Further Review

Main Case – Data Near Computation HT Volume Store (UM) HT Volume Store (IUPUI) HT Volume Store (IUPUI) HTRC Volume Store and Index (IUB) FutureGrid Computation Cloud FutureGrid Computation Cloud XSEDE Compute Allocation IU Compute Allocation IU Compute Allocation UIUC Compute Allocation

Non-Consumptive Research Paradigm No action or set of actions on part of users, either acting alone or in cooperation with other users over duration of one or multiple sessions can result in sufficient information gathered from collection of copyrighted works to reassemble pages from collection. Definition disallows collusion between users, or accumulation of material over time. Differentiates human researcher from proxy which is not a user. Users are human beings.

Amicus Brief and NCR Jockers, Sag, Schultz – h ttp://tinyurl.com/cy34hhr

Use Cases for Phase 1 Architecture Use Case #1 - Previously registered user submitted algorithm retrieved and run with results set Use Case #2 - HTRC applications/portal access (SEASR) Use Case #3 – Blacklight Lucene/Solr faceted access Use Case #4 - Direct programmatic access through Secure Data API (right now only for UnCamp and open content)

HTRC Current Infrastructure Servers – 14 production-level quad-core servers 16 – 32GB of memory 250 – 500GB of local disk each – 6-node Cassandra cluster for volume store – Ingest service and secure Data API access point Storage (IU University Infrastructure) – 13TB of 15,000 RPM SAS disk storage – Increase up to 17TB by end of 2012 – 500TB available in late year 2-year 3

Key Components of Architecture Portal Access Blacklight Access Agent Registry Secured Data API Access Solr Proxy

HTRC Architecture Data API access interface Portal Access Direct programmatic access (by programs running on HTRC machines) Security (OAuth2) Audit Cassandra cluster volume store Solr index Algorithms Result Sets Meandre Workflows Registry (WSO2) Compute resources Storage resources Agent Job Submission Collection building Collections Blacklight Solr Proxy

HTRC Architecture Data API access interface Portal Access Direct programmatic access (by programs running on HTRC machines) Security (OAuth2) Audit Cassandra cluster volume store Solr index Algorithms Result Sets Meandre Workflows Registry (WSO2) Compute resources Storage resources Agent Job Submission Collection building Collections Blacklight Solr Proxy Portal Access HTRC Portal Blacklight App SEARApp Blacklight

HTRC Architecture Data API access interface Portal Access Direct programmatic access (by programs running on HTRC machines) Security (OAuth2) Audit Cassandra cluster volume store Solr index Algorithms Result Sets Meandre Workflows Registry (WSO2) Compute resources Storage resources Agent Job Submission Collection building Collections Blacklight Solr Proxy Agent HTRC Agent Job Submission Collection building

HTRC Architecture Data API access interface Portal Access Direct programmatic access (by programs running on HTRC machines) Security (OAuth2) Audit Cassandra cluster volume store Solr index Algorithms Result Sets Meandre Workflows Registry (WSO2) Compute resources Storage resources Agent Job Submission Collection building Collections Blacklight Solr Proxy 11 HTRC Registry Algorithms Result Sets Meandre Workflows Registry (WSO2) Collections

HTRC Architecture Data API access interface Portal Access Direct programmatic access (by programs running on HTRC machines) Security (OAuth2) Audit Cassandra cluster volume store Solr index Algorithms Result Sets Meandre Workflows Registry (WSO2) Compute resources Storage resources Agent Job Submission Collection building Collections Blacklight Solr Proxy Secure Data API RESTful Web Service – Language agnostic – Clients don’t have to deal with Cassandra Simple OAuth2 authentication HTTP over SSL Audits client access Protected behind firewall, accessible only to authorized IPs HTRC

HTRC Architecture Data API access interface Portal Access Direct programmatic access (by programs running on HTRC machines) Security (OAuth2) Audit Cassandra cluster volume store Solr index Algorithms Result Sets Meandre Workflows Registry (WSO2) Compute resources Storage resources Agent Job Submission Collection building Collections Blacklight Solr Proxy RFS distributed file system Solr proxy Solr service

NoSQL Methodology Currently HT content is stored in a pair-tree file system convention (CDL) Moving these files into a NoSQL store like Cassandra enabled HTRC to aggregate them into larger sets of files for use in retrieval Use of Cassandra enabled HTRC to share content over a commodity based Cassandra cluster of virtual machines Originally investigated use of MongoDB, CouchDB, Hbase and Cassandra

HTRC Solr Proxy + Solr Service Preserves all query syntax of original Solr Prevents user from modification Hides the host machine and port number HTRC Solr is actually running on Creates audit log of requests Provides filtered term vector for words starting with user-specified letter Filters out “dangerous” requests to Solr Adds additional features to Solr – E.g. Term Vectors

Data Capsules VM Cluster Provide secure VM Scholars Remote Desktop Or VNC Submit secure capsule map/reduce Data Capsule images to FutureGrid. Receive and review results FutureGrid Computation Cloud FutureGrid Computation Cloud HTRC Volume Store and Index Non-Consumptive Research-Secure Data Capsule

Sessions for Further Review For more on Secure Data API – Tues Topic I/II (Yiming Sun) For more on Portal/SEASR – Tues Topic II (Loretta Auvil) For more on Portal/Blacklight – Tues Topic III (Stacy Kowalczyk)

Contact Information Robert H. McDonald – – – Chat – rhmcdonald on googletalk | skype – Twitter – Blog – – Twitter Hashtag: #HTRC12