HTRC API Overview Yiming Sun. HTRC Architecture Data API Portal access Direct programmatic access (by programs running on HTRC machines) Security (OAuth2)

Slides:



Advertisements
Similar presentations
Chapter 6 Server-side Programming: Java Servlets
Advertisements

Overview of Twitter API Nathan Liu. Twitter API Essentials Twitter API is a Representational State Transfer(REST) style web services exposed over HTTP(S).
Hypertext Transfer PROTOCOL ----HTTP Sen Wang CSE5232 Network Programming.
HTTP Hypertext Transfer Protocol. HTTP messages HTTP is the language that web clients and web servers use to talk to each other –HTTP is largely “under.
Microsoft ® Official Course Interacting with the Search Service Microsoft SharePoint 2013 SharePoint Practice.
HTTP Overview Vijayan Sugumaran School of Business Administration Oakland University.
Basic Concepts Architecture Topology Protocols Basic Concepts Open e-Print Archive Open Archive -- generalization of e-print Data Provider and Service.
Rensselaer Polytechnic Institute CSC-432 – Operating Systems David Goldschmidt, Ph.D.
Web Client/Server Communication A290/A590, Fall /09/2014.
Web server and web browser It’s a take and give policy in between client and server through HTTP(Hyper Text Transport Protocol) Server takes a request.
Ajax (Asynchronous JavaScript and XML). AJAX  Enable asynchronous communication between a web client and a server.  A client is not blocked when an.
Simple Web Services. Internet Basics The Internet is based on a communication protocol named TCP (Transmission Control Protocol) TCP allows programs running.
1 HTML and CGI Scripting CSC8304 – Computing Environments for Bioinformatics - Lecture 10.
CHAPTER 12 COOKIES AND SESSIONS. INTRO HTTP is a stateless technology Each page rendered by a browser is unrelated to other pages – even if they are from.
FTP (File Transfer Protocol) & Telnet
Simple Web Services. Internet Basics The Internet is based on a communication protocol named TCP (Transmission Control Protocol) TCP allows programs running.
What’s New in VRS? GUGM May 15, 2008 Presenter: Kelly P. Robinson GIL Service Georgia State University
CP476 Internet Computing Lecture 5 : HTTP, WWW and URL 1 Lecture 5. WWW, HTTP and URL Objective: to review the concepts of WWW to understand how HTTP works.
Designing and Implementing Web Data Services in Perl
Web application architecture
REST.  REST is an acronym standing for Representational State Transfer  A software architecture style for building scalable web services  Typically,
JavaScript, Fourth Edition
Online Autonomous Citation Management for CiteSeer CSE598B Course Project By Huajing Li.
9 Chapter Nine Compiled Web Server Programs. 9 Chapter Objectives Learn about Common Gateway Interface (CGI) Create CGI programs that generate dynamic.
USING PERL FOR CGI PROGRAMMING
Web HTTP Hypertext Transfer Protocol. Web Terminology ◘Message: The basic unit of HTTP communication, consisting of structured sequence of octets matching.
Lecture Set 14 B new Introduction to Databases - Database Processing: The Connected Model (Using DataReaders)
Let VRS Work for You! ELUNA Conference 2008 Presenter: Kelly P. Robinson GIL Service Georgia State University
Chapter 8 Cookies And Security JavaScript, Third Edition.
HathiTrust Research Center Architecture Overview Robert H. McDonald Executive Committee-HathiTrust Research Center (HTRC) Deputy Director-Data.
Collection and Data Overview Jeremy York Stacy Kowalczyk.
HathiTrust Research Center Architecture Data subsystem.
Chapter 6 Server-side Programming: Java Servlets
Page 1 Non-Payroll Cost Transfer Enhancements Last update January 24, 2008 What are the some of the new enhancements of the Non-Payroll Cost Transfer?
DM_PPT_NP_v01 SESIP_0715_JR HDF Server HDF for the Web John Readey The HDF Group Champaign Illinois USA.
Server-side Programming The combination of –HTML –JavaScript –DOM is sometimes referred to as Dynamic HTML (DHTML) Web pages that include scripting are.
Netprog 2002 CGI Programming1 CGI Programming CLIENT HTTP SERVER CGI Program http request http response setenv(), dup(), fork(), exec(),...
1 Web Servers (Chapter 21 – Pages( ) Outline 21.1 Introduction 21.2 HTTP Request Types 21.3 System Architecture.
Saving State on the WWW. The Issue  Connections on the WWW are stateless  Every time a link is followed is like the first time to the server — it has.
2007cs Servers on the Web. The World-Wide Web 2007 cs CSS JS HTML Server Browser JS CSS HTML Transfer of resources using HTTP.
Appendix E: Overview of HTTP ©SoftMoore ConsultingSlide 1.
ECMM6018 Enterprise Networking for Electronic Commerce Tutorial 7
What is a Servlet? Java Program that runs in a Java web server and conforms to the servlet api. A program that uses class library that decodes and encodes.
Web Server.
 Previous lessons have focused on client-side scripts  Programs embedded in the page’s HTML code  Can also execute scripts on the server  Server-side.
Module: Software Engineering of Web Applications Chapter 2: Technologies 1.
Web Services. 2 Internet Collection of physically interconnected computers. Messages decomposed into packets. Packets transmitted from source to destination.
©SoftMooreSlide 1 Introduction to HTML: Forms ©SoftMooreSlide 2 Forms Forms provide a simple mechanism for collecting user data and submitting it to.
8 Chapter Eight Server-side Scripts. 8 Chapter Objectives Create dynamic Web pages that retrieve and display database data using Active Server Pages Process.
RESTful Web Services A MIDAS MISSION PRESENTATION APRIL 29, 2015.
HTRC Loretta Auvil, Boris Capitanu University of Illinois at Urbana-Champaign
©2003 Paula Matuszek GOOGLE API l Search requests: submit a query string and a set of parameters to the Google Web APIs service and receive in return a.
DICOMwebTM 2015 Conference & Hands-on Workshop University of Pennsylvania, Philadelphia, PA September 10-11, 2015 DICOMweb Workflow API (UPS-RS) Jonathan.
Session 11: Cookies, Sessions ans Security iNET Academy Open Source Web Development.
Simple Web Services. Internet Basics The Internet is based on a communication protocol named TCP (Transmission Control Protocol) TCP allows programs running.
General Architecture of Retrieval Systems 1Adrienn Skrop.
REST API Design. Application API API = Application Programming Interface APIs expose functionality of an application or service that exists independently.
Servlets.
Azure Identity Premier Fast Start
HTTP – An overview.
Web Development Web Servers.
Hypertext Transport Protocol
Detailed search stats from DSpace Solr
Testing REST IPA using POSTMAN
WEB API.
Chapter 27 WWW and HTTP.
REST APIs Maxwell Furman Department of MIS Fox School of Business
Getting Started With Solr
D Guidance 26-Jun: Would like to see a refresh of this title slide
Presentation transcript:

HTRC API Overview Yiming Sun

HTRC Architecture Data API Portal access Direct programmatic access (by programs running on HTRC machines) Security (OAuth2) Audit Cassandra cluster volume store Solr index shards Token filter Sentence tokenizer Word position Latent semantic analysis Meandre flows Registry (WSO2) Compute resources Storage resources Agent Application submission Collection building Collections Solr Proxy We are here

HTRC RESTful Services Solr Proxy API – Search full text and metadata – Retrieve metadata – Pass through read-only Solr requests – Audit Data API – Retrieve volume and page data – Authorization via OAuth2 – Communication over TLS (i.e. https) – Audit

Why API? Why RESTful? API – Application Programming Interface – Interface stays the same while underlying implementation changes for improvement – Client to the API won’t need to change RESTful – REpresentational State Transfer – A style of web service using HTTP – Light-weight with little to no dependencies – Language-agnostic

Typical Flow User Solr Proxy Data API search criteria preliminary results refined search criteria final results volumeID or pageID list requested volume or page data OAuth2 token Extract volumeIDs /pageIDs

Search indexed fields in Solr Proxy API Search for author named Dickens URL: q=author:dickens chinkapin.pti.indiana.edu:9994 – solr proxy host and port number solr – Solr proxy service select - query qt=sharding – query type must be sharding, as it is being used q=author:dickens – query string

Specify returned fields in Solr Proxy API Use “fl” to specify the fields to return. e.g. Return title, volumeID, and author of books whose author is Shakespeare URL: qt=sharding&q=author:shakespeare&fl=title,id,author fl=title,id,author – return title, id, and author fields

Specify number of rows returned from Solr Proxy API By default, Solr returns only 10 entries. Use “rows” to specify a different number: e.g. Return max of 20 entries for books whose title contains the word canterbury URL: qt=sharding&q=title:canterbury&rows=20 rows=20 – return a maximum of 20 entries

Some common indexed fields* Field nameIndexedStoredExplanation idYYField for volume ID ocrYNOcr field for full text search titleYYField for book title authorYYAuthor for a volume isbnYYThe International Standard Book Number (ISBN) for a volume issnYYThe International Standard Serial Number (ISSN) for a volume callnumberYYThe call number for a volume languageYNField for the languages in which a volume is written topicYNTopic of the volume genreYNGenre the volume belongs to publishDateYYPublish date of the volume publisherYYpublisher of the volume *complete list at the end of the slide deck

Remember to URL encode query string Web browsers and tools such as Blacklight perform URL encoding on your Solr query behind the scene so you may enter special characters such as colons and spaces directly: URL: e:canterbury AND author:chaucer If you are writing code to query Solr using http connection, you must URL encode your query strings. Most languages include URL encoding libraries. URL: e%3Acanterbury%20AND%20author%3Achaucer

Retrieve MARC records for specified volumes URL: MARC – retrieve MARC records volumeIDs – parameter for volume ID list - list of volume IDs. Individual volume IDs are joined together using “|” (bar, pipe) volumeIDs=uc2.ark:/13960/t9668bb2t|uc2.ark:/13960/t8kd1sn38 Again, query string needs to be URL encoded: volumeIDs=uc2.ark%3A%2F13960%2Ft9668bb2t%7Cuc2.ark%3A%2F13960%2Ft8 kd1sn38

What’s returned Streams back a ZIP file containing one MARC XML file for each volume marc.zip uc2.ark+=13960=t8kd1sn38.xml uc2.ark+=13960=t9668bb2t.xml

Retrieve volumes from Data API 2 request headers must be present: -Content-type: application/x-www-form- urlencoded -Authorization: Bearer oauth2-token Request method: -POST

Retrieve volumes from Data API URL: silvermaple.pti.indiana.edu:25443 – service host and port number data-api – Data API service volumes – request for volumes Request body: volumeIDs= volumeIDs – parameter to list requested volumes - volumeIDs joined by “|” uc2.ark:/13960/t9668bb2t|uc2.ark:/13960/t8kd1sn38 volumeIDs=uc2.ark%3A%2F13960%2Ft9668bb2t%7Cuc2.ark%3A%2F13960%2Ft8kd1sn38

What does Data API return? Streams back a ZIP file where each volume is a directory, and each page in the volume is a text file named with a 8-digit page sequence number in the directory. volumes.zip uc2.ark+=13960=t8kd1sn txt txt uc2.ark+=13960=t9668bb2t txt txt...

Retrieve volumes with pages concatenated URL: Request body: volumeIDs=uc2.ark%3A%2F13960%2Ft9668bb2t%7Cuc2.ark%3A %2F13960%2Ft8kd1sn38&concat=true concat=true – all pages of a volume are concatenated into a single text file

What does Data API return? Streams back a ZIP file with each volume being a single text file and the content of the text file is all pages concatenated together. volumes.zip uc2.ark+=13960=t9668bb2t.txt uc2.ark+=13960=t8kd1sn38.txt

Retrieving pages Enables retrieval of select pages instead of entire volumes Page ID is volume ID followed by a comma separated list of page sequences enclosed in square brackets: uc2.ark:/13960/t9668bb2t[2,4] Page IDs are also joined by “|” to form a pageID list

Retrieve pages from multiple volumes Page ID list: uc2.ark:/13960/t9668bb2t[2,4]|uc2.ark:/13960/t8kd1sn38[5,3,1] URL: Request body: pageIDs= uc2.ark%3A%2F13960%2Ft9668bb2t%5B2%2C4%5D%7Cuc2.ark%3A% 2F13960%2Ft8kd1sn38%5B5%2C3%2C1%5D pages – retrieve pages pageIDs – parameter for pageID list

What does Data API return? Streams back a ZIP file where each volume is a directory, and each requested page is a separate text file named with the 8-digit page sequence number. pages.zip uc2.ark+=13960=t8kd1sn txt txt uc2.ark+=13960=t9668bb2t txt txt txt

Retrieve pages as a bag of words URL: Request body: pageIDs= uc2.ark%3A%2F13960%2Ft9668bb2t%5B2%2C4%5D%7Cuc2.ark%3A%2F1396 0%2Ft8kd1sn38%5B5%2C3%2C1%5D&concat=true concat=true – all requested pages are concatenated into a single file

What does Data API return? Streams back a ZIP file containing a single text file named wordbag.txt which is the concatenation of all requested pages pages.zip wordbag.txt

What is ERROR.err? If Data API encounters an error before it has started streaming back data, the client will receive an HTTP error response status code (e.g. 404, 500) and an explanation of the error in response body. But if Data API encounters an error after it has started streaming back data as ZIP, it injects an ERROR.err file in the ZIP as the very last entry. This file contains the information of the very first error encountered. If ERROR.err is present, one or more requested items are missing from the ZIP pages.zip ERROR.err...

When and why do I get an error? For several possible reasons: Requested item does not exist (e.g. result of a deletion notice) Requested more items than the policy would allow Request is invalid (e.g. malformed volumeID list) Backend server overloaded or crashed * list of possible errors at the end of the slide deck

OAuth2 Security Token To allow only authorized users to access HTRC data and resources To enforce auditing of activities Every request to Data API must bear a valid token – As “Authorization” HTTP request header to Data API URL: redentials&client_id= &client_secret= oauth2 – OAuth2 token provider grant_type=client_credentials – the only grant type we support at the moment client_id – your assigned client ID, must be URL encoded client_secret – your assigned client secret, must be URL encoded

What does OAuth2 return and what do I do with the token? Our OAuth2 token provider returns a JSON string looks something like the following: {“access_token” : “aaabbbccdd”, “expires_in” : “3600”} The token must be set as “Authorization” header in the http request to Data API: Authorization: Bearer aaabbbccdd

Flow of obtaining and using OAuth2 tokens User OAuth2 token endpoint Data API userID + secret access token + expire time request + access token is token valid? affirmative requested resources

Summary Solr Proxy – Filters Solr queries to allow read-only requests – Always use qt=sharding – Augmented functionality: retrieve MARC records – All requests are HTTP GET requests Data API – Retrieve volumes and pages as ZIP – Requires valid OAuth2 token – All requests are HTTP POST requests – Request parameters go into request body OAuth2 token endpoint – Currently only support grant_type=client_credentials – Returns JSON URL Encoding your parameters!

References Solr 3.6 tutorial: – files/tutorial.html files/tutorial.html Pairtree Specification – OAuth2 –

Questions?

Indexed and stored Solr fields Field nameIndexedStoredExplanation idYYField for volume ID ocrYNOcr field for full text search titleYYField for book title rightsYYCopyright for a volume. e.g. "1" for public domain. please refer to authorYYAuthor for a volume authorStrYYAuthor string not tokenized to facilitate facet on author oclcYYThe OCLC Control Number rptnumYY sdrnumYY isbnYYThe International Standard Book Number (ISBN) for a volume issnYYThe International Standard Serial Number (ISSN) for a volume callnumberYYThe call number for a volum sudocYYSuperintendent of Documents number for a volume. Please refer to languageYNField for the languages in which a volume is written htsourceYYField for the sources of a volume, e.g. "Indiana University" eraYNThe era of the volume geographicYNBrief geographic information for a volume, e.g."pennsylvania" country_of_pubYNThe publication country of the book countryOfPubStrYYCountry names not tokenized for better facet string on it topicYNTopic of the volume topicStrYYTopic string not tokenized for better facet string on it genreYNGenre the volume belongs to publishDateYYPublish date of the volume publisherYYpublisher of the volume

Data API error responses Response StatusResponse BodyReason 400 Bad RequestMissing required parameter volumeIDsrequest for volumes does not have volumeIDs query parameter 400 Bad RequestMissing required parameter pageIDsrequest for pages does not have pageIDs query parameter 400 Bad RequestMalformed volume ID list. Offending token: xxx volumeID list contain tokens that are not valid volumeIDs and cannot be parsed 400 Bad RequestMalformed page ID list. Offending token: xxx pageID list contains tokens that are not valid pageIDs and cannot be parsed 400 Bad RequestRequest too greedy. Request violates Max Volumes Allowed xxx. Offending ID: zzz request would touch and retrieve more volumes than allowed by the policy. In the response body, xxx is the limit set by the policy, and zzz is the first volumeID that exceeds the limit. Applicable if the policy is set on the server. 400 Bad RequestRequest too greedy. Request violates Max Total Pages Allowed xxx. Offending ID: zzz request would touch and retrieve more pages than allowed by the policy. In the response body, xxx is the limit set by the policy, and zzz is the first pageID that exceeds the limit. Applicable if the policy is set on the server. 400 Bad RequestRequest too greedy. Request violates Max Pages Per Volume Allowed xxx. Offending ID: zzz request would touch and retrieve more pages per volume than allowed by the policy. In the response body, xxx is the limit set by the policy, and zzz is the first ID that exceeds the limit. Applicable if the policy is set on the server 404 Not FoundKey not found. Offending key: xxxrequest asks for a non-existent volumeID, or the request asks for a non-existent page sequence of a valid volume (e.g. asking for page 100 of a volume with only 90 pages). 500 Internal Server Error Server too busy.Data API gets a timed out exception while trying to retrieve data. 500 Internal Server Error Internal server error.other exceptions occurred when Data API tries to retrieve data.