Presentation is loading. Please wait.

Presentation is loading. Please wait.

SCAPE Andy Jackson The British Library SCAPEdev1 AIT, Vienna - 6 th – 7 th June 2011 PC Integration Plan First SCAPE Developers’ Workshop.

Similar presentations


Presentation on theme: "SCAPE Andy Jackson The British Library SCAPEdev1 AIT, Vienna - 6 th – 7 th June 2011 PC Integration Plan First SCAPE Developers’ Workshop."— Presentation transcript:

1 SCAPE Andy Jackson The British Library SCAPEdev1 AIT, Vienna - 6 th – 7 th June 2011 PC Integration Plan First SCAPE Developers’ Workshop

2 SCAPE The Challenges Reproducible tool invocation across all contexts: CLI, Java, SOAP to REST Interoperable data formats and consistent semantics across contexts where required So clients can call tools and understand outputs correctly Ease of development and deployment 2

3 SCAPE Preservation Components (Tools) Characterization Components Tell me about this digital object… Does it contain known preservation risks? Is it valid by the spec? Or my profile spec? Preservation Action Components Transform this digital object into this format… Repair links, or remove preservation risks Quality Assurance Components Assess the differences between these two objects… Assess the Preservation Actions 3

4 SCAPE Tool Integration Roadmap For year one Focus on getting the Testbeds going Deploying ad-hoc WSDL/SOAP web services While learning what we need For year two Start using Tool Specifications Define how to run tools (CLI or pure Java) Invoked locally, or as RESTful services For year three We’ll see, based on year two. 4

5 SCAPE Year One Plan Taverna for the Testbeds Loose integration via WSDL/SOAP XML inputs and outputs can be managed easily Workflows work without installing anything else ONB is hosting services at the moment Deployed via Sven’s Axis2 wrapper (on GitHub) Not so pretty but it works. Exports parameters as ports so Taverna can show them Planning ahead… Working with Taverna External Tools & Components Building shareable tool specifications 5

6 SCAPE Tool Specifications Simple XML definition specifies the tool and how to invoke it to perform particular actions Based on Taverna External Tools plugin: http://www.mygrid.org.uk/dev/wiki/display/developer/Cal ling+external+commands+from+Tavernahttp://www.mygrid.org.uk/dev/wiki/display/developer/Cal ling+external+commands+from+Taverna <program name="bourne_shell_script" description="execute shell script” command="/bin/sh input"> 6

7 SCAPE Re-using Tool Specifications Makes sharing tool specs easy Email tool spec. to colleague, or share on GitHub, etc. Allows more reproducible tool invocation: Invoked from Taverna, CLI or via REST via shared ‘launcher’ code (we only write the wrapper once) Invoker can add performance metrics automatically Invoker could add optional deep process analysis Process results can be shared 7

8 SCAPE Interoperability: Tools & Components Taverna wants to standardize Components Want to hot-swap different implementations of same act Planets had standardized actions WSDL-based interface definitions Too high-level, local first please! Too complex, non-extensible, tool wrapping was hard Lets bring the two together… 8

9 SCAPE CLI and Java Interfaces Extensible Java method signatures & CLI templates e.g. Identify must accept at least a digital object, and return at least a URI Extra parameters may exist, but must have sane defaults More flexible that in Planets But tight enough that clients can call easily Coded for local data and/or streams More constrained than ‘vanilla’ Taverna use Should align with Taverna Component efforts 9

10 SCAPE Standard, Extensible Interfaces Standard processes may include: Identify, Characterize, Validate Migrate/Transform/Convert Compare, Assess We should document the logic on the SCAPE wiki. Deployment helpers wrap this up to make it deploy in different contexts CLI Invoker for local development and testing JAX-RS RESTful service mapping Also wrap benchmarking code around invocation 10

11 SCAPE Interoperability: Data Handling Planets defaulted to pass-by-value Cumbersome, brittle. SCAPE will default to pass-by-reference (URI) Leverage URI schemes to delegate issues like encoding and authentication to the transport layer. More modular design, leveraging standard transports. Java/CLI will expect local files or streams Wrapper layer handles retrieving items via URI Separation of concerns – wrapper could support e.g. HTTP(S), SMB/CIFS, FTP, SFTP/SCP, HBase URI, etc. May modify or re-use JHOVE2 Source/Input objects 11

12 SCAPE Interoperability: Data Formats The required arguments passed to tools will be standardized via Java/JAXB e.g. JHOVE2 property tree as Characterization result, mapped to and from XML Some other concepts will also need standardization Service description for discovery (WADL?) The optional arguments need a declaration Format identifiers for supported input/output formats Passed through the TCC to review and disseminate 12

13 SCAPE Interoperability: Sharing Concepts Common concepts shared on the SCAPE wiki Tool interface definitions Data definitions Both linked to the source code, headed for the JavaDoc A SCAPE/OPF Registry First understand what we really need for tool discovery and use, based on initial integration plan. Then mix-in wider integration issues. Define only format identifiers, or do more? Track and merge with UDFR effort? Now or later? 13

14 SCAPE CC Development Plan Develop FITS, DROID, file etc. For identification (including conflict resolution via FITS) and brief characterization Do not support compound objects well Develop JHOVE2 modules For deep characterization, profile analysis, etc. Supports compound objects FITS as a JHOVE2 identification module? 14

15 SCAPE CC Integrated Deployment CLI FITS and JHOVE2 have CLI interfaces, wrap as Tool Specs REST API Source URI in, Properties out Property data to follow JHOVE2 form e.g. normalize output using the JHOVE2 property language Properties have URIs RDF approach is compatible 15

16 SCAPE CC Validation Interface Format/profile validation Re-use JHOVE2 assessment language for profile validation, if appropriate RESTful version If we need a Validation over REST, consider re-using the W3C Unicorn Validator interface. http://code.w3.org/unicorn/wiki/Documentation/Observer 16

17 SCAPE PA Integration Plan Develop as standalone tools Improving existing tools or making new ones Initially Web Services As Sven has been doing CLI Wrap standalone tool in Tool Spec, specifies input and output formats etc. REST Use src parameter to pass input & create new resource Return alone or with a report via Content Negotiation 17

18 SCAPE QA Integration Plan Develop standalone tools Improving existing tools or making new ones Re-use JHOVE2 property language for comparative properties. Re-use JHOVE2 assessment language for profile validation? RESTful Compare interface Two URIs in: src1 & src2 Properties out: re-using JHOVE2 model. 18

19 SCAPE Repository Integration Some initial ideas 19

20 SCAPE SCAPE Platform Repository Integration Given an existing repository of content, how process items on Hadoop. Two examples from the New York Times. Two initial proposals. 20

21 SCAPE Hadoop & The New York Times 4TB of TIFFs+OCR to 1.5 TB PDFs 11 million articles in 24 hours on 100 EC2 nodes They found a problem, but EC2 is cheap enough that could afford to run it twice. Tools JetS3t – open source Java toolkit for S3 iText PDF Library Java Advanced Image package http://open.blogs.nytimes.com/2007/11/01/self- service-prorated-super-computing-fun/http://open.blogs.nytimes.com/2007/11/01/self- service-prorated-super-computing-fun/ 21

22 SCAPE NYT Project 2 “Using Amazon Web Services, Hadoop and our own code, we ingested 405,000 very large TIFF images, 3.3 million articles in SGML and 405,000 xml files mapping articles to rectangular regions in the TIFF’s. This data was converted to a more web- friendly 810,000 PNG images (thumbnails and full images) and 405,000 JavaScript files — all of it ready to be assembled into a TimesMachine. By leveraging the power of AWS and Hadoop, we were able to utilize hundreds of machines concurrently and process all the data in less than 36 hours.” http://open.blogs.nytimes.com/2008/05/21/the-new-york- times-archives-amazon-web-services-timesmachine/ & http://open.blogs.nytimes.com/tag/hadoop/http://open.blogs.nytimes.com/tag/hadoop/ 22

23 SCAPE PROPOSAL 1: Repository Caching Cluster Workflow driven from Hadoop: User passes list of references to content in repo. Hadoop downloads the item into HBase, returning a HBase URI. Hadoop process the item as required, using repo API to post any results back to the repo. Item remains cached in HBase until it is needed again. Old items get bumped out if space runs low. This would suite the BL’s Digital Library System. Storage architecture is decoupled from processing. 23

24 SCAPE PROPOSAL 2: Preservation Service Farm The repository drives the workflow, but needs to invoke services on lots of content. Underlying tools may have varied OS needs. SCAPE Platform could spin-up machines as needed, each providing RESTful endpoints that the repository can call. Could be simple services or full workflows. Repository POSTs data to cluster and pulls the result back again. Requires a complex all-in-one repository system including workflow engines & triggers. 24

25 SCAPE PROPOSAL 3: Run Repository On HBase More radical but powerful option is to run the repository system on top of HBase. Very scaleable. Powerful content analysis and processing. But hard work if repository expects a traditional DB. 25

26 SCAPE Development Infrastructure Working together 26

27 SCAPE TCC, calls etc Mailing lists: techie@list.scape-project.eu Wiki what we are working on, so clear what codebases we are improving. http://wiki.opf-labs.org/display/SP/ See the Developers’ Guide Build Manager (IM) System Manager and central cluster (IM) 27


Download ppt "SCAPE Andy Jackson The British Library SCAPEdev1 AIT, Vienna - 6 th – 7 th June 2011 PC Integration Plan First SCAPE Developers’ Workshop."

Similar presentations


Ads by Google