Writing & Deploying Clowder Extractors Max Burnette, ISDA June 16, 2016.

Slides:

Advertisements

Similar presentations

Polycom Quotes on Demand Tool Partner User Guide Version 1.1

Advertisements

Copyright 2004 Monash University IMS5401 Web-based Systems Development Topic 2: Elements of the Web (g) Interactivity.

Browsers and Servers CGI Processing Model ( Common Gateway Interface ) © Norman White, 2013.

XP Browser and Basics1. XP Browser and Basics2 Learn about Web browser software and Web pages The Web is a collection of files that reside.

Christopher M. Pascucci Basic Structural Concepts of.NET Browser – Server Interaction.

Form Handling, Validation and Functions. Form Handling Forms are a graphical user interfaces (GUIs) that enables the interaction between users and servers.

Chapter 9 Collecting Data with Forms. A form on a web page consists of form objects such as text boxes or radio buttons into which users type information.

DAT602 Database Application Development Lecture 15 Java Server Pages Part 1.

M. Taimoor Khan * Java Server Pages (JSP) is a server-side programming technology that enables the creation of dynamic,

Chapter 1: Introduction to Web

Chapter 33 CGI Technology for Dynamic Web Documents There are two alternative forms of retrieving web documents. Instead of retrieving static HTML documents,

® IBM Software Group © 2009 IBM Corporation Rational Publishing Engine RQM Multi Level Report Tutorial David Rennie, IBM Rational Services A/NZ

Home Media Network Hard Drive Training for Update to 2.0 By Erik Collett Revised for Firmware Update.

5 Chapter Five Web Servers. 5 Chapter Objectives Learn about the Microsoft Personal Web Server Software Learn how to improve Web site performance Learn.

Created by: Maria Abrahms Modified Date: Classification: How to get it done Contributing to OpenStack.

Kingdom of Saudi Arabia Ministry of Higher Education Al-Imam Muhammad Ibn Saud Islamic University College of Computer and Information Sciences Chapter.

Some Design Notes Iteration - 2 Method - 1 Extractor main program Runs from an external VM Listens for RabbitMQ messages Starts a light database engine.

CSC 2720 Building Web Applications Server-side Scripting with PHP.

Topics Sending an Multipart message Storing images Getting confirmation Session tracking using PHP Graphics Input Validators Cookies.

PAYware Transact Terminal Interface Manager

Collecting Copyright Transfers and Disclosures via Editorial Manager™ -- Editorial Office Guide 2015.

Galaxy in Production Nate Coraor Galaxy Team Penn State University.

Troubleshooting Directories and Files Debugging

Mahara-Moodle Integration Iñaki Arenaza This work licensed under the conditions of “Creative Commons Attribution-Share Alike 3.0 Spain License”

1 Terminal Management System Usage Overview Document Version 1.1.

COMP3121 E-Commerce Technologies Richard Henson University of Worcester December 2009.

1 Chapter 1 INTRODUCTION TO WEB. 2 Objectives In this chapter, you will: Become familiar with the architecture of the World Wide Web Learn about communication.

Air Plugins Extending and customizing uDeploy (IBM UrbanCode Deploy)

Core LIMS Training: Project Management

Data Virtualization Demoette… Logging in CIS

Development Environment

Running a Forms Developer Application

Project Management: Messages

Essentials of UrbanCode Deploy v6.1 QQ147

CReSIS Git Tutorial.

Configuring Attendant Console

Node.js Express Web Applications

Data Transport for Online & Offline Processing

Data Virtualization Tutorial… OAuth Example using Google Sheets

LOCO Extract – Transform - Load

DCR ARB Presentation Team 5: Tour Conductor.

In-situ Visualization using VisIt

Introduction to Programming the WWW I

Macaualy2 Workshop Berkeley 2017

Basic Web Scraping with Python

PHP / MySQL Introduction

Bomgar Remote support software

DrayWatch Training November 2009.

Testing REST IPA using POSTMAN

(Includes setup) FAQ ON DOCUMENTS (Includes setup)

Computer Communication & Networks

What is Cookie? Cookie is small information stored in text file on user’s hard drive by web server. This information is later used by web browser to retrieve.

complextractors! (advanced extractors)

Oracle Sales Cloud Sales campaign

Getting Started: Amazon AWS Account Creation

Module P3 Practical: Building a webapp in nodejs and

Configuration Of A Pull Network.

Cordova & Cordova Plugin Installation and Management

Lecture 5: Functions and Parameters

Request Form You gain access to the Request Form from your intranet set-up by your IT dept. Or the internet via either our desktop launcher icon. Or a.

Overview of Contract Association Batch Upload

Rational Publishing Engine RQM Multi Level Report Tutorial

GitHub 101 Using Github and Git for Source Control

Carthage ios 8 onwards Dependency manager that streamlines the process of integrating the libraries into the project.

(Includes setup) FAQ ON DOCUMENTS (Includes setup)

How to install and manage exchange server 2010 OP Saklani.

DIBBs Brown Dog BDFiddle

Presentation transcript:

Writing & Deploying Clowder Extractors Max Burnette, ISDA June 16, 2016

Extractor Pipeline Overview Setting up a test environment Installing required software Running sample extractor Extractor Basic Design PyClowder library Writing an extractor Handling inputs and outputs Testing extractors Q/A

RabbitMQ Message Bus Spaces Mongo Database Collections Datasets Files EXIF Extractor EXIF queue Spectral queue PlantCV queue Spectral Extractor PlantCV Extractor 1 PlantCV Extractor 2... clowder exchange Extractor Pipeline Overview Clowder New Extractor

Clowder RabbitMQ Message Bus Spaces Mongo Database Collections Datasets Files EXIF Extractor EXIF queue Spectral queue PlantCV queue Spectral Extractor PlantCV Extractor 1 PlantCV Extractor 2... clowder exchange 1 - Clowder event occurs -new file uploaded -file added/removed to dataset -metadata added to file/dataset -triggered via UI or API New Extractor

Clowder RabbitMQ Message Bus Spaces Mongo Database Collections Datasets Files EXIF Extractor EXIF queue Spectral queue PlantCV queue Spectral Extractor PlantCV Extractor 1 PlantCV Extractor 2... clowder exchange 2 - Event message sent to RMQ -includes a message type and content, e.g. -*.dataset.added -*.file.image.# -*.dataset.metadata.added New Extractor

Clowder RabbitMQ Message Bus Spaces Mongo Database Collections Datasets Files EXIF Extractor EXIF queue Spectral queue PlantCV queue Spectral Extractor PlantCV Extractor 1 PlantCV Extractor 2... clowder exchange 3 - RMQ routes message to queue -each queue corresponds to one named extractor - but multiple instances of an extractor can coexist -each queue listens for a particular kind of message New Extractor

Clowder RabbitMQ Message Bus Spaces Mongo Database Collections Datasets Files EXIF Extractor EXIF queue Spectral queue PlantCV queue Spectral Extractor PlantCV Extractor 1 PlantCV Extractor 2... clowder exchange 4 - Extractors listen to their queue -multiple instances of an extractor can share 1 queue - scalable! -messages will accumulate in queue until an extractor comes along to handle each message New Extractor

Clowder RabbitMQ Message Bus Spaces Mongo Database Collections Datasets Files EXIF Extractor EXIF queue Spectral queue PlantCV queue Spectral Extractor PlantCV Extractor 1 PlantCV Extractor 2... clowder exchange 5 - Extractor handles next message -fetch next message in queue and potentially process based on it -message can include file IDs, dataset IDs, other information *PyClowder will make this easier! New Extractor

Clowder RabbitMQ Message Bus Spaces Mongo Database Collections Datasets Files EXIF Extractor EXIF queue Spectral queue PlantCV queue Spectral Extractor PlantCV Extractor 1 PlantCV Extractor 2... clowder exchange 6 - Send info back to Clowder -outputs from extractor such as metadata and derived files can then be sent back into Clowder New Extractor

Extractor Pipeline Overview Setting up a test environment Installing required software Running sample extractor Extractor Basic Design PyClowder library Writing an extractor Handling inputs and outputs Testing extractors Q/A

RabbitMQ MongoDB Required Software Clowder Java PyClowder 2 installation approaches: -Manual (install things individually, run them individually) -PROS: under-the-hood access to everything -CONS: more work -Docker (start up entire stack at once) -PROS: easy -CONS: things are slightly more obfuscated Docker is the recommended option for those who will primarily be developing extractors.

RabbitMQ MongoDB Required Software (Manual) Clowder Java./bin/mongod./sbin/rabbitmq-server./sbt run PyClowder python setup.py install

Clowder Configuration (Manual) Enabling RabbitMQ communication Plugin is disabled by default in conf/play.plugins configuration file. Override by creating custom/play.plugins file and adding this line: 9992:services.RabbitmqPlugin Running the application./sbt run

RabbitMQ MongoDB Required Software (Docker) Clowder Java PyClowder

RabbitMQ MongoDB Required Software (Docker) Clowder Java PyClowder Docker > docker-compose up > docker-machine ip :9000 = Clowder Docker Quickstart Terminal

Docker Terminal If you don’t want to start a new Docker Quickstart Terminal each time, add this line to.profile (e.g. ~/.bash_profile): if which docker-machine >/dev/null; then eval "$(docker-machine env default)” fi This should make all subsequent new terminal sessions Docker-aware.

Clowder Configuration Clowder should now be accessible at localhost:9000 (or :9000 for Docker).

Clowder Configuration Creating a local account Initially, Clowder will have no accounts and no configured server. To create an account: 1.Sign up for an account inside Clowder 2.The will not be sent; however it will appear in the Clowder terminal: 3.Copy that URL into your browser to activate > docker logs > docker ps

Running a Sample Extractor The wordcount extractor in the PyClowder repository is a simple example that will process incoming text files and add metadata describing the content. 1.Navigate to pyclowder/sample-extractors/wordcount if using Docker, change RabbitMQ URL in config.py to correct IP python wordcount.py This extractor can be found in the public PyClowder repository.

Running a Sample Extractor The wordcount extractor in the PyClowder repository is a simple example that will process incoming text files and add metadata describing the content. 1.Navigate to pyclowder/sample-extractors/wordcount if using Docker, change RabbitMQ URL in config.py to correct IP python wordcount.py 2.Create a new Clowder dataset and upload a.txt file Datasets > Create Select Files > Upload 3.Check extractor output 4.Verify file metadata

Extractor Pipeline Overview Setting up a test environment Installing required software Running sample extractor Extractor Basic Design PyClowder library Writing an extractor Handling inputs and outputs Testing extractors Q/A

New Extractor Extractor Basic Design 1. Connect with RabbitMQ. Check your extractor’s queue for new messages. 2. Evaluate the message. Is the file relevant based on filename? Do you need to check metadata? If writing a dataset extractor, are all required files available? You can use the Clowder API to fetch more details about files/datasets if necessary. 3. Process the message. If the message is relevant, perform the key operations for your extractor. With PyClowder you will have the files available in a temporary location. In other languages you may need to download files or datasets manually if you need them. 4. Upload output data. Add new files to Clowder datasets, upload metadata, etc. 5. Notify RabbitMQ. Tell RabbitMQ the message is handled, so it doesn’t get repeated.

New Extractor Extractor Basic Design 1. Connect with RabbitMQ. 2. Evaluate the message. 3. Process the message. 4. Upload output data. 5. Notify RabbitMQ. These steps can be done in any language that supports: -HTTP requests -JSON parsing -RabbitMQ interaction - -Java,.NET, Ruby, PHP, C++, Perl, and more We have created a PyClowder wrapper library for Python to simplify this step. Absent other language requirements, this is the easiest path.

PyClowderNew Extractor PyClowder Library 1. Connect with RabbitMQ. 2. Evaluate the message. 3. Process the message. 4. Upload output data. 5. Notify RabbitMQ. connect_message_bus() check_message() process_file() various utilities

PyClowderNew Extractor PyClowder Library 1. Connect with RabbitMQ. 2. Evaluate the message. 3. Process the message. 4. Upload output data. 5. Notify RabbitMQ. connect_message_bus() check_message() process_file() various utilities config.py define extractor name, file type, URLs, etc. YOUR_SCRIPT.py import extractors def main(): setup() connect_message_bus() def check_message():... def process_file():...

sample: wordcount.py config.py define extractor name, file type, URLs, etc. This extractor can be found in the public pyclowder repository. We will use this as our test extractor after we set up our local testing environment soon. docker IP address

sample: wordcount.py YOUR_SCRIPT.py import extractors def main(): setup() connect_message_bus() def check_message():... def process_file():...

Extractor Pipeline Overview Setting up a test environment Installing required software Running sample extractor Extractor Basic Design PyClowder library Writing an extractor Handling inputs and outputs Testing extractors Q/A

PyClowderNew Extractor Writing an Extractor 1. Connect with RabbitMQ. 2. Evaluate the message. 3. Process the message. 4. Upload output data. 5. Notify RabbitMQ. connect_message_bus() check_message() process_file() various utilities config.py define extractor name, file type, URLs, etc. YOUR_SCRIPT.py import extractors def main(): setup() connect_message_bus() def check_message():... def process_file():...

Inputs & Outputs config.py define extractor name, file type, URLs, etc. YOUR_SCRIPT.py import extractors def main(): setup() connect_message_bus() def check_message():... def process_file():... main() -set up logging & set globals setup() -connect to the message bus connect_message_bus()

Inputs & Outputs config.py define extractor name, file type, URLs, etc. YOUR_SCRIPT.py import extractors def main(): setup() connect_message_bus() def check_message():... def process_file():... main() -set up logging & set globals setup() -connect to the message bus connect_message_bus() check_message(parameters) -evalute the contents of the message -e.g. list of file(s) that were added -if you need access to the files, return True -this will download files and pass pointers to process_file() -if not, return “bypass” -this will pass along to process_file() without downloading files first -if message is irrelevant, return False

Inputs & Outputs config.py define extractor name, file type, URLs, etc. YOUR_SCRIPT.py import extractors def main(): setup() connect_message_bus() def check_message():... def process_file():... main() -set up logging & set globals setup() -connect to the message bus connect_message_bus() check_message(parameters) -evalute the contents of the message -e.g. list of file(s) that were added -if you need access to the files, return True -this will download files and pass pointers to process_file() -if not, return “bypass” -this will pass along to process_file() without downloading files first -if message is irrelevant, return False process_file(parameters) -file(s) themselves are directly accessible here (unless bypassed) -metadata is also available -for dataset extractors, files will automatically be unzipped for easy processing

Testing Extractors 1.Run your extractor script 2.In Clowder, upload a file of appropriate type 3.Watch the script output 4.Verify desired metadata/new files are correctly being uploaded For development, liberal use of logs is useful if things are behaving unexpectedly. print(parameters) in check_message() and process_file() to make sure you are getting the information you expect

Testing Extractors If you aren’t sure what message to use, you can set up a generic queue to test: -Go to RabbitMQ Management console: (or :15672) -Go to Queues, enter a name and click Add queue -In the queue, under Bindings, add routing key *.# from clowder exchange: -Now all messages will arrive in this queue.

Pending Additions These are in the final stages of approval before deployment. Once merged, you may need to reinstall the newer version of PyClowder or download and restart Clowder respectively. PyClowder -Pull Request 13 - will add PyClowder support for dataset extractors (including a sample extractor). requests/13/overview requests/13/overview -clone the git repo, switch to this branch and reinstall if you want to use immediately Clowder -Pull Request will add support for “metadata added” events (i.e. trigger extractors on metadata updates) requests/880/overviewhttps://opensource.ncsa.illinois.edu/bitbucket/projects/CATS/repos/clowder/pull- requests/880/overview -Pull Request will add API support for maintaining only one instance of metadata per extractor, per file. requests/899/overviewhttps://opensource.ncsa.illinois.edu/bitbucket/projects/CATS/repos/clowder/pull- requests/899/overview

Extractor Pipeline Overview Setting up a test environment Installing required software Running sample extractor Extractor Basic Design PyClowder library Writing an extractor Handling inputs and outputs Testing extractors Q/A Max Burnette