Writing & Deploying Clowder Extractors Max Burnette, ISDA June 16, 2016.

Writing & Deploying Clowder Extractors Max Burnette, ISDA June 16, 2016

Extractor Pipeline Overview Setting up a test environment Installing required software Running sample extractor Extractor Basic Design PyClowder library Writing an extractor Handling inputs and outputs Testing extractors Q/A

RabbitMQ Message Bus Spaces Mongo Database Collections Datasets Files EXIF Extractor EXIF queue Spectral queue PlantCV queue Spectral Extractor PlantCV Extractor 1 PlantCV Extractor 2... clowder exchange Extractor Pipeline Overview Clowder New Extractor

Clowder RabbitMQ Message Bus Spaces Mongo Database Collections Datasets Files EXIF Extractor EXIF queue Spectral queue PlantCV queue Spectral Extractor PlantCV Extractor 1 PlantCV Extractor 2... clowder exchange 1 - Clowder event occurs -new file uploaded -file added/removed to dataset -metadata added to file/dataset -triggered via UI or API New Extractor

Clowder RabbitMQ Message Bus Spaces Mongo Database Collections Datasets Files EXIF Extractor EXIF queue Spectral queue PlantCV queue Spectral Extractor PlantCV Extractor 1 PlantCV Extractor 2... clowder exchange 2 - Event message sent to RMQ -includes a message type and content, e.g. -*.dataset.added -*.file.image.# -*.dataset.metadata.added New Extractor

Clowder RabbitMQ Message Bus Spaces Mongo Database Collections Datasets Files EXIF Extractor EXIF queue Spectral queue PlantCV queue Spectral Extractor PlantCV Extractor 1 PlantCV Extractor 2... clowder exchange 3 - RMQ routes message to queue -each queue corresponds to one named extractor - but multiple instances of an extractor can coexist -each queue listens for a particular kind of message New Extractor

Clowder RabbitMQ Message Bus Spaces Mongo Database Collections Datasets Files EXIF Extractor EXIF queue Spectral queue PlantCV queue Spectral Extractor PlantCV Extractor 1 PlantCV Extractor 2... clowder exchange 4 - Extractors listen to their queue -multiple instances of an extractor can share 1 queue - scalable! -messages will accumulate in queue until an extractor comes along to handle each message New Extractor

Clowder RabbitMQ Message Bus Spaces Mongo Database Collections Datasets Files EXIF Extractor EXIF queue Spectral queue PlantCV queue Spectral Extractor PlantCV Extractor 1 PlantCV Extractor 2... clowder exchange 5 - Extractor handles next message -fetch next message in queue and potentially process based on it -message can include file IDs, dataset IDs, other information *PyClowder will make this easier! New Extractor

Clowder RabbitMQ Message Bus Spaces Mongo Database Collections Datasets Files EXIF Extractor EXIF queue Spectral queue PlantCV queue Spectral Extractor PlantCV Extractor 1 PlantCV Extractor 2... clowder exchange 6 - Send info back to Clowder -outputs from extractor such as metadata and derived files can then be sent back into Clowder New Extractor

RabbitMQ MongoDB Required Software Clowder https://opensource.ncsa.illinois.edu/confluence/display/CATS/Clowder+Requirements Java PyClowder 2 installation approaches: -Manual (install things individually, run them individually) -PROS: under-the-hood access to everything -CONS: more work -Docker (start up entire stack at once) -PROS: easy -CONS: things are slightly more obfuscated Docker is the recommended option for those who will primarily be developing extractors.

RabbitMQ MongoDB Required Software (Manual) Clowder https://opensource.ncsa.illinois.edu/confluence/display/CATS/Clowder+Requirements Java./bin/mongod./sbin/rabbitmq-server./sbt run PyClowder python setup.py install

Clowder Configuration (Manual) Enabling RabbitMQ communication Plugin is disabled by default in conf/play.plugins configuration file. Override by creating custom/play.plugins file and adding this line: 9992:services.RabbitmqPlugin Running the application./sbt run

RabbitMQ MongoDB Required Software (Docker) Clowder https://opensource.ncsa.illinois.edu/bitbucket/projects/BD/repos/bd-extractor-template/browse/README.md Java PyClowder

RabbitMQ MongoDB Required Software (Docker) Clowder https://opensource.ncsa.illinois.edu/bitbucket/projects/BD/repos/bd-extractor-template/browse/README.md Java PyClowder Docker https://opensource.ncsa.illinois.edu/bitbucket/projects/CATS/repos/clowder/browse/docker-compose.yml > docker-compose up > docker-machine ip :9000 = Clowder Docker Quickstart Terminal

Docker Terminal If you don’t want to start a new Docker Quickstart Terminal each time, add this line to.profile (e.g. ~/.bash_profile): if which docker-machine >/dev/null; then eval "$(docker-machine env default)” fi This should make all subsequent new terminal sessions Docker-aware.

Clowder Configuration Clowder should now be accessible at localhost:9000 (or :9000 for Docker).

Clowder Configuration Creating a local account Initially, Clowder will have no accounts and no configured email server. To create an account: 1.Sign up for an account inside Clowder 2.The email will not be sent; however it will appear in the Clowder terminal: 3.Copy that URL into your browser to activate > docker logs > docker ps

Running a Sample Extractor The wordcount extractor in the PyClowder repository is a simple example that will process incoming text files and add metadata describing the content. 1.Navigate to pyclowder/sample-extractors/wordcount if using Docker, change RabbitMQ URL in config.py to correct IP python wordcount.py This extractor can be found in the public PyClowder repository. https://opensource.ncsa.illinois.edu/bitbucket/projects/CATS/repos/pyclowder/browse https://opensource.ncsa.illinois.edu/bitbucket/projects/CATS/repos/pyclowder/browse

Running a Sample Extractor The wordcount extractor in the PyClowder repository is a simple example that will process incoming text files and add metadata describing the content. 1.Navigate to pyclowder/sample-extractors/wordcount if using Docker, change RabbitMQ URL in config.py to correct IP python wordcount.py 2.Create a new Clowder dataset and upload a.txt file Datasets > Create Select Files > Upload 3.Check extractor output 4.Verify file metadata

New Extractor Extractor Basic Design 1. Connect with RabbitMQ. Check your extractor’s queue for new messages. 2. Evaluate the message. Is the file relevant based on filename? Do you need to check metadata? If writing a dataset extractor, are all required files available? You can use the Clowder API to fetch more details about files/datasets if necessary. 3. Process the message. If the message is relevant, perform the key operations for your extractor. With PyClowder you will have the files available in a temporary location. In other languages you may need to download files or datasets manually if you need them. 4. Upload output data. Add new files to Clowder datasets, upload metadata, etc. 5. Notify RabbitMQ. Tell RabbitMQ the message is handled, so it doesn’t get repeated.

New Extractor Extractor Basic Design 1. Connect with RabbitMQ. 2. Evaluate the message. 3. Process the message. 4. Upload output data. 5. Notify RabbitMQ. These steps can be done in any language that supports: -HTTP requests -JSON parsing -RabbitMQ interaction -https://www.rabbitmq.com/devtools.htmlhttps://www.rabbitmq.com/devtools.html -Java,.NET, Ruby, PHP, C++, Perl, and more We have created a PyClowder wrapper library for Python to simplify this step. Absent other language requirements, this is the easiest path.

PyClowderNew Extractor PyClowder Library 1. Connect with RabbitMQ. 2. Evaluate the message. 3. Process the message. 4. Upload output data. 5. Notify RabbitMQ. connect_message_bus() check_message() process_file() various utilities

PyClowderNew Extractor PyClowder Library 1. Connect with RabbitMQ. 2. Evaluate the message. 3. Process the message. 4. Upload output data. 5. Notify RabbitMQ. connect_message_bus() check_message() process_file() various utilities config.py define extractor name, file type, URLs, etc. YOUR_SCRIPT.py import extractors def main(): setup() connect_message_bus() def check_message():... def process_file():...

sample: wordcount.py config.py define extractor name, file type, URLs, etc. This extractor can be found in the public pyclowder repository. We will use this as our test extractor after we set up our local testing environment soon. https://opensource.ncsa.illinois.edu/bitbucket/projects/CATS/repos/pyclowder/browse docker IP address

sample: wordcount.py YOUR_SCRIPT.py import extractors def main(): setup() connect_message_bus() def check_message():... def process_file():...

PyClowderNew Extractor Writing an Extractor 1. Connect with RabbitMQ. 2. Evaluate the message. 3. Process the message. 4. Upload output data. 5. Notify RabbitMQ. connect_message_bus() check_message() process_file() various utilities config.py define extractor name, file type, URLs, etc. YOUR_SCRIPT.py import extractors def main(): setup() connect_message_bus() def check_message():... def process_file():...

Inputs & Outputs config.py define extractor name, file type, URLs, etc. YOUR_SCRIPT.py import extractors def main(): setup() connect_message_bus() def check_message():... def process_file():... main() -set up logging & set globals setup() -connect to the message bus connect_message_bus()

Inputs & Outputs config.py define extractor name, file type, URLs, etc. YOUR_SCRIPT.py import extractors def main(): setup() connect_message_bus() def check_message():... def process_file():... main() -set up logging & set globals setup() -connect to the message bus connect_message_bus() check_message(parameters) -evalute the contents of the message -e.g. list of file(s) that were added -if you need access to the files, return True -this will download files and pass pointers to process_file() -if not, return “bypass” -this will pass along to process_file() without downloading files first -if message is irrelevant, return False

Inputs & Outputs config.py define extractor name, file type, URLs, etc. YOUR_SCRIPT.py import extractors def main(): setup() connect_message_bus() def check_message():... def process_file():... main() -set up logging & set globals setup() -connect to the message bus connect_message_bus() check_message(parameters) -evalute the contents of the message -e.g. list of file(s) that were added -if you need access to the files, return True -this will download files and pass pointers to process_file() -if not, return “bypass” -this will pass along to process_file() without downloading files first -if message is irrelevant, return False process_file(parameters) -file(s) themselves are directly accessible here (unless bypassed) -metadata is also available -for dataset extractors, files will automatically be unzipped for easy processing

Testing Extractors 1.Run your extractor script 2.In Clowder, upload a file of appropriate type 3.Watch the script output 4.Verify desired metadata/new files are correctly being uploaded For development, liberal use of logs is useful if things are behaving unexpectedly. print(parameters) in check_message() and process_file() to make sure you are getting the information you expect

Testing Extractors If you aren’t sure what message to use, you can set up a generic queue to test: -Go to RabbitMQ Management console: http://localhost:15672/#/ (or :15672) http://localhost:15672/#/ -Go to Queues, enter a name and click Add queue -In the queue, under Bindings, add routing key *.# from clowder exchange: -Now all messages will arrive in this queue.

Pending Additions These are in the final stages of approval before deployment. Once merged, you may need to reinstall the newer version of PyClowder or download and restart Clowder respectively. PyClowder -Pull Request 13 - will add PyClowder support for dataset extractors (including a sample extractor). https://opensource.ncsa.illinois.edu/bitbucket/projects/CATS/repos/pyclowder/pull- requests/13/overview https://opensource.ncsa.illinois.edu/bitbucket/projects/CATS/repos/pyclowder/pull- requests/13/overview -clone the git repo, switch to this branch and reinstall if you want to use immediately Clowder -Pull Request 880 - will add support for “metadata added” events (i.e. trigger extractors on metadata updates) https://opensource.ncsa.illinois.edu/bitbucket/projects/CATS/repos/clowder/pull- requests/880/overviewhttps://opensource.ncsa.illinois.edu/bitbucket/projects/CATS/repos/clowder/pull- requests/880/overview -Pull Request 899 - will add API support for maintaining only one instance of metadata per extractor, per file. https://opensource.ncsa.illinois.edu/bitbucket/projects/CATS/repos/clowder/pull- requests/899/overviewhttps://opensource.ncsa.illinois.edu/bitbucket/projects/CATS/repos/clowder/pull- requests/899/overview

Extractor Pipeline Overview Setting up a test environment Installing required software Running sample extractor Extractor Basic Design PyClowder library Writing an extractor Handling inputs and outputs Testing extractors Q/A Max Burnette mburnet2@illinois.edu

Writing & Deploying Clowder Extractors Max Burnette, ISDA June 16, 2016.

Similar presentations

Presentation on theme: "Writing & Deploying Clowder Extractors Max Burnette, ISDA June 16, 2016."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Writing & Deploying Clowder Extractors Max Burnette, ISDA June 16, 2016.

Similar presentations

Presentation on theme: "Writing & Deploying Clowder Extractors Max Burnette, ISDA June 16, 2016."— Presentation transcript:

Similar presentations

About project

Feedback