Download presentation
Presentation is loading. Please wait.
Published byWinfred Richardson Modified over 8 years ago
1
Writing & Deploying Clowder Extractors Max Burnette, ISDA June 16, 2016
2
Extractor Pipeline Overview Setting up a test environment Installing required software Running sample extractor Extractor Basic Design PyClowder library Writing an extractor Handling inputs and outputs Testing extractors Q/A
3
RabbitMQ Message Bus Spaces Mongo Database Collections Datasets Files EXIF Extractor EXIF queue Spectral queue PlantCV queue Spectral Extractor PlantCV Extractor 1 PlantCV Extractor 2... clowder exchange Extractor Pipeline Overview Clowder New Extractor
4
Clowder RabbitMQ Message Bus Spaces Mongo Database Collections Datasets Files EXIF Extractor EXIF queue Spectral queue PlantCV queue Spectral Extractor PlantCV Extractor 1 PlantCV Extractor 2... clowder exchange 1 - Clowder event occurs -new file uploaded -file added/removed to dataset -metadata added to file/dataset -triggered via UI or API New Extractor
5
Clowder RabbitMQ Message Bus Spaces Mongo Database Collections Datasets Files EXIF Extractor EXIF queue Spectral queue PlantCV queue Spectral Extractor PlantCV Extractor 1 PlantCV Extractor 2... clowder exchange 2 - Event message sent to RMQ -includes a message type and content, e.g. -*.dataset.added -*.file.image.# -*.dataset.metadata.added New Extractor
6
Clowder RabbitMQ Message Bus Spaces Mongo Database Collections Datasets Files EXIF Extractor EXIF queue Spectral queue PlantCV queue Spectral Extractor PlantCV Extractor 1 PlantCV Extractor 2... clowder exchange 3 - RMQ routes message to queue -each queue corresponds to one named extractor - but multiple instances of an extractor can coexist -each queue listens for a particular kind of message New Extractor
7
Clowder RabbitMQ Message Bus Spaces Mongo Database Collections Datasets Files EXIF Extractor EXIF queue Spectral queue PlantCV queue Spectral Extractor PlantCV Extractor 1 PlantCV Extractor 2... clowder exchange 4 - Extractors listen to their queue -multiple instances of an extractor can share 1 queue - scalable! -messages will accumulate in queue until an extractor comes along to handle each message New Extractor
8
Clowder RabbitMQ Message Bus Spaces Mongo Database Collections Datasets Files EXIF Extractor EXIF queue Spectral queue PlantCV queue Spectral Extractor PlantCV Extractor 1 PlantCV Extractor 2... clowder exchange 5 - Extractor handles next message -fetch next message in queue and potentially process based on it -message can include file IDs, dataset IDs, other information *PyClowder will make this easier! New Extractor
9
Clowder RabbitMQ Message Bus Spaces Mongo Database Collections Datasets Files EXIF Extractor EXIF queue Spectral queue PlantCV queue Spectral Extractor PlantCV Extractor 1 PlantCV Extractor 2... clowder exchange 6 - Send info back to Clowder -outputs from extractor such as metadata and derived files can then be sent back into Clowder New Extractor
10
Extractor Pipeline Overview Setting up a test environment Installing required software Running sample extractor Extractor Basic Design PyClowder library Writing an extractor Handling inputs and outputs Testing extractors Q/A
11
RabbitMQ MongoDB Required Software Clowder https://opensource.ncsa.illinois.edu/confluence/display/CATS/Clowder+Requirements Java PyClowder 2 installation approaches: -Manual (install things individually, run them individually) -PROS: under-the-hood access to everything -CONS: more work -Docker (start up entire stack at once) -PROS: easy -CONS: things are slightly more obfuscated Docker is the recommended option for those who will primarily be developing extractors.
12
RabbitMQ MongoDB Required Software (Manual) Clowder https://opensource.ncsa.illinois.edu/confluence/display/CATS/Clowder+Requirements Java./bin/mongod./sbin/rabbitmq-server./sbt run PyClowder python setup.py install
13
Clowder Configuration (Manual) Enabling RabbitMQ communication Plugin is disabled by default in conf/play.plugins configuration file. Override by creating custom/play.plugins file and adding this line: 9992:services.RabbitmqPlugin Running the application./sbt run
14
RabbitMQ MongoDB Required Software (Docker) Clowder https://opensource.ncsa.illinois.edu/bitbucket/projects/BD/repos/bd-extractor-template/browse/README.md Java PyClowder
15
RabbitMQ MongoDB Required Software (Docker) Clowder https://opensource.ncsa.illinois.edu/bitbucket/projects/BD/repos/bd-extractor-template/browse/README.md Java PyClowder Docker https://opensource.ncsa.illinois.edu/bitbucket/projects/CATS/repos/clowder/browse/docker-compose.yml > docker-compose up > docker-machine ip :9000 = Clowder Docker Quickstart Terminal
16
Docker Terminal If you don’t want to start a new Docker Quickstart Terminal each time, add this line to.profile (e.g. ~/.bash_profile): if which docker-machine >/dev/null; then eval "$(docker-machine env default)” fi This should make all subsequent new terminal sessions Docker-aware.
17
Clowder Configuration Clowder should now be accessible at localhost:9000 (or :9000 for Docker).
18
Clowder Configuration Creating a local account Initially, Clowder will have no accounts and no configured email server. To create an account: 1.Sign up for an account inside Clowder 2.The email will not be sent; however it will appear in the Clowder terminal: 3.Copy that URL into your browser to activate > docker logs > docker ps
19
Running a Sample Extractor The wordcount extractor in the PyClowder repository is a simple example that will process incoming text files and add metadata describing the content. 1.Navigate to pyclowder/sample-extractors/wordcount if using Docker, change RabbitMQ URL in config.py to correct IP python wordcount.py This extractor can be found in the public PyClowder repository. https://opensource.ncsa.illinois.edu/bitbucket/projects/CATS/repos/pyclowder/browse https://opensource.ncsa.illinois.edu/bitbucket/projects/CATS/repos/pyclowder/browse
20
Running a Sample Extractor The wordcount extractor in the PyClowder repository is a simple example that will process incoming text files and add metadata describing the content. 1.Navigate to pyclowder/sample-extractors/wordcount if using Docker, change RabbitMQ URL in config.py to correct IP python wordcount.py 2.Create a new Clowder dataset and upload a.txt file Datasets > Create Select Files > Upload 3.Check extractor output 4.Verify file metadata
21
Extractor Pipeline Overview Setting up a test environment Installing required software Running sample extractor Extractor Basic Design PyClowder library Writing an extractor Handling inputs and outputs Testing extractors Q/A
22
New Extractor Extractor Basic Design 1. Connect with RabbitMQ. Check your extractor’s queue for new messages. 2. Evaluate the message. Is the file relevant based on filename? Do you need to check metadata? If writing a dataset extractor, are all required files available? You can use the Clowder API to fetch more details about files/datasets if necessary. 3. Process the message. If the message is relevant, perform the key operations for your extractor. With PyClowder you will have the files available in a temporary location. In other languages you may need to download files or datasets manually if you need them. 4. Upload output data. Add new files to Clowder datasets, upload metadata, etc. 5. Notify RabbitMQ. Tell RabbitMQ the message is handled, so it doesn’t get repeated.
23
New Extractor Extractor Basic Design 1. Connect with RabbitMQ. 2. Evaluate the message. 3. Process the message. 4. Upload output data. 5. Notify RabbitMQ. These steps can be done in any language that supports: -HTTP requests -JSON parsing -RabbitMQ interaction -https://www.rabbitmq.com/devtools.htmlhttps://www.rabbitmq.com/devtools.html -Java,.NET, Ruby, PHP, C++, Perl, and more We have created a PyClowder wrapper library for Python to simplify this step. Absent other language requirements, this is the easiest path.
24
PyClowderNew Extractor PyClowder Library 1. Connect with RabbitMQ. 2. Evaluate the message. 3. Process the message. 4. Upload output data. 5. Notify RabbitMQ. connect_message_bus() check_message() process_file() various utilities
25
PyClowderNew Extractor PyClowder Library 1. Connect with RabbitMQ. 2. Evaluate the message. 3. Process the message. 4. Upload output data. 5. Notify RabbitMQ. connect_message_bus() check_message() process_file() various utilities config.py define extractor name, file type, URLs, etc. YOUR_SCRIPT.py import extractors def main(): setup() connect_message_bus() def check_message():... def process_file():...
26
sample: wordcount.py config.py define extractor name, file type, URLs, etc. This extractor can be found in the public pyclowder repository. We will use this as our test extractor after we set up our local testing environment soon. https://opensource.ncsa.illinois.edu/bitbucket/projects/CATS/repos/pyclowder/browse docker IP address
27
sample: wordcount.py YOUR_SCRIPT.py import extractors def main(): setup() connect_message_bus() def check_message():... def process_file():...
28
Extractor Pipeline Overview Setting up a test environment Installing required software Running sample extractor Extractor Basic Design PyClowder library Writing an extractor Handling inputs and outputs Testing extractors Q/A
29
PyClowderNew Extractor Writing an Extractor 1. Connect with RabbitMQ. 2. Evaluate the message. 3. Process the message. 4. Upload output data. 5. Notify RabbitMQ. connect_message_bus() check_message() process_file() various utilities config.py define extractor name, file type, URLs, etc. YOUR_SCRIPT.py import extractors def main(): setup() connect_message_bus() def check_message():... def process_file():...
30
Inputs & Outputs config.py define extractor name, file type, URLs, etc. YOUR_SCRIPT.py import extractors def main(): setup() connect_message_bus() def check_message():... def process_file():... main() -set up logging & set globals setup() -connect to the message bus connect_message_bus()
31
Inputs & Outputs config.py define extractor name, file type, URLs, etc. YOUR_SCRIPT.py import extractors def main(): setup() connect_message_bus() def check_message():... def process_file():... main() -set up logging & set globals setup() -connect to the message bus connect_message_bus() check_message(parameters) -evalute the contents of the message -e.g. list of file(s) that were added -if you need access to the files, return True -this will download files and pass pointers to process_file() -if not, return “bypass” -this will pass along to process_file() without downloading files first -if message is irrelevant, return False
32
Inputs & Outputs config.py define extractor name, file type, URLs, etc. YOUR_SCRIPT.py import extractors def main(): setup() connect_message_bus() def check_message():... def process_file():... main() -set up logging & set globals setup() -connect to the message bus connect_message_bus() check_message(parameters) -evalute the contents of the message -e.g. list of file(s) that were added -if you need access to the files, return True -this will download files and pass pointers to process_file() -if not, return “bypass” -this will pass along to process_file() without downloading files first -if message is irrelevant, return False process_file(parameters) -file(s) themselves are directly accessible here (unless bypassed) -metadata is also available -for dataset extractors, files will automatically be unzipped for easy processing
33
Testing Extractors 1.Run your extractor script 2.In Clowder, upload a file of appropriate type 3.Watch the script output 4.Verify desired metadata/new files are correctly being uploaded For development, liberal use of logs is useful if things are behaving unexpectedly. print(parameters) in check_message() and process_file() to make sure you are getting the information you expect
34
Testing Extractors If you aren’t sure what message to use, you can set up a generic queue to test: -Go to RabbitMQ Management console: http://localhost:15672/#/ (or :15672) http://localhost:15672/#/ -Go to Queues, enter a name and click Add queue -In the queue, under Bindings, add routing key *.# from clowder exchange: -Now all messages will arrive in this queue.
35
Pending Additions These are in the final stages of approval before deployment. Once merged, you may need to reinstall the newer version of PyClowder or download and restart Clowder respectively. PyClowder -Pull Request 13 - will add PyClowder support for dataset extractors (including a sample extractor). https://opensource.ncsa.illinois.edu/bitbucket/projects/CATS/repos/pyclowder/pull- requests/13/overview https://opensource.ncsa.illinois.edu/bitbucket/projects/CATS/repos/pyclowder/pull- requests/13/overview -clone the git repo, switch to this branch and reinstall if you want to use immediately Clowder -Pull Request 880 - will add support for “metadata added” events (i.e. trigger extractors on metadata updates) https://opensource.ncsa.illinois.edu/bitbucket/projects/CATS/repos/clowder/pull- requests/880/overviewhttps://opensource.ncsa.illinois.edu/bitbucket/projects/CATS/repos/clowder/pull- requests/880/overview -Pull Request 899 - will add API support for maintaining only one instance of metadata per extractor, per file. https://opensource.ncsa.illinois.edu/bitbucket/projects/CATS/repos/clowder/pull- requests/899/overviewhttps://opensource.ncsa.illinois.edu/bitbucket/projects/CATS/repos/clowder/pull- requests/899/overview
36
Extractor Pipeline Overview Setting up a test environment Installing required software Running sample extractor Extractor Basic Design PyClowder library Writing an extractor Handling inputs and outputs Testing extractors Q/A Max Burnette mburnet2@illinois.edu
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.