Nebula A cloud-based back end for

Nebula A cloud-based back end for SETI@home
David P. Anderson Kevin Luong Space Sciences Lab University of California, Berkeley

SETI@home observation signal detection Signal storage
Back-end processing: RFI detection/removal persistent signal detection re-observation

Signal storage Using SQL database (Informix) Signal types
spike, Gaussian, triplet, pulse, autocorrelation Database table hierarchy tape workunit group workunit result (signal tables)

Pixelized sky position
HealPix: Hierarchical Equal-Area Isolatitude Pixelization ~51M pixels; telescope beam is ~1 pixel

Current back end: NTPCKR
As signals are added, mark pixels as “hot” Score hot pixels DB-intensive Do RFI removal from high-scoring pixels, flag for re-scoring

Problems with current back end
Signal DB is large 5 billion signals, 10 TB Informix has limited speed NTPCKR can’t keep up with signal arrival > 1 year to score all pixels labor-intensive non-scalable

Impact on science We haven’t done scoring/reobservation in 10 years
We wouldn’t find ET signal if it were there We don’t have anything to tell volunteers We don’t have basis for writing papers

Nebula goals Short-term
RFI-remove and score all pixels in ~1 day for ~$100 stop doing sysadmin, start doing science e.g. continuous reobservation, experiment with scoring algorithm Long-term generality; include other signal sources (SERENDIP) provide outside access to scoring, signals, raw data General build expertise in clouds and big-data techniques form relationship with cloud providers, e.g. Amazon

Design decisions Use Amazon cloud (AWS) for the heavy lifting
For bursty usage, clouds are cheaper than in-house hardware Use flat files and Unix filesystem NoSQL DB systems don’t buy us anything Software C++ for compute-intensive stuff (use existing code) Python for the rest

AWS features Elastic Computing Cloud (EC2) Simple Storage System(S3)
disk storage by the GB/month Elastic Computing Cloud (EC2) VM hosting by the hour various “node types” HTTP mount HTTP Elastic Block Storage (EBS) disk storage by the GB/month attached to 1 EC2 node Internet

Interfaces to AWS Web-based Python APIs Boto3: interface to S3 storage
Fabric: interface to EC2 nodes AWS local host script.py HTTP

Nebula: the basic idea Dump SETI@home database to flat files
Upload files to S3 Split files by pixel (~80M files) remove RFI, redundant signals in the process do this in parallel on EC2 nodes Score the pixels

Moving data from Informix to S3
Informix DB unload: 1-2 days Nebula upload script use Unix “split” to make 2GB chunks upload chunks in parallel thread pool / queue approach, 8 threads S3 automatically reassembles chunks Getting close to 1 Gb throughput

Pixelization Need to: Can’t do this sequentially
Divide TB-size files into 16M files remove RFI, redundant signals Can’t do this sequentially A process can only have 1024 open files it would take too long

Hierarchical pixelization
Level 1 split flat files 512 ways based on pixel convert from ASCII to binary remove redundant signals Level 2 split level 1 files 256 ways result: 130K level 2 files Level 3 split each level 2 file 512 ways remove RFI

Pixelization on EC2 Create N instances (t2.micro)
Create a thread per node Create a queue of level 1 tasks To run a task: get input file from S3 run pixelize program upload output files to S3 create next-level tasks Keep going until all tasks done Kill instances

Removing redundant signals
Old way: for each signal, walk up chain of DB tables New way: create bitmap file, indexed by result ID, saying whether result is from a redundant tape memory-map this file given a signal, can instantly see if it’s redundant

Pixel scoring Assemble signals in disc centered at pixel
Compute probability that these are noise Can be done independently for each pixel

Nebula scoring program
Same code as NTPCKR Modified to get signals from flat files instead of DB First try: remove all references to Informix this failed; too intertwined Second try: keep Informix but don’t use it

Parallelizing scoring
Need to score 16M pixels Use about 1K nodes Want to minimize file transfers; reuse signal files on a node Divide pixels into adjacent “blocks” of 4^n, say 1024 Each block is a job (16K of them) Each job loops over pixels, fetches and caches files, creates and uploads output file (pixel, score) Master script instantiates EC2 nodes, uses thread/queue approach keeps nodes busy even if some pixels take longer than others

Nebula user interface Configuration AWS, Nebula config files
check out, build software Scripts s3_upload.py, s3_status.py, s3_delete.py pixelize.py score.py logging Amazon accounting tools

Status Mostly written, working doing performance, cost tests
code: seti_science/nebula doing performance, cost tests I think we’ll meet goals design docs are on Google readable to ucb_seti_dev group.

Future directions Flat-file-centric architecture
assimilators write signals to flat files load into SQL DB if needed Amazon spot instances (auction pricing) instances are killed if price goes above bid Amazon elastic file system (upcoming) shared mountable storage, at a price Incremental processing

Nebula A cloud-based back end for

Similar presentations

Presentation on theme: "Nebula A cloud-based back end for"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Nebula A cloud-based back end for

Similar presentations

Presentation on theme: "Nebula A cloud-based back end for"— Presentation transcript:

Similar presentations

About project

Feedback