Prototyping a virtual filesystem for storing and processing petascale neural circuit datasets. Art Wetzel, Greg Hood and Markus Dittrich National Resource for Biomedical Supercomputing Pittsburgh Supercomputing Center awetzel@psc.edu 412-268-3912 www.psc.edu and www.nrbsc.org R. Clay Reid, Jeff Lichtman, Wei-Chung Allen Lee Harvard Medical School, Allen Institute for Brain Science Center for Brain Science, Harvard University Davi Bock HMMI Janelia Farm David Hall and Scott Emmons Albert Einstein College of Medicine Jan 11, 2012 Connectomics Data Project Overview
Reconstructing brain circuits requires high resolution electron microscopy over “long” distances == BIGDATA Vesicles ~30 nm diam. A synaptic junction >500 nm wide with cleft gap ~20 nm Dendritic spine www.coolschool.ca/lor/BI12/unit12/U12L04.htm Recent ICs have 32nm features 22nm chips are being delivered. Dendrite Gate oxide 1.2nm thick
A10 Tvoxel dataset aligned by our group was an essential part of the March 2011 Nature paper with Davi Bock, Clay Reid and Harvard colleagues Now we are working on two datasets of 100TB each and expect to reach PBs in 2-3 years.
The CS project is to implement and test a prototype virtual filesystem to address common problems associated with neural circuit and other massive datasets. The most important aim is reducing unwanted data duplication as raw data are preprocessed for final analysis. The virtual filesystem addresses this by replacing redundant storage by on-the-fly computing. The second aim is to provide a convenient framework for efficient on-the-fly computation on multidimensional datasets within high performance parallel computing environments using both CPU and GPGPU processing. The Filesystem in User Space mechanism (FUSE) provides a convenient implementation basis that will work across a variety of systems. There are many existing FUSE codes that serve as useful examples.
We would eventually like to have a flexible software framework that allows a combination of common prewritten and user written application codes to operate together and take advantage of parallel CPU and GPGPU technologies.
Multidimensional data structures to provide efficient random and sequential access analogous to the 1D representations provided by standard filesystems will be part of this work. Students working on this project will have access to a parallel cluster which holds our large datasets along with the compilers and other tools required. Minimal end-to-end functionality with simple linear transforms can likely be achieved in about 8 weeks and then extended as time permits. Please contact Art Wetzel if there are further questions – awetzel@psc.edu.