Ten Years and Change the MX data archive at ALS 8.3.1
Acknowledgements ALS creator: Tom Alber PRT head: Jamie Cate Center for Structure of Membrane Proteins Membrane Protein Expression Center II Center for HIV Accessory and Regulatory Complexes W. M. Keck Foundation Plexxikon, Inc. M D Anderson CRC University of California Berkeley University of California San Francisco National Science Foundation University of California Campus-Laboratory Collaboration Grant Henry Wheeler The Advanced Light Source is supported by the Director, Office of Science, Office of Basic Energy Sciences, Materials Sciences Division, of the US Department of Energy under contract No. DE-AC02-05CH11231 at Lawrence Berkeley National Laboratory.
ALS data collection history terabytes (uncompressed)
ALS data collection history terabytes (uncompressed)
ALS data collection history images x 10 6
DVD data archive: 68 TB
DVD data archive
50 TB
Primary failure mode of DVDs
3000 files remain unrecoverable (~0.1%)
Which data go with which PDB? 260,000 images are called “test” cell: – is within 5 Å and 5° of 16,000 PDBs focusing on PDBs credit ALS with data 44 of these didn’t actually collect data 64 collected data, but no credit
1.images from collected “near” edges 3.find “runs” of >10 images 4.unify multi-wedge sets 5.run labelit & XDS 6.>70% complete? 7.I/σ > 10 8.reduced cell vs PDB 1,604, , to 200+ Which data go with which PDB?
Unit Cell: best R cryst after rigid-body refinement RMS unit cell length deviation (Å) 1hh7 M. TB CSOR 1rb5 myoglobin
MAD/SAD datasets R iso vs PDB deposit best R cryst after rigid-body refinement Published non-isomorphous Unsolved?
Responses to inquiries “I have to find my old note book as I have no idea what that is.” “I have changed jobs a few times since and am really far away from crystallography now.” “Will see what I can find.” “We solved it but never published it. Sorry!”
EGDA Dec 01 19:45: egda46_*1_E#_###.img (1112 images, Se MAD) Dec 02 15:10: egda27_*1_###.img (180, 1A, native?) Dec 02 19:21: egdau1_*1_###.img (427, 8000eV (U?) SAD) Dec 02 20:58: egdau1_*2_###.img (360, 8000eV (U?) SAD) Jun 01 14:07: egda60_*1_###.img (360, Lutetium SAD) “I think that these EGDA data sets are very likely some of xxx’s data sets, he was working on E.coli guanine deaminase, something he brought from yyy. No structure was ever published James, xxx was unable to solve the structure from these data.”
~2.9 Å P R = 0.32 R free = 0.39 PDB ID: ???? E. coli guanine deaminase
Responses to inquiries “Thank you for your effort, it looks like the "elves" have brought an early Christmas present.... I think it is worth depositing of course, it would be a shame to do the difficult part and avoid the easiest. ”
Metadata: can we rely on it? Duquerroy, et al. (1994). "Lobster enolase crystallized by serendipity", Proteins: Struct., Funct., Bioinf. 18, authors were after lobster arginine kinase got enolase instead arginine kinase structure still unknown
compresses 4.2x raw image
compresses 337x just spots
compresses 5x, but only one per dataset! pixel-wise median across dataset
compresses 3.5x deviation from median in “non-spot” areas
compressed ~50x after h264 of non-spot areas
compresses 5.2x difference between raw and compressed
Lossy compression vs R/R free R factor compression ratio
backblaze.com “pod” server backblaze.com offers “unlimited storage” data backup for $5/month.
backblaze offers “unlimited storage” data backup for $5/month.
backblaze does not sell these “pods”, but “protocase.com” does.
Summary saving data could double productivity unit cell is not a good score lossy compression: rallying cry? backup vs archive metadata: what do we really know?
Brief Summary this is a lot of work. who is going to pay for it?