Lessons Learned from Managing a Petabyte Jacek Becla Stanford Linear Accelerator Center (SLAC) Daniel Wang now University of CA in Irvine, formerly SLAC
2 of 18CIDR’05, Asilomar, CA Roadmap u Who we are u Simplified data processing u Core architecture and migration u Challenges/surprises/problems u Summary Don’t miss the “lessons”, just look for yellow stickers
3 of 18CIDR’05, Asilomar, CA Who We Are u Stanford Linear Accelerator Center –DoE National Lab, operated by Stanford University u BaBar –one of the largest High Energy Physics (HEP) experiments online –in production since 1999 –over petabyte of production data u HEP –data intensive science –statistical studies –needle in a haystack searches
4 of 18CIDR’05, Asilomar, CA Simplified Data Processing
5 of 18CIDR’05, Asilomar, CA A Typical Day in Life (SLAC only) u ~8 TB accessed in ~100K files u ~7 TB in/out tertiary storage u 2-5 TB in/out SLAC u ~35K jobs complete –2500 run at any given time –many long running jobs (up to few days)
6 of 18CIDR’05, Asilomar, CA Some of the Data-related Challenges u Finding perfect snowflake(s) in an avalanche u Volume organizing data u Dealing with I/O –sparse reads –random access –small object size: o(100) bytes u Providing data for many tens of sites
7 of 18CIDR’05, Asilomar, CA More Challenges… Data Distribution u ~25 sites worldwide produce data –many more use it u Distribution pros/cons + keep data close to users –makes administration tougher + works as a backup Kill two birds with one stone, replicate for availability as well as backup
8 of 18CIDR’05, Asilomar, CA Core Architecture u Mass Storage (HPSS) –tapes cost-effective & more reliable than disks u 160 TB disk cache, 40+ data servers u Database engine: ODBMS: Objectivity/DB –scalable thin-dataserver thick-client architecture –gives full control over data placement & clustering –ODBMS later replaced by system built within HEP u DB related code hidden behind transient-persistent wrapper Consider all factors when choosing software and hardware
9 of 18CIDR’05, Asilomar, CA Reasons to Migrate u ODBMS not a mainstream –true for HEP and elsewhere –long term future u Locked in certain OSes/compilers u Unnecessary DB overhead –e.g. transactions for immutable data u Maintenance at small institutes u Monetary cost Build flexible system, be prepared for non-trivial changes. Bet on simplicity.
10 of 18CIDR’05, Asilomar, CA xrootd Data Server u Developed in-house –becoming de facto HEP standard now u Numerous must-have features, some hard to add to the commercial server –deferral –redirection –fault tolerance –scalability –automatic load balancing –proxy server Larger systems depend more heavily on automation
11 of 18CIDR’05, Asilomar, CA More Lessons… Challenges, Surprises, Problems u Organizing & managing data –Divide into mutable & immutable, separate queryable data immutable easier to optimize, replicate & scale –Decentralize metadata updates contention happens in unexpected places makes data mgmt harder still need some centralization u Fault tolerance –Large system likely to use commodity hardware fault tolerance essential Single technology likely not enough to efficiently manage petabytes
12 of 18CIDR’05, Asilomar, CA u Main bottleneck: disk I/O –underlying persistency less important than one’d expect –access patterns more important must understand to derandomize I/O u Job mgmt/bookkeeping –better to stall jobs than to kill u Power, cooling, floor weight u Admin Challenges, Surprises, Problems (cont…) Hide disruptive events by stalling data flow
13 of 18CIDR’05, Asilomar, CA On Bleeding Edge Since Day 1 u Huge collection of interesting challenges… –Increasing address space –Improving server code –Tuning and scaling whole system –Reducing lock collisions –Improving I/O –…many others u In summary –we made it work (big success), but… –continuous improvements were needed for the first several years to keep up When you push limits, expect many problems everywhere. Normal maxima are too small. Observe refine repeat
14 of 18CIDR’05, Asilomar, CA Uniqueness of … Scientific Community u Hard to convince scientific community to use commercial products –BaBar: 5+ million lines of home grown, complex C++ u Continuously look for better approaches –system has to be very flexible u Most data immutable u Many smart people that can build almost anything Specific needs of your community can impact everything, including the system architecture
15 of 18CIDR’05, Asilomar, CA DB-related Effort u ~4-5 core db developers since 1996 –effort augmented by many physicists, students and visitors u 3 DBAs –since production started till recently –less than 3 now system finally automated and fault tolerant Automation. is the key to low- maintenance, fault tolerant, system
16 of 18CIDR’05, Asilomar, CA Lessons Summary Kill two birds with one stone, replicate for availability as well as backup Consider all factors when choosing software and hardware When you push limits, expect many problems everywhere. Normal maxima are too small. Observe refine repeat Specific needs of your community can impact everything, including the system architecture Automation. is the key to low- maintenance, fault tolerant, system Larger systems depend more heavily on automation Hide disruptive events by stalling data flow Single technology likely not enough to efficiently manage petabytes Organize data (mutable, immutable, queryable, …) Build flexible system, be prepared for non-trivial changes. Bet on simplicity.
17 of 18CIDR’05, Asilomar, CA Petabyte Frontier just a few highlights… u How to cost-effectively backup a PB? u How to provide fault tolerance with 1000s disks –RAID 5 is not good enough u How to build low maintenance system? –“1 full-time person per 1 TB” does not scale u How to store the data? (tape anyone? ) –consider all factors: cost, power, cooling, robustness u …YES, there are “new” problems beyond “known problems scaled up”
18 of 18CIDR’05, Asilomar, CA The Summary u Great success –ODBMS based system, migration & 2 nd generation –Some DoD projects are being built on ODBMS u Lots of useful experience with managing (very) large datasets –Would not be able to achieve all that with any RDBMS (today) –Thin server thick client architecture works well –Starting to help astronomers (LSST) to manage their petabytes