Presentation is loading. Please wait.

Presentation is loading. Please wait.

Feb 4, 2005Scott Koranda1 Cataloging, Replicating, and Managing LIGO Data on the Grid Scott Koranda UW-Milwaukee On behalf of the LIGO.

Similar presentations


Presentation on theme: "Feb 4, 2005Scott Koranda1 Cataloging, Replicating, and Managing LIGO Data on the Grid Scott Koranda UW-Milwaukee On behalf of the LIGO."— Presentation transcript:

1 Feb 4, 2005Scott Koranda1 Cataloging, Replicating, and Managing LIGO Data on the Grid Scott Koranda UW-Milwaukee skoranda@uwm.edu On behalf of the LIGO Scientific Collaboration MardiGras Conference February 4, 2005 LSU

2 Feb 4, 2005 Scott Koranda 2 Laser Interferometer Gravitational-wave Observatory LIGO is opening a new frontier in observational astrophysics  Detect & use gravitational waves (GW) to observe the Universe, provide a more complete picture of the Cosmos.  Complementary to radio/infrared/optical/X-ray/g-ray astronomy EM emitters not likely to be strong GW emitters & vice versa  Detect & observe cataclysmic events leading to death of stars, birth of neutron stars & black holes  Study Einstein’s theory of general relativity in the strong-field regime near massive compact objects, where GW are produced LIGO is now observing, acquiring science data and in full analysis production

3 Feb 4, 2005 Scott Koranda 3 Who is LIGO? LIGO = Laser Interferometer Gravitational-wave Observatory LIGO = + + +

4 Feb 4, 2005 Scott Koranda 4 What is the LSC? LSC = LIGO Scientific Collaboration LSC = + + 28 (or more) other institutions

5 Feb 4, 2005 Scott Koranda 5 LIGO Data Challenges Revealing the full science content of LIGO data is a computationally and data intensive challenge  LIGO interferometers generate ~ 10 MB/s or almost 1 TB/day Several classes of data analysis challenges require large-scale computational resources  In general for analysis FFT data segment Choose template (based on physical parameters) and filter Repeat again and again and again…

6 Feb 4, 2005 Scott Koranda 6 One Adventure Story on the “Grid”... an evolutionary tale of one physicist’s adventures reveal some things we learned  technical  social (collaborations)

7 Feb 4, 2005 Scott Koranda 7 First Challenge Early 2001 Bruce Allen asked me to “use some Grid tools to get LIGO data to UWM”  LIGO data in HPSS archive at CIT (CACR)  LIGO E7 data set 10’s of TBs (full frames at time)  Data at UWM to go onto 296 disk partitions  Make it fast!  Make it robust! If a disk at UWM dies the data should automatically reappear (quickly, or course!)  And, make it fast!

8 Feb 4, 2005 Scott Koranda 8 Initial Prototype Pulled together existing production tools Globus GridFTP server and client  Move data fast!  globus-url-copy command-line client  Multiple parallel data streams  GSI authentication  Tunable tcp windows and I/O buffers Globus Replica Catalog for tracking what we have where  Mappings of logical filenames to physical locations  LDAP based Python scripts as “glue” to hold it together

9 Feb 4, 2005 Scott Koranda 9 How did we do for E7? Failure! Low HPSS effective throughput Globus Replica Catalog (LDAP) not up to the challenge  Bad scaling past 10 5 filenames Nothing fault tolerant or robust about this prototype Still… ~ 2 TB came over the network 10 MB/s transfers once data out of HPSS (good sign) Learned GridFTP would be a firm foundation Feedback into Globus via GriPhyN and iVDGL

10 Feb 4, 2005 Scott Koranda 10 Second Prototype for S1 Limited task again: Replicate S1 data from Caltech to UWM  S1 data spins at Caltech Pull together pieces: GridFTP  Code against API to create custom, tightly-integrated client  Use default but improved GridFTP server  Cache open connections and fill the pipe with data Plain-text catalogs!  Simple flat files to keep track of what files are available in the “collection” and what we already have  Begin storing file metadata like size, md5 checksums  Added simple data verification on client side Python as glue again

11 Feb 4, 2005 Scott Koranda 11 How did we do for S1? So So…  Did get all S1 locked data to UWM  Plain-text catalogs don’t scale (not surprisingly)  Not much fault tolerant or robust about the system Needs administrative attention each and every day Still…  Did get all S1 locked data to UWM  10 MB/s transfer rate when system working  GridFTP definitely firm foundation  Verified integrity of replica using checksums and sizes  Provide data catalog requirements to Globus team

12 Feb 4, 2005 Scott Koranda 12 A Lesson Learned Too much focus early on data transfer rates Moving data fast easier problem to solve Real challenges for data replication What data exists?  How does a site learn about what data exists? Where is it?  How does a site learn about what data other sites have? How does a site get the data it wants?  What mechanism can be used to schedule and prioritize data for replication? How do users find the data?  In what ways will users try to find the data? What tools are necessary? And of course, data should move fast… Is it here yet?

13 Feb 4, 2005 Scott Koranda 13 LIGO Data Replicator for S2 Replicate S2 RDS data from Caltech to MIT, PSU, UWM  AEI, Cardiff added late LIGO Data Replicator (LDR) GridFTP server, customized clients Globus Replica Location Service (RLS)  Local Replica Catalog (LRC) maps logical filenames to URLs  Replica Location Index (RLI) maps logical filenames to other LRCs  RDBMS based (MySQL) First attempt at real metadata catalog  Use MySQL but with very naïve tables Python Glue LIGO Data Replicator → “Lightweight Data Replicator”

14 Feb 4, 2005 Scott Koranda 14 How did we do for S2? Better, but not quite there…  Network transfer rate problems  Smaller files leads to new problems/insights  Naïve metadata table design limits performance & scalability  Robustness and fault tolerance better--not good enough  Publishing is awkward  No automatic data discovery Still…  Replication to MIT, AEI, Cardiff from both Caltech and UWM  Replicated from UWM back to Caltech after disk lost  LDRdataFindServer and LALdataFind expose data to users at UWM A first attempt that goes too well…becomes a necessary feature!  GridFTP continues to be solid foundation  Globus RLS will be a firm foundation There have been reliability issues, but all addressed

15 Feb 4, 2005 Scott Koranda 15 LDR for S3 Wish list by admins for LDR features/enhancements  Better/easier installation, configuration ☺  “Dashboard” for admins for insights into LDR state X  More robustness, especially with RLS server hangs ☺  API and templates for publishing X  New schema for metadata tables X  Transfer rate database X  Latest version of Globus Replica Location Server (RLS) ☺  Latest upgrades to GridFTP Server and API ☺  Simple installation using Pacman ☺ Deploy at end of September 2003

16 Feb 4, 2005 Scott Koranda 16 How did we do for S3? Still… Replicated data to 4 sites with minimal latency for most of the extended run Average LDR intervention time up to a few days (for the most part) Deployment using Pacman a solid foundation  But now there is yum and apt-get??? RLS statistics:  ~ 6 million LFNs per LRC  between 6 and 30 million PFNs per LRC  network of 5 to 7 RLS servers all updating each other

17 Feb 4, 2005 Scott Koranda 17 Great Collaboration ISI Globus team and LDR team  close collaboration over RLS  many performance issues solved  new client API functions added Collaboration challenges  What the CS people had to “put up with”? physicists more concerned about performance then new CS research ideas irregular update schedule, based on experiment’s needs not CS needs server performance statistics not a high priority use cases change—sometimes daily

18 Feb 4, 2005 Scott Koranda 18 Great Collaboration Collaboration Challenges What the physics people had to “put up with”?  tendencies for “throw it over the fence” approach  lack of interest in some user/admin issues  landscape shifts (to web services for example)  Java...

19 Feb 4, 2005 Scott Koranda 19 Great Collaboration Why did it work?  Credit RLS developers with great listening much effort into understanding LSC use case(s)  Single points of contact between two groups make 1 physicist and 1 CS responsible ignore other “inputs”  Good logging helped communicate state  Regular face-2-face meetings sounds simple prevents useless tangents due to poor communication

20 Feb 4, 2005 Scott Koranda 20 What’s Next for S4/S5? New metadata schema has to be top priority Current schema makes queries to find data too slow  More users demanding LSCdataFind/framequery capabilities and performance Current metadata propagation is not scaling well  Probably can’t even make it to S4, much less into S4 New metadata based on Globus MCS project  We don’t want to be in the metadata catalog business, but have to be at this time  We are making particular assumptions/choices in order to implement a propagation strategy  Feedback our experience and requirements into Grid community

21 Feb 4, 2005 Scott Koranda 21 What’s Next for S4/S5? Need to solve the “small file problem” Bruce is right and we do still have to worry about replication rates Trend is toward more but smaller files published into LDR Plan is a “tar on the fly, move, untar” approach New technologies make this attainable  pyGlobus GridFTP-enabled server class  New Globus GridFTP server base in beta Proof of concept already done by IBM using Java Cog

22 Feb 4, 2005 Scott Koranda 22 What’s Next for S4/S5? Data discovery and automated filesystem watching Admins do need to move data around in filesystem and have changes appear automatically in LDR Publishing of existing data sets needs to be quicker and easier and automated

23 Feb 4, 2005 Scott Koranda 23 What’s Next for S4/S5? Strong pressure to “open” LDR network for any user/files I have resisted this initially  Problem of replicating bulk “raw” data sets fundamentally different then many small sets of user files  User’s do crazy things!  Undelete problem is hard Have agreed to look in detail and try  Need to make sure doesn’t derail LDR’s first mission of replicating LIGO/GEO S4/S5 data  Need to broaden the discussion

24 Feb 4, 2005 Scott Koranda 24 Lightweight Data Replicator Metadata Service Discovery Service Replication Service MySQL RLS LDR Grid FTP GSI SOAP GSI

25 Feb 4, 2005 Scott Koranda 25 LHOCITMITLLOUWMPSUAEI“Publish” data Metadata Catalog H-RDS_R_L3-752653616-16.gwf 1950187 bytes Frame type RDS_R_L3 Run tag S3 Locked … “Publish” data Metadata Catalog L-RDS_R_L3-752653616-16.gwf 983971 bytes Frame type RDS_R_L3 Run tag S3 Locked … What data do we want? Ask metadata catalog Collection: Instrument = ‘H’ AND frameType = ‘RDS_R_L3’ AND runTag = ‘S3’ Where can we get it? Ask URL catalog H-RDS_R_L3-752653616-16.gwf is available at LHO Local Replica Catalog H-RDS_R_L3-752653616-16.gwf → gsiftp://ldas.ligo-wa.caltech. edu:15000/samrds/S3/L3/LHO/H- RDS_R_L3-7526/H-RDS_R_L3- 752653616-16.gwf Local Replica Catalog L-RDS_R_L3-752653616-16.gwf → gsiftp://ldas.ligo-la.caltech. edu:15000/samrds/S3/L3/LLO/L- RDS_R_L3-7526/L-RDS_R_L3- 752653616-16.gwf “I have URLs for files…” URL Catalog What is URL for H-RDS_R_L3-752653616-16.gwf? gsiftp://ldas.ligo-wa.caltech.edu:15000/samrds/S3/L3/LHO/H-RDS_R_L3-7526/H-RDS_R_L3- 752653616-16.gwf

26 Feb 4, 2005 Scott Koranda 26 Looking ahead to S6... Metadata Avalanche!  3 sites (LHO, LLO, GEO) for 6 month run  ~ 40 million new pieces of metadata information at a minimum  maximum could be order of magnitude higher  How we partition?  Replication strategies?  Performance?

27 Feb 4, 2005 Scott Koranda 27 LDR “like” replication service for GT4 Collaboration between Globus ISI and UW-M Look for “preview technology” in GT4  start with just a “replication service”  send the service list of files to replicate  later add more components like metadata service?


Download ppt "Feb 4, 2005Scott Koranda1 Cataloging, Replicating, and Managing LIGO Data on the Grid Scott Koranda UW-Milwaukee On behalf of the LIGO."

Similar presentations


Ads by Google