Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to Data Management Dr Jens Jensen Head of Data Services Group,Leader of Storage and Data Management and Scientific Computing DeptGridPPmore.

Similar presentations


Presentation on theme: "Introduction to Data Management Dr Jens Jensen Head of Data Services Group,Leader of Storage and Data Management and Scientific Computing DeptGridPPmore."— Presentation transcript:

1 Introduction to Data Management Dr Jens Jensen Head of Data Services Group,Leader of Storage and Data Management and Scientific Computing DeptGridPPmore STFC...

2 Scientific data management: – Large data volumes (10s of PB) – Distributed user base – Need for high performance transfers – Need for data security (or not) – Scalability

3

4

5 Data in “the Grid”? “The Grid” Data

6 Data in “the Cloud”? “The Cloud” Data

7 Transfer Protocols – GridFTP (http://www.ogf.org/documents/GFD.20.pdf)http://www.ogf.org/documents/GFD.20.pdf Aka “gsiftp” (GSI = Globus (Grid) Security Infrastructure, cf RFC3820) – HTTP(S) – WebDAV (RFC 4918)

8 GridFTP – based on FTP Ancient protocol... RFCs 114 (1971), 141 (1971), 172 (1971), 265 (1971), 354 (1972), 542 (1973), 765 (1980), 959 (1985) Splitting control and data connection Extensions RFC 2228, 2773 (security), 2640 (internationalisation), 3659 (misc.), 2389, 5797 (FEAT)

9 Control connection: port 21 (FTP), 2811 (GridFTP) ClientServer Data connections and firewalls (active vs passive mode (PASV))

10 (Grid)FTP - “3 rd party copying”

11 GridFTP – extensions to FTP GSI security (later RFC 3820) Striping (and EBLOCK mode) TCP buffer size control/negot.? Data channel authentication (DCAU)

12 The Grid.... Ad-hoc transfers between GridFTP endpoints Initial user ingest? scp? Hands on with GridFTP: uberftp (cf ftp)

13 Moving data in (and to, and from) the Grid “Manually,” with GridFTP Portals – e.g. NGS portal GlobusOnline FTS (as of 3.0, tbc)

14 The gLite grid – daily TLA dose EMI – European Middleware Initiative UMD – Unified Middleware Distribution EGI – European Grid Infrastructure IGE – Infrastructure for Globus in Europe NGI – National Grid Initiative

15 The gLite grid – component TLAs SE – Storage Element SRM – Storage Resource Manager LFC – LHC file catalogue FTS – File Transfer Service BDII – Berkeley Database Information Index (LDAP)

16 LFC SRMGridFTPBDII Storage Element FTS SRM (OGF GFD.129) – control interface – support for “spaces” (reserved areas) – retention policies (replica, output, custodial) – access latencies (offline, nearline, online) – storage “type” - permanent, volatile LFN – Logical File Name (optional) Resolved by LFC into GUID – Globally Unique Identifier Resolved by LFC into SURL – Storage URL (or Site URL) Resolved by SE into TURL – Transfer URL (eg gsiftp)

17 gLite - Summary of basic data commands lcg-cp Copy to/from SE, or between SEs (no LFC) lcg-cr Copy file into SE, and register in LFC (guid) lcg-del lcg-rep Replicate

18 Exercises Lots of small files (10 5, 10 6 ) Large files (10 8 -10 12 ) Migration Format migration, checksumming Who can copy data? Write/Modify?

19 Exercises How is scientific data mgmt different? – How do research disciplines differ? – What are the interdisciplinary benefits? How grids and clouds differ...? Can we trust the grids/clouds? Who leads the way? HEP? Industry?

20 Storage Accounting - static Ongoing work... – Distributed storage systems – Temporary file copies created – Scheduled deletions – Inaccessible free spaces, reserved space – Filesystem/tape overheads – Timeliness and accuracy – Impact of compression

21 GridFTP today GridFTP – workhorse of WAN grid data (OGF standard) The need for GSI (non-TLS) Numerous LAN protocols... … moving towards more common standards? (eg HTTP)

22 lcg-cr --vo dteam -l lfn:my_stuff -d srm-dteam.gridpp.rl.ac.uk file://`pwd`/foo.tmp guid:921ac0b8-82aa-61dc-0192-6effece Subsequent access and replication is by GUID

23 Data Security  Data security is like data security everywhere...  Except that the devil is in the detail  And the details are always different...

24 Data Security – Confidentiality Data In flight, or at rest The performance issue And the time issue Who can “activate” it? Data

25 Data Security – Availability LOCKSS again... clouds are good at this. Data Somebody already thought about the difficult stuff...? Liability, SLAs,...

26 Data Security – Availability DDoS  Intentional  Botnets  Unintentional

27 Referencing Data  DOIs for data – DONA – Digital Objects Numbering Authority  Granularity?  Licences, permissions  Implementing data policies

28 Cloud Data – Cost  Clouds are elastic  Elasticity is good for (rapid) growth  Or shrinkth  Elasticity can be expensive, though  Compared to “traditional” data centre  Or in-house (but don’t underestimate this!)  Different cost models (Hybrids!)

29 Infrastructure Security  End-to-end security  Authentication and authorisation  Developing a threat model  Protecting credentials  Usability of security  Anonymised??

30 Infrastructure  Federated identity and single sign-on  Integration with existing infrastructures  Accounting  Securely...  Anonymously?  And billing

31 The Role of Standards  Standards promote interoperation  And maturity (sometimes)  Interoperation solves problems  Sometimes  E.g. eggs and baskets  Standards peer reviewed

32 Other Data Services IRODS – “data grid” Successor to SRB Server side workflows: rules, microservices Safety Deposit Box Commercial product from Tessella Data preservation

33 NGS data services NGS portal – https://portal.ngs.ac.uk/ http://www.ngs.ac.uk/tools/vbrowser Databases: Oracle, MySQL

34 EU Funded Data Projects EUDAT (www.eudat.eu) Collaborative iRODS based infrastructure Multidisciplinary, scalable, long tail SCIDIP-ES (earth science) www.scidip-es.eu SCAPE (www.scape-project.eu)www.scape-project.eu PANDATA (neutron/synchrotron) pan-data.eu

35 New Stuff? More mature approach to clouds? CCN – Content Centric Networking RAID --> ECC, “object” storage

36 Exercises Lots of small files (10 5, 10 6 ) Large files (10 8 -10 12 ) Migration Format migration, checksumming Who can copy data? Write/Modify?

37 Exercises How is scientific data mgmt different? – How do research disciplines differ? How much can be shared? – What are the interdisciplinary benefits? How grids and clouds differ...? Can we trust the grids/clouds? Who leads the way? HEP? Industry?

38 References www.ngs.ac.uk www.ogf.org UMD user guide https://edms.cern.ch/document/722398/ https://edms.cern.ch/document/722398/ GridPP storage and data management group – http://www.gridpp.ac.uk/wiki/Grid_Storage


Download ppt "Introduction to Data Management Dr Jens Jensen Head of Data Services Group,Leader of Storage and Data Management and Scientific Computing DeptGridPPmore."

Similar presentations


Ads by Google