Introduction to Data Management Dr Jens Jensen Head of Data Services Group,Leader of Storage and Data Management and Scientific Computing DeptGridPPmore.

Slides:



Advertisements
Similar presentations
© 2007 Open Grid Forum Data Management Challenge - The View from OGF OGF22 – February 28, 2008 Cambridge, MA, USA Erwin Laure David E. Martin Data Area.
Advertisements

FP7-INFRA Enabling Grids for E-sciencE EGEE Induction Grid training for users, Institute of Physics Belgrade, Serbia Sep. 19, 2008.
Grid Data Management Assaf Gottlieb - Israeli Grid NA3 Team EGEE is a project funded by the European Union under contract IST EGEE tutorial,
Data Transfer Efficiency - leave no byte unchurned Jens Jensen Rutherford Appleton Laboratory GridPP26, U Sussex, March 2011.
MTA SZTAKI Hungarian Academy of Sciences Grid Computing Course Porto, January Introduction to Grid portals Gergely Sipos
The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Services Abderrahman El Kharrim
Data Grids: Globus vs SRB. Maturity SRB  Older code base  Widely accepted across multiple communities  Core components are tightly integrated Globus.
UMIACS PAWN, LPE, and GRASP data grids Mike Smorul.
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
Data Grid Web Services Chip Watson Jie Chen, Ying Chen, Bryan Hess, Walt Akers.
Federated A(A(A))I Jens Jensen hepsysman, RAL,
OSG End User Tools Overview OSG Grid school – March 19, 2009 Marco Mambelli - University of Chicago A brief summary about the system.
Here Come the Feds Federated identity management: the consumer’s perspective Jens Jensen, STFC On behalf of EUDAT AAI TF EGI CF Manchester April 2013.
Data Management Kelly Clynes Caitlin Minteer. Agenda Globus Toolkit Basic Data Management Systems Overview of Data Management Data Movement Grid FTP Reliable.
EGEE-II INFSO-RI Enabling Grids for E-sciencE gLite Data Management System Yaodong Cheng CC-IHEP, Chinese Academy.
Heads in the cloud? GSM-WG at OGF31, Taipei Jens Jensen, RAL.
INFSO-RI Enabling Grids for E-sciencE gLite Data Management Services - Overview Mike Mineter National e-Science Centre, Edinburgh.
GT Components. Globus Toolkit A “toolkit” of services and packages for creating the basic grid computing infrastructure Higher level tools added to this.
The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) Riccardo Rotondo
Why GridFTP? l Performance u Parallel TCP streams, optimal TCP buffer u Non TCP protocol such as UDT u Order of magnitude greater l Cluster-to-cluster.
EGEE-II INFSO-RI Enabling Grids for E-sciencE Data Grid Services/SRB/SRM & Practical Hai-Ning Wu Academia Sinica Grid Computing.
Enabling Grids for E-sciencE ENEA and the EGEE project gLite and interoperability Andrea Santoro, Carlo Sciò Enea Frascati, 22 November.
Data Management The GSM-WG Perspective. Background SRM is the Storage Resource Manager A Control protocol for Mass Storage Systems Standard protocol:
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE middleware Data Management in gLite.
ASPiS Security Jens Jensen Science and Technology Facilities Council AHM, 8-11 Sep 2008 Edinburgh.
Your university or experiment logo here Storage and Data Management - Background Jens Jensen, STFC.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE middleware: gLite Data Management EGEE Tutorial 23rd APAN Meeting, Manila Jan.
Using the EMI testbed ARC middleware Marek Kočan University of P. J. Šafárik, Košice.
Enabling Grids for E-sciencE Introduction Data Management Jan Just Keijser Nikhef Grid Tutorial, November 2008.
Replica Management Services in the European DataGrid Project Work Package 2 European DataGrid.
MTA SZTAKI Hungarian Academy of Sciences Introduction to Grid portals Gergely Sipos
WebFTS File Transfer Web Interface for FTS3 Andrea Manzi On behalf of the FTS team Workshop on Cloud Services for File Synchronisation and Sharing.
Jens G Jensen RAL, EDG WP5 Storage Element Overview DataGrid Project Conference Heidelberg, 26 Sep-01 Oct 2003.
Glite. Architecture Applications have access both to Higher-level Grid Services and to Foundation Grid Middleware Higher-Level Grid Services are supposed.
WP 10 ATF meeting April 8, 2002 Data Management and security requirements of biomedical applications Johan Montagnat - WP10.
Managing Data DIRAC Project. Outline  Data management components  Storage Elements  File Catalogs  DIRAC conventions for user data  Data operation.
SEE-GRID-SCI Storage Element Installation and Configuration Branimir Ackovic Institute of Physics Serbia The SEE-GRID-SCI.
INFSO-RI Enabling Grids for E-sciencE Introduction Data Management Ron Trompert SARA Grid Tutorial, September 2007.
David Adams ATLAS ATLAS distributed data management David Adams BNL February 22, 2005 Database working group ATLAS software workshop.
EGI-Engage Data Services and Solutions Part 1: Data in the Grid Vincenzo Spinoso EGI.eu/INFN Data Services.
Security Issues and Challenges in High Performance Grid Computing SASA SUBOTIC SASA SUBOTIC University of Pretoria.
SESEC Storage Element (In)Security hepsysman, RAL 0-1 July 2009 Jens Jensen.
DMLite GridFTP frontend Andrey Kiryanov IT/SDC 13/12/2013.
© 2006 Open Grid Forum Open Grid Forum EGI CF, Bari Jens Jensen, STFC/RAL jens. jensen (a) stfc.ac.uk.
The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) Riccardo Bruno
INFSO-RI Enabling Grids for E-sciencE University of Coimbra gLite 1.4 Data Management System Salvatore Scifo, Riccardo Bruno Test.
Breaking the frontiers of the Grid R. Graciani EGI TF 2012.
EGEE-II INFSO-RI Enabling Grids for E-sciencE Architecture of LHC File Catalog Valeria Ardizzone INFN Catania – EGEE-II NA3/NA4.
The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) Algiers, EUMED/Epikh Application Porting Tutorial, 2010/07/04.
Get Data to Computation eudat.eu/b2stage B2STAGE How to shift large amounts of data Version 4 February 2016 This work is licensed under the.
Data Infrastructure in the TeraGrid Chris Jordan Campus Champions Presentation May 6, 2009.
Enabling Grids for E-sciencE EGEE-II INFSO-RI Status of SRB/SRM interface development Fu-Ming Tsai Academia Sinica Grid Computing.
Introduction to Storage Element Hsin-Wei Wu Academia Sinica Grid Computing Center, Taiwan.
Grid Data Management Assaf Gottlieb Tel-Aviv University assafgot tau.ac.il EGEE is a project funded by the European Union under contract IST
The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Data Management Maha Metawei
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI EGI Services for Distributed e-Infrastructure Access Tiziana Ferrari on behalf.
User Domain Storage Elements SURL  TURL LFC Domain (LCG File Catalogue) SA1 – Data Grid Interoperation Enabling Grids for E-sciencE EGEE-III INFSO-RI
FP6−2004−Infrastructures−6-SSA E-infrastructure shared between Europe and Latin America LFC Server Installation and Configuration.
Scuola Grid INFN, Trieste, 1-12 Dic Managing Confidential Data in the gLite Middleware – The Secure Storage.
Riccardo Zappi INFN-CNAF SRM Breakout session. February 28, 2012 Ingredients 1. Basic ingredients (Fabric & Conn. level) 2. (Grid) Middleware ingredients.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI EGI solution for high throughput data analysis Peter Solagna EGI.eu Operations.
Vincenzo Spinoso EGI.eu/INFN
Data Bridge Solving diverse data access in scientific applications
Scuola Grid INFN, Martina Franca, Nov
gLite Data management system overview
Introduction to Data Management in EGI
Grid Services Ouafa Bentaleb CERIST, Algeria
Data services in gLite “s” gLite and LCG.
Architecture of the gLite Data Management System
Presentation transcript:

Introduction to Data Management Dr Jens Jensen Head of Data Services Group,Leader of Storage and Data Management and Scientific Computing DeptGridPPmore STFC...

Scientific data management: – Large data volumes (10s of PB) – Distributed user base – Need for high performance transfers – Need for data security (or not) – Scalability

Data in “the Grid”? “The Grid” Data

Data in “the Cloud”? “The Cloud” Data

Transfer Protocols – GridFTP ( Aka “gsiftp” (GSI = Globus (Grid) Security Infrastructure, cf RFC3820) – HTTP(S) – WebDAV (RFC 4918)

GridFTP – based on FTP Ancient protocol... RFCs 114 (1971), 141 (1971), 172 (1971), 265 (1971), 354 (1972), 542 (1973), 765 (1980), 959 (1985) Splitting control and data connection Extensions RFC 2228, 2773 (security), 2640 (internationalisation), 3659 (misc.), 2389, 5797 (FEAT)

Control connection: port 21 (FTP), 2811 (GridFTP) ClientServer Data connections and firewalls (active vs passive mode (PASV))

(Grid)FTP - “3 rd party copying”

GridFTP – extensions to FTP GSI security (later RFC 3820) Striping (and EBLOCK mode) TCP buffer size control/negot.? Data channel authentication (DCAU)

The Grid.... Ad-hoc transfers between GridFTP endpoints Initial user ingest? scp? Hands on with GridFTP: uberftp (cf ftp)

Moving data in (and to, and from) the Grid “Manually,” with GridFTP Portals – e.g. NGS portal GlobusOnline FTS (as of 3.0, tbc)

The gLite grid – daily TLA dose EMI – European Middleware Initiative UMD – Unified Middleware Distribution EGI – European Grid Infrastructure IGE – Infrastructure for Globus in Europe NGI – National Grid Initiative

The gLite grid – component TLAs SE – Storage Element SRM – Storage Resource Manager LFC – LHC file catalogue FTS – File Transfer Service BDII – Berkeley Database Information Index (LDAP)

LFC SRMGridFTPBDII Storage Element FTS SRM (OGF GFD.129) – control interface – support for “spaces” (reserved areas) – retention policies (replica, output, custodial) – access latencies (offline, nearline, online) – storage “type” - permanent, volatile LFN – Logical File Name (optional) Resolved by LFC into GUID – Globally Unique Identifier Resolved by LFC into SURL – Storage URL (or Site URL) Resolved by SE into TURL – Transfer URL (eg gsiftp)

gLite - Summary of basic data commands lcg-cp Copy to/from SE, or between SEs (no LFC) lcg-cr Copy file into SE, and register in LFC (guid) lcg-del lcg-rep Replicate

Exercises Lots of small files (10 5, 10 6 ) Large files ( ) Migration Format migration, checksumming Who can copy data? Write/Modify?

Exercises How is scientific data mgmt different? – How do research disciplines differ? – What are the interdisciplinary benefits? How grids and clouds differ...? Can we trust the grids/clouds? Who leads the way? HEP? Industry?

Storage Accounting - static Ongoing work... – Distributed storage systems – Temporary file copies created – Scheduled deletions – Inaccessible free spaces, reserved space – Filesystem/tape overheads – Timeliness and accuracy – Impact of compression

GridFTP today GridFTP – workhorse of WAN grid data (OGF standard) The need for GSI (non-TLS) Numerous LAN protocols... … moving towards more common standards? (eg HTTP)

lcg-cr --vo dteam -l lfn:my_stuff -d srm-dteam.gridpp.rl.ac.uk file://`pwd`/foo.tmp guid:921ac0b8-82aa-61dc effece Subsequent access and replication is by GUID

Data Security  Data security is like data security everywhere...  Except that the devil is in the detail  And the details are always different...

Data Security – Confidentiality Data In flight, or at rest The performance issue And the time issue Who can “activate” it? Data

Data Security – Availability LOCKSS again... clouds are good at this. Data Somebody already thought about the difficult stuff...? Liability, SLAs,...

Data Security – Availability DDoS  Intentional  Botnets  Unintentional

Referencing Data  DOIs for data – DONA – Digital Objects Numbering Authority  Granularity?  Licences, permissions  Implementing data policies

Cloud Data – Cost  Clouds are elastic  Elasticity is good for (rapid) growth  Or shrinkth  Elasticity can be expensive, though  Compared to “traditional” data centre  Or in-house (but don’t underestimate this!)  Different cost models (Hybrids!)

Infrastructure Security  End-to-end security  Authentication and authorisation  Developing a threat model  Protecting credentials  Usability of security  Anonymised??

Infrastructure  Federated identity and single sign-on  Integration with existing infrastructures  Accounting  Securely...  Anonymously?  And billing

The Role of Standards  Standards promote interoperation  And maturity (sometimes)  Interoperation solves problems  Sometimes  E.g. eggs and baskets  Standards peer reviewed

Other Data Services IRODS – “data grid” Successor to SRB Server side workflows: rules, microservices Safety Deposit Box Commercial product from Tessella Data preservation

NGS data services NGS portal – Databases: Oracle, MySQL

EU Funded Data Projects EUDAT ( Collaborative iRODS based infrastructure Multidisciplinary, scalable, long tail SCIDIP-ES (earth science) SCAPE ( PANDATA (neutron/synchrotron) pan-data.eu

New Stuff? More mature approach to clouds? CCN – Content Centric Networking RAID --> ECC, “object” storage

Exercises Lots of small files (10 5, 10 6 ) Large files ( ) Migration Format migration, checksumming Who can copy data? Write/Modify?

Exercises How is scientific data mgmt different? – How do research disciplines differ? How much can be shared? – What are the interdisciplinary benefits? How grids and clouds differ...? Can we trust the grids/clouds? Who leads the way? HEP? Industry?

References UMD user guide GridPP storage and data management group –