Peta-Cache: Electronics Discussion I Gunther Haller

Slides:



Advertisements
Similar presentations
Distributed Xrootd Derek Weitzel & Brian Bockelman.
Advertisements

Lecture 13 Page 1 CS 111 Online File Systems: Introduction CS 111 On-Line MS Program Operating Systems Peter Reiher.
Chapter 6 Computer Architecture
The Zebra Striped Network File System Presentation by Joseph Thompson.
13 May, 2005GlueX Collaboration Meeting1 Experiences with Large Data Sets Curtis A. Meyer.
Characteristics of Computer Memory
Characteristics Location Capacity Unit of transfer Access method Performance Physical type Physical characteristics Organisation.
Using Secondary Storage Effectively In most studies of algorithms, one assumes the "RAM model“: –The data is in main memory, –Access to any item of data.
On-Demand Media Streaming Over the Internet Mohamed M. Hefeeda, Bharat K. Bhargava Presented by Sam Distributed Computing Systems, FTDCS Proceedings.
Characteristics of Computer Memory
Peta-Cache, Mar30, 2006 V1 1 Peta-Cache: Electronics Discussion II Presentation Ryan Herbst, Mike Huffer, Leonid Saphoznikov Gunther Haller
CMPE 421 Parallel Computer Architecture
Distributed File Systems
Chapter 12: Mass-Storage Systems Silberschatz, Galvin and Gagne ©2005 Operating System Concepts – 7 th Edition, Jan 1, 2005 Chapter 12: Mass-Storage.
Disks Chapter 5 Thursday, April 5, Today’s Schedule Input/Output – Disks (Chapter 5.4)  Magnetic vs. Optical Disks  RAID levels and functions.
Experience with the Thumper Wei Yang Stanford Linear Accelerator Center May 27-28, 2008 US ATLAS Tier 2/3 workshop University of Michigan, Ann Arbor.
Buffers Let’s go for a swim. Buffers A buffer is simply a collection of bytes A buffer is simply a collection of bytes – a char[] if you will. Any information.
LCG Phase 2 Planning Meeting - Friday July 30th, 2004 Jean-Yves Nief CC-IN2P3, Lyon An example of a data access model in a Tier 1.
Server to Server Communication Redis as an enabler Orion Free
Database Systems Disk Management Concepts. WHY DO DISKS NEED MANAGING? logical information  physical representation bigger databases, larger records,
Sep. 17, 2002BESIII Review Meeting BESIII DAQ System BESIII Review Meeting IHEP · Beijing · China Sep , 2002.
Chapter 8: System Memory Dr Mohamed Menacer Taibah University
Multimedia Retrieval Architecture Electrical Communication Engineering, Indian Institute of Science, Bangalore – , India Multimedia Retrieval Architecture.
Characteristics Location Capacity Unit of transfer Access method Performance Physical type Physical characteristics Organisation.
Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 10: Mass-Storage Systems.
Measuring Performance Based on slides by Henri Casanova.
File-System Management
TYPES OF MEMORY.
Operating System (013022) Dr. H. Iwidat
Cache Memory.
WP18, High-speed data recording Krzysztof Wrona, European XFEL
Chapter 2 Memory and process management
Lecture 16: Data Storage Wednesday, November 6, 2006.
Indexing ? Why ? Need to locate the actual records on disk without having to read the entire table into memory.
Experiences with Large Data Sets
External Sorting Chapter 13
OPERATING SYSTEMS CS 3502 Fall 2017
Bernd Panzer-Steindel, CERN/IT
Chapter 12: Mass-Storage Structure
File System Structure How do I organize a disk into a file system?
Operating System I/O System Monday, August 11, 2008.
Swapping Segmented paging allows us to have non-contiguous allocations
Database Performance Tuning and Query Optimization
File Processing : Storage Media
Chapter 14 Based on the slides supporting the text
RAID RAID Mukesh N Tekwani
Chapter 12: Mass-Storage Systems
Database Management Systems (CS 564)
File Processing : Storage Media
So far… Text RO …. printf() RW link printf Linking, loading
External Sorting Chapter 13
Overview Continuation from Monday (File system implementation)
Chapter 6 Memory System Design
Cache Memory.
Lecture 15 Reading: Bacon 7.6, 7.7
Speculative execution and storage
Chapter 11 Database Performance Tuning and Query Optimization
Threads vs. Processes Hank Levy 1.
Lecture 3: Main Memory.
Kanga Tim Adye Rutherford Appleton Laboratory Computing Plenary
RAID RAID Mukesh N Tekwani April 23, 2019
External Sorting Chapter 13
Lecture 4: File-System Interface
Mr. M. D. Jamadar Assistant Professor
Presentation transcript:

Peta-Cache: Electronics Discussion I Gunther Haller Presentation Ryan Herbst, Mike Huffer, Leonid Saphoznikov Gunther Haller haller@slac.stanford.edu (650) 926-4257

Content Collection of information so base-line is common Couple of slides how it seems to be done presently Part 1: Option 3: (Tuesday) Alternative architecture with skim builder and event server Part 2: Option 1 and 2 (Thursday) Using Flash with minimum changes to present architecture Using Flash in more changes to present architecture

Collection of Info (to be checked) Total number of BBR events: ~ 400 Million Original dataflow events: ¼ to ½ Mbytes -> reconstruction results in 16-kbytes raw events From raw events (16 kbytes) generate Mini’s Micros (~2.5 kbytes) (MC’s ~ 5kbytes) Tags (512 bytes) Raw (reconstructed) events: 400 Million x 16 kbytes = 6 Terrabytes Fraction of them are needed most often Estimate: 200 skims total Large skims use 10-20% of total # of events Smallest use ~<1% of total Question of how many of those are needed how frequently Presently about 6-10 ms disk-access time What is event analysis time Very preliminary benchmark for analysis with little processing including decompression 35 ms if data in memory 70 ms if data on disk What is the number as a function of users

More info Now: each type of *new* skim involves linking new code into the babar (to zeroth order, single image!) application, which entails a significant amount of time for creating a new build, testing and worrying about the effect of its robustness on the rest of operation etc. What does Redirector do? (check): effectively maps an event *number* requested by the user to a file some where in the storage system (the combination of all the disks over all the tape system). Basically, if the user doesn't have an event in its xdroot server, it finds out which one has the file on the disk it serves and then asks the original server to make *copy* of it onto its *own* disks. If no server has the file then the user's xrootd server goes to the tape (HPSS) system and fetches it Xrootd server: ~ $6k Disk: ~$2/Gbyte (including cooling, etc)

Raw Events - Tags – Skims Raw-Data (12-kbytes) Micros (2.5 KBytes) Skim Tag data (512 Bytes) Tag data (512 Bytes) Tag data (512 Bytes) Collection of tags (with pointers to raw-data or can contain raw-data Alls tags, or all micros, etc Contains interesting characteristics and pointer to raw data Tag data and sometimes micro data is used for skims Run over tag data with defined algorithm and produce a skim Pointer skim: collection of tags with pointers to raw-data Deep skim: contains also raw-data (e.g. for transfer to remote institution) About 200 different skims. Time to get a new skim can be about 9 months due process (formalism to make sure skims are of interest to many) One goal is to be able to produce easily and fast new skim

Present Architecture (needs check) Skims: go thru all tags (sometimes micro’s, but never raw-events) on disk to make collection of tags (which is a skim) Client broadcasts to xrootd servers the first event to be fetched in a specific skim If no xrootd servers responds (i.e. time-out) then go to tape and put event into xrootd server So populate the disks with events which have been requested If other xrootd has it, it copies it over Still read ~2-6 Tbytes/day from tape? Does xrootd server really know what it has in its disks? is redirector involved for every event to find where event is from event number client is supplying? What if disk is full? How recover new space? Seems inefficient use of space since many copies 4.8 TB Mass Storage (RAID-5) 3 TB usable 4.8 TB Mass Storage (RAID-5) 3 TB usable 40 sets, to be extended to 80 sets Fibre-Channel xrootd server xrootd server Tape 1-G Ethernet Switch Redirector Client (1, 2, or 4 core) Client (1, 2, or 4 core) Up to 1,500 cores in ~ 800 units?

Present Architecture (needs check) con’t At the end, all tags and events which client desires on a single xrootd server Why are all of those copied into one place, because the client or other clients reuses them many times later? What is xrootd server really doing? Xrootd server fetches event from its disk and delivers to client So select clients for a server which uses same skims? Client decompresses event for analysis 4.8 TB Mass Storage (RAID-5) 3 TB usable 4.8 TB Mass Storage (RAID-5) 3 TB usable 40 sets, to be extended to 80 sets FiberChannel xrootd server xrootd server Tape 1-G Ethernet Redirector Client (1, 2, or 4 core) Client (1, 2, or 4 core) Up to 1,500 cores in ~ 800 units?

Goals (really need requirements) Reduce time/overhead to make new skim Goal that each user can make own “skim” quickly Reduce time to get data to client Presently analysis time and time to get next event is not concurrent but sequential. To get next event it queries all xrootd servers If one has it, it is copied into its (for that analysis) xrootd server If not, its xrootd server gets it from tape Issue is latency to get data from disk Minimize duplicate storage

Option with Skim Builder and Event Server (1) Skim builder and event server have access to all disks Client gives skim builder code to build skim Results are given to client Client broadcast/multicast to event servers the request of events in the list (name of list and name of client) Event server is selected which fits the criteria best Maybe already has the list, may not doing anything, maybe whoever has most in the list is selected, that is flexible (or optionally a specific event server can be locked to a client (directly IO) If no event in cache then after initial latency going to disk, get all the data Event server may multicast its events to be delivered to all which requested any events in the list Clients don’t have to get events in list in order Any client can get events at 100 Mbytes/s independent of number of clients Event server is cache box (e.g. 16 Gbytes of mem) Pipe between event servers and storage is slower but averaged due to cache Client could consume ~< 50 events/sec. Number of clients, e.g. 4,000. each events is 16kbytes. So average thruput 3.6Gbyte/sec Tape Disk Storage Disk Storage To file system Skim builder (s) Event Server Event Server Ethernet/PCI-E/etc Optionally direct IO Client (1, 2, or 4 core) Client (1, 2, or 4 core) Client (1, 2, or 4 core) Up to 1,500 cores in ~ 800 units?

Event Processing Center file system fabric disks pizza box as skim builder switch HPSS switch(s) switch (s) sea of cores fabric out protocol conversion pizza box in protocol conversion Event processing node

Pizza box block diagram (In) PCI Express x16 PLX8532 x4 PPC 405 PLX8508 RLDRAM II IGbyte x4 x16 PLX8508 XILNIX XC4VFX40 (Out) PCI Express PLX8532