STACS STACS: Storage Access Coordination of Tertiary Storage for High Energy Physics Applications Arie Shoshani, Alex Sim, John Wu, Luis Bernardo*, Henrik.

STACS STACS: Storage Access Coordination of Tertiary Storage for High Energy Physics Applications Arie Shoshani, Alex Sim, John Wu, Luis Bernardo*, Henrik Nordberg*, Doron Rotem* Scientific Data Management Group Computing Science Directorate Lawrence Berkeley National Laboratory * no longer at LBNL

STACS Outline Short High Energy Physics overview (of data handling problem) Description of the Storage Coordination System File tracking The Query Estimator (QE) —Details of the bit-sliced index The Query Monitor —coordination of “file bundles” The Cache Manager —tertiary storage queuing and tape coordination —transfer time for query estimation

STACS Optimizing Storage Management for High Energy Physics Applications Data Volumes for planned HENP experiments STAR: Solenoidal Tracker At RHIC RHIC: Relativistic Heavy Ion Collider

STACS Particle Detection Systems STAR detector at RHIC Phenix at RHIC

STACS Result of Particle Collision (event)

STACS Typical Scientific Exploration Process Generate large amounts of raw data —large simulations —collect from experiments Post-processing of data —analyze data (find particles produced, tracks) —generate summary data e.g. momentum, no. of pions, transverse energy Number of properties is large (50-100) Analyze data —use summary data as guide —extract subsets from the large dataset Need to access events based on partial properties specification (range queries) e.g. ((0.1 6000) —apply analysis code

STACS STAR experiment —10 8 events over 3 years —1-10 MB per event: reconstructed data —events organized into 0.1 - 1 GB files —10 15 total size —10 6 files, ~30,000 tapes (30 GB tapes) Access patterns —Subsets of events are selected by region in high-dimensional property space for analysis —10,000 - 50,000 out of total of 10 8 —Data is randomly scattered all over the tapes Goal: Optimize access from tape systems Size of Data and Access Patterns

STACS EXAMPLE OF EVENT PROPERTY VALUES I event 1 I N(1) 9965 I N(2) 1192 I N(3) 1704 I Npip(1) 2443 I Npip(2) 551 I Npip(3) 426 I Npim(1) 2480 I Npim(2) 541 I Npim(3) 382 I Nkp(1) 229 I Nkp(2) 30 I Nkp(3) 50 I Nkm(1) 209 I Nkm(2) 23 I Nkm(3) 32 I Np(1) 255 I Np(2) 34 I Np(3) 24 I Npbar(1) 94 I Npbar(2) 12 I Npbar(3) 24 I NSEC(1) 15607 I NSEC(2) 1342 I NSECpip(1) 638 I NSECpip(2) 191 I NSECpim(1) 728 I NSECpim(2) 206 I NSECkp(1) 3 I NSECkp(2) 0 I NSECkm(1) 0 I NSECkm(2) 0 I NSECp(1) 524 I NSECp(2) 244 I NSECpbar(1) 41 I NSECpbar(2) 8 R AVpT(1) 0.325951 R AVpT(2) 0.402098 R AVpTpip(1) 0.300771 R AVpTpip(2) 0.379093 R AVpTpim(1) 0.298997 R AVpTpim(2) 0.375859 R AVpTkp(1) 0.421875 R AVpTkp(2) 0.564385 R AVpTkm(1) 0.435554 R AVpTkm(2) 0.663398 R AVpTp(1) 0.651253 R AVpTp(2) 0.777526 R AVpTpbar(1) 0.399824 R AVpTpbar(2) 0.690237 I NHIGHpT(1) 205 I NHIGHpT(2) 7 I NHIGHpT(3) 1 I NHIGHpT(4) 0 I NHIGHpT(5) 0 54 Properties, as many as 10 8 events

STACS Prevent / eliminate unwanted queries => query estimation (fast estimation index) Read only events qualified for a query from a file (avoid reading irrelevant events) => exact index over all properties Share files brought into cache by multiple queries => look ahead for files needed and cache management Read files from same tape when possible => coordinating file access from tape Opportunities for optimization

STACS The Storage Access Coordination System (STACS) Query Estimator (QE) Cache Manager (CM) Query Monitor (QM) Query estimation / execution requests file caching request Caching Policy Module File Catalog (FC) Bit- Sliced index file purging Disk Cache file caching User’s Application open, read, close

STACS A typical SQL-like Query SELECT * FROM star_dataset WHERE 500<total_tracks<1000 & energy<3 -- The index will generate a set of files {F 6 : E 4,E 17,E 44, F 13 : E 6,E 8,E 32, …, F 1036 : E 503,E 3112 } that the query needs -- The files can be returned to the application in any order

STACS File Tracking (1)

STACS File Tracking (2)

STACS File Tracking query1 start query2 start query3 start All 3 queries

STACS 8 file info Typical Processing Flow Query EstimatorEvent IteratorsQuery MonitorPolicy Module Cache Manager 5 6 3 execute 4 request whichFileToCache FileID ToCache 7 stage 11 staged 12 retrieve 14 release 15 purge 17 purged IndexFile Catalog Query Object User Code Local Disk 1 new Query Quick Estimate 2 Execute Full Estimate 18 done 9 File Caching Request 10 File Caching 16 purge 13 STACS

The Storage Access Coordination System (STACS) Query Estimator (QE) Cache Manager (CM) Query Monitor (QM) Query estimation / execution requests file caching request Caching Policy Module File Catalog (FC) Bit- Sliced index file purging Disk Cache file caching User’s Application open, read, close

STACS Bit-Sliced Index: used by Query Estimator Index size —property space: 10 8 events x 100 properties x 4 bytes = 40 GB —index requirements: range queries (10 < Np < 20) ^ (0.1 < AVpT <.2) number of properties involved is small: 3-5 Problem —how to organize property space index

STACS indexing over all properties Multi-dimensional index methods —partitioning MD space (KD-trees, n-QUAD-trees,...) —for high dimensionality - either fanout or tree depth too large —e.g. symmetric n-QUAD-trees require 2 100 fanout —non-symmetric solutions are order dependent QUAD-tree KD-tree

STACS Partitioning property space One possible solution —partition property space into subsets —e.g. 7 dimensions at a time Performance —good for non-partial range queries (full hypercube) —bad if only few of the dimensions in each partition are involved in query S. Berchtold, C. Bohm, H. Kriegel, The Pyramid-Technique: Towards Breaking the Curse of Dimensionality, SIGMOD 1998 —best for non-skewed (random) data —best for full hypercube queries —for partial range (e.g. 3 out 100) close to sequential scan

STACS Bit-Sliced Index Solution: Take advantage that index need to be is append only partition each property into bins —(e.g. for 0<Np<300, have 20 equal size bins) for each bin generate a bit vector compress each bit vector (run length encoding) 000000000000000000000000000000 000010001000000000010001000000 000001110111011000001110111011 101100000000000101100000000000 010000000000000010000000000000 000000000000100000000000000100 000000000000000000000000000000 property 1 000001110111011000001110111011 101100000000000101100000000000 010000001000000010000001000000 000000000000100000000000000100 000000000000000000000000000000 property 2 000000000000000000000000000000 000000001000000000000001000000 000001110111011000001110111011 101100000000000101100000000000 010000000000000010000000000000 000000000000100000000000000100 000000000000000000000000000000 property n 000010000000000000010000000000...

STACS Run Length Compression Uncompressed: 0000000000001111000000000......0000001000000001111111100000000.... 000000 Compressed: 12, 4, 1000,1,8,1000 Store very short sequences as-is Advantage: Can perform: AND, OR, COUNT operations on compressed data

STACS Bit-Sliced Index 000000000000000000000000000000 000000001000000000000001000000 000001110111011000001110111011 101100000000000101100000000000 010000000000000010000000000000 000000000000100000000000000100 000000000000000000000000000000 000010000000000000010000000000 Advantages Advantages: space for index very small - can fit in memory Need only touch properties involved in queries (vertical partitioning) Need only touch bins involved 000000000000000000000000000000 000000001000000000000001000000 000001110111011000001110111011 101100000000000101100000000000 010000000000000010000000000000 000000000000100000000000000100 000000000000000000000000000000 000010000000000000010000000000 Query Estimation in memory only !! min-max

STACS Inner Bins vs. Edge Bins...................................................... Range(x) Range(y) Edge bin

STACS Bit-Sliced index (in memory) events in bin1... Vertical partitions (on disk) properties range conditions file list, and events that qualify in each ~ 20-40 GB ~ 50-100 MB Events in edge bin 0 0 0 0 0 1 1 1 0 1 1 1 0 1 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Event list events in bin2 edge bin edge bin

STACS Experimental Results on Index Simulated dataset (hijing) —10 million events —70 properties property space —BSI: 2.5 GB —Oracle: 3.5 GB index size —BSI: 280 MB (4 MB/property) —Oracle: 7 GB (100 MB/property) index creation time —BSI: 3 hours (~2.5 min / property) —Oracle: 47 hours (~40 min / property)

STACS Experimental Results on Index Run a “count” query (preliminary) —BSI: 1 property: 14 - 70 sec (depending on size of range) 2 properties: 90 sec (both about half the range) => linear with # of bins touched —Oracle: 1 property: comparable (counts only) 2 properties: > 2 hours ! —use one index, loop on table => Need to tune Oracle —run “analyze” on indexes, choose policy —bitmap index - did not help => After tuning: 12 Min

STACS File Bundles: Multiple Event Components e1 e2 e3 e4 e5 e6 e7 e8 e9 Files of Component A Files of Component B Component A of event e1 Component B of event e1 File 4 File 3 File 1 File 2 File Bundles: (F1,F2: e1,e2,e3,e5), (F3,F2: e4,e7), (F3,F4: e6,e8,e9)

STACS A typical SQL-like Query for Multiple Components SELECT Vertices, Raw FROM star_dataset WHERE 500<total_tracks<1000 & energy<3 -- The index will generate a set of bundles {[F 7, F 16 : E 4,E 17,E 44 ], [F 13, F 16 : E 6,E 8,E 32 ], …} that the query needs -- The bundles can be returned to the application in any order -- Bundle: the set of files that need to be in cache at the same time

STACS File Weight Policy for Managing File Bundles File weight (bundle) = 1 if it appears in a bundle, = 0 otherwise Initial file weight = SUM (all bundles for each query) over all queries Example: —query 1: file F K appears is 5 bundles —query 2: file F K appears is 3 bundles Then, IFW ( F K) = 8

STACS File Weight Policy for Managing File Bundles (cont’d) Dynamic file weight: the file weight for a file in a bundle that was processed is decremented by 1 Dynamic Bundle Weight

STACS How file weights are used for caching and purging Bundle caching policy —For each query, in turn, cache the bundle with the most files in cache —In case of a tie, select the bundle with the highest weight —Ensures that a bundle that include files needed by other bundles/queries have priority File purging policy —No file purging occurs till space is needed —Purge file not in use with smallest weight —Ensures that files needed in other bundles stay in cache

STACS Other policies Pre-fetching policy —queries can request pre-fetching of bundles subject to a limit —Currently, limit set to two bundles —multiple pre-fetching useful for parallel processing Query service policy —queries serviced in Round Robin fashion —queries that have all their bundles cached and are still processing are skipped

STACS Managing the queues............... Query Queue Bundle Set File Set Files Being Processed Files in Cache

STACS File Tracking of Bundles Query 1 starts here Query 2 starts here Bundle was found in cache Bundle shared by two queries Bundle (3 files) formed, then passed to query

STACS Summary The key to managing bundle caching and purging policies is weight assignment —caching - based on bundle weight —purging - based on file weight Other file weight policies are possible —e.g. based on bundle size —e.g. based on tape sharing Proving which policy is best - a hard problem —can test in real system - expensive, need stand alone —simulation - too many parameters in query profile can vary: processing time, inter-arrival time, number of drive, size of cache, etc. —model with a system of queues - hard to model policies —we are working on last two methods

STACS Queuing File Transfers Number of PFTPs to HPSS are limited —limit set by a parameter - NoPFTP —parameter can be changed dynamically CM is multi-threaded —issues and monitors multiple PFTPs in parallel All requests beyond PFTP limit are queued File Catalog used to provide for each file —HPSS path/file_name —Disk cache path/file_name —File size —tape ID

STACS File Queue Management Goal — minimize tape mounts — still respect the order of requests — do not postpone unpopular tapes forever File clustering parameter - FCP —If the file at top of queue is in Tape i and FCP > 1 (e.g. 5) then up to 4 files from Tape i will be selected to be transferred next —then, go back to file at top of queue Parameter can be set dynamically F(T i ) 1 2 3 4 5

STACS File Queue Management Goal — minimize tape mounts — still respect the order of requests — do not postpone unpopular tapes forever File clustering parameter - FCP —If the file at top of queue is in Tape i and FCP > 1 (e.g. 4) then up to 4 files from Tape i will be selected to be transferred next —then, go back to file at top of queue Parameter can be set dynamically F 1 (T i ) F 3 (T i ) F 2 (T i ) F 4 (T i ) 1 2 3 4 5 Order of file service

STACS File Caching Order for different File Clustering Parameters File Clustering Parameter = 1 File Clustering Parameter = 10

STACS Transfer Rate (Tr) Estimates Need Tr to estimate total time of a query Tr is average over recent file transfers from the time PFTP request is made to the time transfer completes. This includes: —mount time, seek time, read to HPSS Raid, transfer to local cache over network For dynamic network speed estimate —check total bytes for all file being transferred over small intervals (e.g. 15 sec) —calculate moving average over n intervals (e.g. 10 intervals) Using this, actual time in HPSS can be estimated

STACS Dynamic Display of Various Measurements

STACS Query Estimate Given: transfer rate Tr. Given a query for which: —X files are in cache —Y files are in the queue —Z files are not scheduled yet Let s(file_set) be the total byte size of all files in file_set If Z = 0, then —QuEst = s(Y’)/Tr If Z = 0, then —QuEst = (s(T)+ q.s(Z))/Tr where q is the number of active queries F 1 (Y) F 3 (Y) F 2 (Y) F 4 (Y) T

STACS Reason for q.s(Z) 20 Queries of length ~20 minutes launched 20 minutes apart 20 Queries of length ~20 minutes launched 5 minutes apart Estimate pretty close Estimate bad - request accumulate in queue

STACS Error Handling 5 generic errors —file not found return error to caller —limit PFTP reached can’t login re-queue request, try later (1-2 min) —HPSS error (I/O, device busy) remove part of file from cache, re-queue try n times (e.g. 3), then return error “transfer_failed” —HPSS down re-queue request, try repeatedly till successful respond to File_status request with “HPSS_down”

STACS Summary HPSS Hierarchical Resource Manager (HRM) —insulates applications from transient HPSS and network errors —limits concurrent PFTPs to HPSS —manages queue to minimize tape mounts —provides file/query time estimates —handles errors in a generic way Same API can be used for any MSS, such as Unitree, Enstore, etc.

STACS Web pointers http://gizmo.lbl.gov/stacs http://gizmo.lbl.gov/~arie/download.papers.html -- to download papers http://gizmo.lbl.gov/stacs/stacs.slides/index.htm -- a STACS presentation

STACS STACS: Storage Access Coordination of Tertiary Storage for High Energy Physics Applications Arie Shoshani, Alex Sim, John Wu, Luis Bernardo*, Henrik.

Similar presentations

Presentation on theme: "STACS STACS: Storage Access Coordination of Tertiary Storage for High Energy Physics Applications Arie Shoshani, Alex Sim, John Wu, Luis Bernardo*, Henrik."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

STACS STACS: Storage Access Coordination of Tertiary Storage for High Energy Physics Applications Arie Shoshani, Alex Sim, John Wu, Luis Bernardo*, Henrik.

Similar presentations

Presentation on theme: "STACS STACS: Storage Access Coordination of Tertiary Storage for High Energy Physics Applications Arie Shoshani, Alex Sim, John Wu, Luis Bernardo*, Henrik."— Presentation transcript:

Similar presentations

About project

Feedback