Data Clustering Research in CMS Koen Holtman CERN/CMS Eindhoven University of technology CHEP ’2000 Feb 7-11, 2000.

Data Clustering Research in CMS Koen Holtman CERN/CMS Eindhoven University of technology CHEP ’2000 Feb 7-11, 2000

Introduction 3-year Ph.D. project on ‘prototyping of CMS storage management’ –Focus on disk/tape based physics analysis, objects >=1 KB –Focus on scalability and (re)clustering Clustering: placement of object data on physical storage media (disk, tape) Reclustering: rearranging the clustering –Clustering and reclustering are not specific to Objectivity or Object databases

I/O Risks for LHC 1 Risk of insufficient scalability –I/O scalability issues have been studied (240 clients, 172 MB/s) –Structure data into chunks (runs), use chunk-level subjobs –‘private’ DBs –Need a read-ahead optimization: ‘bursty sequential reading’

I/O risks for LHC 2 Risk of insufficient I/O performance –MB/s needs for interactive physics analysis??? –MB/s in 2005: 50-200 GB/s sequential I/O on CERN CMS disk farm –Random I/O can be factor 10-100 slower! –Well-understood Clustering is important –Subdetector clustering

HEP problem Main HEP problem: increasing selectivity over time, degrades performance to that of random reading Well understood by now Solution: recluster ‘by hand’ (DSTs) –By hand is good enough?? Issues: consistency, # of users, space, effort, on-demand reconstruction Research on automatic reclustering

Disk reclustering Developed on the fly + batch reclustering –Dynamically recluster data based on observing new access patterns –Implemented as ‘object store’ class Keeps I/O efficiency on disk good enough, automatically Supports on-demand reconstruction Scaling…..

Tape (re)clustering 1 Clustering on tape: HENP GC Cache filtering and chunk reclustering in a multiuser analysis system with disk and tape

Tape (re)clustering 2 Cache filtering yields factor 1-50 performance gain depending on workload parameters Compensates to some extent for low clustering efficiency on tape Chunk reclustering does not seem attractive, only performance gains for very small disk farm sizes So risks remain large Extension path... With cache filtering No cache filtering

Conclusions Existing practice: –Chunks/runs, subjobs, sequential access, subdetector based clustering In this project: –Validated existing practice, detailed investigation of disk performance, scalablility, read-ahead, disk reclustering, disk+tape system with cache filtering Remaining risks: –Don’t know how much I/O needed, clustering efficiency on tape, WAN issues –To investigate systems with large caching effects: access patterns needed Design for large parameter space through simulation

Access patterns Never know enough about access patterns –Known: object sizes, increasing selectivity, full reconstruction –We don’t know much about: user-level physics analysis In systems with large caching effects, these parameters have a large effect on performance Performance of a tape+disk based analysis system for various workload parameters –Strategy: design over large parameter space (simulation) –Strategy: investigate parameters and their importance

Data Clustering Research in CMS Koen Holtman CERN/CMS Eindhoven University of technology CHEP ’2000 Feb 7-11, 2000.

Similar presentations

Presentation on theme: "Data Clustering Research in CMS Koen Holtman CERN/CMS Eindhoven University of technology CHEP ’2000 Feb 7-11, 2000."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Clustering Research in CMS Koen Holtman CERN/CMS Eindhoven University of technology CHEP ’2000 Feb 7-11, 2000.

Similar presentations

Presentation on theme: "Data Clustering Research in CMS Koen Holtman CERN/CMS Eindhoven University of technology CHEP ’2000 Feb 7-11, 2000."— Presentation transcript:

Similar presentations

About project

Feedback