Event Metadata Records as a Testbed for Scalable Data Mining David Malon, Peter van Gemmeren (Argonne National Laboratory) At a data rate of 200 hertz,

Event Metadata Records as a Testbed for Scalable Data Mining David Malon, Peter van Gemmeren (Argonne National Laboratory) At a data rate of 200 hertz, event metadata records ("TAGs," in ATLAS parlance) provide fertile grounds for development and evaluation of tools for scalable data mining. It is easy, of course, to apply HEP- specific selection or classification rules to event records and to label such an exercise "data mining," but our interest is different. Advanced statistical methods and tools such as classification, association rule mining, and cluster analysis are common outside the high energy physics community. These tools can prove useful, not for discovery physics, but for learning about our data, our detector, and our software. A fixed and relatively simple schema makes TAG export to other storage technologies such as HDF5 straightforward. At a data rate of 200 hertz, event metadata records ("TAGs," in ATLAS parlance) provide fertile grounds for development and evaluation of tools for scalable data mining. It is easy, of course, to apply HEP- specific selection or classification rules to event records and to label such an exercise "data mining," but our interest is different. Advanced statistical methods and tools such as classification, association rule mining, and cluster analysis are common outside the high energy physics community. These tools can prove useful, not for discovery physics, but for learning about our data, our detector, and our software. A fixed and relatively simple schema makes TAG export to other storage technologies such as HDF5 straightforward. This simplifies the task of exploiting very-large-scale parallel platforms such as Argonne National Laboratory's BlueGene/P, currently the largest supercomputer in the world for open science, in the development of scalable tools for data mining. Using a domain-neutral scientific data format may also enable us to take advantage of existing data mining components from other communities. There is, further, a substantial literature on the topic of one-pass algorithms and stream mining techniques, and such tools may be inserted naturally at various points in the event data processing and distribution chain. This paper describes early experience with event metadata records from ATLAS simulation and commissioning as a testbed for scalable data mining tool development and evaluation. This simplifies the task of exploiting very-large-scale parallel platforms such as Argonne National Laboratory's BlueGene/P, currently the largest supercomputer in the world for open science, in the development of scalable tools for data mining. Using a domain-neutral scientific data format may also enable us to take advantage of existing data mining components from other communities. There is, further, a substantial literature on the topic of one-pass algorithms and stream mining techniques, and such tools may be inserted naturally at various points in the event data processing and distribution chain. This paper describes early experience with event metadata records from ATLAS simulation and commissioning as a testbed for scalable data mining tool development and evaluation. 1. Abstract 2. Details Raw Data EventFilter Reconstruction AOD merging ESD Data AOD Data TAG production Primary DPD making D 1 PD Data TAG TAG extract/copy TAG DPD making D N PD Data Tier 0Tier 1Tier 2 TAG scale is attractive as a data mining testbed 2 x 10 9 events per year (real data) equivalent amount of Monte Carlo ‒ have more than 0.5 x 10 9 tags coming from Monte Carlo already in the next months also millions of TAGs from cosmic ray commissioning ‒ content simpler, smaller than that of physics TAGs Standard TAG processing provides opportunities for mining Example: one-pass algorithms learn from data as they go by--no iterative processing ‒ distributional characteristics, patterns can detect outliers and anomalies, shifts in machine or detector behavior ‒ not for discovery physics, but we have already detected software bugs by anomaly detection in TAG data this is how network traffic is mined tag processing provides several loci for nonintrusive insertion of stream mining tools ‒ e.g., tag upload or transfer TAG content and mining potential Example: association rules physics data contain known associations behavior ‒ e.g., between trigger and physics content testbed and tool validation ‒ What associations are found by statistical tools without domain-specific knowledge? Example: clustering we already cluster data when we do streaming ‒ e.g., by trigger testbed and tool validation: ‒ if we treat, e.g., a subset of event attributes (say, N physics attributes, no trigger information) as points in N-dimensional space, what clusters are found by statistical tools? First steps: neutral data format (HDF5) ATLAS TAG format is relatively simple, readily mapped to other technologies ‒ a few TAG attributes depend upon run-dependent encoding, though, so using _every_ attribute for potential mining poses challenges ‒ but mining every attribute is not essential for most purposes current TAG storage technology in ATLAS is ROOT via POOL Collections, Oracle via POOL, MySQL via POOL (limited) ‒ not well suited to use of off-the-shelf mining tools, or to porting to high-performance hardware HDF5 is a unique technology suite that makes possible the management of extremely large and complex data collections. some data mining tools, both open-source and proprietary, can read HDF5 (e.g. rattle, mathworks) TAG mining and BlueGene/P With TAGs in HDF5, we do not need to port LCG or ATLAS software to BlueGene in order to begin First steps: statistical tools for distributional characteristics, outliers, anomalies run on each node, processing disjoint subsets of data ‒ no inter-process communication until an aggregation phase at the end ‒ aggregation stage (can be done hierarchically, e.g., binary tree) Next steps: use large-scale parallelism for mining problems that have a large search space ‒ e.g., O(N 2 ) possible pair-wise associations in N attributes ‒ each node may process the entire sample ‒ parallel I/O challenges are different than those for partitioned samples Later steps: Address mining problems that require inter-process communication in parallel implementations ‒ iterative process, exchange, process, exchange models e.g., clustering and classification We are only beginning. Figure 1: Data flow at ATLAS. Figure 2: K- Means Clustering example Figure 3: BlueGene/P.

Event Metadata Records as a Testbed for Scalable Data Mining David Malon, Peter van Gemmeren (Argonne National Laboratory) At a data rate of 200 hertz,

Similar presentations

Presentation on theme: "Event Metadata Records as a Testbed for Scalable Data Mining David Malon, Peter van Gemmeren (Argonne National Laboratory) At a data rate of 200 hertz,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Event Metadata Records as a Testbed for Scalable Data Mining David Malon, Peter van Gemmeren (Argonne National Laboratory) At a data rate of 200 hertz,

Similar presentations

Presentation on theme: "Event Metadata Records as a Testbed for Scalable Data Mining David Malon, Peter van Gemmeren (Argonne National Laboratory) At a data rate of 200 hertz,"— Presentation transcript:

Similar presentations

About project

Feedback