Event Metadata Records as a Testbed for Scalable Data Mining David Malon, Peter van Gemmeren (Argonne National Laboratory) At a data rate of 200 hertz,

Slides:

Advertisements

Similar presentations

31/03/00 CMS(UK)Glenn Patrick What is the CMS(UK) Data Model? Assume that CMS software is available at every UK institute connected by some infrastructure.

Advertisements

Resources for the ATLAS Offline Computing Basis for the Estimates ATLAS Distributed Computing Model Cost Estimates Present Status Sharing of Resources.

1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)

23/04/2008VLVnT08, Toulon, FR, April 2008, M. Stavrianakou, NESTOR-NOA 1 First thoughts for KM3Net on-shore data storage and distribution Facilities VLV.

Date:2011/06/08 吳昕澧 BOA: The Bayesian Optimization Algorithm.

Managing Data Resources

Civil and Environmental Engineering Carnegie Mellon University Sensors & Knowledge Discovery (a.k.a. Data Mining) H. Scott Matthews April 14, 2003.

CS244-Introduction to Embedded Systems and Ubiquitous Computing Instructor: Eli Bozorgzadeh Computer Science Department UC Irvine Winter 2010.

MS DB Proposal Scott Canaan B. Thomas Golisano College of Computing & Information Sciences.

ATLAS Analysis Model. Introduction On Feb 11, 2008 the Analysis Model Forum published a report (D. Costanzo, I. Hinchliffe, S. Menke, ATL- GEN-INT )

Chapter 4 Entity Relationship (E-R) Modeling

7/14/2015EECS 584, Fall MapReduce: Simplied Data Processing on Large Clusters Yunxing Dai, Huan Feng.

Data Mining – Intro.

KDD for Science Data Analysis Issues and Examples.

Chapter 10: Architectural Design

DATA PRESERVATION IN ALICE FEDERICO CARMINATI. MOTIVATION ALICE is a 150 M CHF investment by a large scientific community The ALICE data is unique and.

Enterprise systems infrastructure and architecture DT211 4

Sensor Network Simulation Simulators and Testbeds Jaehoon Kim Jeeyoung Kim Sungwook Moon.

Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.

LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.

Data Mining Techniques

Data Mining on the Web via Cloud Computing COMS E6125 Web Enhanced Information Management Presented By Hemanth Murthy.

The Asynchronous Dynamic Load-Balancing Library Rusty Lusk, Steve Pieper, Ralph Butler, Anthony Chan Mathematics and Computer Science Division Nuclear.

Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.

Data Mining Chun-Hung Chou

High Energy Physics At OSCER A User Perspective OU Supercomputing Symposium 2003 Joel Snow, Langston U.

Exploring the Applicability of Scientific Data Management Tools and Techniques on the Records Management Requirements for the National Archives and Records.

CCA Common Component Architecture Manoj Krishnan Pacific Northwest National Laboratory MCMD Programming and Implementation Issues.

Appraisal and Data Mining of Large Size Complex Documents Rob Kooper, William McFadden and Peter Bajcsy National Center for Supercomputing Applications.

Extreme-scale computing systems – High performance computing systems Current No. 1 supercomputer Tianhe-2 at petaflops Pushing toward exa-scale computing.

Miguel Branco CERN/University of Southampton Enabling provenance on large-scale e-Science applications.

Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)

ALICE Upgrade for Run3: Computing HL-LHC Trigger, Online and Offline Computing Working Group Topical Workshop Sep 5 th 2014.

Databases E. Leonardi, P. Valente. Conditions DB Conditions=Dynamic parameters non-event time-varying Conditions database (CondDB) General definition:

© 2012 xtUML.org Bill Chown – Mentor Graphics Model Driven Engineering.

ETISEO Evaluation Nice, May th 2005 Evaluation Cycles.

ATLAS Detector Description Database Vakho Tsulaia University of Pittsburgh 3D workshop, CERN 14-Dec-2004.

Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.

CS244-Introduction to Embedded Systems and Ubiquitous Computing Instructor: Eli Bozorgzadeh Computer Science Department UC Irvine Winter 2010.

Last Words DM 1. Mining Data Steams / Incremental Data Mining / Mining sensor data (e.g. modify a decision tree assuming that new examples arrive continuously,

ATLAS Data Challenges US ATLAS Physics & Computing ANL October 30th 2001 Gilbert Poulard CERN EP-ATC.

© 2005 Prentice Hall1-1 Stumpf and Teague Object-Oriented Systems Analysis and Design with UML.

Data Mining and Decision Trees 1.Data Mining and Biological Information 2.Data Mining and Machine Learning Techniques 3.Decision trees and C5 4.Applications.

DDM Kirk. LSST-VAO discussion: Distributed Data Mining (DDM) Kirk Borne George Mason University March 24, 2011.

Learning to Detect Faces A Large-Scale Application of Machine Learning (This material is not in the text: for further information see the paper by P.

A New Tool For Measuring Detector Performance in ATLAS ● Arno Straessner – TU Dresden Matthias Schott – CERN on behalf of the ATLAS Collaboration Computing.

Coevolutionary Automated Software Correction Josh Wilkerson PhD Candidate in Computer Science Missouri S&T.

Integration of the ATLAS Tag Database with Data Management and Analysis Components Caitriana Nicholson University of Glasgow 3 rd September 2007 CHEP,

A Flexible Distributed Event-level Metadata System for ATLAS David Malon*, Jack Cranshaw, Kristo Karr (Argonne), Julius Hrivnac, Arthur Schaffer (LAL Orsay)

Classification using Decision Trees 1.Data Mining and Information 2.Data Mining and Machine Learning Techniques 3.Decision trees and C5 4.Applications.

The ATLAS TAGs Database - Experiences and further developments Elisabeth Vinek, CERN & University of Vienna on behalf of the TAGs developers group.

TAGS in the Analysis Model Jack Cranshaw, Argonne National Lab September 10, 2009.

1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.

The ATLAS Computing & Analysis Model Roger Jones Lancaster University ATLAS UK 06 IPPP, 20/9/2006.

Analysis Performance and I/O Optimization Jack Cranshaw, Argonne National Lab October 11, 2011.

Next-Generation Navigational Infrastructure and the ATLAS Event Store Abstract: The ATLAS event store employs a persistence framework with extensive navigational.

Points from DPD task force First meeting last Tuesday (29 th July) – Need to have concrete proposal in 1 month – Still some confusion and nothing very.

M. Caprini IFIN-HH Bucharest DAQ Control and Monitoring - A Software Component Model.

XML and Distributed Applications By Quddus Chong Presentation for CS551 – Fall 2001.

ATLAS Distributed Computing Tutorial Tags: What, Why, When, Where and How? Mike Kenyon University of Glasgow.

SNS COLLEGE OF TECHNOLOGY

What Is Cluster Analysis?

Database Replication and Monitoring

Applying Control Theory to Stream Processing Systems

Overview: high-energy computing

Controlling a large CPU farm using industrial tools

به نام خدا Big Data and a New Look at Communication Networks Babak Khalaj Sharif University of Technology Department of Electrical Engineering.

Overview of big data tools

ATLAS DC2 & Continuous production

L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher

Presentation transcript:

Event Metadata Records as a Testbed for Scalable Data Mining David Malon, Peter van Gemmeren (Argonne National Laboratory) At a data rate of 200 hertz, event metadata records ("TAGs," in ATLAS parlance) provide fertile grounds for development and evaluation of tools for scalable data mining. It is easy, of course, to apply HEP- specific selection or classification rules to event records and to label such an exercise "data mining," but our interest is different. Advanced statistical methods and tools such as classification, association rule mining, and cluster analysis are common outside the high energy physics community. These tools can prove useful, not for discovery physics, but for learning about our data, our detector, and our software. A fixed and relatively simple schema makes TAG export to other storage technologies such as HDF5 straightforward. At a data rate of 200 hertz, event metadata records ("TAGs," in ATLAS parlance) provide fertile grounds for development and evaluation of tools for scalable data mining. It is easy, of course, to apply HEP- specific selection or classification rules to event records and to label such an exercise "data mining," but our interest is different. Advanced statistical methods and tools such as classification, association rule mining, and cluster analysis are common outside the high energy physics community. These tools can prove useful, not for discovery physics, but for learning about our data, our detector, and our software. A fixed and relatively simple schema makes TAG export to other storage technologies such as HDF5 straightforward. This simplifies the task of exploiting very-large-scale parallel platforms such as Argonne National Laboratory's BlueGene/P, currently the largest supercomputer in the world for open science, in the development of scalable tools for data mining. Using a domain-neutral scientific data format may also enable us to take advantage of existing data mining components from other communities. There is, further, a substantial literature on the topic of one-pass algorithms and stream mining techniques, and such tools may be inserted naturally at various points in the event data processing and distribution chain. This paper describes early experience with event metadata records from ATLAS simulation and commissioning as a testbed for scalable data mining tool development and evaluation. This simplifies the task of exploiting very-large-scale parallel platforms such as Argonne National Laboratory's BlueGene/P, currently the largest supercomputer in the world for open science, in the development of scalable tools for data mining. Using a domain-neutral scientific data format may also enable us to take advantage of existing data mining components from other communities. There is, further, a substantial literature on the topic of one-pass algorithms and stream mining techniques, and such tools may be inserted naturally at various points in the event data processing and distribution chain. This paper describes early experience with event metadata records from ATLAS simulation and commissioning as a testbed for scalable data mining tool development and evaluation. 1. Abstract 2. Details Raw Data EventFilter Reconstruction AOD merging ESD Data AOD Data TAG production Primary DPD making D 1 PD Data TAG TAG extract/copy TAG DPD making D N PD Data Tier 0Tier 1Tier 2 TAG scale is attractive as a data mining testbed 2 x 10 9 events per year (real data) equivalent amount of Monte Carlo ‒ have more than 0.5 x 10 9 tags coming from Monte Carlo already in the next months also millions of TAGs from cosmic ray commissioning ‒ content simpler, smaller than that of physics TAGs Standard TAG processing provides opportunities for mining Example: one-pass algorithms learn from data as they go by--no iterative processing ‒ distributional characteristics, patterns can detect outliers and anomalies, shifts in machine or detector behavior ‒ not for discovery physics, but we have already detected software bugs by anomaly detection in TAG data this is how network traffic is mined tag processing provides several loci for nonintrusive insertion of stream mining tools ‒ e.g., tag upload or transfer TAG content and mining potential Example: association rules physics data contain known associations behavior ‒ e.g., between trigger and physics content testbed and tool validation ‒ What associations are found by statistical tools without domain-specific knowledge? Example: clustering we already cluster data when we do streaming ‒ e.g., by trigger testbed and tool validation: ‒ if we treat, e.g., a subset of event attributes (say, N physics attributes, no trigger information) as points in N-dimensional space, what clusters are found by statistical tools? First steps: neutral data format (HDF5) ATLAS TAG format is relatively simple, readily mapped to other technologies ‒ a few TAG attributes depend upon run-dependent encoding, though, so using _every_ attribute for potential mining poses challenges ‒ but mining every attribute is not essential for most purposes current TAG storage technology in ATLAS is ROOT via POOL Collections, Oracle via POOL, MySQL via POOL (limited) ‒ not well suited to use of off-the-shelf mining tools, or to porting to high-performance hardware HDF5 is a unique technology suite that makes possible the management of extremely large and complex data collections. some data mining tools, both open-source and proprietary, can read HDF5 (e.g. rattle, mathworks) TAG mining and BlueGene/P With TAGs in HDF5, we do not need to port LCG or ATLAS software to BlueGene in order to begin First steps: statistical tools for distributional characteristics, outliers, anomalies run on each node, processing disjoint subsets of data ‒ no inter-process communication until an aggregation phase at the end ‒ aggregation stage (can be done hierarchically, e.g., binary tree) Next steps: use large-scale parallelism for mining problems that have a large search space ‒ e.g., O(N 2 ) possible pair-wise associations in N attributes ‒ each node may process the entire sample ‒ parallel I/O challenges are different than those for partitioned samples Later steps: Address mining problems that require inter-process communication in parallel implementations ‒ iterative process, exchange, process, exchange models e.g., clustering and classification We are only beginning. Figure 1: Data flow at ATLAS. Figure 2: K- Means Clustering example Figure 3: BlueGene/P.