NIST Big Data Public Working Group Overview

Slides:



Advertisements
Similar presentations
Kensington Oracle Edition: Open Discovery Workflow Meets Oracle 10g Professor Yike Guo.
Advertisements

Current NIST Definition NIST Big data consists of advanced techniques that harness independent resources for building scalable data systems when the characteristics.
MS DB Proposal Scott Canaan B. Thomas Golisano College of Computing & Information Sciences.
NIST Big Data Public Working Group
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
material assembled from the web pages at
Master Thesis Defense Jan Fiedler 04/17/98
Big Data Ogres and their Facets Geoffrey Fox, Judy Qiu, Shantenu Jha, Saliya Ekanayake Big Data Ogres are an attempt to characterize applications and algorithms.
IPlant cyberifrastructure to support ecological modeling Presented at the Species Distribution Modeling Group at the American Museum of Natural History.
Data Management BIRN supports data intensive activities including: – Imaging, Microscopy, Genomics, Time Series, Analytics and more… BIRN utilities scale:
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
51 Use Cases and implications for HPC & Apache Big Data Stack Architecture and Ogres International Workshop on Extreme Scale Scientific Computing (Big.
1 Limitations of BLAST Can only search for a single query (e.g. find all genes similar to TTGGACAGGATCGA) What about more complex queries? “Find all genes.
51 Detailed Use Cases: Contributed July-September 2013 Covers goals, data features such as 3 V’s, software, hardware
DDM Kirk. LSST-VAO discussion: Distributed Data Mining (DDM) Kirk Borne George Mason University March 24, 2011.
Big Data to Knowledge Panel SKG 2014 Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China August Geoffrey Fox
HPC in the Cloud – Clearing the Mist or Lost in the Fog Panel at SC11 Seattle November Geoffrey Fox
1 Panel on Merge or Split: Mutual Influence between Big Data and HPC Techniques IEEE International Workshop on High-Performance Big Data Computing In conjunction.
Big Data-An Analysis. Big Data: A definition Big data is a collection of data sets so large and complex that it becomes difficult.
Data Analytics (CS40003) Introduction to Data Lecture #1
Geoffrey Fox Panel Talk: February
Connected Infrastructure
VisIt Project Overview
Data Mining – Intro.
Big Data is a Big Deal!.
Big Data Enterprise Patterns
Big Data Use Cases February 2017
Visualizing Complex Software Systems
Connected Living Connected Living What to look for Architecture
Returning to Java Grande: High Performance Architecture for Big Data
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
Hadoop and Analytics at CERN IT
An Open Source Project Commonly Used for Processing Big Data Sets
INTRODUCTION TO GEOGRAPHICAL INFORMATION SYSTEM
Status and Challenges: January 2017
Pathology Spatial Analysis February 2017
HPC 2016 HIGH PERFORMANCE COMPUTING
Volume 3, Use Cases and General Requirements Document Scope
Connected Living Connected Living What to look for Architecture
Big Data, Simulations and HPC Convergence
NSF start October 1, 2014 Datanet: CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science Indiana University.
Connected Infrastructure
CHAPTER 3 Architectures for Distributed Systems
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Hadoop Clusters Tess Fulkerson.
Establishing A Data Management Fabric For Grid Modernization At Exelon
NSF : CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science PI: Geoffrey C. Fox Software: MIDAS HPC-ABDS.
Digital Science Center I
Ministry of Higher Education
I590 Data Science Curriculum August
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
Data Science Curriculum March
Tutorial Overview February 2017
Data Warehousing and Data Mining
Cloud DIKW based on HPC-ABDS to integrate streaming and batch Big Data
Big Data Use Cases and Requirements
Clouds from FutureGrid’s Perspective
Overview of big data tools
Big Data Use Cases and Requirements
Grid Application Model and Design and Implementation of Grid Services
Department of Intelligent Systems Engineering
Charles Tappert Seidenberg School of CSIS, Pace University
Big Data, Bigger Data & Big R Data
Panel on Research Challenges in Big Data
MapReduce: Simplified Data Processing on Large Clusters
Big Data, Simulations and HPC Convergence
UNIT 6 RECENT TRENDS.
Convergence of Big Data and Extreme Computing
Big Data.
Presentation transcript:

NIST Big Data Public Working Group Overview NIST Big Data Public Working Group Volume 3, Use Cases and General Requirements NIST Big Data Public Working Group Overview Geoffrey Fox Indiana University Piyush Mehrotra, NASA Ames NIST Campus Gaithersburg, Maryland June 1, 2017

Volume 3, Use Cases and General Requirements Document Scope Version 1 collected 51 big data use cases with a 26 feature template and used this to extract requirements to feed into NIST Big Data Reference Architecture The version 2 template merges version 1 General and Security & Privacy use case analysis The discussion of this at first NIST Big Data meeting identified need for patterns which were proposed during version 2 work; version 2 template incorporates new questions to help identify patterns. Work with Vol 4 (SnP) Vol 6 (Big Data Reference Architecture) Vol 7 (standards) Vol 8 (interfaces)

Volume 3, Use Cases and General Requirements Version 1 Overview Gathered and evaluated 51 use cases from nine application domains. Gathered input regarding Big Data requirements Analyzed and prioritized a list of challenging use case specific requirements that may delay or prevent adoption of Big Data deployment Developed a comprehensive list of generalized Big Data requirements Developed a set of features that characterized applications Used to compare different Big Data problems and to Collaborated with the NBD-PWG Reference Architecture Subgroup to provide input for the NBDRA

51 Detailed Use Cases: Version 1 Contributed July-September 2013 http://bigdatawg.nist.gov/usecases.php, 26 Features for each use case Government Operation(4): National Archives and Records Administration, Census Bureau Commercial(8): Finance in Cloud, Cloud Backup, Mendeley (Citations), Netflix, Web Search, Digital Materials, Cargo shipping (as in UPS) Defense(3): Sensors, Image surveillance, Situation Assessment Healthcare and Life Sciences(10): Medical records, Graph and Probabilistic analysis, Pathology, Bioimaging, Genomics, Epidemiology, People Activity models, Biodiversity Deep Learning and Social Media(6): Driving Car, Geolocate images/cameras, Twitter, Crowd Sourcing, Network Science, NIST benchmark datasets The Ecosystem for Research(4): Metadata, Collaboration, Translation, Light source data Astronomy and Physics(5): Sky Surveys including comparison to simulation, Large Hadron Collider at CERN, Belle II Accelerator in Japan Earth, Environmental and Polar Science(10): Radar Scattering in Atmosphere, Earthquake, Ocean, Earth Observation, Ice sheet Radar scattering, Earth radar mapping, Climate simulation datasets, Atmospheric turbulence identification, Subsurface Biogeochemistry (microbes to watersheds), AmeriFlux and FLUXNET gas sensors Energy(1): Smart grid

Version 1 Use Case Template Note agreed in this form August 11 2013 Some clarification on Veracity v. Data Quality added Request for picture and summary done by hand for version 1 but included in version 2 template. Early version 1 use cases did a detailed breakup of workflow into multiple stages which we want to restore but do not have agreed format yet

Size of Requirements Analysis 35 General Requirements 437 Specific Requirements 8.6 per use case, 12.5 per general requirement Data Sources: 3 General 78 Specific Transformation: 4 General 60 Specific Capability (Infrastructure): 6 General 133 Specific Data Consumer: 6 General 55 Specific Security & Privacy: 2 General 45 Specific Lifecycle: 9 General 43 Specific Other: 5 General 23 Specific

Part of Property Summary Table

Classifying Use Cases into Patterns labelled by Features The Big Data Ogres built on a collection of 51 big data uses gathered by the NIST Public Working Group where 26 properties were gathered for each application. This information was combined with other studies including the Berkeley dwarfs, the NAS parallel benchmarks and the Computational Giants of the NRC Massive Data Analysis Report. The Ogre analysis led to a set of 50 features divided into four views that could be used to categorize and distinguish between applications. The four views are Problem Architecture (Macro pattern); Execution Features (Micro patterns); Data Source and Style; and finally the Processing View or runtime features. We generalized this approach to integrate Big Data and Simulation applications into a single classification looking separately at Data and Model with the total facets growing to 64 in number, called convergence diamonds, and split between the same 4 views.

7 Computational Giants of NRC Massive Data Analysis Report G1: Basic Statistics (termed MRStat later as suitable for simple MapReduce implementation) G2: Generalized N-Body Problems G3: Graph-Theoretic Computations G4: Linear Algebraic Computations G5: Optimizations e.g. Linear Programming G6: Integration (Called GML Global Machine Learning Later) G7: Alignment Problems e.g. BLAST

Features of 51 Use Cases I PP (26) “All” Pleasingly Parallel or Map Only MR (18) Classic MapReduce MR (add MRStat below for full count) MRStat (7) Simple version of MR where key computations are simple reduction as found in statistical averages such as histograms and averages MRIter (23) Iterative MapReduce or MPI (Spark, Twister) Graph (9) Complex graph data structure needed in analysis Fusion (11) Integrate diverse data to aid discovery/decision making; could involve sophisticated algorithms or could just be a portal Streaming (41) data comes in incrementally and is processed this way Classify (30) Classification: divide data into categories S/Q (12) Index, Search and Query

Patterns (Ogres) modelled on 13 Berkeley Dwarfs Dense Linear Algebra Sparse Linear Algebra Spectral Methods N-Body Methods Structured Grids Unstructured Grids MapReduce Combinational Logic Graph Traversal Dynamic Programming Backtrack and Branch-and-Bound Graphical Models Finite State Machines The Berkeley dwarfs and NAS Parallel Benchmarks are perhaps two best known approaches to characterizing Parallel Computing Uses Cases / Kernels / Patterns Note dwarfs somewhat inconsistent as for example MapReduce is a programming model and spectral method is a numerical method. No single comparison criterion and so need multiple facets!

Features of 51 Use Cases II CF (4) Collaborative Filtering for recommender engines LML (36) Local Machine Learning (Independent for each parallel entity) – application could have GML as well GML (23) Global Machine Learning: Deep Learning, Clustering, LDA, PLSI, MDS, Large Scale Optimizations as in Variational Bayes, MCMC, Lifted Belief Propagation, Stochastic Gradient Descent, L-BFGS, Levenberg-Marquardt . Can call EGO or Exascale Global Optimization with scalable parallel algorithm Workflow (51) Universal GIS (16) Geotagged data and often displayed in ESRI, Microsoft Virtual Earth, Google Earth, GeoServer etc. HPC (5) Classic large-scale simulation of cosmos, materials, etc. generating (visualization) data Agent (2) Simulations of models of data-defined macroscopic entities represented as agents

Government 3: Census Bureau Statistical Survey Response Improvement (Adaptive Design) Application: Survey costs are increasing as survey response declines. The goal of this work is to use advanced “recommendation system techniques” that are open and scientifically objective, using data mashed up from several sources and historical survey para-data (administrative data about the survey) to drive operational processes in an effort to increase quality and reduce the cost of field surveys. Current Approach: About a petabyte of data coming from surveys and other government administrative sources. Data can be streamed with approximately 150 million records transmitted as field data streamed continuously, during the decennial census. All data must be both confidential and secure. All processes must be auditable for security and confidentiality as required by various legal statutes. Data quality should be high and statistically checked for accuracy and reliability throughout the collection process. Use Hadoop, Spark, Hive, R, SAS, Mahout, Allegrograph, MySQL, Oracle, Storm, BigMemory, Cassandra, Pig software. Futures: Analytics needs to be developed which give statistical estimations that provide more detail, on a more near real time basis for less cost. The reliability of estimated statistics from such “mashed up” sources still must be evaluated. PP, MRStat, S/Q, Index, CF Parallelism over Government items (from people), People viewing

13: Cloud Large Scale Geospatial Analysis and Visualization Defense 13: Cloud Large Scale Geospatial Analysis and Visualization Application: Need to support large scale geospatial data analysis and visualization with number of geospatially aware sensors and the number of geospatially tagged data sources rapidly increasing. Current Approach: Traditional GIS systems are generally capable of analyzing a millions of objects and easily visualizing thousands. Data types include Imagery (various formats such as NITF, GeoTiff, CADRG), and vector with various formats like shape files, KML, text streams. Object types include points, lines, areas, polylines, circles, ellipses. Data accuracy very important with image registration and sensor accuracy relevant. Analytics include closest point of approach, deviation from route, and point density over time, PCA and ICA. Software includes Server with Geospatially enabled RDBMS, Geospatial server/analysis software – ESRI ArcServer, Geoserver; Visualization by ArcMap or browser based visualization Futures: Today’s intelligence systems often contain trillions of geospatial objects and need to be able to visualize and interact with millions of objects. Critical issues are Indexing, retrieval and distributed analysis; Visualization generation and transmission; Visualization of data at the end of low bandwidth wireless connections; Data is sensitive and must be completely secure in transit and at rest (particularly on handhelds); Geospatial data requires unique approaches to indexing and distributed analysis. PP, GIS, Classification Streaming Parallelism over Sensors and people accessing data

19: NIST Genome in a Bottle Consortium Healthcare Life Sciences 19: NIST Genome in a Bottle Consortium Application: NIST/Genome in a Bottle Consortium integrates data from multiple sequencing technologies and methods to develop highly confident characterization of whole human genomes as reference materials, and develop methods to use these Reference Materials to assess performance of any genome sequencing run. Current Approach: The storage of ~40TB NFS at NIST is full; there are also PBs of genomics data at NIH/NCBI. Use Open-source sequencing bioinformatics software from academic groups (UNIX-based) on a 72 core cluster at NIST supplemented by larger systems at collaborators. Futures: DNA sequencers can generate ~300GB compressed data/day which volume has increased much faster than Moore’s Law. Future data could include other ‘omics’ measurements, which will be even larger than DNA sequencing. Clouds have been explored. PP, MR, MRIter, Classification Streaming Parallelism over Gene fragments at various stages

38: Large Survey Data for Cosmology Astronomy & Physics 38: Large Survey Data for Cosmology Application: For DES (Dark Energy Survey) the data are sent from the mountaintop via a microwave link to La Serena, Chile. From there, an optical link forwards them to the NCSA as well as NERSC for storage and "reduction”. Here galaxies and stars in both the individual and stacked images are identified, catalogued, and finally their properties measured and stored in a database. Current Approach: Subtraction pipelines are run using extant imaging data to find new optical transients through machine learning algorithms. Linux cluster, Oracle RDBMS server, Postgres PSQL, large memory machines, standard Linux interactive hosts, GPFS. For simulations, HPC resources. Standard astrophysics reduction software as well as Perl/Python wrapper scripts, Linux Cluster scheduling. Futures: Techniques for handling Cholesky decompostion for thousands of simulations with matrices of order 1M on a side and parallel image storage would be important. LSST will generate 60PB of imaging data and 15PB of catalog data and a correspondingly large (or larger) amount of simulation data. Over 20TB of data per night.  Victor M. Blanco Telescope Chile where new wide angle 520 mega pixel camera DECam installed PP, MRIter, Classification Streaming Parallelism over stars and images

Filter Identifying Events Typical Big Data Pattern 2. Perform real time analytics on data source streams and notify users when specified events occur Storm (Heron), Kafka, Hbase, Zookeeper Streaming Data Posted Data Identified Events Filter Identifying Events Repository Specify filter Archive Post Selected Events Fetch streamed Data

Typical Big Data Pattern 5A Typical Big Data Pattern 5A. Perform interactive analytics on observational scientific data Grid or Many Task Software, Hadoop, Spark, Giraph, Pig … Data Storage: HDFS, Hbase, File Collection Streaming Twitter data for Social Networking Science Analysis Code, Mahout, R, SPIDAL Transport batch of data to primary analysis data system Record Scientific Data in “field” Local Accumulate and initial computing Direct Transfer NIST examples include LHC, Remote Sensing, Astronomy and Bioinformatics

10. Orchestrate multiple sequential and parallel data transformations and/or analytic processing using a workflow manager This can be used for science by adding data staging phases as in case 5A Specify Analytics Pipeline Orchestration Layer (Workflow) Analytic-3 (Visualize) Analytic-2 Analytic-1 Hadoop, Spark, Giraph, Pig … Data Storage: HDFS, Hbase

More use cases. (Roll up sleeves; budget an hour.) Volume 3, Use Cases and General Requirements Version 2 Opportunities for Contribution More use cases. (Roll up sleeves; budget an hour.) Soliciting greater application domain diversity: Smart cars (Smart X) Large scale utility IoT Geolocation applications involving people Energy from discovery to generation Scientific studies involving human subjects at large scale Highly distributed use cases bridging multiple enterprises Compare different big data applications in needed architecture, interfaces Compare simulations and Dig Data (use exascale supercomputers to process big data) Choose a domain and collect/analyze a set of related use-cases Develop technology requirements for applications in domain Feed lessons into version 3 of template

Volume 3, Use Cases and General Requirements Possible Version 3 Topics Identify gaps in use cases Develop plausible, semi-fictionalized use cases from industry reports, white papers, academic project reports Identify important parameters for classifying systems Microservice use cases Map Use cases to work in Vol 4 (SnP) Vol 6 (Big Data Reference Architecture) Vol 7 (standards) Vol 8 (interfaces) Container-oriented use cases Forensic and provenance-centric use cases Review fitness of the BDRA to use cases