NIST Big Data Public Working Group Overview

NIST Big Data Public Working Group Overview
NIST Big Data Public Working Group Volume 3, Use Cases and General Requirements NIST Big Data Public Working Group Overview Geoffrey Fox Indiana University Piyush Mehrotra, NASA Ames NIST Campus Gaithersburg, Maryland June 1, 2017

Volume 3, Use Cases and General Requirements Document Scope
Version 1 collected 51 big data use cases with a 26 feature template and used this to extract requirements to feed into NIST Big Data Reference Architecture The version 2 template merges version 1 General and Security & Privacy use case analysis The discussion of this at first NIST Big Data meeting identified need for patterns which were proposed during version 2 work; version 2 template incorporates new questions to help identify patterns. Work with Vol 4 (SnP) Vol 6 (Big Data Reference Architecture) Vol 7 (standards) Vol 8 (interfaces)

Volume 3, Use Cases and General Requirements Version 1 Overview
Gathered and evaluated 51 use cases from nine application domains. Gathered input regarding Big Data requirements Analyzed and prioritized a list of challenging use case specific requirements that may delay or prevent adoption of Big Data deployment Developed a comprehensive list of generalized Big Data requirements Developed a set of features that characterized applications Used to compare different Big Data problems and to Collaborated with the NBD-PWG Reference Architecture Subgroup to provide input for the NBDRA

51 Detailed Use Cases: Version 1 Contributed July-September 2013
26 Features for each use case Government Operation(4): National Archives and Records Administration, Census Bureau Commercial(8): Finance in Cloud, Cloud Backup, Mendeley (Citations), Netflix, Web Search, Digital Materials, Cargo shipping (as in UPS) Defense(3): Sensors, Image surveillance, Situation Assessment Healthcare and Life Sciences(10): Medical records, Graph and Probabilistic analysis, Pathology, Bioimaging, Genomics, Epidemiology, People Activity models, Biodiversity Deep Learning and Social Media(6): Driving Car, Geolocate images/cameras, Twitter, Crowd Sourcing, Network Science, NIST benchmark datasets The Ecosystem for Research(4): Metadata, Collaboration, Translation, Light source data Astronomy and Physics(5): Sky Surveys including comparison to simulation, Large Hadron Collider at CERN, Belle II Accelerator in Japan Earth, Environmental and Polar Science(10): Radar Scattering in Atmosphere, Earthquake, Ocean, Earth Observation, Ice sheet Radar scattering, Earth radar mapping, Climate simulation datasets, Atmospheric turbulence identification, Subsurface Biogeochemistry (microbes to watersheds), AmeriFlux and FLUXNET gas sensors Energy(1): Smart grid

Version 1 Use Case Template
Note agreed in this form August Some clarification on Veracity v. Data Quality added Request for picture and summary done by hand for version 1 but included in version 2 template. Early version 1 use cases did a detailed breakup of workflow into multiple stages which we want to restore but do not have agreed format yet

Size of Requirements Analysis
35 General Requirements 437 Specific Requirements 8.6 per use case, 12.5 per general requirement Data Sources: 3 General 78 Specific Transformation: 4 General 60 Specific Capability (Infrastructure): 6 General 133 Specific Data Consumer: 6 General 55 Specific Security & Privacy: 2 General 45 Specific Lifecycle: 9 General 43 Specific Other: 5 General 23 Specific

Part of Property Summary Table

Classifying Use Cases into Patterns labelled by Features
The Big Data Ogres built on a collection of 51 big data uses gathered by the NIST Public Working Group where 26 properties were gathered for each application. This information was combined with other studies including the Berkeley dwarfs, the NAS parallel benchmarks and the Computational Giants of the NRC Massive Data Analysis Report. The Ogre analysis led to a set of 50 features divided into four views that could be used to categorize and distinguish between applications. The four views are Problem Architecture (Macro pattern); Execution Features (Micro patterns); Data Source and Style; and finally the Processing View or runtime features. We generalized this approach to integrate Big Data and Simulation applications into a single classification looking separately at Data and Model with the total facets growing to 64 in number, called convergence diamonds, and split between the same 4 views.

7 Computational Giants of NRC Massive Data Analysis Report
G1: Basic Statistics (termed MRStat later as suitable for simple MapReduce implementation) G2: Generalized N-Body Problems G3: Graph-Theoretic Computations G4: Linear Algebraic Computations G5: Optimizations e.g. Linear Programming G6: Integration (Called GML Global Machine Learning Later) G7: Alignment Problems e.g. BLAST

Features of 51 Use Cases I PP (26) “All” Pleasingly Parallel or Map Only MR (18) Classic MapReduce MR (add MRStat below for full count) MRStat (7) Simple version of MR where key computations are simple reduction as found in statistical averages such as histograms and averages MRIter (23) Iterative MapReduce or MPI (Spark, Twister) Graph (9) Complex graph data structure needed in analysis Fusion (11) Integrate diverse data to aid discovery/decision making; could involve sophisticated algorithms or could just be a portal Streaming (41) data comes in incrementally and is processed this way Classify (30) Classification: divide data into categories S/Q (12) Index, Search and Query

Patterns (Ogres) modelled on 13 Berkeley Dwarfs
Dense Linear Algebra Sparse Linear Algebra Spectral Methods N-Body Methods Structured Grids Unstructured Grids MapReduce Combinational Logic Graph Traversal Dynamic Programming Backtrack and Branch-and-Bound Graphical Models Finite State Machines The Berkeley dwarfs and NAS Parallel Benchmarks are perhaps two best known approaches to characterizing Parallel Computing Uses Cases / Kernels / Patterns Note dwarfs somewhat inconsistent as for example MapReduce is a programming model and spectral method is a numerical method. No single comparison criterion and so need multiple facets!

Features of 51 Use Cases II
CF (4) Collaborative Filtering for recommender engines LML (36) Local Machine Learning (Independent for each parallel entity) – application could have GML as well GML (23) Global Machine Learning: Deep Learning, Clustering, LDA, PLSI, MDS, Large Scale Optimizations as in Variational Bayes, MCMC, Lifted Belief Propagation, Stochastic Gradient Descent, L-BFGS, Levenberg-Marquardt . Can call EGO or Exascale Global Optimization with scalable parallel algorithm Workflow (51) Universal GIS (16) Geotagged data and often displayed in ESRI, Microsoft Virtual Earth, Google Earth, GeoServer etc. HPC (5) Classic large-scale simulation of cosmos, materials, etc. generating (visualization) data Agent (2) Simulations of models of data-defined macroscopic entities represented as agents

Government 3: Census Bureau Statistical Survey Response Improvement (Adaptive Design) Application: Survey costs are increasing as survey response declines. The goal of this work is to use advanced “recommendation system techniques” that are open and scientifically objective, using data mashed up from several sources and historical survey para-data (administrative data about the survey) to drive operational processes in an effort to increase quality and reduce the cost of field surveys. Current Approach: About a petabyte of data coming from surveys and other government administrative sources. Data can be streamed with approximately 150 million records transmitted as field data streamed continuously, during the decennial census. All data must be both confidential and secure. All processes must be auditable for security and confidentiality as required by various legal statutes. Data quality should be high and statistically checked for accuracy and reliability throughout the collection process. Use Hadoop, Spark, Hive, R, SAS, Mahout, Allegrograph, MySQL, Oracle, Storm, BigMemory, Cassandra, Pig software. Futures: Analytics needs to be developed which give statistical estimations that provide more detail, on a more near real time basis for less cost. The reliability of estimated statistics from such “mashed up” sources still must be evaluated. PP, MRStat, S/Q, Index, CF Parallelism over Government items (from people), People viewing

13: Cloud Large Scale Geospatial Analysis and Visualization
Defense 13: Cloud Large Scale Geospatial Analysis and Visualization Application: Need to support large scale geospatial data analysis and visualization with number of geospatially aware sensors and the number of geospatially tagged data sources rapidly increasing. Current Approach: Traditional GIS systems are generally capable of analyzing a millions of objects and easily visualizing thousands. Data types include Imagery (various formats such as NITF, GeoTiff, CADRG), and vector with various formats like shape files, KML, text streams. Object types include points, lines, areas, polylines, circles, ellipses. Data accuracy very important with image registration and sensor accuracy relevant. Analytics include closest point of approach, deviation from route, and point density over time, PCA and ICA. Software includes Server with Geospatially enabled RDBMS, Geospatial server/analysis software – ESRI ArcServer, Geoserver; Visualization by ArcMap or browser based visualization Futures: Today’s intelligence systems often contain trillions of geospatial objects and need to be able to visualize and interact with millions of objects. Critical issues are Indexing, retrieval and distributed analysis; Visualization generation and transmission; Visualization of data at the end of low bandwidth wireless connections; Data is sensitive and must be completely secure in transit and at rest (particularly on handhelds); Geospatial data requires unique approaches to indexing and distributed analysis. PP, GIS, Classification Streaming Parallelism over Sensors and people accessing data

19: NIST Genome in a Bottle Consortium
Healthcare Life Sciences 19: NIST Genome in a Bottle Consortium Application: NIST/Genome in a Bottle Consortium integrates data from multiple sequencing technologies and methods to develop highly confident characterization of whole human genomes as reference materials, and develop methods to use these Reference Materials to assess performance of any genome sequencing run. Current Approach: The storage of ~40TB NFS at NIST is full; there are also PBs of genomics data at NIH/NCBI. Use Open-source sequencing bioinformatics software from academic groups (UNIX-based) on a 72 core cluster at NIST supplemented by larger systems at collaborators. Futures: DNA sequencers can generate ~300GB compressed data/day which volume has increased much faster than Moore’s Law. Future data could include other ‘omics’ measurements, which will be even larger than DNA sequencing. Clouds have been explored. PP, MR, MRIter, Classification Streaming Parallelism over Gene fragments at various stages

38: Large Survey Data for Cosmology
Astronomy & Physics 38: Large Survey Data for Cosmology Application: For DES (Dark Energy Survey) the data are sent from the mountaintop via a microwave link to La Serena, Chile. From there, an optical link forwards them to the NCSA as well as NERSC for storage and "reduction”. Here galaxies and stars in both the individual and stacked images are identified, catalogued, and finally their properties measured and stored in a database. Current Approach: Subtraction pipelines are run using extant imaging data to find new optical transients through machine learning algorithms. Linux cluster, Oracle RDBMS server, Postgres PSQL, large memory machines, standard Linux interactive hosts, GPFS. For simulations, HPC resources. Standard astrophysics reduction software as well as Perl/Python wrapper scripts, Linux Cluster scheduling. Futures: Techniques for handling Cholesky decompostion for thousands of simulations with matrices of order 1M on a side and parallel image storage would be important. LSST will generate 60PB of imaging data and 15PB of catalog data and a correspondingly large (or larger) amount of simulation data. Over 20TB of data per night. Victor M. Blanco Telescope Chile where new wide angle 520 mega pixel camera DECam installed PP, MRIter, Classification Streaming Parallelism over stars and images

Filter Identifying Events
Typical Big Data Pattern 2. Perform real time analytics on data source streams and notify users when specified events occur Storm (Heron), Kafka, Hbase, Zookeeper Streaming Data Posted Data Identified Events Filter Identifying Events Repository Specify filter Archive Post Selected Events Fetch streamed Data

Typical Big Data Pattern 5A
Typical Big Data Pattern 5A. Perform interactive analytics on observational scientific data Grid or Many Task Software, Hadoop, Spark, Giraph, Pig … Data Storage: HDFS, Hbase, File Collection Streaming Twitter data for Social Networking Science Analysis Code, Mahout, R, SPIDAL Transport batch of data to primary analysis data system Record Scientific Data in “field” Local Accumulate and initial computing Direct Transfer NIST examples include LHC, Remote Sensing, Astronomy and Bioinformatics

10. Orchestrate multiple sequential and parallel data transformations and/or analytic processing using a workflow manager This can be used for science by adding data staging phases as in case 5A Specify Analytics Pipeline Orchestration Layer (Workflow) Analytic-3 (Visualize) Analytic-2 Analytic-1 Hadoop, Spark, Giraph, Pig … Data Storage: HDFS, Hbase

More use cases. (Roll up sleeves; budget an hour.)
Volume 3, Use Cases and General Requirements Version 2 Opportunities for Contribution More use cases. (Roll up sleeves; budget an hour.) Soliciting greater application domain diversity: Smart cars (Smart X) Large scale utility IoT Geolocation applications involving people Energy from discovery to generation Scientific studies involving human subjects at large scale Highly distributed use cases bridging multiple enterprises Compare different big data applications in needed architecture, interfaces Compare simulations and Dig Data (use exascale supercomputers to process big data) Choose a domain and collect/analyze a set of related use-cases Develop technology requirements for applications in domain Feed lessons into version 3 of template

Volume 3, Use Cases and General Requirements Possible Version 3 Topics
Identify gaps in use cases Develop plausible, semi-fictionalized use cases from industry reports, white papers, academic project reports Identify important parameters for classifying systems Microservice use cases Map Use cases to work in Vol 4 (SnP) Vol 6 (Big Data Reference Architecture) Vol 7 (standards) Vol 8 (interfaces) Container-oriented use cases Forensic and provenance-centric use cases Review fitness of the BDRA to use cases

NIST Big Data Public Working Group Overview

Similar presentations

Presentation on theme: "NIST Big Data Public Working Group Overview"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

NIST Big Data Public Working Group Overview

Similar presentations

Presentation on theme: "NIST Big Data Public Working Group Overview"— Presentation transcript:

Similar presentations

About project

Feedback